Simon Mourier's Blog - Advanced HTML Agility Pack useage

Advanced HTML Agility Pack useage

Jan 6, 2011 See the question and my original answer on StackOverflow

Well, you have to understand XPATH to really take advandage of the HTML agility pack scraping capabilities :-) You can Google on XPATH examples to start with.

Focusing on the screen-scraping question, the tricky part is to select what you think is the most discriminant xpath expression for the information you want to get. Most of the time, there is not only one solution, and you must be prepared to update your code to stick with the target site HTML evolution.

So it's a trade off between very simple expressions with a risk that they match unwanted texts, and too discriminant expressions, not tolerant with evolutions in the scraped HTML, with a risk that they match nothing.

As for your specific text, this is a good real world example, and here is a code that does it:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(yourText);

string companyName = doc.DocumentNode.SelectSingleNode("/td/table/tr/td/table/tr/th").InnerText;
Console.WriteLine("company name=" + companyName);

// another way
companyName = doc.DocumentNode.SelectSingleNode("//td[@class='black']/table/tr/th").InnerText;
Console.WriteLine("company name=" + companyName);

// a more advanced XPATH expression, means
// "Select a TD tag anywhere in the doc that has a preceding sibling of TD type with a B chid, with a FONT child with inner text starting with 'Phone Number'"
string phoneNumber = doc.DocumentNode.SelectSingleNode("//td[starts-with(preceding-sibling::td/b/font/text(), 'Phone Number')]").InnerText;
Console.WriteLine("phone Number=" + phoneNumber);

// same kind of story but go down the next A tag
string email = doc.DocumentNode.SelectSingleNode("//td[starts-with(preceding-sibling::td/b/font/text(), 'E-mail')]/a").InnerText;
Console.WriteLine("email=" + email);

PS: please note the HTML Agility Pack always expect tags used in XPATH expressions to be lowercase, even if they're not in the original HTML text.

As you see, the company name is retrieved here using two different expressions. They both work on the sample, but the first one will not resist if a new tag is added anywhere in the middle. The second one is more future-proof but is based on a CSS class tag that also may change. It's always a trade-off.

The phone number & email are similar but show the power of XPATH.