Simon Mourier's Blog - Html Agility Pack cannot find element using xpath but it is working fine with WebDriver

Html Agility Pack cannot find element using xpath but it is working fine with WebDriver

Jun 18, 2015 See the question and my original answer on StackOverflow

WebDriver will always rely on the target browser when working with XPATH. Technically, it's just a fancy bridge to the browser (whether the browser is Firefox or Chrome - IE up to 11 does not support XPATH)

Unfortunately the DOM (elements and attributes structure) that reside in browser memory is not the same as the DOM that you probably provided to the Html Agility Pack. It could be the same if you loaded the HAP with the content of the DOM from the browser memory (an equivalent to document.OuterHtml for example). In general this is not the case because developers use HAP to scrap sites without a browser, so they feed it from a network stream (from an HTTP GET request) or a raw file.

This problem is easy to demonstrate. For example, if you create a file that contains only this:

<table><tr><td>hello world</td></tr></table>

(no html, no body tag, this is in fact an invalid html file)

With HAP you can load it like this:

HtmlDocument doc = new HtmlDocument();
doc.Load(myFile);

And the structure HAP will come up with is simply this:

+table
 +tr
  +td
   'hello world'

The HAP is not a browser, it's a parser and it doesn't really know HTML specifications, it just knows how to parse a bunch of tags and build a DOM with it. It doesn't know for example a document should start with HTML, and should contain a BODY, or that a TABLE element always has a TBODY child when inferred by a browser.

In a Chrome browser though, it you open this file, inspect it and ask the XPATH for the TD element, it will report this:

/html/body/table/tbody/tr/td

Because Chrome has just made this up by itself... As you see the two systems don't match.

Note if you have id attributes available in the source HTML, the story is better, for example, with the following HTML:

<table><tr><td id='hw'>hello world</td></tr></table>

Chrome will report the following XPATH (it will try to use id attributes as much as possible):

//*[@id="hw"]

Wich can be used in HAP as well. But, this does not work all the time though. For example, with the following HTML

<table id='hw'><tr><td>hello world</td></tr></table>

Chrome will now produce this XPATH to the TD:

//*[@id="mytable"]/tbody/tr/td

as you see this is not usable in HAP again because of that inferred TBODY.

So, in the end, you can't just blindly use browsers-generated XPATH in other contexts than in those browsers. In other contexts, you will have to find other discriminants.

Actually, I personnally think it's somehow a good thing because it will make your XPATH more resistant to changes. But you'll have to think :-)

Now let's get back to your case :)

The following C# sample console case should work fine:

  static void Main(string[] args)
  {
      var web = new HtmlWeb();
      var doc = web.Load("http://www2.epa.gov/languages/traditional-chinese");
      var node = doc.DocumentNode.SelectSingleNode("//section[@id='main-content']//div[@class='pane-content']//a");
      Console.WriteLine(node.OuterHtml); // displays <a href="http://www.oehha.ca.gov/fish/pdf/59329_CHINESE.pdf">...etc...</a>"
  }

If you look at the structure of the stream or file (or even what the browser displays, but take care, avoid TBODYs...), the easiest is to

find an id (just like browser do) and/or
find unique child or grand child elements or attributes below this, recursively or not
avoid too precise XPATHs. Things like p/p/p/div/a/div/whatever are bad

So, here, after the main-content id attribute, we just look (recursively with //) for a DIV that has a special class and we look (again recursively) for the first child A available.

This XPATH should work in webdriver and in HAP.

Note this XPATH also works: //div[@class='pane-content']//a but it looks a bit loose to me. Setting the foot on id attributes is often a good idea.