How can I use html agility to grab everything between <b> and <br>
See the question and my original answer on StackOverflowIt's not that easy because the original document is quite unstructured (not using a hierarchical layout, but a flat one), but here is how you can extract the main text fields with the Html Agility Pack:
HtmlDocument doc = new HtmlDocument();
doc.Load("yourDoc.Htm");
// Get A nodes that have an HREF attribute
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//b/a[@href]"))
{
// This will contain anchor's displayed text
string title = node.InnerText;
Console.WriteLine("title=" + title);
// Get the 1st BR, and then it's next sibling of TEXT type.
HtmlNode sizeNode = node.SelectSingleNode("../following-sibling::br[1]/following-sibling::text()");
Console.WriteLine(" size=" + sizeNode.InnerText.Trim());
// Get the 3nd BR, and then it's next sibling of TEXT type.
HtmlNode eanNode = node.SelectSingleNode("../following-sibling::br[2]/following-sibling::text()");
Console.WriteLine(" ean=" + eanNode.InnerText.Trim());
// Get the 3rd BR, and then it's next sibling of TEXT type.
HtmlNode upcNode = node.SelectSingleNode("../following-sibling::br[3]/following-sibling::text()");
Console.WriteLine(" upc=" + upcNode.InnerText.Trim());
}
This will display:
title=Peanut Delight Peanut Butter & Grape Jelly
size=Size: 18 oz
ean=GTIN/EAN-13: 0041498143909 / 00-41498-14390-9
upc=UPC-A: 041498143909 / 04149814390
title=Nabisco Nutter Butter Sandwich Cookie Bites Peanut Butter
size=Size: 10 oz
ean=GTIN/EAN-13: 0044000046118 / 00-44000-04611-8
upc=UPC-A: 044000046118 / 04400004611
title=Nabisco Nutter Butter Sandwich Cookies Chocolate Peanut Butter 4 Ct
size=Size: 12 oz
ean=GTIN/EAN-13: 0044000003562 / 00-44000-00356-2
upc=UPC-A: 044000003562 / 04400000356
etc...
NOTE: It's not 100% finished, as you'll have to parse the size, ean and upc variable using standard string manipulation (IndexOf, Substring, etc...) or Regex but the Html side of things is done.