Scraping GRO search results using AngleSharp

Python has Scrapy, BeautifulSoup and lxml. Ruby has Mechanize and nokogiri. What about .NET?

Requirements

To scrape websites you generally need a library which will allow you to: navigate the site; log in if necessary; submit search forms; page through results; extract the data. Other considerations include: performance; tolerance of malformed markup. Tools usually fall into one of two categories: website crawlers or HTML parsers with some having elements of both.

Most articles I've so far found using .NET to scrape websites seem to focus only on parsing a single page of results, which is frustrating as there's usually a lot more process before you get to that part.

Contenders

There appears to be no particular de facto "go to" library for .NET. The options are:

  • HtmlAgilityPack - this is possibly the most established library for parsing HTML, though appears to now be under new ownership. The new website seems a bit sparse on details (and a bit broken). Its an HTML parser rather than a full scraping library.
  • ScrapySharp - provides scraping functionality on top of HtmlAgilityPack. Last updated over 18 months ago, precious little documentation.
  • IronWebScraper - this looked promising, though again very little "getting started" type information, just the source code generated documentation. I installed the NuGet package and started to use it, but there seemed to be a strange model of saving scraped results to the file system. Abandoned.
  • AngleSharp - seems like development has stalled recently and documentation is scattered and not linked from the main "showcase" website. Nevertheless a bit of digging reveals the crawling and parsing aspects seem reasonably functional.
  • DotnetSpider - referenced from the HtmlAgilityPack third party library page - by all accounts a port of Scrapy / WebMagic - looks interesting, not yet tried.

Supporting Tools

  • Fiddler - when you're scraping and submitting forms and nothing is working, its useful to check out exactly what the scraping library is actually doing. Fiddler is a debugging proxy which allows you to see the requests and responses going back and forwards on the wire.

The GRO website

The General Register Office website allows searching the historic indexes for births and deaths registered in England or Wales since 1837. The information available is great, but searching is slightly frustrating since you have to specify the gender and can only search within a five year range at a time. I originally developed a scraper in Ruby using Mechanize and nokogiri, but for a recent project in .NET I decided to rewrite the scraper rather than calling the existing Ruby version from the .NET code (I might yet change my mind).

Scraping all records for one surname

AngleSharp documentation

The GitHub repo homepage has brief getting started instructions, but the best documentation is linked from the wiki area: https://github.com/AngleSharp/AngleSharp/wiki

Set up

Logging in to the GRO site requires submitting a form and thereafter keeping track of cookies. In AngleSharp, this requires setting up and reusing a "browsing context". The brief instructions on the github homepage in conjunction with an article on codeproject.com cover the basics.

var config = Configuration.Default.WithDefaultLoader().WithCookies();
var context = BrowsingContext.New(config);

Log in

Having set up the context, browsing to the initial login page is as simple as await context.OpenAsync("login-page-url"). The object context.Active contains details of the current document being "browsed" i.e. the login page here. Various selectors are available to locate the login <form>. The SubmitAsync method can then be used to submit it.
Login page
This didn't work for ages and inspecting the differences in Fiddler between the calls AngleSharp was submitting and a basic HttpClient implementation revealed the name of the button being clicked wasn't getting sent. This was due to me submitting the form rather than "clicking the button". I eventually found this issue from Florian which described the problem and the solution. Bingo.

var loginPage = "https://www.gro.gov.uk/gro/content/certificates/login.asp";
await context.OpenAsync(loginPage);

await context.Active
    .QuerySelector<IHtmlFormElement>("form")
    .QuerySelector<IHtmlInputElement>("input.formButton")
    .SubmitAsync(new
    {
        username = _username,
        password = _password
    });

Select 'Search the GRO Indexes'

Menu page

There are various ways to find a link and click it. There is no id attribute so the least fragile option seemed to be to locate the link by the text. The selectors use CSS which aren't good at finding links by text. AngleSharp can alternatively use LINQ type queries to locate elements.

var result = context.Active.Links.Single(a => a.TextContent == "Search the GRO Indexes") as IHtmlAnchorElement;
await result.NavigateAsync();

Choose births or deaths

The search form first requires selecting either the Birth or Death index so the remaining search fields can be tailored accordingly.

Choose index

This time simply submitting the form having selected the required radio button works fine.

await context.Active
    .QuerySelector<IHtmlFormElement>("form")
    .SubmitAsync(new { index = "EW_Birth" });

Fill in and submit search form

Parse first page