Saturday, April 30, 2011

C# Programming: Web Scraping with Regular Expressions (draft)

Overview


If you want to extract data from a public web site you have to scrape it.

Use Case 1: You need all products from a given online shop catalog to do price comparisons or re-sell them at your own shop.

Use Case 2: You need to know how your site is positioned in search engine result pages (SERP) for given keywords. Then you need to search for your keywords and scrape the SERPs to find the positioning of your web site.

There are many use cases where web scraping can be applied. Before scraping a web site check the terms and conditions of the web site as there could be legal issues related to scraping information from certain sites.

Crawling before Scraping


The scraping part is the easiest part of the exercise. The more complex thing is to get to the pages you want to scrape. And there is only one way to do it - like a spider you need to crawl your way to the desired page.

Use regular expressions


Here I show a simple class that receives the HTML string and then extracts all the links and their text into structs. It is fairly fast, but I offer some optimization tips further down. It would be better to use a class here and offer methods that act on its contents.

Use SingleLine mode


Many C# developers make the mistake of not specifying that the Regexes work on multiple lines, treating newlines as regular characters. MSDN states that SingleLine "Specifies single-line mode. Changes the meaning of the dot (.) so it matches every character (instead of every character except \n)."

0 comments:

Post a Comment