Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
November 22, 2022 02:20 pm GMT

Web Scraping With C

After reading this post, you'll learn how to create a real-life C# web scraper. Keep in mind that even if you're using C#, you'll be able to adapt this information to all languages supported by the .NET platform, including VB.NET and F#.

Let's get started!

Building a web scraper with C#

As mentioned, you'll learn how to write a C# public web scraping code using HTML Agility Pack. In this tutorial, we'll be employing the .NET 5 SDK with Visual Studio Code. This code has been tested with .NET Core 3 and .NET 5, and it should work with other versions of .NET.

We'll follow the hypothetical scenario: scraping a bookstoreand collecting book names and prices. Let's set up the development environment before writing a C# web crawler.

Setup development environment

For the C# development environment, install Visual Studio Code. Note that Visual Studio and Visual Studio Code are two completely different applications if you use them for writing C# code.

Once Visual Studio Code is installed, install .NET 5.0or newer. You can also use .NET Core 3.1. After installation is complete, open the terminal and run the following command to verify that .NET CLI or Command Line Interface is working properly:

dotnet--version

This should output the version number of the .NET installed.

Project structure and dependencies

The code will be a part of a .NET project. To keep it simple, create a console application. Then, make a folder where you'll want to write the C# code. Open the terminal and navigate to that folder. Now, type in this command:

dotnetnewconsole

This command's output should confirm that the console application has been successfully created.

Now, it's time to install the required packages. To use C# for scraping public web pages, HTML Agility Pack will be a good choice. You can install it for this project using this command:

dotnetadd packageHtmlAgilityPack

You should install one more package so that you can easily export the scraped data to a CSV file:

dotnetadd packageCsvHelper

If you're using Visual Studio instead of Visual Studio Code, click "File," select "New Solution," and press on "Console Application." To install the dependencies, follow these steps:

  • Choose "Project;"
  • Click on "Manage Project Dependencies." This will open the NuGet Packages window;
  • Search for HtmlAgilityPack and select it;
  • Finally, search for CsvHelper, choose it, and click "Add Packages."


project structure

Now that the packages have been installed, you can move on to writing code for web scraping the bookstore.

Download and parse web pages

The first step of any web scraping program is to download the HTML of a web page. This HTML will be a string you'll need to convert into an object that can be processed further. The latter part is called parsing. Html Agility Pack can read and parse files from local files, HTML strings, any URL, or even a browser.

In this case, you only need to get HTML from a URL. Instead of using .NET native functions, Html Agility Pack provides a convenient class HtmlWeb. This class offers a Loadfunction that can take a URL and return an instance of the HtmlDocumentclass, which is also part of the package we use. With this information, you can write a function that takes a URL and returns an instance of HtmlDocument.

Open Program.csfile and enter this function in the class Program:

// Parses the URL and returns HtmlDocument object

staticHtmlDocument GetDocument(string url)

{

HtmlWeb web =newHtmlWeb();

HtmlDocument doc =web.Load(url);

returndoc;

}

With this, the first step of the code is complete. The next step is to parse the document.

Parsing the HTML: getting book links

You'll extract the required information from the web page in this part of the code. At this stage, a document is now an object of type HtmlDocument. This class exposes two functions to select the elements. Both functions accept XPath as input and return HtmlNodeor HtmlNodeCollection. Here is the signature of these two functions:

publicHtmlNodeCollection SelectNodes(string xpath);

publicHtmlNode SelectSingleNode(string xpath);

Let's discuss SelectNodesfirst.

For this example C# web scraper you'll scrape all the book details from this page. First, it needs to be parsed so that all the links to the books can be extracted. To do that, open this page in the browser, right-click any book links and click "Inspect." This will open the Developer Tools.

After understanding some time with the markup, your XPath to select should be something like this:

//h3/a

This XPath can now be passed to the SelectNodesfunction.

HtmlDocument doc =GetDocument(url);

HtmlNodeCollection linkNodes =doc.DocumentNode.SelectNodes("//h3/a");

Note that the DocumentNodeattribute of the HtmlDocumentis calling the SelectNodesfunction.

The variable linkNodesis a collection. You can write a foreachloop over it and get the hreffrom each link one by one. There is one tiny problem that you need to take care of the links on the page are relative. Hence, they need to be converted into an absolute URL before you can scrape these extracted links.

For converting the relative URLs, you can make use of the Uriclass. You can use this constructor to get a Uriobject with an absolute URL.

Uri(Uri baseUri,string?relativeUri);

Once you have the Uriobject, you can simply check the AbsoluteUriproperty to get the complete URL.

Write all this in a function to keep the code organized.

staticList<string>GetBookLinks(string url)

{

varbookLinks =newList<string>();

HtmlDocument doc =GetDocument(url);

HtmlNodeCollection linkNodes =doc.DocumentNode.SelectNodes("//h3/a");

varbaseUri =newUri(url);

foreach(varlink inlinkNodes)

{

string href =link.Attributes["href"].Value;

bookLinks.Add(newUri(baseUri,href).AbsoluteUri);

}

returnbookLinks;

}

You're starting with an empty List<string>object in this function. In the foreachloop, you need to add all the links to this object and return it.

Now, it's time to modify the Main()function so that you can test the C# code that you've written so far. Modify the function so that it looks like this:

staticvoidMain(string[]args)

{

varbookLinks =GetBookLinks("http://books.toscrape.com/catalogue/category/books/mystery_3/index.html");

Console.WriteLine("Found {0} links",bookLinks.Count);

}

To run this code, open the terminal and navigate to the directory which contains this file, and type in the following:

dotnetrun

The output should be as follows:

Found 20links

Let's move to the next part, where you'll learn to process all the links to get the book data. If you have any questions before jumping to the next section, write a comment, and we'll answer them as soon as possible.

Parsing the HTML: getting book details

At this point, you should have a list of strings that contain the URLs of the books. You can simply write a loop that will get the document using the GetDocumentfunction you've already written. After that, you'll use the SelectSingleNodefunction to extract the book's title and price.

To keep the data organized, let's start with a class. This class will represent a book. This class will have two properties Titleand Price. It looks like this:

publicclassBook

{

publicstring Title {get;set;}

publicstring Price {get;set;}

}

Now, open a book page in the browser and create the XPath for the Title //h1. Creating an XPath for the price is a little trickier because the additional books at the bottom have the same class applied.



parsing html

The XPath of the price will be this:

//div[contains(@class,"product_main")]/p[@class="price_color"]

Note that XPath contains double quotes. You'll have to escape these characters by prefixing them with a backslash.

Now, you can use the SelectSingleNodefunction to get the Node and then employ the InnerTextproperty to get the text contained in the element. You can organize everything in a function as follows:

staticList<Book>GetBookDetails(List<string>urls)

{

varbooks =newList<Book>();

foreach(varurl inurls)

{

HtmlDocument document =GetDocument(url);

vartitleXPath ="//h1";

varpriceXPath ="//div[contains(@class,&quot;product_main&quot;)]/p[@class=&quot;price_color&quot;]";

varbook =newBook();

book.Title =document.DocumentNode.SelectSingleNode(titleXPath).InnerText;

book.Price =document.DocumentNode.SelectSingleNode(priceXPath).InnerText;

books.Add(book);

}

returnbooks;

}

This function will return a list of Bookobjects. It's time to update the Main() function as well:

staticvoidMain(string[]args)

{

varbookLinks =GetBookLinks("http://books.toscrape.com/catalogue/category/books/mystery_3/index.html");

Console.WriteLine("Found {0} links",bookLinks.Count);

varbooks =GetBookDetails(bookLinks);

}

The final part of this web scraping project is to export the data in a CSV.

Exporting data

If you haven't yet installed the CsvHelper, you can do this by running the command dotnetadd package CsvHelperfrom within the terminal.

The export function is pretty straightforward. First, you need to create a StreamWriterand send the CSV file name as the parameter. Next, you should use this object to create a CsvWriter. Finally, you can use the WriteRecordsfunction to write all the books in just one line of code.

To ensure that all the resources are closed properly, you can use the usingblock. You can also wrap everything in a function as follows:

staticvoidexportToCSV(List<Book>books)

{

using(varwriter =newStreamWriter("./books.csv"))

using(varcsv =newCsvWriter(writer,CultureInfo.InvariantCulture))

{

csv.WriteRecords(books);

}

}

Finally, you can call this function from the Main()function:

staticvoidMain(string[]args)

{

varbookLinks =GetBookLinks("http://books.toscrape.com/catalogue/category/books/mystery_3/index.html");

varbooks =GetBookDetails(bookLinks);

exportToCSV(books);

}

That's it! To run this code, open the terminal and run the following command:

dotnetrun

Within seconds, you'll have a books.csvfile created.

Summary

You can use multiple packages to write a web scraper with C#.In this post, we've shown how to employ HTML Agility Pack, a powerful and easy-to-use package. This was a simple example that can be enhanced further; for instance, you can try adding the above logic to handle multiple pages to this code.

If you have any questions about this tutorial or any other web scraping topics, don't hesitate to comment on this post, we'll answer you right away.


Original Link: https://dev.to/oxylabs-io/web-scraping-with-c-48bd

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To