Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
November 7, 2022 07:20 am GMT

Building a Web Scraper in Golang: Complete Tutorial

Ever wondered how to build a web scraper in Golang? Check out this practical tutorial.

Golang, or Go, is designed to leverage the static typing and run-time efficiency of C and usability of Python and JavaScript, with added features of high-performance networking and multiprocessing. Its also compiled and excels in concurrency, making it quick.

This article will guide you through the step-by-step process of writing a fast and efficient Golang web scraper that can extract public data from a target website.

Installing Go

To start, head over to the Go downloadspage. Here you can download all of the common installers, such as Windows MSI installer, macOS Package, and Linux tarball. Go is open-source, meaning that if you wish to compile Go on your own, you can download the source code as well.

A package manager facilitates working with first-party and third-party libraries by helping you to define and download project dependencies. The manager pins down version changes, allowing you to upgrade your dependencies without fear of breaking the established infrastructure.

Installing Go on macOS

If you prefer package managers, you can use Homebrewon macOS. Open the terminal and enter the following:

brew install go

Installing Go on Windows

On Windows, you can use the Chocolateypackage manager. Open the command prompt and enter the following:

choco install golang

Installing Go on Linux

Installing Go on Linux requires five simple steps:

1. Remove previous Go installations (if any) using the following command:

rm -rf /usr/local/go

2. Download the GO for Linuxpackage; head over to the Go downloadspage, or use:

wget https://go.dev/dl/go1.19.2.linux-amd64.tar.gz

3. Once the .tar.gzfile is downloaded, extract the archive in the /usr/localdirectory through:

tar -C/usr/local -xzf go1.19.2.linux-amd64.tar.gz

4. Add the Go path to the PATHenvironment variable by adding the following line into $HOME/.profilefile, or for a system-wide installation, add it in /etc/profilefile:

exportPATH=$PATH:/usr/local/go/bin

5. Use the source $HOME/.profile command to apply changes in the environment variable of the.profile file.

Now, you can use the go versioncommand to verify that the Go version is installed.

Once Go is installed, you can use any code editor or an integrated development environment (IDE) that supports Go.

How to install Golang in Visual Studio Code?

While you can use virtually any code editor to write a Go program, one of the most commonly used ones is Visual Studio Code.

For Golang to be supported, youll need to install the Go extension. To do that, select the Extensions icon on the left side, type in Goin the search bar, and simply click Install :


intalling golang

Go extension for Visual Studio Code

Once youve finished installing the Goextension, youll need to update Go tools.

Press Ctrl+Shift+P to open the Show All Commands window and search for Go: Install/Update tools.

Take a look at the image below to see how it looks:
nstalled golang

Go tools for Visual Studio Code

After selecting all the available Go tools, click on the OK button to install.

We can also use a separate IDE (e.g., GoLand) to write, debug, compile, and run the Go projects. Both Visual Studio Code and GoLand are available for Windows, macOS, and Linux.

Web scraping frameworks

Go offers a wide selection of frameworks. Some are simple packages with core functionality, while others, such as Ferret, Gocrawl, Soup, and Hakrawler, provide a complete web scraping infrastructure to simplify data extraction. Lets have a brief overview of these frameworks.

Ferret

Ferret is a fast, portable, and extensible framework for designing Go web scrapers. Its pretty easy to use as the user simply needs to write a declarative query expressing which data to extract. Ferret handles the HTML retrieving and parsing part by itself.

Gocrawl

Gocrawl is a web scraping framework written in Go language. It gives complete control to visit, inspect, and query different URLs using goquery. This framework allows concurrent execution as it applies goroutines.

Soup

Soup is a small web scraping framework that can be used to implement a Go web scraper. It provides an API for retrieving and parsing the content.

Hakrawler

Hakrawler is a simple and fast web crawler available with Go language. Its a simplified version of the most popular Golang web scraping framework GoColly. Its mainly used to extract URLs and JavaScript file locations.

GoQuery

GoQuery is a framework that provides functionalities similar to jQueryin Golang. It uses two basic Go packages net/html(a Golang HTML parser) and cascadia(a CSS Selector).

Colly

The most popular framework for writing web scrapers in Go is Colly.

Colly is a fast scraping framework that can be used to write any kind of crawler, scraper, or spider. If you want to know more about differentiating a scraper from a crawler, check this article.

Colly has a clean API, handles cookies and sessions automatically, supports caching and robots.txt, and, most importantly, its fast. Colly offers distributed scraping, HTTP request delays, and concurrency per domain.

In this Golang Colly tutorial, well be using Colly to scrape books.toscrape.com. The website is a dummy book store for practicing web scraping.

How to import a package in Golang?

As the name suggests, the importdirective imports different packages into a Golang program. For example, the fmtpackage has definitions of formatted I/O library functions and can be imported using the importpreprocessor directive, as shown in the following snippet:

packagemain

import"fmt"

func main(){

fmt.Println("Hello World")

}

The code above first imports the fmtpackage and then uses its Println function to display the Hello Worldtext in the console.

We can also import multiple packages using a single importdirective, as you can see from the example below:

packagemain

import(

"fmt"

"math/rand"

)

func main(){

fmt.Println("Hello World")

fmt.Println(rand.Intn(25))

}

Parsing HTML with Colly

To easily extract structured data from the URLs and HTML, the first step is to create a project and install Colly.

Create a new directory and navigate there using the terminal. From this directory, run the following command:

go mod init oxylabs.io/web-scraping-with-go

This will create a go.modfile that contains the following lines with the name of the module and the version of Go. In this case, the version of Go is 1.17:

module oxylabs.io/web-scraping-with-go

go 1.17

Next, run the following command to install Colly and its dependencies:

go getgithub.com/gocolly/colly

This command will also update the go.modfile with all the required dependencies as well as create a go.sumfile.

We are now ready to write the web scraper code file. Create a new file, save it as books.goand enter the following code:

packagemain

import(

"encoding/csv"

"fmt"

"log"

"os"

"github.com/gocolly/colly"

)

func main(){

// Scraping code here

fmt.Println("Done")

}

The first line is the name of the package. Next, there are some built-in packages being imported as well as Colly itself.

The main()function is going to be the entry point of the program. This is where well write the code for the web scraper.

Sending HTTP requests with Colly

The fundamental component of a Colly web scraper is the Collector. The Collector makes HTTP requests and traverses HTML pages.

The Collector exposes multiple events. We can hook custom functions that execute when these events are raised. These functions are anonymous and pass as a parameter.

First, to create a new Collector using default settings, enter this line in your code:

c:=colly.NewCollector()

There are many other parameters that can be used to control the behavior of the Collector. In this example, we are going to limit the allowed domains. Change the line as follows:

c:=colly.NewCollector(

colly.AllowedDomains("books.toscrape.com"),

)

Once the instance is available, the Visit()function can be called to start the scraper. However, before doing so, its important to hook up to a few events.

The OnRequestevent is raised when an HTTP request is sent to a URL. This event is used to track which URL is being visited. Simple use of an anonymous function that prints the URL being requested is as follows:

c.OnRequest(func(r colly.Request){

fmt.Println("Visiting",r.URL)

})

Note that the anonymous function being sent as a parameter here is a callback function. It means that this function will be called when the event is raised.

Similarly, OnResponsecan be used to examine the response. The following is one such example:

c.OnResponse(func(r colly.Response){

fmt.Println(r.StatusCode)

})

The OnHTMLevent can be used to take action when a specific HTML element is found.

Locating HTML elements via CSS selector

The OnHTMLevent can be hooked using the CSS selector and a function that executes when the HTML elements matching the selector are found.

For example, the following function executes when a titletag is encountered:

c.OnHTML("title",func(e colly.HTMLElement){

fmt.Println(e.Text)

})

This function extracts the text inside the titletag and prints it. Putting together all we have gone through so far, the main()function is as follows:

func main(){

c:=colly.NewCollector(

colly.AllowedDomains("books.toscrape.com"),

)

c.OnHTML("title",func(e colly.HTMLElement){

fmt.Println(e.Text)

})

c.OnResponse(func(r colly.Response){

fmt.Println(r.StatusCode)

})

c.OnRequest(func(r colly.Request){

fmt.Println("Visiting",r.URL)

})

c.Visit("https://books.toscrape.com/")

}

This file can be run from the terminal as follows:

go run books.go

The output will be as follows:

Visiting https://books.toscrape.com/

200

All products |Books to Scrape -Sandbox

Extracting the HTML elements

Now that we know how Colly works lets modify OnHTMLto extract the book titles and prices.

The first step is to understand the HTML structure of the page.
web scraping html structure

The books are in the <article> tags

Each book is contained in an articletag that has a product_podclass. The CSS selector would be .product_pod.

Next, the complete book title is found in the thumbnail image as an altattribute value. The CSS selector for the book title would be .image_container img.

Finally, the CSS selector for the book price would be .price_color.

The OnHTMLcan be modified as follows:

c.OnHTML(".product_pod",func(e colly.HTMLElement){

title:=e.ChildAttr(".image_container img","alt")

price:=e.ChildText(".price_color")

})

This function will execute every time a book is found on the page.

Note the use of the ChildAttrfunction that takes two parameters: the CSS selector and the name of the attribute it isnt subtle. A better idea would be to create a data structure to hold this information. In this case, we can use structas follows:

type Book struct {

Title string

Price string

}

The OnHTMLwill be modified as follows:

c.OnHTML(".product_pod",func(e colly.HTMLElement){

book:=Book{}

book.Title =e.ChildAttr(".image_container img","alt")

book.Price =e.ChildText(".price_color")

fmt.Println(book.Title,book.Price)

})

For now, this web scraper is simply printing the information to the console, which isnt particularly useful. Well revisit this function when its time to save the data to a CSV file.

Handling pagination

First, we need to locate the next button and create a CSS selector. For this particular site, the CSS selector is .next > a. Using the selector, a new function can be added to the OnHTMLevent. In this function, well convert a relative URL to an absolute URL. Then, well call the Visit()function to crawl the converted URL:

c.OnHTML(".next > a",func(e colly.HTMLElement){

nextPage:=e.Request.AbsoluteURL(e.Attr("href"))

c.Visit(nextPage)

})

The existing function that scrapes the book information will be called on all of the resulting pages as well. No additional code is needed.

Now that we have the data from all of the pages, its time to save it to a CSV file.

Writing data to a CSV file

The built-in CSV library can be used to save the structure to CSV files. If you want to save the data in JSON format, you can use the JSON library as well.

To create a new CSV file, enter the following code before creating the Colly collector:

file,err:=os.Create("export.csv")

iferr !=nil {

log.Fatal(err)

}

defer file.Close()

This will create export.csvand delay closing the file until the program completes its cycle.

Next, add these two lines to create a CSV writer:

writer:=csv.NewWriter(file)

defer writer.Flush()

Now, its time to write the headers:

headers:=[]string{"Title","Price"}

writer.Write(headers)

Finally, modify the OnHTMLfunction to write each book as a single row:

c.OnHTML(".product_pod",func(e colly.HTMLElement){

book:=Book{}

book.Title =e.ChildAttr(".image_container img","alt")

book.Price =e.ChildText(".price_color")

row:=[]string{book.Title,book.Price}

writer.Write(row)

})

Thats all! The code for the Golang web scraper is now complete.

Run the file by entering the following in the terminal:

go run books.go

This will create an export.csvfile with 1,000 rows of data.

Scheduling tasks with GoCron

For some tasks, you might want to schedule a web scraper to extract data periodically or at a specific time. You can do that by using your OS's schedulers or a high-level scheduling package usually available with the language you're using.

To schedule a Go scraper, you can use OS tools like Cronor Windows Task Scheduler. Alternatively, you can equip a high-level GoCron task scheduling packageavailable with Golang. It's essential to keep in mind that scheduling a scraper through OS-provided schedulers limits the portability of the code. However, the GoCron task scheduler package solves this problem and works well with almost all operating systems.

GoCron is a task scheduling package available in Golang for running specific codes at a particular time. It offers similar functionalities as Python's job scheduling module named schedule.

Scheduling a task with GoCron requires a package to be installed with Golang, which you can do by using the following command:

go getgithub.com/go-co-op/gocron

The next step is to write a GoCron script to schedule our code. Let's look at the following code example to understand how GoCron scheduler works:

packagemain

import(

"fmt"

"time"

"github.com/go-co-op/gocron"

)

func My_Task_1(){

fmt.Println("Hello Task 1")

}

func main(){

my_scheduler :=gocron.NewScheduler(time.UTC)

my_scheduler.Every(5).Seconds().Do(My_Task_1)

my_scheduler.StartAsync()

my_scheduler.StartBlocking()

}

The code above schedules the My_task_1function to run every 5 seconds. Moreover, we can start the GoCron scheduler in two modes: asynchronous mode and blocking mode.

StartAsync()will start the scheduler asynchronously, while the StartBlocking()method will start the scheduler in blocking mode by blocking the current execution path.

Side note: The above code example starts the GoCron scheduler in both the asynchronous and the blocking modes. However, we can choose either of these as per our requirements.

Lets schedule our Golang web scraper code example using the GoCron scheduling module.

packagemain

import(

"encoding/csv"

"fmt"

"log"

"os"

"time"

"github.com/go-co-op/gocron"

"github.com/gocolly/colly"

)

type Book struct {

Title string

Price string

}

func BooksScraper(){

fmt.Println("Start scraping")

file,err:=os.Create("export.csv")

iferr !=nil {

log.Fatal(err)

}

defer file.Close()

writer:=csv.NewWriter(file)

defer writer.Flush()

headers:=[]string{"Title","Price"}

writer.Write(headers)

c:=colly.NewCollector(

colly.AllowedDomains("books.toscrape.com"),

)

c.OnHTML(".product_pod",func(e colly.HTMLElement){

book:=Book{}

book.Title =e.ChildAttr(".image_container img","alt")

book.Price =e.ChildText(".price_color")

row:=[]string{book.Title,book.Price}

writer.Write(row)

})

c.OnResponse(func(r colly.Response){

fmt.Println(r.StatusCode)

})

c.OnRequest(func(r *colly.Request){

fmt.Println("Visiting",r.URL)

})

c.Visit("https://books.toscrape.com/")

}

func main(){

my_scheduler:=gocron.NewScheduler(time.UTC)

my_scheduler.Every(2).Minute().Do(BooksScraper)

my_scheduler.StartBlocking()

}

Summary

The code used in this article ran in less than 12 seconds. Executing the same task in Scrapy, which is one of the most optimized modern frameworks for Python, took 22 seconds. If speed is what you prioritize for your web scraping tasks, its a good idea to consider Golang in tandem with a modern framework such as Colly. You can click hereto find the complete code used in this article for your convenience.


Original Link: https://dev.to/oxylabs-io/building-a-web-scraper-in-golang-complete-tutorial-34if

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To