Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
July 20, 2021 05:17 pm GMT

Web Scraping using Python! Create your own Dataset

Machine Learning requires a lot of data and not always it is easy to get the data you want. Have you ever wondered how Kaggle and other such websites provide us with huge datasets? The answer is web scraping. So, let us see how we can extract data from the web.
Lets assume we are building a model which requires movie information such as title, summary, and rating of a number of movies. When it comes to movies, we know IMDB has the largest database. Let us dig into it.

What exactly we do to scrape a webpage?

Theres a pattern in everything. We need to observe and find a pattern in the HTML code of the web page to extract relevant data. Lets go step by step. We will be doing everything using python and scrape the data from the following URL :
https://www.imdb.com/search/title?release_date=2019&sort=user_rating,desc&ref_=adv_nxt

1. Install dependencies

# To download the webpagepip install requests# To scrape data from the downloaded webpagepip install beautifulsoup4

2. Download the webpage

Requests is a great HTTP library to make request calls. We will use it to download the webpage of the given URL.

import requestsurl = "https://www.imdb.com/search/title?release_date=2019&sort=user_rating,desc&ref_=adv_nxt"# get() method downloads the entire HTML of the provided urlresponse = requests.get(url)# Get the text from the response objectresponse_text = response.text

3. Inspecting elements and finding the pattern

Now the data we have downloaded is exactly the same you see when you right-click and do inspect element in the browser. Lets right-click on the rating and see how we can extract it.

medium1.png

When we look closely we will see the class ratings-bar contains the rating of the movie. If we inspect other movies, we will find all the movies have the same class name for the ratings on that page. Here, we found a pattern to extract all the ratings from the page. Similarly, we can extract summary, title, genre, etc.

Not only using class but you can select a specific part of the HTML code using id, tags, etc as well.

Lets jump into the code!

BeautifulSoup allows us to extract data(more precisely parse data) from HTML using the class name, id, tags, etc. Isnt it Beautiful? :-D

from bs4 import BeautifulSoup# Create a BeautifulSoup object# response_text -> The downloaded webpage# lxml -> Used for processing HTML and XML pagessoup = BeautifulSoup(response_text,'lxml')

To select the content from the page we use CSS Selectors. CSS Selectors allows us to select different classes, ids, tags, and other html elements easily. CSS Selector for Class*is *"."** and for ID is "#". To select a class we need to prefix a "." to the class name we want to extract and similarly, for ID we need to prefix "#".

# As we saw the rating's class name was "ratings-bar" # we prefix "." since its a classrating_class_selector = ".ratings-bar"# Extract the all the ratings classrating_list = soup.select(rating_class_selector)

This rating_list is the list of object containing all the <div> elements containing ratings-bar as class name. We need to get the text from within the div element.

Heres how a single rating object looks like:

<div class="ratings-bar"><div class="inline-block ratings-imdb-rating" data-value="10" name="ir"><span class="global-sprite rating-star imdb-rating"></span><strong>10.0</strong></div>...</div>

We need to get the rating value from the <strong> tag. We can extract the tags using find(tagName) method and get the text using getText().

# This List will store all the ratingsratings = []# Iterate through all the ratings objectfor rating_object in rating_list:    # Find the <strong> tag and get the Text    rating_text = rating_object.find('strong').getText()     # Append the rating to the list    ratings.append(rating_text)print(ratings)

And we are done. Similarly, you can extract Titles, Summary, Genre using the above method with the appropriate class name and tag names.

You can store the data to CSV or excel file and use it for your Machine Learning model.

Full Code present on my Github:

https://github.com/prashant2018/Medium-Article-Code-Snippets/tree/master/Web-Scraping-Using-Python

Follow me on Twitter:

https://twitter.com/prash2018


Original Link: https://dev.to/prashant2018/web-scraping-using-python-create-your-own-dataset-50n5

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To