Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
October 15, 2020 05:20 pm GMT

Build A Web Crawler To Check for Broken Links with Python & BeautifulSoup

In this article, I am going to show you how you can build a simple web crawler with Python and BeautifulSoup that checks for broken links.

Prerequisites

Before we are going to make our application, we need the following tools installed on our device:

  • Python 3. If you havent installed it yet, download and install it from their website.
  • An IDE. You are free to use any IDE/text editor that is available out there. I am going to use PyCharm. If you want to download the free version, make sure you download and install the Community Edition.
  • BeautifulSoup. We need to download and install BeautifulSoup using pip. In your command line (or terminal) you can run the following command: pip install beautifulsoup4
  • requests. This is the last library we need to install. You can also install it by entering this command: pip install requests

What is BeautifulSoup?

Beautiful Soup is a library written in Python that extracts data out of HTML and XML files. It works well if you want to get data quickly and saves programmers a lot of time.

Writing our script

The first thing we need to is to create a script. Create an empty file in your IDE and give it the name verify_response_code.py

The second thing we need to do is to import BeautifulSoup from bs4 (the library we installed in our prerequisites). We also need to import the library requests. Our code looks like this:

from bs4 import BeautifulSoup, SoupStrainerimport requests
Enter fullscreen mode Exit fullscreen mode

Next, we create a variable with the name url where we create a prompt message where we are entering the URL we want to retrieve the links from. Our code looks like this:

url = input("Enter your url: ")
Enter fullscreen mode Exit fullscreen mode

Afterward, we create a variable in which we going to use the requests library. Within the library, we use the get method to actually get the URL we entered.

page = requests.get(url)
Enter fullscreen mode Exit fullscreen mode

We know got our URL. Now we want to retrieve the response code. If our site is available, we want to get the response code 200. If it isnt available, we will get the response code 404. We going to use the variable page we used before and we are going to convert it using str method. Our code like this

response_code = str(page.status_code)
Enter fullscreen mode Exit fullscreen mode

Furthermore, our application needs to display the URL text itself. To do that we create a variable called data that is going to display the URL in a string.

data = page.text
Enter fullscreen mode Exit fullscreen mode

The last variable we have to add is soup. In this variable, we are going to assign it to BeautifulSoup and use the data variable as the argument. We do this so we can use the built-in methods of BeautifulSoup.

soup = BeautifulSoup(data)
Enter fullscreen mode Exit fullscreen mode

The last step in our web crawler is adding a for-loop. We are going to use the method find_all with the argument 'a'. This is trying to find all a elements on our webpage. After that, we are going to print our URL. We are going to use the get method again to get all a elements with the value href. This is to indicate that we want to have only the URL. Next to it, we want to put our response code. Our code looks now like this:

for link in soup.find_all('a'):    print(f"Url: {link.get('href')} " + f"| Status Code: {response_code}")
Enter fullscreen mode Exit fullscreen mode

If it is correct, your whole code should look like this:

# Import librariesfrom bs4 import BeautifulSoup, SoupStrainerimport requests# Prompt user to enter the URLurl = input("Enter your url: ")# Make a request to get the URLpage = requests.get(url)# Get the response code of given URLresponse_code = str(page.status_code)# Display the text of the URL in strdata = page.text# Use BeautifulSoup to use the built-in methodssoup = BeautifulSoup(data)# Iterate over all links on the given URL with the response code next to itfor link in soup.find_all('a'):    print(f"Url: {link.get('href')} " + f"| Status Code: {response_code}")
Enter fullscreen mode Exit fullscreen mode

Now run the script by typing python verify_response_code.py in your terminal. You are asked to enter an URL. Enter the given the URL and press enter. If things are going well, you should receive an output like this below.

Output URL and response codes

Thats it! Our small web crawler is done. I hope this article was good for you. If you want to check out more content on my blog, join the newsletter.

Happy coding!

If you want to know more tips about programming, feel free to check out my blog.
Also, feel free to check out my YouTube channel for more tutorials!


Original Link: https://dev.to/arvindmehairjan/build-a-web-crawler-to-check-for-broken-links-with-python-beautifulsoup-39mg

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To