An Interest In:
Web News this Week
- April 18, 2024
- April 17, 2024
- April 16, 2024
- April 15, 2024
- April 14, 2024
- April 13, 2024
- April 12, 2024
Build A Web Crawler To Check for Broken Links with Python & BeautifulSoup
In this article, I am going to show you how you can build a simple web crawler with Python and BeautifulSoup that checks for broken links.
Prerequisites
Before we are going to make our application, we need the following tools installed on our device:
- Python 3. If you havent installed it yet, download and install it from their website.
- An IDE. You are free to use any IDE/text editor that is available out there. I am going to use PyCharm. If you want to download the free version, make sure you download and install the Community Edition.
- BeautifulSoup. We need to download and install BeautifulSoup using pip. In your command line (or terminal) you can run the following command:
pip install beautifulsoup4
requests
. This is the last library we need to install. You can also install it by entering this command:pip install requests
What is BeautifulSoup?
Beautiful Soup is a library written in Python that extracts data out of HTML and XML files. It works well if you want to get data quickly and saves programmers a lot of time.
Writing our script
The first thing we need to is to create a script. Create an empty file in your IDE and give it the name verify_response_code.py
The second thing we need to do is to import BeautifulSoup
from bs4
(the library we installed in our prerequisites). We also need to import the library requests
. Our code looks like this:
from bs4 import BeautifulSoup, SoupStrainerimport requests
Next, we create a variable with the name url
where we create a prompt message where we are entering the URL we want to retrieve the links from. Our code looks like this:
url = input("Enter your url: ")
Afterward, we create a variable in which we going to use the requests library. Within the library, we use the get
method to actually get the URL we entered.
page = requests.get(url)
We know got our URL. Now we want to retrieve the response code. If our site is available, we want to get the response code 200. If it isnt available, we will get the response code 404. We going to use the variable page
we used before and we are going to convert it using str
method. Our code like this
response_code = str(page.status_code)
Furthermore, our application needs to display the URL text itself. To do that we create a variable called data
that is going to display the URL in a string.
data = page.text
The last variable we have to add is soup
. In this variable, we are going to assign it to BeautifulSoup
and use the data
variable as the argument. We do this so we can use the built-in methods of BeautifulSoup.
soup = BeautifulSoup(data)
The last step in our web crawler is adding a for-loop. We are going to use the method find_all
with the argument 'a'
. This is trying to find all a
elements on our webpage. After that, we are going to print our URL. We are going to use the get
method again to get all a
elements with the value href
. This is to indicate that we want to have only the URL. Next to it, we want to put our response code. Our code looks now like this:
for link in soup.find_all('a'): print(f"Url: {link.get('href')} " + f"| Status Code: {response_code}")
If it is correct, your whole code should look like this:
# Import librariesfrom bs4 import BeautifulSoup, SoupStrainerimport requests# Prompt user to enter the URLurl = input("Enter your url: ")# Make a request to get the URLpage = requests.get(url)# Get the response code of given URLresponse_code = str(page.status_code)# Display the text of the URL in strdata = page.text# Use BeautifulSoup to use the built-in methodssoup = BeautifulSoup(data)# Iterate over all links on the given URL with the response code next to itfor link in soup.find_all('a'): print(f"Url: {link.get('href')} " + f"| Status Code: {response_code}")
Now run the script by typing python verify_response_code.py
in your terminal. You are asked to enter an URL. Enter the given the URL and press enter. If things are going well, you should receive an output like this below.
Thats it! Our small web crawler is done. I hope this article was good for you. If you want to check out more content on my blog, join the newsletter.
Happy coding!
If you want to know more tips about programming, feel free to check out my blog.
Also, feel free to check out my YouTube channel for more tutorials!
Original Link: https://dev.to/arvindmehairjan/build-a-web-crawler-to-check-for-broken-links-with-python-beautifulsoup-39mg
Dev To
An online community for sharing and discovering great ideas, having debates, and making friendsMore About this Source Visit Dev To