Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
September 7, 2022 07:23 pm GMT

Scraping with Python

I made a tool based on the beautifulsoup package to scrap web pages in a Pythonic way.

Disclaimer

Scraping websites without authorization is illegal.

Demo

I made a quick and dirty scraper with the library. You can call a URL and get all links or image URLs, for example:

python3 pextractor.py -u https://mytarget.com -e links -o links.txtpython3 pextractor.py -u https://mytarget.com -e img -o images.txt

GitHub logo jmau111 / pextractor.py

Quick and dirty page scrapper.

pextractor.py

Simple parser to scrap HTML contents (on a unique page).

Warning

Scrapping websites without authorization is illegal. Whether you use this tool or not, don't do it.

How to use

git clone https://github.com/jmau111/pextractor.py pextractorcd pextractorpip install -r requirements.txtpython3 pextractor.py -u https://mytarget.com -e links -o results.txt

Prerequisites

pip install -r requirements.txt

Entities

Entities are specific tags or elements you want to retrieve:

  • links: hrefs in the page (<a> tags)
  • generator: the meta generator that is added by some frameworks and CMS
  • comments: HTML comments
  • img: HTML images URLs

Options

  • -u: the URL to target.
  • -e : specific entity, "links" is the default entity. See Entities for the whole list.
  • -o: outfile. The path to the file where you'd like to save the results (optional)
  • -nc: stands for "no check". Will skip the public IP check.

IP check

The tool

Why?

I wanted to test how "quick" it would be to build my scraper in Python. I'm not surprised by the easiness, as the ecosystem is extremely powerful, but the possibilities for reconnaissance, brute-force attacks and footprinting are amazing.

It's very accurate, and you can target specific elements in the whole HTML tree, and even specific attributes:

python3 pextractor.py -u https://mytarget.com -e generator

The above command gets the meta generator if it finds it.

More advanced usages

There are many more options like prettify, which is helpful when you don't know what you're looking for, or interesting helpers like the one that extracts the raw text. You can also scrap entire websites recursively if you combine with other packages.

Types of scrapers

Here are a few examples:

  • Spiders
  • HTML parsers ^^
  • Screenscrapers (e.g., PhantomJS)
  • Human copycat (plagiarism)

How to protect against unwanted scraping

As you can see, it's relatively easy to setup. There are various scripts that rely on the library actually.

Preventing scraping is not an easy task. Many techniques involve special cookies or js-based solutions that display content conditionally, but it can create more problems than it solves.

Likewise, obfuscating data and using captchas (or honeypots) is not necessarily the best approach and can harm accessibility and user experience.

What you can do is rate-limiting requests and keep logs of user's requests (authentication is not required). This way, you may "identify" scrapers or, at least, make their life significantly harder. If they reach the rate limit, they'll get a 429 error (Too many requests).

You can also target specific IP ranges to block them more effectively and stop those who use known bypasses like IP rotation.

There are ways to catch automated traffic (bots), especially with some metrics like high volumes. If your adversaries are stronger than usual, you can try to block the traffic by User-Agent (or empty User-Agent) and more complex patterns.

You may also modify your HTML to see if your scrapers are happy with that.

Last but not least, I would recommend to be extra-careful with your APIs and other JSON/XML feeds. While it is handy for many purposes, it can also feed scrapers if it's too open or misconfigured.

Wrap up

Web scraping is not evil, but black hats and other malicious actors use it too. Active monitoring is recommended.


Original Link: https://dev.to/jmau111/scraping-with-python-1c3n

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To