Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
December 20, 2020 02:21 pm GMT

Two Quick Hacks for Web Scraping Pages With Dynamic CSS Class Names

Even with great tools, one of the challenges with crawling and extracting data from pages at a large scale is you dont really know what structure a page is going to have before you get to it.

Here are a couple of quick hacks to scrape data from pages that have dynamic css rules:

1. Dont Use Rule-Based Extraction

Rule-based extraction is fine for small scale scraping, one-off scripts to grab some data, and sites that dont routinely change. But these days a site with data of any value that isnt dynamic to some degree is relatively rare.

Additionally, classifying extraction rules for a given domain doesnt scale to multiple domains. Simply ensuring regularly updated web data from a small group of domains routinely requires a whole team to manage the process. And the process still breaks down. Trust me, we hear this a ton in conversations with current or potential clients.
So you have a few choices for following this tip. Or at least for avoiding what this tip is meant to avoid: unscalable or regularly broken scrapers.

The first is that you can build a non-rule centered form of extraction custom to you. There are more free training data sets out there than ever before. Out of the box, NLP is improving from a handful of providers. And particularly if you want to focus on a small set of domains, you may be able to pull this off.

Secondly, you can reach out to the small handful of providers who truly offer rule-less web extraction. If youre wanting to extract from a wide range of sites, your sites are regularly changing, or you're seeking a variety of document types, this is likely the way to go.
Third, you can stick to gathering public web data about particularly well-known sites. At the end of the day, this may simply be paying someone else to maintain rule-based extractors for you. But for example theres a veritable cottage industry around scraping very specific sites like social media. Their whole business is to provide up-to-date extractors for things like lists of members of a given Facebook group.

2. If You Have To Use Rule-Based Extraction Try Out These Advanced Selectors

If you truly cant find a way to extract what you need with one of the options above, there are a few ways you can at least proof your scraping of dynamic content.

Alternatively, this type of rule-based selector extraction is how most major extraction services work (like Import.io, plugin web extractors, Octoparse, or if youre rolling your own extractor with something like Selenium or BeautifulSoup).

Now there are a few scenarios where these selectors become useful. Typically if a site is well structured, class and ID names make sense, and you have classed elements inside of classed elements, youre good without these techniques. But if youve spent any time with web scraping, dont tell me you havent occasionally gotten a few of these:

<div>  <a href="https://dev.to/some/stuff" data-event="ev=filedownload" data-link-event=" Our_Book ">    <span class="">Download Our Book</span>  </a></div>
Enter fullscreen mode Exit fullscreen mode

Or

<div class="Cell-sc-1abjmm4-0 Layout__RailCell-sc-1goy157-1 hcxgdw">  <div class="RailGeneric__RailBox-sc-1565s4y-0 iZilXF mt5">    ...  </div>  <div class="RailGeneric__AdviceBox-sc-1565s4y-3 kObkOT">    ...  </div></div>
Enter fullscreen mode Exit fullscreen mode

The above both stray from regular class declarations and eschew attempts to extract data using typical selectors. Theyre both examples of irregular markup, but potentially in inverse ways. The first example provides very little traditional markup that could be used for typical CSS selectors. The second contains very specific class names that are dynamically created in something like React. For both, we can use the same handful of advanced CSS selectors to grab the values we want.

CSS Begins With, Ends With, and Contains

You wont encounter these CSS selectors very often when building your own site. And maybe thats why theyre often overlooked in explanations. But many individuals dont know that you can essentially use regex in a subset of CSS selector types. Fortunately, Regex-like selectors can be applied to HTML attribute/value selectors.

So in the first example above, something like the following works great:

a[data-link*='Our_Book']
Enter fullscreen mode Exit fullscreen mode

Within CSS, square brackets are used to filter. And follow the general format of:

element[attribute=value]
Enter fullscreen mode Exit fullscreen mode

This in and of itself doesnt solve either of our issues up there, its the inclusion of the three regex operators for begins with, ends with, and contains.

In the above example grabbing Our_Book (note these selectors are case sensitive), the original markup has extra whitespace to either side of the characters. thats where our friend contains comes into play. In short, these selectors work like so:

div[class^="beginsWith"]div[class$="endsWith"]div[class*="containsThis"]
Enter fullscreen mode Exit fullscreen mode

Where class can be any attribute, and where the value string matches the beginning, ending, or some substring of the total value name.

Conclusion

Thanks for reading this article, I hope that it has helped outline some potential methods for successfully implementing web scraping practices. This is certainly a concept that continues to be refined and improved so be sure to keep up to date as progress is made! Feel free to leave any feedback or questions in the comments section.


Original Link: https://dev.to/scrapehunt/two-quick-hacks-for-web-scraping-pages-with-dynamic-css-class-names-j0b

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To