An Interest In:
Web News this Week
- March 18, 2024
- March 17, 2024
- March 16, 2024
- March 15, 2024
- March 14, 2024
- March 13, 2024
- March 12, 2024
Two Quick Hacks for Web Scraping Pages With Dynamic CSS Class Names
Even with great tools, one of the challenges with crawling and extracting data from pages at a large scale is you dont really know what structure a page is going to have before you get to it.
Here are a couple of quick hacks to scrape data from pages that have dynamic css rules:
1. Dont Use Rule-Based Extraction
Rule-based extraction is fine for small scale scraping, one-off scripts to grab some data, and sites that dont routinely change. But these days a site with data of any value that isnt dynamic to some degree is relatively rare.
Additionally, classifying extraction rules for a given domain doesnt scale to multiple domains. Simply ensuring regularly updated web data from a small group of domains routinely requires a whole team to manage the process. And the process still breaks down. Trust me, we hear this a ton in conversations with current or potential clients.
So you have a few choices for following this tip. Or at least for avoiding what this tip is meant to avoid: unscalable or regularly broken scrapers.
The first is that you can build a non-rule centered form of extraction custom to you. There are more free training data sets out there than ever before. Out of the box, NLP is improving from a handful of providers. And particularly if you want to focus on a small set of domains, you may be able to pull this off.
Secondly, you can reach out to the small handful of providers who truly offer rule-less web extraction. If youre wanting to extract from a wide range of sites, your sites are regularly changing, or you're seeking a variety of document types, this is likely the way to go.
Third, you can stick to gathering public web data about particularly well-known sites. At the end of the day, this may simply be paying someone else to maintain rule-based extractors for you. But for example theres a veritable cottage industry around scraping very specific sites like social media. Their whole business is to provide up-to-date extractors for things like lists of members of a given Facebook group.
2. If You Have To Use Rule-Based Extraction Try Out These Advanced Selectors
If you truly cant find a way to extract what you need with one of the options above, there are a few ways you can at least proof your scraping of dynamic content.
Alternatively, this type of rule-based selector extraction is how most major extraction services work (like Import.io, plugin web extractors, Octoparse, or if youre rolling your own extractor with something like Selenium or BeautifulSoup).
Now there are a few scenarios where these selectors become useful. Typically if a site is well structured, class and ID names make sense, and you have classed elements inside of classed elements, youre good without these techniques. But if youve spent any time with web scraping, dont tell me you havent occasionally gotten a few of these:
<div> <a href="https://dev.to/some/stuff" data-event="ev=filedownload" data-link-event=" Our_Book "> <span class="">Download Our Book</span> </a></div>
Or
<div class="Cell-sc-1abjmm4-0 Layout__RailCell-sc-1goy157-1 hcxgdw"> <div class="RailGeneric__RailBox-sc-1565s4y-0 iZilXF mt5"> ... </div> <div class="RailGeneric__AdviceBox-sc-1565s4y-3 kObkOT"> ... </div></div>
The above both stray from regular class declarations and eschew attempts to extract data using typical selectors. Theyre both examples of irregular markup, but potentially in inverse ways. The first example provides very little traditional markup that could be used for typical CSS selectors. The second contains very specific class names that are dynamically created in something like React. For both, we can use the same handful of advanced CSS selectors to grab the values we want.
CSS Begins With, Ends With, and Contains
You wont encounter these CSS selectors very often when building your own site. And maybe thats why theyre often overlooked in explanations. But many individuals dont know that you can essentially use regex in a subset of CSS selector types. Fortunately, Regex-like selectors can be applied to HTML attribute/value selectors.
So in the first example above, something like the following works great:
a[data-link*='Our_Book']
Within CSS, square brackets are used to filter. And follow the general format of:
element[attribute=value]
This in and of itself doesnt solve either of our issues up there, its the inclusion of the three regex operators for begins with, ends with, and contains.
In the above example grabbing Our_Book (note these selectors are case sensitive), the original markup has extra whitespace to either side of the characters. thats where our friend contains comes into play. In short, these selectors work like so:
div[class^="beginsWith"]div[class$="endsWith"]div[class*="containsThis"]
Where class can be any attribute, and where the value string matches the beginning, ending, or some substring of the total value name.
Conclusion
Thanks for reading this article, I hope that it has helped outline some potential methods for successfully implementing web scraping practices. This is certainly a concept that continues to be refined and improved so be sure to keep up to date as progress is made! Feel free to leave any feedback or questions in the comments section.
Original Link: https://dev.to/scrapehunt/two-quick-hacks-for-web-scraping-pages-with-dynamic-css-class-names-j0b
Dev To
An online community for sharing and discovering great ideas, having debates, and making friendsMore About this Source Visit Dev To