An Interest In:
Web News this Week
- April 28, 2024
- April 27, 2024
- April 26, 2024
- April 25, 2024
- April 24, 2024
- April 23, 2024
- April 22, 2024
April 21, 2023 03:04 am GMT
Original Link: https://dev.to/coderhxl/a-flexible-nodejs-multifunctional-crawler-library-x-crawl-3ja7
A flexible Node.js multifunctional crawler library x-crawl
x-crawl
x-crawl is a flexible Node.js multifunctional crawler library. Used to crawl pages, crawl interfaces, crawl files, and poll crawls.
If you also like x-crawl, you can give x-crawl repository a star to support it, thank you for your support!
Features
- AsyncSync - Just change the mode attribute value to switch async or sync crawling mode.
- Multiple functions - It can crawl pages, crawl interfaces, crawl files and polling crawls, and supports crawling single or multiple.
- Flexible writing style - Simple target configuration, detailed target configuration, mixed target array configuration and advanced configuration, the same crawling API can adapt to multiple configurations.
- Device Fingerprinting - Zero configuration or custom configuration to avoid fingerprinting to identify and track us from different locations.
- Interval Crawling - No interval, fixed interval and random interval can generate or avoid high concurrent crawling.
- Retry on failure - Global settings, local settings and individual settings, It can avoid crawling failure caused by temporary problems.
- Priority Queue - According to the priority of a single crawling target, it can be crawled ahead of other targets.
- Crawl SPA - Crawl SPA (Single Page Application) to generate pre-rendered content (aka "SSR" (Server Side Rendering)).
- Controlling Pages - Headless browsers can submit forms, keystrokes, event actions, generate screenshots of pages, etc.
- Capture Record - Capture and record crawling results and other information, and highlight reminders on the console.
- TypeScript - Own types, implement complete types through generics.
Example
Take some pictures of Airbnb hawaii experience and Plus listings automatically every day as an example:
// 1.Import module ES/CJSimport xCrawl from 'x-crawl'// 2.Create a crawler instanceconst myXCrawl = xCrawl({ maxRetry: 3, intervalTime: { max: 3000, min: 2000 } })// 3.Set the crawling task/* Call the startPolling API to start the polling function, and the callback function will be called every other day*/myXCrawl.startPolling({ d: 1 }, async (count, stopPolling) => { // Call crawlPage API to crawl Page const res = await myXCrawl.crawlPage([ 'https://zh.airbnb.com/s/hawaii/experiences', 'https://zh.airbnb.com/s/hawaii/plus_homes' ]) // Store the image URL to targets const targets = [] const elSelectorMap = ['.c14whb16', '.a1stauiv'] for (const item of res) { const { id } = item const { page } = item.data // Gets the URL of the page's wheel image element const boxHandle = await page.$(elSelectorMap[id - 1]) const urls = await boxHandle!.$$eval('picture img', (imgEls) => { return imgEls.map((item) => item.src) }) targets.push(...urls) // Close page page.close() } // Call the crawlFile API to crawl pictures myXCrawl.crawlFile({ targets, storeDir: './upload' })})
running result:
Note: Do not crawl at will, you can check the robots.txt protocol before crawling. This is just to demonstrate how to use x-crawl.
More
For more detailed documentation, please check: https://github.com/coder-hxl/x-crawl
Original Link: https://dev.to/coderhxl/a-flexible-nodejs-multifunctional-crawler-library-x-crawl-3ja7
Share this article:
Tweet
View Full Article
Dev To
An online community for sharing and discovering great ideas, having debates, and making friendsMore About this Source Visit Dev To