An Interest In:
Web News this Week
- April 3, 2024
- April 2, 2024
- April 1, 2024
- March 31, 2024
- March 30, 2024
- March 29, 2024
- March 28, 2024
Scrape images from a search engine with JavaScript and Puppeteer
Introduction
In the previous post of this series, we discovered how to use Nodejs and Puppeteer for scraping and searching content on web pages. I recommend reading it first if you have never used Puppeteer or need to set up the project.
In this article, we will fetch full-resolution images from a search engine. Our goal time is to get a picture of every dog breed.
Script to get the images links
You should have Node.js and Puppeteer installed with npm
or yarn
.
We will use the same methods than on the first part.
We are going to use a simple JSON as our list of dog breeds that can be found here: dog breeds dataset
As for the search engine, we will scrape on Duckduckgo because it allows us to easily get the images at a full resolution which can be more tricky on Google images.
const puppeteer = require("puppeteer")const data = require("./dog-breeds.json")const script = async () => { //this will open visibly a chromium window, this is useful to see what is going on and test stuff before the finalized script const browser = await puppeteer.launch({ headless: false, slowMo: 100 }) const page = await browser.newPage() //loop on every breed for (let dogBreed of data) { console.log("Start for breed:", dogBreed) const url = `https://duckduckgo.com/?q=${dogBreed.replaceAll( " ", "+" )}&va=b&t=hc&iar=images&iax=images&ia=images` //in case we encounter a page without images or an error try { await page.goto(url) //make sure the page is loaded and contain our targeted element await page.waitForNavigation() await page.waitForSelector(".tile--img__media") await page.evaluate( () => { const firstImage = document.querySelector(".tile--img__media") //we open the panel that contains the image info firstImage.click() }, { delay: 400 } ) //get the link of the image from the panel await page.waitForSelector(".detail__pane a") const link = await page.evaluate( () => { const links = document.querySelectorAll(".detail__pane a") const linkImage = Array.from(links).find((item) => item.innerText.includes("fichier") ) return linkImage?.getAttribute("href") }, { delay: 250 } ) console.log("link succesfully retrieved:", link) console.log("=====") } catch (e) { console.log(e) } }}script()
After running the script with node scrapeImages.js
you should get something like this:
Download and optimize the images
We now have the links of every images but some of them are quite heavy (>1mb).
Fortunately we can use another Node.js library to compress their size with minimal loss of quality: sharp
It is a massively used library (2M+ weekly download) to convert, resize and optimize images.
You can add this at the end of the script to have a folder filled with the optimized images
const stream = fs.createWriteStream(dogBreed + ".jpg")await https.get(link, async function(response) { response.pipe(stream) stream.on("finish", () => { stream.close() console.log("Download Completed") })})//resize to a maximum width or height of 1000pxawait sharp(`./${dogBreed}.jpg`) .resize(1000, 1000) .toFile(`./${dogBreed}-small.jpg`)
Conclusion
You can adapt this script to get pretty much anything, you can also not limit yourself to the first image for each query but get every image. As for myself, I used this script to get the initial images for a tool I'm working on https://dreamclimate.city
Thanks for reading! If you found this article useful, it's part of a series and the next article will be about scraping images on a search engine. To get notified follow me on Twitter, I also share tips on development, design and share my journey to create my own startup studio
Original Link: https://dev.to/antoine_m/scrape-images-from-a-search-engine-with-javascript-and-puppeteer-3dlh
Dev To
An online community for sharing and discovering great ideas, having debates, and making friendsMore About this Source Visit Dev To