Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
November 4, 2022 04:37 pm GMT

Scrape images from a search engine with JavaScript and Puppeteer

Introduction

In the previous post of this series, we discovered how to use Nodejs and Puppeteer for scraping and searching content on web pages. I recommend reading it first if you have never used Puppeteer or need to set up the project.

In this article, we will fetch full-resolution images from a search engine. Our goal time is to get a picture of every dog breed.

Script to get the images links

You should have Node.js and Puppeteer installed with npm or yarn.
We will use the same methods than on the first part.
We are going to use a simple JSON as our list of dog breeds that can be found here: dog breeds dataset

As for the search engine, we will scrape on Duckduckgo because it allows us to easily get the images at a full resolution which can be more tricky on Google images.

const puppeteer = require("puppeteer")const data = require("./dog-breeds.json")const script = async () => {  //this will open visibly a chromium window, this is useful to see what is going on and test stuff before the finalized script  const browser = await puppeteer.launch({ headless: false, slowMo: 100 })  const page = await browser.newPage()  //loop on every breed  for (let dogBreed of data) {    console.log("Start for breed:", dogBreed)    const url = `https://duckduckgo.com/?q=${dogBreed.replaceAll(      " ",      "+"    )}&va=b&t=hc&iar=images&iax=images&ia=images`    //in case we encounter a page without images or an error    try {      await page.goto(url)      //make sure the page is loaded and contain our targeted element      await page.waitForNavigation()      await page.waitForSelector(".tile--img__media")      await page.evaluate(        () => {          const firstImage = document.querySelector(".tile--img__media")          //we open the panel that contains the image info          firstImage.click()        },        { delay: 400 }      )      //get the link of the image from the panel      await page.waitForSelector(".detail__pane a")      const link = await page.evaluate(        () => {          const links = document.querySelectorAll(".detail__pane a")          const linkImage = Array.from(links).find((item) =>            item.innerText.includes("fichier")          )          return linkImage?.getAttribute("href")        },        { delay: 250 }      )      console.log("link succesfully retrieved:", link)      console.log("=====")    } catch (e) {      console.log(e)    }  }}script()

After running the script with node scrapeImages.js you should get something like this:

Gif scraping puppeteer

Download and optimize the images

We now have the links of every images but some of them are quite heavy (>1mb).
Fortunately we can use another Node.js library to compress their size with minimal loss of quality: sharp

It is a massively used library (2M+ weekly download) to convert, resize and optimize images.

You can add this at the end of the script to have a folder filled with the optimized images

const stream = fs.createWriteStream(dogBreed + ".jpg")await https.get(link, async function(response) {  response.pipe(stream)  stream.on("finish", () => {    stream.close()    console.log("Download Completed")  })})//resize to a maximum width or height of 1000pxawait sharp(`./${dogBreed}.jpg`)  .resize(1000, 1000)  .toFile(`./${dogBreed}-small.jpg`)

Conclusion

You can adapt this script to get pretty much anything, you can also not limit yourself to the first image for each query but get every image. As for myself, I used this script to get the initial images for a tool I'm working on https://dreamclimate.city

screenshot dream climate city personal project

Thanks for reading! If you found this article useful, it's part of a series and the next article will be about scraping images on a search engine. To get notified follow me on Twitter, I also share tips on development, design and share my journey to create my own startup studio


Original Link: https://dev.to/antoine_m/scrape-images-from-a-search-engine-with-javascript-and-puppeteer-3dlh

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To