2025-02-25
Hello! Have you ever wanted to have confidence in the links on your website? Or maybe you're just curious about web scraping and want to give it a try? Well, you're in the right place!
This guide is intended for beginners or anyone who's never tried web scraping in general. We will walk through creating a web scraper using Puppeteer to validate links on a website. I'll explain each section of the code and offer some tips along the way.
For those that would just like to look at the code, you can find the full example script at the end of this guide.
In this tutorial, you'll learn how to:
fetch
.This guide is fairly straightforward, so let's make a Dex check and jump right in!
Before we begin, make sure you have Node.js installed. You'll also need:
fetch
.First things first, let's set up Puppeteer to visit a page. We'll need to use the browser
and page
objects to navigate to a URL and extract a page's content. We'll do this by launching a browser instance and opening a blank page. From there, we can navigate to our target URL. Once the page is loaded, we'll extract the HTML content and log it to the console.
I hope that you're comfortable with asynchronous JavaScript, as we'll be using async
, await
, and try/catch
throughout this tutorial. You can thank me later for the practice!
Here's a basic example to get you started:
const puppeteer = require('puppeteer')
const crawl = async () => {
// Launch a new browser instance
const browser = await puppeteer.launch()
// Create a new page within the browser
const page = await browser.newPage()
// Navigate to the target URL (For this example I'll use example.com)
await page.goto('https://example.com/')
// Extract the HTML content of the page and log it
console.log(await page.evaluate(() => document.body.innerHTML))
// Close the browser
await browser.close()
}
crawl()
If you run this script using Node.js, you should see the HTML content of the example.com homepage logged to the console.
For my next trick, I'll extract all the links from a page. This involves querying the DOM for any and all anchor (<a />
) elements and retrieving their href
attributes. Let's update the page evaluation step from the previous example with a way to extract the links.
page.evaluate
is a method that'll allow you to run code within the context of the page. You can use it to interact with the DOM, extract data, and a lot more! In this case, we'll use it to query all anchor elements and return an array of their href
attributes.
// Replace console.log(await page.evaluate(() => document.body.innerHTML))
const links = await page.evaluate(() => {
// Locate all anchor elements on the page
const anchors = Array.from(document.querySelectorAll('a'))
// Extract the href attribute from each anchor element
return anchors.map((anchor) => anchor.href)
})
console.log(links)
Now that we're able to extract every link from a page, it's time to check if they actually work! Let's use Node's fetch
module for validation. We'll loop through each link and make a request to it. If the request is successful, we'll log a success message. If not, we'll log an error message.
Just to be clear, we're not going to be checking the content of the links. We're only checking if the links are valid and return a successful status code (200). If you want to check the content of the links, you'll need to navigate to them using the page.goto
method and extract the content.
Note: Did you know that you could use emojis in your code? It's a fun way to add some personality to your scripts! 🎉
for (const link of links) {
try {
// Make a request to the link
const res = await fetch(link)
// Log a result message based on the response
const message = `${res.ok ? '✅' : '❌'} - ${link}`
console.log(message)
} catch (error) {
console.log('Error:', error)
}
}
Now that we have all the pieces, let's put them together into a single script. This script will visit a page, extract all links from it, and then validate them.
Be aware that this script will only work for simple websites. If you're dealing with a more complex website, you may need to adjust the script to handle JavaScript rendering, rate limiting, and other issues.
const puppeteer = require('puppeteer')
const crawl = async () => {
const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto('https://example.com/')
const links = await page.evaluate(() => {
const anchors = Array.from(document.querySelectorAll('a'))
return anchors.map((anchor) => anchor.href)
})
for (const link of links) {
try {
const res = await fetch(link)
const message = `${res.ok ? '✅' : '❌'} - ${link}`
console.log(message)
} catch (error) {
console.log('Error:', error)
}
}
await browser.close()
}
crawl()
That was surprisingly easy, right? You've just built a web scraper using Puppeteer to validate links on your website! You can take this script and expand upon it to suit your needs. And believe me, this is just "scraping" the surface of what you can do with web scraping! (Pun intended)
If you're looking for some ideas on how to expand upon this script, here are a few suggestions:
This is Part 1 of a two-part series on web scraping. In the next part, we'll explore how to recursively crawl a website and validate links across multiple pages. Stay tuned!
Enjoy my content? You can read more on my blog at The Glitched Goblet or follow me on BlueSky at kaemonisland.bsky.social. I'll be updating older posts and adding new ones each week, so be sure to check back often!