The Glitched Goblet

Introduction

Hello! Have you ever wanted to have confidence in the links on your website? Or maybe you're just curious about web scraping and want to give it a try? Well, you're in the right place!

This guide is intended for beginners or anyone who's never tried web scraping in general. We will walk through creating a web scraper using Puppeteer to validate links on a website. I'll explain each section of the code and offer some tips along the way.

For those that would just like to look at the code, you can find the full example script at the end of this guide.

In this tutorial, you'll learn how to:

Use Puppeteer to scrape web pages.
Extract links from the DOM.
Validate links using Node's HTTPS module and fetch.

This guide is fairly straightforward, so let's make a Dex check and jump right in!

Setup

Before we begin, make sure you have Node.js installed. You'll also need:

Puppeteer For headless browsing.
A HTTPS module, for handling requests, especially when dealing with certificates. For this tutorial, we'll use fetch.

Visiting a Page

First things first, let's set up Puppeteer to visit a page. We'll need to use the browser and page objects to navigate to a URL and extract a page's content. We'll do this by launching a browser instance and opening a blank page. From there, we can navigate to our target URL. Once the page is loaded, we'll extract the HTML content and log it to the console.

I hope that you're comfortable with asynchronous JavaScript, as we'll be using async, await, and try/catch throughout this tutorial. You can thank me later for the practice!

Here's a basic example to get you started:

const puppeteer = require('puppeteer')

const crawl = async () => {
  // Launch a new browser instance
  const browser = await puppeteer.launch()
  // Create a new page within the browser
  const page = await browser.newPage()

  // Navigate to the target URL (For this example I'll use example.com)
  await page.goto('https://example.com/')
  // Extract the HTML content of the page and log it
  console.log(await page.evaluate(() => document.body.innerHTML))
  // Close the browser
  await browser.close()
}

crawl()

If you run this script using Node.js, you should see the HTML content of the example.com homepage logged to the console.

Extracting Links

For my next trick, I'll extract all the links from a page. This involves querying the DOM for any and all anchor (<a />) elements and retrieving their href attributes. Let's update the page evaluation step from the previous example with a way to extract the links.

page.evaluate is a method that'll allow you to run code within the context of the page. You can use it to interact with the DOM, extract data, and a lot more! In this case, we'll use it to query all anchor elements and return an array of their href attributes.

// Replace console.log(await page.evaluate(() => document.body.innerHTML))
const links = await page.evaluate(() => {
  // Locate all anchor elements on the page
  const anchors = Array.from(document.querySelectorAll('a'))
  // Extract the href attribute from each anchor element
  return anchors.map((anchor) => anchor.href)
})

console.log(links)

Validating Links

Now that we're able to extract every link from a page, it's time to check if they actually work! Let's use Node's fetch module for validation. We'll loop through each link and make a request to it. If the request is successful, we'll log a success message. If not, we'll log an error message.

Just to be clear, we're not going to be checking the content of the links. We're only checking if the links are valid and return a successful status code (200). If you want to check the content of the links, you'll need to navigate to them using the page.goto method and extract the content.

Note: Did you know that you could use emojis in your code? It's a fun way to add some personality to your scripts! 🎉

for (const link of links) {
  try {
    // Make a request to the link
    const res = await fetch(link)
    // Log a result message based on the response
    const message = `${res.ok ? '✅' : '❌'} - ${link}`
    console.log(message)
  } catch (error) {
    console.log('Error:', error)
  }
}

Putting it All Together

Now that we have all the pieces, let's put them together into a single script. This script will visit a page, extract all links from it, and then validate them.

Be aware that this script will only work for simple websites. If you're dealing with a more complex website, you may need to adjust the script to handle JavaScript rendering, rate limiting, and other issues.

const puppeteer = require('puppeteer')

const crawl = async () => {
  const browser = await puppeteer.launch()
  const page = await browser.newPage()

  await page.goto('https://example.com/')
  const links = await page.evaluate(() => {
    const anchors = Array.from(document.querySelectorAll('a'))
    return anchors.map((anchor) => anchor.href)
  })

  for (const link of links) {
    try {
      const res = await fetch(link)
      const message = `${res.ok ? '✅' : '❌'} - ${link}`
      console.log(message)
    } catch (error) {
      console.log('Error:', error)
    }
  }
  await browser.close()
}

crawl()

Conclusion

That was surprisingly easy, right? You've just built a web scraper using Puppeteer to validate links on your website! You can take this script and expand upon it to suit your needs. And believe me, this is just "scraping" the surface of what you can do with web scraping! (Pun intended)

Want to Gain Inspiration?

If you're looking for some ideas on how to expand upon this script, here are a few suggestions:

Save the results to a file.
Add a delay between requests to avoid rate limiting.
Use a headless browser to interact with pages that require JavaScript to render.
Implement a recursive function to crawl multiple pages.
Add a user agent to mimic a real browser.
Use batching to make requests in parallel.

Plugs

This is Part 1 of a two-part series on web scraping. In the next part, we'll explore how to recursively crawl a website and validate links across multiple pages. Stay tuned!

Enjoy my content? You can read more on my blog at The Glitched Goblet or follow me on BlueSky at kaemonisland.bsky.social. I'll be updating older posts and adding new ones each week, so be sure to check back often!

Building a Web Scraper with Puppeteer for Link Validation