The Glitched Goblet Logo

The Glitched Goblet

Where Magic Meets Technology

Building a Web Scraper for Link Validation: Part 2

2025-03-07

Introduction

Welcome back! In Part 1, we built a simple web scraper that extracts and validates links from a single page. Now we're taking it to the next level. In this tutorial, we'll take the web scraper built in the first tutorial and add the following:

  1. Recursive Crawling: So we can follow links and keep scraping.
  2. Setting a Crawl Depth: To control how deep our scraper goes. (We can't let it get existential)
  3. Adding a Delay: To avoid hammering the server.
  4. Batching Requests: To validate links in groups for better performance.

Grab your favorite beverage and let’s level up our scraper!

Just in case you missed the first tutorial, this is the web scraper we built and will be enhancing. I've also realized that there are a few updates we need to make to the code to make it more efficient and robust. I've added comments to the areas that need updating:

const puppeteer = require('puppeteer')

const crawl = async () => {
  // Add a headless option to the launch function. It makes the crawler run in the background.
  const browser = await puppeteer.launch({ headless: true })
  const page = await browser.newPage()

  // Add a waitUntil option to the goto function to wait for the page to load.
  await page.goto('https://example.com/', { waitUntil: 'networkidle2' })
  const links = await page.evaluate(() => {
    const anchors = Array.from(document.querySelectorAll('a'))
    return anchors.map((anchor) => anchor.href)
  })

  for (const link of links) {
    try {
      const res = await fetch(link)
      const message = `${res.ok ? '✅' : '❌'} - ${link}`
      console.log(message)
    } catch (error) {
      console.log('Error:', error)
    }
  }
  await browser.close()
}

crawl()

Setup

Now, before we really get into this, we need to make a small change to the website we're using to test this scraper. www.example.com only has a single page and only one link to verify, and we need more than that to test our new features. So, I've built a simple website with multiple pages and links for you to use. You can clone it from this GitHub repository.

Once you clone the repo, you just need to install the dependencies using npm install or yarn install and start the server with npm run dev or yarn dev. The website will be available at http://localhost:3000. Going forward, we'll use this website to test our scraper.

Recursive Crawling & Setting a Crawl Depth

Implementing recursive crawling is a powerful feature that will allow the scraper to follow links and scrape multiple pages. However, with great power comes great responsibility. We'll need to implement a way to control how deep the scraper goes to avoid getting stuck in an infinite loop.

In this case we won't actually be using recursion, but rather a depth-first search algorithm to crawl the website. We'll do this by creating a toVisit set that'll keep track of URLs that need to be visited and a visited set to keep track of URLs that have already been visited.

We can then iterate over the toVisit set in a while loop until it's empty. For each URL, we'll visit the page, extract the links, and add them to the toVisit set. We'll also add a depth parameter to control how deep we want to crawl. You can think of depth as the number of clicks away from the starting URL.

To implement this feature, we'll need to introduce a few changes to our scraper.

  1. We'll pass the starting URL and the maximum depth to the crawl function.
  2. We'll add some new variables to keep track of the URLs we need to visit and the ones we've already visited.
    1. We'll keep track of depth by adding a depth property to each URL in the toVisit set.
  3. We'll move the actions of visiting a page, extracting links, and validating them to into a loop that iterates over the toVisit set.
  4. After validating each link, we'll check if we should add it to the toVisit set based on the current depth.

Here's what the updated scraper looks like, I've also added some helpful logging to visualize the crawler as it runs:

const puppeteer = require('puppeteer')

// Check if a link is already scheduled for a visit
const isScheduled = (links, link) => links.some((l) => l.url === link)

// 1. Add parameters for the starting URL and the maximum depth
const crawl = async (baseUrl = 'http://localhost:3000', maxDepth = 1) => {
  const browser = await puppeteer.launch({ headless: true })

  // 2. Initialize the "toVisit" and "visited" variables
  const toVisit = [{ url: baseUrl, depth: 0 }]
  const visited = new Set()

  // 3. Loop over the "toVisit" set until it's empty
  while (toVisit.length > 0) {
    // Grab the next URL to visit and add it to the visited set
    const { url, depth } = toVisit.pop()
    visited.add(url)

    const page = await browser.newPage()
    console.log('--------------------------------------')
    console.log(`Crawling: ${url}`)

    await page.goto(url, { waitUntil: 'networkidle2' })

    const links = await page.evaluate(() => {
      const anchors = Array.from(document.querySelectorAll('a'))
      return anchors.map((anchor) => anchor.href)
    })

    console.log(`Found ${links.length} links.`)

    for (const link of links) {
      try {
        const res = await fetch(link)
        const message = `${res.ok ? '✅' : '❌'} - ${link}`
        console.log(message)
      } catch (error) {
        console.log('Error:', error)
      }

      // 4. Check if the link should be added to "toVisit" based on the following:
      if (
        link.startsWith(baseUrl) && // If it's internal. Meaning it's from the same domain
        !isScheduled(links, link) && // If we haven't already added it to the stack
        !visited.has(link) && // If we haven't visited it already
        depth < maxDepth // If we haven't reached the maximum depth
      ) {
        console.log(`Adding "${link}" to stack.`)
        toVisit.push({ url: link, depth: depth + 1 })
      }
    }

    await page.close()
  }

  await browser.close()
}

crawl('http://localhost:3000', 1)

Now when you run the crawler, assuming your using the example site, you should see something like this:

Crawling: http://localhost:3000
Found 4 links.
✅ - http://localhost:3000/
Adding "http://localhost:3000/" to stack.
✅ - http://localhost:3000/about
Adding "http://localhost:3000/about" to stack.
✅ - http://localhost:3000/contact
Adding "http://localhost:3000/contact" to stack.
✅ - http://localhost:3000/articles
Adding "http://localhost:3000/articles" to stack.
--------------------------------------
Crawling: http://localhost:3000/about
Found 5 links.
✅ - http://localhost:3000/
✅ - http://localhost:3000/about
✅ - http://localhost:3000/contact
✅ - http://localhost:3000/articles
❌ - http://localhost:3000/nowhere

We can see that the crawler is now following links and adding them to the stack based on the depth we've set. Try changing the depth parameter to see how the crawler behaves. Within each article page there is another link to comments. If you set the maxDepth to 3, you should see the crawler visit the comments page.

!!! Warning !!!

Be careful when running this script. If you try to run it on large websites or with a high maxDepth value. You could end up crawling thousands of pages and potentially get blocked by the website. Always be mindful of the impact your scraper has on the website you're crawling.

For more information on ethical web scraping, check out this wiki page on robots.txt.

Batching Requests & Adding a Delay

When validating links, it's a good idea to batch requests together to avoid overwhelming the server. This can help prevent rate limiting and ensure that your scraper is a good web citizen. We'll also add a delay between requests to further reduce the load on the server. The example site only has a few links, so these changes won't really be necessary, but for a site that has hundreds or even thousands of links, like wikipedia, it'll save you from getting blocked.

For batching, we'll group links into groups of n and validate them together. This way we'll make fewer requests, be more efficient, and reduce the load on the server. We'll also add a delay of 1000ms between each batch to give the server some breathing room.

A simple batching function could look like this:

const validateLinksInBatches = async (links, batchSize = 5) => {
  // Iterate over the list of links in batches of `batchSize`
  for (let i = 0; i < links.length; i += batchSize) {
    const batch = links.slice(i, i + batchSize)
    await Promise.all(batch.map((link) => validateLink(link)))
  }
}

For now, let's set our batching size to 2 and add a delay of 1000ms between each batch and each page visit. We'll also need to break out the link validation into a separate function.

const delay = (ms) => new Promise((resolve) => setTimeout(resolve, ms))

const validateLink = async (link) => {
  try {
    const res = await fetch(link)
    console.log(`${res.ok ? '✅' : '❌'} - ${link}`)
  } catch (error) {
    console.log('Error:', error)
  }
}

Putting It All Together

Below is a complete example that combines recursive crawling, depth control, delay between requests, and batching. Notice how everything is modular, making it easier to adjust individual components as needed:

const puppeteer = require('puppeteer')

const delay = (ms) => new Promise((resolve) => setTimeout(resolve, ms))

const validateLink = async (link) => {
  try {
    const res = await fetch(link)
    console.log(`${res.ok ? '✅' : '❌'} - ${link}`)
  } catch (error) {
    console.log('Error:', error)
  }
}

const validateLinksInBatches = async (links, batchSize = 5) => {
  for (let i = 0; i < links.length; i += batchSize) {
    const batch = links.slice(i, i + batchSize)
    console.log(`Validating ${batch.length} links...`)
    await delay(1000)
    await Promise.all(batch.map((link) => validateLink(link)))
  }
}

// Check if a link is already scheduled for a visit
const isScheduled = (links, link) => links.some((l) => l.url === link)

const crawl = async (baseUrl = 'http://localhost:3000', maxDepth = 1) => {
  const browser = await puppeteer.launch({ headless: true })

  const toVisit = [{ url: baseUrl, depth: 0 }]
  const visited = new Set()

  while (toVisit.length > 0) {
    // Grab the next URL to visit and add it to the visited set
    const { url, depth } = toVisit.pop()
    visited.add(url)

    const page = await browser.newPage()
    console.log('--------------------------------------')
    console.log(`Crawling: ${url}`)

    await page.goto(url, { waitUntil: 'networkidle2' })

    const links = await page.evaluate(() => {
      const anchors = Array.from(document.querySelectorAll('a'))
      return anchors.map((anchor) => anchor.href)
    })

    console.log(`Found ${links.length} links.`)

    await validateLinksInBatches(links, 2)
    await delay(1000)

    for (const link of links) {
      if (
        link.startsWith(baseUrl) &&
        !isScheduled(link) &&
        !visited.has(link) &&
        depth < maxDepth
      ) {
        console.log(`Adding "${link}" to stack.`)
        toVisit.push({ url: link, depth: depth + 1 })
      }
    }

    await page.close()
  }

  await browser.close()
}

crawl('http://localhost:3000', 1)

Now, when you run the crawler, you'll notice that it now validates links in batches of 2 in addition to waiting an extra second between batches and after each page visit. Congratulations! You've just built a more robust web scraper that can crawl multiple pages, control depth, and validate links efficiently. What's next? Let's wrap this up.

Conclusion

Great job! You’ve just expanded your web scraper to handle recursive crawling, controlled depth, delays, and batching. These additions make your scraper more robust and production-ready. Remember, as you build more complex scrapers, always be mindful of the ethical and legal considerations of web scraping.

Feel free to experiment with the code—maybe add error logging, customize delays based on server response. Your scraper is now a solid foundation to build upon!

Here's some inspiration for further enhancements:

  • Save the results to a file in an organized manner.
  • Created a validated set to avoid re-validating the same links.
  • Implement a retry mechanism for failed requests.
  • Add parallel processing with concurrency limits.

Happy coding, and may your links always validate! If you have any questions or need further tweaks, just give me a shout. I'm here to help you improve and refine your skills.

Plugs

Enjoy my content? You can read more on my blog at The Glitched Goblet or follow me on BlueSky at kaemonisland.bsky.social. I'll be updating older posts and adding new ones each week, so be sure to check back often!