2025-03-07
Welcome back! In Part 1, we built a simple web scraper that extracts and validates links from a single page. Now we're taking it to the next level. In this tutorial, we'll take the web scraper built in the first tutorial and add the following:
Grab your favorite beverage and let’s level up our scraper!
Just in case you missed the first tutorial, this is the web scraper we built and will be enhancing. I've also realized that there are a few updates we need to make to the code to make it more efficient and robust. I've added comments to the areas that need updating:
const puppeteer = require('puppeteer')
const crawl = async () => {
// Add a headless option to the launch function. It makes the crawler run in the background.
const browser = await puppeteer.launch({ headless: true })
const page = await browser.newPage()
// Add a waitUntil option to the goto function to wait for the page to load.
await page.goto('https://example.com/', { waitUntil: 'networkidle2' })
const links = await page.evaluate(() => {
const anchors = Array.from(document.querySelectorAll('a'))
return anchors.map((anchor) => anchor.href)
})
for (const link of links) {
try {
const res = await fetch(link)
const message = `${res.ok ? '✅' : '❌'} - ${link}`
console.log(message)
} catch (error) {
console.log('Error:', error)
}
}
await browser.close()
}
crawl()
Now, before we really get into this, we need to make a small change to the website we're using to test this scraper. www.example.com
only has a single page and only one link to verify, and we need more than that to test our new features. So, I've built a simple website with multiple pages and links for you to use. You can clone it from this GitHub repository.
Once you clone the repo, you just need to install the dependencies using npm install
or yarn install
and start the server with npm run dev
or yarn dev
. The website will be available at http://localhost:3000
. Going forward, we'll use this website to test our scraper.
Implementing recursive crawling is a powerful feature that will allow the scraper to follow links and scrape multiple pages. However, with great power comes great responsibility. We'll need to implement a way to control how deep the scraper goes to avoid getting stuck in an infinite loop.
In this case we won't actually be using recursion, but rather a depth-first search algorithm to crawl the website. We'll do this by creating a toVisit
set that'll keep track of URLs that need to be visited and a visited
set to keep track of URLs that have already been visited.
We can then iterate over the toVisit
set in a while
loop until it's empty. For each URL, we'll visit the page, extract the links, and add them to the toVisit
set. We'll also add a depth
parameter to control how deep we want to crawl. You can think of depth as the number of clicks away from the starting URL.
To implement this feature, we'll need to introduce a few changes to our scraper.
crawl
function.depth
property to each URL in the toVisit
set.toVisit
set.toVisit
set based on the current depth.Here's what the updated scraper looks like, I've also added some helpful logging to visualize the crawler as it runs:
const puppeteer = require('puppeteer')
// Check if a link is already scheduled for a visit
const isScheduled = (links, link) => links.some((l) => l.url === link)
// 1. Add parameters for the starting URL and the maximum depth
const crawl = async (baseUrl = 'http://localhost:3000', maxDepth = 1) => {
const browser = await puppeteer.launch({ headless: true })
// 2. Initialize the "toVisit" and "visited" variables
const toVisit = [{ url: baseUrl, depth: 0 }]
const visited = new Set()
// 3. Loop over the "toVisit" set until it's empty
while (toVisit.length > 0) {
// Grab the next URL to visit and add it to the visited set
const { url, depth } = toVisit.pop()
visited.add(url)
const page = await browser.newPage()
console.log('--------------------------------------')
console.log(`Crawling: ${url}`)
await page.goto(url, { waitUntil: 'networkidle2' })
const links = await page.evaluate(() => {
const anchors = Array.from(document.querySelectorAll('a'))
return anchors.map((anchor) => anchor.href)
})
console.log(`Found ${links.length} links.`)
for (const link of links) {
try {
const res = await fetch(link)
const message = `${res.ok ? '✅' : '❌'} - ${link}`
console.log(message)
} catch (error) {
console.log('Error:', error)
}
// 4. Check if the link should be added to "toVisit" based on the following:
if (
link.startsWith(baseUrl) && // If it's internal. Meaning it's from the same domain
!isScheduled(links, link) && // If we haven't already added it to the stack
!visited.has(link) && // If we haven't visited it already
depth < maxDepth // If we haven't reached the maximum depth
) {
console.log(`Adding "${link}" to stack.`)
toVisit.push({ url: link, depth: depth + 1 })
}
}
await page.close()
}
await browser.close()
}
crawl('http://localhost:3000', 1)
Now when you run the crawler, assuming your using the example site, you should see something like this:
Crawling: http://localhost:3000
Found 4 links.
✅ - http://localhost:3000/
Adding "http://localhost:3000/" to stack.
✅ - http://localhost:3000/about
Adding "http://localhost:3000/about" to stack.
✅ - http://localhost:3000/contact
Adding "http://localhost:3000/contact" to stack.
✅ - http://localhost:3000/articles
Adding "http://localhost:3000/articles" to stack.
--------------------------------------
Crawling: http://localhost:3000/about
Found 5 links.
✅ - http://localhost:3000/
✅ - http://localhost:3000/about
✅ - http://localhost:3000/contact
✅ - http://localhost:3000/articles
❌ - http://localhost:3000/nowhere
We can see that the crawler is now following links and adding them to the stack based on the depth we've set. Try changing the depth
parameter to see how the crawler behaves. Within each article
page there is another link to comments. If you set the maxDepth
to 3
, you should see the crawler visit the comments
page.
Be careful when running this script. If you try to run it on large websites or with a high maxDepth
value. You could end up crawling thousands of pages and potentially get blocked by the website. Always be mindful of the impact your scraper has on the website you're crawling.
For more information on ethical web scraping, check out this wiki page on robots.txt.
When validating links, it's a good idea to batch requests together to avoid overwhelming the server. This can help prevent rate limiting and ensure that your scraper is a good web citizen. We'll also add a delay between requests to further reduce the load on the server. The example site only has a few links, so these changes won't really be necessary, but for a site that has hundreds or even thousands of links, like wikipedia, it'll save you from getting blocked.
For batching, we'll group links into groups of n
and validate them together. This way we'll make fewer requests, be more efficient, and reduce the load on the server. We'll also add a delay of 1000ms
between each batch to give the server some breathing room.
A simple batching function could look like this:
const validateLinksInBatches = async (links, batchSize = 5) => {
// Iterate over the list of links in batches of `batchSize`
for (let i = 0; i < links.length; i += batchSize) {
const batch = links.slice(i, i + batchSize)
await Promise.all(batch.map((link) => validateLink(link)))
}
}
For now, let's set our batching size to 2 and add a delay of 1000ms
between each batch and each page visit. We'll also need to break out the link validation into a separate function.
const delay = (ms) => new Promise((resolve) => setTimeout(resolve, ms))
const validateLink = async (link) => {
try {
const res = await fetch(link)
console.log(`${res.ok ? '✅' : '❌'} - ${link}`)
} catch (error) {
console.log('Error:', error)
}
}
Below is a complete example that combines recursive crawling, depth control, delay between requests, and batching. Notice how everything is modular, making it easier to adjust individual components as needed:
const puppeteer = require('puppeteer')
const delay = (ms) => new Promise((resolve) => setTimeout(resolve, ms))
const validateLink = async (link) => {
try {
const res = await fetch(link)
console.log(`${res.ok ? '✅' : '❌'} - ${link}`)
} catch (error) {
console.log('Error:', error)
}
}
const validateLinksInBatches = async (links, batchSize = 5) => {
for (let i = 0; i < links.length; i += batchSize) {
const batch = links.slice(i, i + batchSize)
console.log(`Validating ${batch.length} links...`)
await delay(1000)
await Promise.all(batch.map((link) => validateLink(link)))
}
}
// Check if a link is already scheduled for a visit
const isScheduled = (links, link) => links.some((l) => l.url === link)
const crawl = async (baseUrl = 'http://localhost:3000', maxDepth = 1) => {
const browser = await puppeteer.launch({ headless: true })
const toVisit = [{ url: baseUrl, depth: 0 }]
const visited = new Set()
while (toVisit.length > 0) {
// Grab the next URL to visit and add it to the visited set
const { url, depth } = toVisit.pop()
visited.add(url)
const page = await browser.newPage()
console.log('--------------------------------------')
console.log(`Crawling: ${url}`)
await page.goto(url, { waitUntil: 'networkidle2' })
const links = await page.evaluate(() => {
const anchors = Array.from(document.querySelectorAll('a'))
return anchors.map((anchor) => anchor.href)
})
console.log(`Found ${links.length} links.`)
await validateLinksInBatches(links, 2)
await delay(1000)
for (const link of links) {
if (
link.startsWith(baseUrl) &&
!isScheduled(link) &&
!visited.has(link) &&
depth < maxDepth
) {
console.log(`Adding "${link}" to stack.`)
toVisit.push({ url: link, depth: depth + 1 })
}
}
await page.close()
}
await browser.close()
}
crawl('http://localhost:3000', 1)
Now, when you run the crawler, you'll notice that it now validates links in batches of 2 in addition to waiting an extra second between batches and after each page visit. Congratulations! You've just built a more robust web scraper that can crawl multiple pages, control depth, and validate links efficiently. What's next? Let's wrap this up.
Great job! You’ve just expanded your web scraper to handle recursive crawling, controlled depth, delays, and batching. These additions make your scraper more robust and production-ready. Remember, as you build more complex scrapers, always be mindful of the ethical and legal considerations of web scraping.
Feel free to experiment with the code—maybe add error logging, customize delays based on server response. Your scraper is now a solid foundation to build upon!
Here's some inspiration for further enhancements:
validated
set to avoid re-validating the same links.Happy coding, and may your links always validate! If you have any questions or need further tweaks, just give me a shout. I'm here to help you improve and refine your skills.
Enjoy my content? You can read more on my blog at The Glitched Goblet or follow me on BlueSky at kaemonisland.bsky.social. I'll be updating older posts and adding new ones each week, so be sure to check back often!