Oliver Nassar

So you want to build a web crawler (and maintain an index)?

September 03, 2025

A few months ago I set off on what many devs have done before: building a crawler.

In 2025, there are plenty of libraries, frameworks and 3rd party services that can help devs do this (taking into consideration complicated things including rate-limiting, captchas etc), but what was different this time was I wanted to maintain an index of the crawled content.

At first, I didn't think that distinction would be as important as it turned out to be. Well.. it turned out to be a major challenge.

First, what do I mean by an "index"?


Maintaining a website index

Crawling a resource (i.e. webpage, image, PDF, etc), group of resources, hostname or group of hostnames presents a series of challenges: the metadata, content, headers and most importantly, relationships to other resources, needs to be maintained.

And the relationship between these resources change over time (e.g. linking to new / different resources, redirects to new / old resources, etc). So the relationship between resources and their content is what I refer to as the "index". And it's a tough challenge.

But instead of detailing every challenge, edge-case or rabbit-hole, I'm going to write a list of questions below that needed to be asked, and answered, in order to accurately maintain an index.

The point of this being: crawling is tough, but maintaining an index of the crawled data is tougher. And while it's a "solved problem", I think reading through the list of questions below would have been helpful before I set off on this task. And mind you: the questions below are the more complicated ones; I'll be leaving out the more obvious ones (e.g. like resolving relative paths when a page has a base tag defined).


Questions that you'll need to ask (and answer):

  1. How will you handle long-polling for crawler payloads while your codebase is being updated, possibly changing your caching keys?
  2. How will you handle pages that have multiple robots tags with possibly different values?
  3. If you want to show distinct open graph image thumbnails in search results, how will you be able to tell which open graph images are unique / distinct from other resources?
  4. How should you handle a URL that appears the same as it's defined canonical URL but with it's URL params in a different order?
  5. How should redirect loops be handled?
  6. How should open-graph redirect loops be handled (e.g. a web resource resolves and loads, but it defines a different canonical URL that redirects back the current page)?
  7. What is the maximum content payload size your server should process (being careful to handle things like zip bombs)?
  8. When a website has it's content duplicated across two hostnames (e.g. @ and www) which hostname should be given priority?
  9. How will the PDF content be extracted (and verified not to be corrupt)?
  10. If the crawler attempts to crawl pages sequentially, how will silent-failures be determined (e.g. page doesn't resolve, and the crawler doesn't know it hasn't resolved due to a timeout)?

Intention

The intention with this post is, largely, to document the tough problems with maintaining an index related to crawled content.

While the crawler + index I've built is functional and being used in production (on this very site via ⌘k), the number of unexpected ("unknown unknowns") challenges that needed to be considered (and resolved) surprised me.

And I still haven't resolved them all.