A few months ago I set off on what many devs have done before: building a crawler.
In 2025, there are plenty of libraries, frameworks and 3rd party services that can help devs do this (taking into consideration complicated things including rate-limiting, captchas etc), but what was different this time was I wanted to maintain an index of the crawled content.
At first, I didn't think that distinction would be as important as it turned out to be. Well.. it turned out to be a major challenge.
First, what do I mean by an "index"?
Maintaining a website index
Crawling a resource (i.e. webpage, image, PDF, etc), group of resources, hostname or group of hostnames presents a series of challenges: the metadata, content, headers and most importantly, relationships to other resources, needs to be maintained.
And the relationship between these resources change over time (e.g. linking to new / different resources, redirects to new / old resources, etc). So the relationship between resources and their content is what I refer to as the "index". And it's a tough challenge.
But instead of detailing every challenge, edge-case or rabbit-hole, I'm going to write a list of questions below that needed to be asked, and answered, in order to accurately maintain an index.
The point of this being: crawling is tough, but maintaining
an index of the crawled data is tougher. And while it's a
"solved problem", I think reading through the list of
questions below would have been helpful before I set off on
this task. And mind you: the questions below are the more
complicated ones; I'll be leaving out the more obvious ones
(e.g. like resolving relative paths when a page has a
base
tag defined).
Questions that you'll need to ask (and answer):
- How will you handle long-polling for crawler payloads while your codebase is being updated, possibly changing your caching keys?
- How will you handle pages that have multiple robots tags with possibly different values?
- If you want to show distinct open graph image thumbnails in search results, how will you be able to tell which open graph images are unique / distinct from other resources?
- How should you handle a URL that appears the same as it's defined canonical URL but with it's URL params in a different order?
- How should redirect loops be handled?
- How should open-graph redirect loops be handled (e.g. a web resource resolves and loads, but it defines a different canonical URL that redirects back the current page)?
- What is the maximum content payload size your server should process (being careful to handle things like zip bombs)?
-
When a website has it's content duplicated across two
hostnames (e.g.
@
andwww
) which hostname should be given priority? - How will the PDF content be extracted (and verified not to be corrupt)?
- If the crawler attempts to crawl pages sequentially, how will silent-failures be determined (e.g. page doesn't resolve, and the crawler doesn't know it hasn't resolved due to a timeout)?
Intention
The intention with this post is, largely, to document the tough problems with maintaining an index related to crawled content.
While the crawler + index I've built is functional and being
used in production (on this very site via ⌘k
),
the number of unexpected
("unknown unknowns")
challenges that needed to be considered (and resolved)
surprised me.
And I still haven't resolved them all.