web.onassar.com Archive

I can be reached at onassar@gmail.com.

For my open source work, check out github.com/onassar

Craigslist is 403-ing my requests

View more posts

I was having trouble crawling (aka. scraping) content from Craigslist.
At first, I was routing my request through PHP Curler, and thought it may have been an issue with the headers I was passing, but alas, it seems to be related to something independent of the formation of the request. And what's left? My IP address.

I tried running the following:

wget http://berlin.en.craigslist.de/search/sub?zoomToPosting=&query=april&srchType=T&minAsk=&maxAsk=600&bedrooms=1&hasPic=1

The result of that query, run from my Ubuntu VM on my current OSX, goes through without a problem. The file is saved.

Running it from my AWS EC2 instance? I get 403'd with the response:

Connecting to berlin.en.craigslist.de (berlin.en.craigslist.de)||:80... connected.
HTTP request sent, awaiting response... 403 Forbidden
2013-03-12 18:12:24 ERROR 403: Forbidden.

I'm guessing a range of IP addresses get blocked from Craigslist (probably with good reason). A heads up to anyone out there hoping to curl Craigslist from an EC2 box.