I was having trouble crawling (aka. scraping) content from Craigslist.
At first, I was routing my request through PHP Curler, and thought it may have been an issue with the headers I was passing, but alas, it seems to be related to something independent of the formation of the request. And what's left? My IP address.
I tried running the following:
wget http://berlin.en.craigslist.de/search/sub?zoomToPosting=&query=april&srchType=T&minAsk=&maxAsk=600&bedrooms=1&hasPic=1
The result of that query, run from my Ubuntu VM on my current OSX, goes through without a problem. The file is saved.
Running it from my AWS EC2 instance? I get 403'd with the response:
Connecting to berlin.en.craigslist.de (berlin.en.craigslist.de)|208.82.236.225|:80... connected.
HTTP request sent, awaiting response... 403 Forbidden
2013-03-12 18:12:24 ERROR 403: Forbidden.
I'm guessing a range of IP addresses get blocked from Craigslist (probably with good reason). A heads up to anyone out there hoping to curl
Craigslist from an EC2 box.