Working on a scraper for web pages that basically curls a page, and checks the source for a title, description, favicon and it's images. Sounds simple enough, but spent maybe 30-35 hours so far working on it. I even had a head start as I had some old crap-code. But in rewriting it, found a few good tute's on regexs. Here's how I capture a favicon, and a couple good links.
Grabbing the favicon from some x/html source would seem simple, but here's the finished code first of all:
/**
* _parseFavicon function.
*
* @access private
* @final
* @return string
*/
final private function _parseFavicon()
{
// generate default
$parsed = parse_url($this->_url);
$default = ($parsed['scheme']) . '://' . ($parsed['host']) . '/favicon.ico';
// get the page links (icon attribute value leading)
preg_match_all('/<link.+[^-]icon.+href=['"]{1}(.+)['"]{1}/imU',
$this->_response, $links);
if (empty($links[1])) {
// get the page links (icon attribute value trailing)
preg_match_all('/<link.+href=['"]{1}(.+)['"]{1}.+[^-]icon/imU',
$this->_response, $links);
if (empty($links[1])) {
return $default;
}
}
// resolve full path
$favicon = array_pop($links[1]);
$favicon = trim($favicon);
$favicon = $this->_resolveFullPath($favicon, $this->getBase());
$favicon = str_replace(PHP_EOL, '', $favicon);
return $favicon;
}
As a quick walkthrough, here's what I'm doing:
- Generate and store the default favicon path for a url
- Run a regex that checks for a
<link>
tag with an attribute likerel="icon"
leading - If none found, do the same but search for the
rel="icon"
attribute trailing - If none found, return the default favicon
- Grab the last favicon found from one of the previous searches
- Trim any whitespace from it
- Resolve the full path to the favicon (incase it was referenced in the style:
href="../../fav.gif"
) - Replace any newlines in the path to it
- Return it
Two sweet advanced RegEx tutorials that I used elsewhere in my scraper are as follows: