home > blogs > web > GitHub Project: PHP-MetaParser

GitHub Project: PHP-MetaParser

June 14, 2012

Oh PHP-MetaParser, how many hours I spent on you.. years ago.

What is this?

PHP-MetaParser was inspired by how Facebook pulled data from links when sharing them. Specifically, it would parse out the title, description and images. That's the goal of PHP-MetaParser.

Specifically, it allows you to provide a body of text (generally, a webpage's markup), and receive meta data back after it's been parsed.

Why did I develop it?

I developed it in an attempt to duplicate what Facebook was able to do. On a project that has been long-burried (Clearmix), I had user profiles, and naturally, comment walls where users could post comments, images, videos and links.

I wanted to add context to the links as Facebook did, so I create a backend script that would CURL the url provided, and then parse it's contents for relevant information.

I believe that I looked at PHPs DOM functionality, but settled on regular expressions in order to extract the above mentioned information, as well as the page's base property, in addition to it's favicon and OpenGraph tags.

What's included?

This project includes one class which does not take care of the CURL action itself. For this, please see my PHP-Curler library.

This instantiable class includes the following public methods:

getBase Returns the base property for the page. Useful for parsing links in a document
getDescription Returns the meta tag description for the page, if found
getDetails Returns an array of all the possible data that could be parsed from the document
getFavicon Returns the path to the favicon for the page. If one is not explictely defined, it makes a best guessed based on the host and base property
getKeywords Returns the page's meta tag keywords, if defined
getOpenGraph Returns OpenGraph details for the page, if defined
getTitle Returns the title attribute for the page, if defined
getURL Returns the parsed URL for the page

How do I use it?

To use this library, it makes the expectation that you already have the body of text which is to be parsed, and have it's data extracted from. By simply creating an instance of a MetaParser object, and passing in the body of text and URL, you can then access all the data through the getDetails method on that instance.

Why did I abstract it out?

That's an especially relevant question with this library, as it's been decoupled from the actual CURLing.

I abstracted this library out to do just the parsing as I found I was performing CURL calls elsewhere in my codebase. I didn't want to have the parsing an inherent part of that.

I thought about extending the CURL library for the MetaParser class, but for a reason I can't recall right now, it didn't make sense programmatically or from a business-logic perspective.

PHP-Gravatar, in it's beautiful-simplicity, is next.