In writing my page parser, I've needed to extract the title tag from a page. You wouldn't think the regex is too complicated, but there were definitely a few twists and turns. I'll throw up the regex I'm using and walk through it (this is for me so later on when I'm looking at it, I'll understand wtf I was thinking).
preg_match('/<title[^>]*>([^<]+)<\/title>/im', $this->_response, $titles);
The first thing is to obviously match the opening title tag. Inside of the tag I reserve space for an expression that allows a title tag to have an attribute and value (I don't think this is valid W3C markup, but I've found it in various sites).
Then I begin catching the title itself. Why don't I use (.*)
instead of ([^<]+)?
(.*)
will capture everything, but the dot-character by definition does not
capture new lines. Many times pages have title tags spit out over three lines:
the first containing the opening tag, the second the title copy/string, and the
third the title closing tag. The expression I'm using however is a negation, and
by definition that does include the newline character :)
Following this I search for the closing title tag. The flag I throw on the end ignores the case of the title tags.
So why didn't I just add the flag s which would set the character-capture-range to include everything including the newlines? Well I tried that, and it failed. I really don't know why, but in some cases it failed to work.
My guess is that it was being applied to the first expression and then skipping the actual copy/string. But all in all, I think this should capture the title nicely.
Note
Ignore the space in the closing title tag. It's my crappy word-wrapper throwing a space in there. I'll be fixing that shortly and will follow up with how insanely difficult word-wrapping really is if you want to get it perfect.