I ran into a curious bug yesterday while trying to crawl an Amazon page. The error, specifically:
[Sat Dec 01 01:54:33 2012] [error] [client 10.211.55.2] htmlentities(): Invalid multibyte sequence in argument in ...
Here's the flow I was running that presented the bug:
htmlentities($str, ENT_QUOTES, "UTF-8", false))
That's when the bug popped up.
It was confusing, because I'd crawled other pages without incident.
Here's what I figured out. The Amazon page's encoding was set to
ISO-8859-1. While this shouldn't have caused an error with encoding, it's possible that during their encoding, they misencoded (is that a word?) some characters, which was then breaking my call to the
A way around this is to convert the string from one character encoding to another, using the iconv PHP function. Specifically, from
UTF-8, and then run the
While googling, I stumbled on the post PHP htmlspecialchars()/htmlentities() invalid multibyte/UTF-8 gotcha with display_errors=true, which found a way to supress the error, which hinted at the idea that it was in fact a rightful error.
Finally, the Stackoverflow post htmlentities, htmlspecialchars, and "invalid multibyte sequence" I found hinted at the conversion I needed to make.
Within my short PHP-Security functions library is where I made the modifications. I think long term, I may need to update my
encode function in that library to accomodate multiple different, possible, character encoding sets, but for now, I'm okay with just