I ran into a curious bug yesterday while trying to crawl an Amazon page. The error, specifically:
[Sat Dec 01 01:54:33 2012] [error] [client 10.211.55.2] htmlentities(): Invalid multibyte sequence in argument in ...
Here's the flow I was running that presented the bug:
- I would curl an Amazon page (using my PHP-Curl library)
- I would then want to insert that data into my database, so I would encode it first using the PHP
htmlentities
function (specifically,htmlentities($str, ENT_QUOTES, "UTF-8", false)
)
That's when the bug popped up.
It was confusing, because I'd crawled other pages without incident.
Here's what I figured out. The Amazon page's encoding was set to ISO-8859-1
. While this shouldn't have caused an error with encoding, it's possible that during their encoding, they misencoded (is that a word?) some characters, which was then breaking my call to the htmlentities
.
A way around this is to convert the string from one character encoding to another, using the iconv PHP function. Specifically, from ISO-8859-1
to UTF-8
, and then run the htmlentities
function.
Success :)
While googling, I stumbled on the post PHP htmlspecialchars()/htmlentities() invalid multibyte/UTF-8 gotcha with display_errors=true, which found a way to supress the error, which hinted at the idea that it was in fact a rightful error.
Finally, the Stackoverflow post htmlentities, htmlspecialchars, and "invalid multibyte sequence" I found hinted at the conversion I needed to make.
Within my short PHP-Security functions library is where I made the modifications. I think long term, I may need to update my encode
function in that library to accomodate multiple different, possible, character encoding sets, but for now, I'm okay with just ISO-8859-1
.