Home > Blogs > web > Multibyte error with character set encoding

Multibyte error with character set encoding

December 02, 2012

I ran into a curious bug yesterday while trying to crawl an Amazon page. The error, specifically:

[Sat Dec 01 01:54:33 2012] [error] [client 10.211.55.2] htmlentities(): Invalid multibyte sequence in argument in ...

Here's the flow I was running that presented the bug:

I would curl an Amazon page (using my PHP-Curl library)
I would then want to insert that data into my database, so I would encode it first using the PHP htmlentities function (specifically, htmlentities($str, ENT_QUOTES, "UTF-8", false))

That's when the bug popped up.
It was confusing, because I'd crawled other pages without incident.

Here's what I figured out. The Amazon page's encoding was set to ISO-8859-1. While this shouldn't have caused an error with encoding, it's possible that during their encoding, they misencoded (is that a word?) some characters, which was then breaking my call to the htmlentities.

A way around this is to convert the string from one character encoding to another, using the iconv PHP function. Specifically, from ISO-8859-1 to UTF-8, and then run the htmlentities function.

Success :)

While googling, I stumbled on the post PHP htmlspecialchars()/htmlentities() invalid multibyte/UTF-8 gotcha with display_errors=true, which found a way to supress the error, which hinted at the idea that it was in fact a rightful error.

Finally, the Stackoverflow post htmlentities, htmlspecialchars, and "invalid multibyte sequence" I found hinted at the conversion I needed to make.

Within my short PHP-Security functions library is where I made the modifications. I think long term, I may need to update my encode function in that library to accomodate multiple different, possible, character encoding sets, but for now, I'm okay with just ISO-8859-1.