UTF8, JSON, Perl, and MySQL (guess where this is heading..)

Posted on January 27th, 2009 in Javascript, Perl by Patrick

Today I learned a little bit about UTF8 in the context of JSON, Perl and MySQL.

Recently one of my clients who was using the Survey wobject in WebGUI noticed their modperl server grinding to a halt. Funny how one little decision can trigger such a catastrophic failure. Funny in the sense that it takes you a few days to repair the damage and track down the root cause of the problem.

The reason modperl was grinding to a halt was easy enough to find – child processes chewing up as much memory as they could whenever a web request touched a Survey instance. The cause of that was also quickly apparent.. one of the text fields in a Survey instance was filled with ~150Mb of rubbish data. The smoking gun looked like this:

In contrast to obsessions, your worries don\x{c3}\x{83}\x{c2}\x{83}\x{c3}\x{82}\x{c2}…

An apostrophe in the text and then BAM! everything goes haywire (that \x{blah} unicode text goes on for 150Mb..)

Turns out the user who triggered the problem was cutting and pasting text from Microsoft Word. As anyone who has worked with people who cut and paste text from Word into content management systems knows, Word uses non-ASCII characters for “smart quotes”, apostrophes, dashes, triple periods, etc.. Those non-ASCII characters usually show up in HTML as garbage characters. In this case it looked like those characters were causing the Survey internals to break.

On closer inspection Survey was breaking as follows: any non-ASCII text was converted to a few garbage characters. On the next view/edit, each of these garbage characters were turned into a few more garbage characters, and so on.. which meant that in no time at all the web server had ground to a halt and the Survey table was weighing in at hundreds of Mb. Fun!

Now, I knew that JSON was supposed to be UTF8 encoded, and Survey serialises its data to JSON, so somewhere along the way the JSON string/Perl String/MySQL string encoding was breaking.

Best practices in WebGUI dictate that you work with JSON as non-encoded, non-binary, Perl UTF8 strings and then let wg modules such as WebGUI::Session::Form do the encoding for you (via the Encode module). Most of the time you don’t need to care if a string in Perl is UTF8 or not, and doing things this way means that most of the time you can remain blissfully unaware of what’s happening under the hood.

In accordance with this, Survey uses JSON::to_json/from_json everywhere. On a hunch I switched all the from_json calls to decode_json, which expects a binary, UTF8-encoded string instead of a non-binary string, and voila, the problem disappeared. So somewhere along the way our JSON-encoded string was being converted to a binary string. The question was: where?

The plot thickened when I noticed that calling Encode::decode_utf8() immediately after reading the string from mysql also removed the issue (decode_utf8 turns a binary UTF8 string into a non-binary utf8 string).

With the help of perlmonkey2 and perlDreamer on #webgui, we found the root cause.. the json-encoded string was being stored in a mysql field of type longblob as opposed to a non-binary type (such as text, mediumtext, etc..). Something was transparently converting it to a binary UTF-encoded string (perhaps DBD::mysql?), and when  Survey then pulled this binary string out the database, it was trying to use from_json() to decode it (which expects a non-binary string).

A quick change to the field type:

alter table Survey modify surveyJSON long

and all was well again with the universe, but if you’d asked me 4 hours ago what mysql field type should be used for json my first thought would probably have been a binary type, because all I knew was the JSON is UTF8 encoded. A little bit of knowledge can be a dangerous thing..