the utf-8 saga

In another series about web development, this is a pretty unhappy installment.

In 2007 it’s clear that your complete web application should be utf-8. This a way to encode characters for use by a computer [yes I know, actually it's a character set]. There are several character sets in use. Letters a-z are encoded in something called ASCII which has been in use for decades. More or less commonly used accented characters can be encoded in extended ASCII.

Regular ASCII only allows for 7 bits of information which translates to 128 characters. No accented stuff is allowed. So ASCII was extended to include an additional 128 characters. Of course, that wasn’t enough, so more character sets were thought up. One of these is iso-8859-1 and this is the one that was most popular in the 90s. It’s used all throughout the Western world.

But Eastern languages have lots more characters and the space for iso-8859-1 wasn’t big enough, so utf-8 entered the arena. This one offers a huge space and has provisions to extend that space without much trouble. So, we’re all set. utf-8 is the shit and everyone should use it.

Enter reality. The most popular open source database MySQL (this site runs on it) did not offer functional utf-8 support until version 4.1 which was only released a few years ago. Only modern browsers support utf-8 and only modern e-mail clients too.

The single largest problem however is working with data that evolved over many years, in the case of GO Magazine from november 2000. Back then, iso was the shit and thus the database was created in the iso format.

The smarter people reading will think “wel bud, just run a converter script and be done with it”. True, in theory. But there are many factors at play. Switching from one character set to another is an all-or-nothing operation. The entire web stack needs to be converted, top to bottom. Many elements in software have default character sets, the text in them often has varying character sets and, in an imperfect world such as this one, there is always nonconforming data, i.e. stuff that simply has an illegal encoding but got stored with the good stuff anyway.

There are good pointers on the web, in particular this one from the developer of cdbaby.com. And I am managing fine, but it’s a depressing job. I dumped the MySQL database and ran iconv across it, which converted the strings. I then changed default character sets on the db server and imported the whole lot. Remembered to set the ;charset=utf-8 header in Rails and things would work.

Or did they?

Upon closer examination some items were’nt converted. These turned out not to be in the iso character set (which one they were in I’m not sure). I needed to bluntly search and replace based on cold hex codes or \235 escaped bytes inside strings. Bah.

But then a particular problem dawned. I had converted the db, set the header in Rails but I had forgotten to set the character-set-server for MySQL! D’oh, that must be a problem right? So I put it in /etc/my.cnf and restarted the whole lot.

This did not improve things.

In fact everything that had been right was now displayed all wrong. A normal รค would now be displayed as two or even three characters, or question marks. Sigh. Switching the database default character set back to iso and the problems were gone.

So now I’m stuck in the middle and hesitant to go further. Clearly, many fields inside the database are utf-8. Ruby’s .is_utf8? method tells me so. The http header is set to utf-8, all browsers see the characters, inputting them works peachy too.

But shouldn’t the MySQL character-set-server be utf-8 too?

I’m not sure why my setup works this way but I know this: I come from MySQL 3.0 and have gone through many, yes MANY migrations. GO is running on it’s fifth server I believe, coming from Linux, then Solaris (twice) and then Linux again. I’ve seen three major MySQL releases and each one has shuffled charsets around. Early versions didn’t even support the stuff I have now. Then the web stuff: what used to be various J2EE servlet engines is now Ruby on Rails, but I used to run custom middleware in Java too. I know for a fact it wasn’t multibyte safe [I know Java is utf-8 internally, but this particular home brewn piece of glue used a very old mysql-j].

So here I am, more or less functioning utf-8, but with some quirkyness. Hence the saga. I am sure there will be followup posts…