Ruby 1.9's charsets FTW

I come from a Java background. With all it’s flaws, Java’s support for character sets is very good. All literal strings are UTF-8 and source files can be ASCII, iso or utf as well. Ruby 1.8 is only so-so and the earliest versions of Rails with their iso-8859-1 default were not helpful too. Fortunately the upcoming 1.9 release of Ruby will make things right with a vengeance.

Yes, I know that all strings in Ruby 1.8 can contain binary data, including all utf-8 characters. But you really need to set external encoding with -K and this reflects the internal encoding, that of the actual .rb source files, as well.

Ruby 1.9 does away with the confusion.

Internal encoding

This is the character set used in the .rb files themselves. Ruby assumes good old US-ASCII. New in 1.9 is the # coding: utf-8 comment. This must appear as the first comment in your script (or as the second if there is a ‘shebang’ comment). You can define UTF-8 here.

This is really helpful when you need/want to use literal utf-8 strings. I can see this being great when I can’t be bothered to look up the escape sequence for things as é in a quick script.

External encoding

This is the golden nugget of 1.9 as it defines your IO and streams that you use as ruby variables. Instead of -Ku (which would also set internal encoding) the new command line option to use is -E or it’s verbose brother --encoding. Instead of some character you specify the full name of the encoding: –encoding=utf-8 will work great. Hopefully, this is what Rails will use when 1.9 is officially supported.