PHP and Multibyte

ever messed around with umlauts or other non [a-z] letters? it’s quite horrible.

for the german speaking region there are mainly two encoding types: iso8859-1 and utf-8. the former encodes each letter with one byte by extending old 7-bit ascii with 127 more letters, amongst others also umlauts. utf-8 includes up to 32,640 more letters (ascii 0x80-0xff are used to select the range of the following byte). this is established by allowing multi-byte characters. in the case of utf-8 the maximum is two letters, but there exist utf-16 and utf-32 with up to 4 bytes per char.

so, what’s the problem? with bandnews we have different sources for our data, meaning that we receive many pages with many different encodings and have to deliver a page that follows only one encoding. we chose to use utf-8 now, because a wide range of letters from many other encodings can be displayed which are not included in iso8859-1.

now it is important that you stop using strlen and substr because it can easily happen that you split an utf-8 character into parts, and forget comparing it to anything, then. alterenatives are mb_strlen and mb_substr and all other sorts of mb_* functions. well… this does not work out of the box, you need to specify what encoding is to be expected. this can be done like this:

mb_internal_encoding("UTF-8");

all mb_* commands use this encoding if no other is specified.

still, non-utf-8 code can come through to the browser, e.g. if you receive it from the database. but there is a chance to get around this quite comfortably:

mb_http_output("UTF-8");
ob_start("mb_output_handler");

the output buffer is cleared from wrong charactes by the mb_output_handler. it is also easily possible to have the output converted to iso8859-1, just by specifying it with the mb_http_output command.
a drawback is, though, that no other output filter can be applied, such as for output compression

ob_start("ob_gzhandler");

the manual states that instead zlib compression should be used, as specified in the php.ini file or via ini_set:

ini_set ('zlib.output_compression', 'on');
ini_set ('zlib.output_handler', 'mb_output_handler');
ob_start();

note that the output-handler for ob_start has to be empty and it is moved to the config option. this sounds great, but i was not able to get it to work. well, i must admit that i did not put so much time into it because i simply decided to move the responsibility to apache: mod_deflate. you might want to modify the configuration line, as i did:

AddOutputFilterByType DEFLATE text/html text/plain text/xml text/javascript text/css

have fun with character encoding. it works after some while. but its a lot of trial and error.