PHP and Multibyte

ever messed around with umlauts or other non [a-z] letters? it's quite horrible.

for the german speaking region there are mainly two encoding types: iso8859-1 and utf-8. the former encodes each letter with one byte by extending old 7-bit ascii with 127 more letters, amongst others also umlauts. utf-8 includes up to 32,640 more letters (ascii 0x80-0xff are used to select the range of the following byte). this is established by allowing multi-byte characters. in the case of utf-8 the maximum is two letters, but there exist utf-16 and utf-32 with up to 4 bytes per char.

so, what's the problem? with bandnews we have different sources for our data, meaning that we receive many pages with many different encodings and have to deliver a page that follows only one encoding. we chose to use utf-8 now, because a wide range of letters from many other encodings can be displayed which are not included in iso8859-1.

now it is important that you stop using strlen and substr because it can easily happen that you split an utf-8 character into parts, and forget comparing it to anything, then. alterenatives are mb_strlen and mb_substr and all other sorts of mb_* functions. well… this does not work out of the box, you need to specify what encoding is to be expected. this can be done like this:


all mb_* commands use this encoding if no other is specified.

still, non-utf-8 code can come through to the browser, e.g. if you receive it from the database. but there is a chance to get around this quite comfortably:


the output buffer is cleared from wrong charactes by the mb_output_handler. it is also easily possible to have the output converted to iso8859-1, just by specifying it with the mb_http_output command.
a drawback is, though, that no other output filter can be applied, such as for output compression


the manual states that instead zlib compression should be used, as specified in the php.ini file or via ini_set:

ini_set ('zlib.output_compression', 'on');
ini_set ('zlib.output_handler', 'mb_output_handler');

note that the output-handler for ob_start has to be empty and it is moved to the config option. this sounds great, but i was not able to get it to work. well, i must admit that i did not put so much time into it because i simply decided to move the responsibility to apache: mod_deflate. you might want to modify the configuration line, as i did:

AddOutputFilterByType DEFLATE text/html text/plain text/xml text/javascript text/css

have fun with character encoding. it works after some while. but its a lot of trial and error.