Just a quick note, be careful when using the whitespace character
preg_match when operating with UTF-8 strings.
Suppose you have a string containing a dagger symbol. When you try to strip all whitespace from the string like this, you will end up with an invalid UTF-8 character:
$ php -r 'echo preg_replace("#\s#", "", "?");' | xxd
(On a side note:
xxd displays all bytes in hexadecimal representation. The resulting string here consists of two bytes
\s stripped away the
a0 byte. I was unaware that this character was included in the whitespace list, but actually it represents the non-breaking space.
So actually use the u (PCRE8) modifier as it will be aware of the
a0 “belonging” to the dagger:
$ php -r 'echo preg_replace("#\s#u", "", "?");' | xxd
0000000: e280 a0
By the way,
trim() doesn’t strip non-breaking spaces and can therefore safely be used for UTF-8 strings. (If you still want to trim non-breaking spaces with
trim, read this comment on PHP.net)
Finally here you can see the ASCII characters matched by
\s when using the u modifier.
$ php -r '$i = 0; while (++$i < 256) echo preg_replace("#[^\s]#", "", chr($i));' | xxd
0000000: 090a 0c0d 2085 a0
$ php -r '$i = 0; while (++$i < 256) echo preg_replace("#[^\s]#u", "", chr($i));' | xxd
0000000: 090a 0c0d 20
Functions operating just on the ASCII characters (with a byte code below 128) are generally safe, as the multi-byte characters of UTF-8 have a leading bit of one (and are therefore above 128).