October 2011 – Alex Kirk

preg_match, UTF-8 and whitespace

October 1, 2011October 3, 2011

Just a quick note, be careful when using the whitespace character \s in preg_match when operating with UTF-8 strings.

Suppose you have a string containing a dagger symbol. When you try to strip all whitespace from the string like this, you will end up with an invalid UTF-8 character:

$ php -r 'echo preg_replace("#\s#", "", "?");' | xxd 0000000: e280

(On a side note: xxd displays all bytes in hexadecimal representation. The resulting string here consists of two bytes e2 and 80)

\s stripped away the a0 byte. I was unaware that this character was included in the whitespace list, but actually it represents the non-breaking space.

So actually use the u (PCRE8) modifier as it will be aware of the a0 “belonging” to the dagger:

$ php -r 'echo preg_replace("#\s#u", "", "?");' | xxd 0000000: e280 a0

By the way, trim() doesn’t strip non-breaking spaces and can therefore safely be used for UTF-8 strings. (If you still want to trim non-breaking spaces with trim, read this comment on PHP.net)

Finally here you can see the ASCII characters matched by \s when using the u modifier.

$ php -r '$i = 0; while (++$i < 256) echo preg_replace("#[^\s]#", "", chr($i));' | xxd 0000000: 090a 0c0d 2085 a0 $ php -r '$i = 0; while (++$i < 256) echo preg_replace("#[^\s]#u", "", chr($i));' | xxd 0000000: 090a 0c0d 20

Functions operating just on the ASCII characters (with a byte code below 128) are generally safe, as the multi-byte characters of UTF-8 have a leading bit of one (and are therefore above 128).

Mastodon/ActivityPub at @alex@kirk.at
Matrix at @alex:kirk.at

Apr 6, 2024 18:13

@jimniels My WordPress plugin Friends can do this together with my other plugin Send to E-reader. When a new article arrives in an RSS feed it would send an email with the ePub of the article that includes it’s images. I also have another plugin for it called Post Collection which can download the articles images to the WordPress media library and rewrite the html so that if that article goes offline some day. That’s not fully offline though, it’d just all be inside your WordPress install.

Here is a demo: https://www.youtube.com/watch?v=kHaODAUazwE
Mar 29, 2024 10:44

@cutterkom @eay @pfefferle Die erste Zeile des Beitrags wird als Titel verwendet, allerdings in der aktuellen Version nur wenn bei App im WordPress Backend als Post-Format “Standard” angewählt ist. Das werden wir in der nächsten Version ändern, sodass das mit dem Erste Zeile = Titel immer funktioniert.
Mar 21, 2024 18:31

@maxheadroom Korrekt, das ist mit dem Plugin schon jetzt möglich!
Mar 21, 2024 14:51

In this year’s Cloudfest Hackathon, @notiz-blog was so kind to work on making my WordPress plugin Enable Mastodon Apps more open and better structured to work with other plugins like ActivityPub.

The Enable Mastodon Apps plugin allows using all those nice Mastodon apps with your own WordPress, instantly opening up a whole new universe of ways to interact with your WordPress blog. With the Friends plugin you can then interact with others, too, just by using your own little blog.

Matthias just blogged about it here: https://notiz.blog/2024/03/21/enable-mastodon-apps/

This is the kind of community work that transforms a plugin to something that is owned by the community. Thanks for your work and I hope that we can get all the refactoring merged soon!
Feb 16, 2024 13:49

@HarHarLinks Thank you for the kind words and sorry about that talk name confusion! My thinking was that there would be enough people in the room who hadn’t seen the community summit talk so that it would still be new to them, along with the intro-part of the talk (which was very similar to the summit indeed). So, to make it worthwhile for everyone, I made it a follow-up talk given the events that had taken place since the first one. I’m glad you found it useful!

All status posts