Add a Rate Limit to Your Website

Suppose you have a ressource on the web (for example an API) that either generates a lot of load, or that is prone to be abused by excessive use, you want to rate-limit it. That is, only a certain number of requests is allowed per time-period.

A possible way to do this is to use Memcache to record the number of requests received per a certain time period.

Task: Only allow 1000 requests per 5 minutes

First attempt:
The naive approach would be to have a key rate-limit-1.2.3.4 (where 1.2.3.4 would be the client's IP address) with a expiration time of 5 minutes (aka 300 seconds) and increment it with every request. But consider this:

10:00: 250 reqs -> value 250
10:02: 500 reqs -> value 750
10:04: 250 reqs -> value 1000
10:06: 100 reqs -> value 1250 -> fails! (though there were only 850 requests in the last 5 minutes)

Whats the problem?

Memcache renews the expiration time with every set.

Second attempt:
Have a new key every 5 minutes: rate-limit-1.2.3.4-${minutes modulo 5}. This circumvents the problem that the key expiration but creates another one:

10:00: 250 reqs -> value 250
10:02: 500 reqs -> value 750
10:04: 250 reqs -> value 1000
10:06: 300 reqs -> value 300 -> doesn't fail! (though there were 1050 requests in the last 5 minutes)

Solution:
Store the value for each minute separately: rate-limit-1.2.3.4-$hour$minute. When checking, query all the keys in the last 5 minutes to calculate the requests in the last 5 minutes.

Sample code:


foreach ($this->getKeys($minutes) as $key) {
    $requests += $this->memcache->get($key);
}

$this->memcache->increment($key, 1);

if ($requests > $allowedRequests) throw new RateExceededException;

For your convenience I have open sourced my code at github: php-ratelimiter.

preg_match, UTF-8 and whitespace

Just a quick note, be careful when using the whitespace character \s in preg_match when operating with UTF-8 strings.

Suppose you have a string containing a dagger symbol. When you try to strip all whitespace from the string like this, you will end up with an invalid UTF-8 character:

$ php -r 'echo preg_replace("#\s#", "", "?");' | xxd
0000000: e280

(On a side note: xxd displays all bytes in hexadecimal representation. The resulting string here consists of two bytes e2 and 80)

\s stripped away the a0 byte. I was unaware that this character was included in the whitespace list, but actually it represents the non-breaking space.

So actually use the u (PCRE8) modifier as it will be aware of the a0 "belonging" to the dagger:

$ php -r 'echo preg_replace("#\s#u", "", "?");' | xxd
0000000: e280 a0

By the way, trim() doesn't strip non-breaking spaces and can therefore safely be used for UTF-8 strings. (If you still want to trim non-breaking spaces with trim, read this comment on PHP.net)

Finally here you can see the ASCII characters matched by \s when using the u modifier.

$ php -r '$i = 0; while (++$i < 256) echo preg_replace("#[^\s]#", "", chr($i));' | xxd 0000000: 090a 0c0d 2085 a0 $ php -r '$i = 0; while (++$i < 256) echo preg_replace("#[^\s]#u", "", chr($i));' | xxd 0000000: 090a 0c0d 20

Functions operating just on the ASCII characters (with a byte code below 128) are generally safe, as the multi-byte characters of UTF-8 have a leading bit of one (and are therefore above 128).

Debugging PHP on Mac OS X

Operating system

  • the software component of a computer system that is responsible for the management and coordination of activities and the sharing of the resources of the computer

OS X

  • a line of computer operating systems developed, marketed, and sold by Apple Inc, the latest of which is pre-loaded on all currently shipping Macintosh computers

PHP

  • an open source programming language

I have been using Mac OS X as my primary operating system for a few years now, and only today I have found a very neat way to debug PHP code, like it is common for application code (i.e. stepping through code for debugging purposes).

The solution is a combination of Xdebug and MacGDBp.

macgdbp-debugger

I am using the PHP package by Marc Liyanage almost ever since I have been working on OS X, because it's far more flexible than the PHP shipped with OS X.

Unfortunately, installing Xdebug the usual pecl install xdebug doesn't work. But on the internetz you can find a solution to this problem.

Basically you need to download the source tarball and use the magic command CFLAGS='-arch x86_64' ./configure --enable-xdebug for configuring it. (The same works for installing APC by the way)


/usr/local/php5/php.d $ cat 50-extension-xdebug.ini
[xdebug]
zend_extension=/usr/local/php5/lib/php/extensions/no-debug-non-zts-20060613/xdebug.so

xdebug.remote_autostart=on
xdebug.remote_enable=on
xdebug.remote_handler=dbgp
xdebug.remote_mode=req
xdebug.remote_host=localhost
xdebug.remote_port=9000

Now you can use MacGDBp. There is an article on Particletree that describes the interface in a little more detail.

I really enjoy using this method to only fire up this external program, when I want to debug some PHP code, and can continue to use my small editor, so that I don't have to switch to a huge IDE to accomplish the same.

Posted in php

Eclipse Everywhere. Buah.

It's been a little quiet lately. This is because I am working on a cute little project that I will be able to present soon. More when the time is ready.

There has been rumor lately that Zend (developer of PHP) will release a PHP Framework. This is nothing new, there has been a IDE (Zend ) for a long time now. But it will be based on Eclipse.

Also Macromedia announced that their new Flex 2.0 environment (Flashbuilder) will be based on Eclispe.

Why on earth Eclipse?! I think this is the most slowest IDEs available. It's based on Java which makes it incredibly slow already and it's so blown up that it's unbelievable.

I just can't understand why developers would use such a tool. I am not willing to buy a GHz monster PC just to have an editor running there. That's a pure waste of money and electricity. Emacs is kinda slow already but it runs on a few MHz.

Can anyone explain to me why to use such a monster?

I thought that maybe everything changed for the better by now and downloaded the whole thing. That's 100MB already. This already shows how much memory it will consume. Ok, I still started it. It took more than 2 minutes on my Powerbook G4. Hello? The features it provides are so not worth that.

I can recommend TextMate (best completition) and EditPlus (best integrated (S)FTP). These are fast, neat text editors. That's what I want.

, , , , ,

Better code downloading with AJAX

I've been playing with Code downloading (or Javascript on Demand) a little more.

Michael Mahemoff pointed me at his great Ajaxpatterns in which he suggests a different solution:

if (self.uploadMessages) { // Already exists
return;
}
var head = document.getElementsByTagName("head")[0];
var script = document.createElement('script');
script.type = 'text/javascript';
script.src = "upload.js";
head.appendChild(script);

Via DOM manipulation a new script tag is added to our document, loading the new script via the 'src' attribute. I have put a working example here. As you can see this does not even need to do an XmlHttpRequest (XHR later on) so it will also work on browsers not supporting that.

So why use this approach and not mine? Initially I thought that it was not as good as doing it via XHR because you receive a direct feedback (i.e. a function call) when the script has been loaded. This is per se not possible with this technique. But as in good ol' times a simple function call at the end of the script file will do the same job (compare source codes from the last example and this one (plus load.js)).

Using this method to load code later on also provides another "feature" (thanks for that hint to Erik Arvidsson): Unlike XHRs Firefox also provides a cache for scripts loaded that way. There seems to be a disagreement about whether this is a bug or a feature (people complaining that IE caches such requests while it could be quite useful in this scenario).

When using dynamically generated javascript code you will also have to keep your HTTP headers in mind (scripts don't send them by default). The headers Cache-Control and Last-Modified will do usually (see section 6.1.2 of my thesis)

The method above is also the method used by Dojo, a developer (David Schontzler) commented, too. He says that Dojo also only loads the stuff the programmer needs, so little overhead can be expected from this project.

Also Alex Russell from Dojo left a comment about bloated javascript libraries. He has some good points about script size to say (read for yourself), I just want quote the best point of his posting:

So yes, large libraries are a problem, but developers need some of the capabilities they provide. The best libraries, though, should make you only pay for what you use. Hopefully Dojo and JSAN will make this the defacto way of doing things.

So hang on for Dojo, they seem to be on a good way (coverage of Dojo to follow).

Finally I want to thank you all for your great and insightful comments!

, , , , ,

PHP and Multibyte

ever messed around with umlauts or other non [a-z] letters? it's quite horrible.

for the german speaking region there are mainly two encoding types: iso8859-1 and utf-8. the former encodes each letter with one byte by extending old 7-bit ascii with 127 more letters, amongst others also umlauts. utf-8 includes up to 32,640 more letters (ascii 0x80-0xff are used to select the range of the following byte). this is established by allowing multi-byte characters. in the case of utf-8 the maximum is two letters, but there exist utf-16 and utf-32 with up to 4 bytes per char.

so, what's the problem? with bandnews we have different sources for our data, meaning that we receive many pages with many different encodings and have to deliver a page that follows only one encoding. we chose to use utf-8 now, because a wide range of letters from many other encodings can be displayed which are not included in iso8859-1.

now it is important that you stop using strlen and substr because it can easily happen that you split an utf-8 character into parts, and forget comparing it to anything, then. alterenatives are mb_strlen and mb_substr and all other sorts of mb_* functions. well… this does not work out of the box, you need to specify what encoding is to be expected. this can be done like this:

mb_internal_encoding("UTF-8");

all mb_* commands use this encoding if no other is specified.

still, non-utf-8 code can come through to the browser, e.g. if you receive it from the database. but there is a chance to get around this quite comfortably:

mb_http_output("UTF-8");
ob_start("mb_output_handler");

the output buffer is cleared from wrong charactes by the mb_output_handler. it is also easily possible to have the output converted to iso8859-1, just by specifying it with the mb_http_output command.
a drawback is, though, that no other output filter can be applied, such as for output compression

ob_start("ob_gzhandler");

the manual states that instead zlib compression should be used, as specified in the php.ini file or via ini_set:

ini_set ('zlib.output_compression', 'on');
ini_set ('zlib.output_handler', 'mb_output_handler');
ob_start();

note that the output-handler for ob_start has to be empty and it is moved to the config option. this sounds great, but i was not able to get it to work. well, i must admit that i did not put so much time into it because i simply decided to move the responsibility to apache: mod_deflate. you might want to modify the configuration line, as i did:

AddOutputFilterByType DEFLATE text/html text/plain text/xml text/javascript text/css

have fun with character encoding. it works after some while. but its a lot of trial and error.