Subversion: The Magic of Merging

When programming professionally, Subversion is a must-have. Same for system administration: it's quite a good idea to keep your configuration files (e.g in Linux the whole /etc/ directory) as a Subversion checkout.

So the goal of Subversion (or any other Source Control system) is to allow you to do something Apple will introduce with it's new Leopard operating system: Time Machine. Go back in time (and restore a version of a file as it was on day x).

Using Subversion on a daily basis is quite easy. Just check in (svn ci) your changes after you have completed a certain task. When you work collaboratively, and someone else has committed some changes, you do a svn up and the changes of the others are applied to your codebase.

That's all you basically need. But how can you go back in time now?

So you poke around a bit and find that svn up has a parameter -r which let's you put your checkout to the state in which it was at a certain revision.

Let's suppose we know that something was ok on monday and is not today. So let's use the command from above to see what it looks like.

~/project/trunk$ svn up -r {2006-10-09} app.php
U app.php

Voila, there it is. Now we choose to use that code now and throw away all changes that have been committed since. We modify the file a bit and do a check in:

~/project/trunk$ svn ci -m "revert to monday" app.php
Sending app.php
svn: Commit failed (details follow):
svn: Your file or directory 'app.php' is probably out-of-date
svn: The version resource does not correspond to the resource within the transaction. Either the requested version resource is out of date (needs to be updated), or the requested version resource is newer than the transaction root (restart the commit).

Uh.. ok. So you probably you know that error message already. It is also returned when you want to check something in on a file that has been changed by someone else since your last svn up.

When you check something into a subversion repository, one of the basic rules is that the file you want to commit is "up to date", i.e. the revision number of your local file (updated by svn up) equals the number in the repository (on the server).

Ok, so, let's update our checkout so we can re-run the check in.

~/project/trunk$ svn up
G app.php

So you discover the changes that happened since have been re-inserted to that file again. Maybe Subversion has alerted you of a conflict, because you changed some lines that have been modified since monday also.

Great! Basically we are back to where we started.

Let's not resign here, but rather use the appropriate command: svn merge. That command is mostly known for merging changes from one branch of development to another. But it can also help you to go back in time.

The parameters of svn merge are to specify a revision range, which changes to be merged, and a source — what part of the subversion repository should be searched for the changes.

Usually one would find this command used in a way like:

~/project/trunk$ svn merge -r 15:26 ../branches/first_release/
G app.php

So with two revisions specified you define a range of changes which should be merged into the current checkout. Ok so how would us help this here?

You can also specify revisions backwards, to go back in time. So to undo the command form before you can write:

~/project/trunk$ svn merge -r 26:15 ../branches/first_release/
G app.php

To put it simple, Subversion generates a diff file behind the scenes that incorporates the changes between the given revisions. Then the changes are merged with the files in the same way the patch command (Linux, Unix, OS X, …) does it. When going back in time, the parameter -R is used which applies the patch in the reverse direction. Voila.

So as a final solution this leaves us with:

~/project/trunk$ svn merge -r head:{2006-10-09} .
U app.php
~/project/trunk$ svn ci -m "revert to monday" app.php
Sending app.php
Transmitting file data .
Committed revision 27.

For further questions, the Subversion FAQ is a good starting point when you know exactly what you want (i.e. the correct terminology). (For example reverting does not mean to go back to a previous version of the file, but rather to remove the changes you did locally).

There is the subversion book (also published by O'Reilly), of which the Guided Tour is a good starting point.

The process I described above as a trial and error is also described in that book at Undoing changes.

Also OSCON: Subversion Best Practices, a transcript of a talk given by the subversion creators (Ben Collins-Sussman and Brian W. Fitzpatrick) by Brad Choate has some good tips.

Have fun :)

, , ,

JavaScript Tricks And Good Programming Style

Note that this is an updated version. Original version can be found here.

Thanks to the commenters I have updated this post with some better tricks.

In a loose series I'd like to point out a few of them. As I am currently mostly programming in JavaScript, I will write most of my samples in that language; also some of the tricks I mention only apply to JavaScript. But most of them apply to most programming languages around.

Optional parameter and default value #
When defining a function in PHP you can declare optional parameters by giving them a default value (something like function myfunc($optional = "default value") {}).

In JavaScript it works a bit differently:

var myfunc = function(optional) {
if (typeof optional == "undefined") {
optional = "default value";
}
alert(optional);
}

This is a clean method to do it. Basically I pretty much recommend the use of typeof operator.

update
Michael Geary (his comment) pointed out this solution that I like.


var myfunc = function(optional) {
if (optional === undefined) {
optional = "default value";
}
alert(optional);
}

The solutions mentioned (if (!optional), optional = optional || "default value", and the like) have problems when you pass 0 (zero) or null as an argument.

Commenters said that the 0/null problem is not one as this would not be the situation to use it. I would not say so. In an AJAX world where you do serialization back to a server/database often a 0/1 to false/true mapping has to be established. For default values it is important.

In case you just need to make sure that an object is not null I do prefer the mentioned

myobject = myobject || { animal: "dog" };

end update

Parameters Hints #
The larger your app gets, the more functions you get which you would use throughout the app. It also creates a problem with maintenance. As each function can contain multiple arguments it is not unlikely that you forget what those parameters were for (especially for boolean variables) or mix up their sequence (I am especially gifted for that).

So what I do is this: update substitute variables with comments end update

var myfunc2 = function(title, enable_notify) {
// [...]
}
myfunc2(/* title */ "test", /* enable_notify */ true);

This piece of code relies on the functionality of programming languages that the return value of an assignment is the assigned value. (This is something that you should also maintain in your app, for example with database storage calls, give the assignment value as a return value. It's minimal effort and you might be glad at some point that you did it).

If you do this you can see at any point in the code, what parameters the function takes. Of course this is not always useful, but especially for functions with many parameters it gets very useful.

Search JavaScript documentation #
When I need some documentation for JavaScript I use the mozilla development center (mdc). To quickly search for toLocaleString, I use Google: http://google.com/search?q=toLocaleString+mdc

As I am a German speaker I also use the excellent (though a bit out-dated) JavaScript section SelfHTML. I use the downloaded version on my own computer for even faster access.

The self variable #

update
… should be avoided. Even if someone like Douglas Crockford (creator of JSON) uses it and calls it that.

Let me quote Jack Slocum who put it best:

// used to fix "this" prob with Function.apply to give call proper scope
// nice method to put in your lib
function delegate(instance, method) {
return function() {
return method.apply(instance, arguments);
}
}

function Animal(name) {
this.name = name;
this.hello = function() {
alert("hello " + this.name);
}
}

var dog = new Animal("Jake");
var button = {
onclick : delegate(dog, dog.hello)
};
button.onclick();

I removed my code as it can be considered obsolete by this.
end update

Reduce indentation amount #

update
I have removed the code because it leads people into believing something different than I meant. So let me put it differently:

What I am opposing is white space deserts. If you have many levels of indentation then probably something is wrong.

If a for loop only applies to a handful of cases, don't indent the whole loop in an if clause but rather catch the other cases at the top.
Often it is advisable to move longer functionality to a function (there is a good reason for that name) that you call throughout a loop.
end update

That's all for now, to be continued. Further readings on this blog:

update
Eventhough some commenters disagreed with what I said, I think posts like this are very much needed in the bloggersphere. Even if they are not free of errors on the first take, great people can help improve them. I would appreciate if more people took that risk.
end update

, ,

JavaScript Tricks And Good Programming Style – Original Version

Note that there is an updated version

I have been programming for about 10 years now, and I am always longing for improving my code. Throughout time I added a few habbits that I consider to be good practices and increase the quality of my code.

In a loose series I'd like to point out a few of them. As I am currently mostly programming in JavaScript, I will write most of my samples in that language; also some of the tricks I mention only apply to JavaScript. But most of them apply to most programming languages around.

Optional parameter and default value #
When defining a function in PHP you can declare optional parameters by giving them a default value (something like function myfunc($optional = "default value") {}).

In JavaScript it works a bit differently:

var myfunc = function(optional) {
if (typeof optional == "undefined") {
optional = "default value";
}
alert(optional);
}

This is a clean method to do it. Basically I pretty much recommend the use of typeof operator. Some people would do the above with a if (!optional), but my version works cross browser (e.g. Safari will throw an error when you try to negate null).

Parameters Hints #
The larger your app gets, the more functions you get which you would use throughout the app. It also creates a problem with maintenance. As each function can contain multiple arguments it is not unlikely that you forget what those parameters were for (especially for boolean variables) or mix up their sequence (I am especially gifted for that).

So what I do is this:

var myfunc2 = function(title, enable_notify) {
// [...]
}
myfunc2(title = "test", enable_notify = true);

This piece of code relies on the functionality of programming languages that the return value of an assignment is the assigned value. (This is something that you should also maintain in your app, for example with database storage calls, give the assignment value as a return value. It's minimal effort and you might be glad at some point that you did it).

If you do this you can see at any point in the code, what parameters the function takes. Of course this is not always useful, but especially for functions with many parameters it gets very useful.

Also be careful that you would override the variable names in the scope of which you are calling the function. You might mini-namespace the variables, e.g. with letter+underscore (p_title, p_enable_notify).

Search JavaScript documentation #
When I need some documentation for JavaScript I use the mozilla development center (mdc). To quickly search for toLocaleString, I use Google: http://google.com/search?q=toLocaleString+mdc

As I am a German speaker I also use the excellent (though a bit out-dated) JavaScript section SelfHTML. I use the downloaded version on my own computer for even faster access.

The self variable #
This technique comes from Private Members in JavaScript by Douglas Crockford. By assigning a value in a function to this.value it will be publically accessible afterwards.

function Animal(name) {
this.name = name;
var self = this;
this.hello = function() {
alert("hello " + self.name);
//alert("hello " + this.name); // would fail
}
}
var dog = new Animal("Jake");
button = {
onclick = dog.hello;
}
button.onclick();

The cause of this problem is that the this keyword receives different values in different contexts. See here for a closer explanation.

Problem with this solution is that I am not absolutely sure if this creates a memory leak in internet explorer

Reduce indentation amount #
One of the most annoying things I find in other people's code is this: (multiple) nested if clauses. Something like this:

var arr = ["dog", "cat"];
var action = 'greet';
for(i = 0, ln = arr.length; i < ln; i++) { animal = arr[i]; if (animal == "cat") { alert("hello " + animal); } }

This is only a short example, but I often saw this going deep into 10 levels of nested clauses. I suggest using the break and continue (and next in Perl):

var arr = ["dog", "cat"];
for(i = 0, ln = arr.length; i < ln; i++) { animal = arr[i]; if (animal != "cat") continue; alert("hello " + animal); }

This accomplishes the same with only one level of indentation. One more example for a function:

function greet_animal(animal) {
if (typeof animal == "undefined") return;
if (animal != "cat") return;
alert("hello " + animal);
}

Javascript is one of the few languages where you can leave the return value empty (i.e. typeof greet_animal() == "undefined"). You might want to rather use return false so that you can easily determine if the function failed for some reason.

, ,

Firefox 1.5, XmlHttpRequest, req.responseXML and document.domain

Recently I have been working on a web application, extending it with an iframe on another subdomain.

When you set up communication with an iframe on another subdomain, it works by setting document.domain in both pages. Pretty nice and straight forward.
But it can mess up the rest of your page.

As soon as you have set document.domain you should be able to do an XHR to your original domain according to the same domain policy.

This will work in IE, Safari, and Opera.
This will not work in Firefox 1.0. This is very awkward but at least it has been fixed in 1.5.
So it will work in Firefox 1.5. But:

The responseXML object is useless. You can't access it, you receive a Permission Denied when trying to access it's content (e.g. documentElement). Very annoying.
Even stranger that responseText is still readable. What's the reason for this? Is there some security risk i am unaware of or is it a plain bug?

As the responseText is available there is a pretty simple fix: re-parse the XML, which is kinda stupid and cpu intense if you have a lot of them. (something like: var doc =
(new DOMParser()).parseFromString(req.responseText, "text/xml");
)

I have some sample code available here.

Apparently a bug report has been filed at 1.5.0.1. No response from developers. Great.
Unfortunately it has only been filed for OSX, but it also afffects Windows Firefox.

Mozilla guys, fix this ASAP.

Update 2007-06-21: Things seem to start moving, we will likely have a fix for Firefox 3.

, , ,

Misuse of the Array Object in JavaScript

There is a very good post about Associative Arrays considered harmful by Andrew Dupont.

The title is a bit misleading but correct. When coming accross a piece of JavaScript like this
foo["test"] = 1;
there is nothing wrong about it. It's the basic usage scheme of assoziative arrays. Or should i rather say objects?

While in languages such as PHP arrays used like this $foo = array("test" => 1); is perfectly correct.

In JavaScript
var foo = new Array();
foo["test"] = 1;

works but does not do what you want.

I don't need to repeat Andrew's really great post, but basically you should use Object instead of Array.

var foo = new Object(); // same as var foo = {};
foo["test"] = 1; // same as foo.test = 1;

Now go and read Andrew's post.

via Erik Arvidsson.

btw: that post lead me to Object.prototype is verboten which explains for me why my for (i in myvar) {} loops never worked correctly. I was using prototype.js version < 1.4 (which messed with Object.prototype).

, , , ,

A better understanding of JavaScript

I've been working with JavaScript for years. It was my replacement for a server side language when I couldn't afford to buy web space in the mid-90's. Still, as the language becomes popular again, I recognized that I did understand the basics but there was much more to the language.

digg it, add to delicious

So I dug into the topic a little deeper. I can highly recommend reading the blogs of all the great JavaScript guys like Alex Russell (of Dojo), Aaron Boodman, Erik Arvidsson (both at Google), Douglas Crockford (at Yahoo). (Give me more in the comments ;)

So, JavaScript is easy to start with. You can take a procedural approach like in C. You declare a function, you call a function.

A Survey of the JavaScript Programming Language (by Douglas Crockford) does an amazing job at explaining the notable aspects of the language on a quite short page.

I want to point out the most interesting points for me:

Subscript and dot notation
You can access a member of an object by using two different notations:

var y = { p: 1 };
alert(y["p"]); // subscript notation
alert(y.p); // dot notation

The great difference is that with subscript notation you can also access member vars that contain reserved words (of which there quite a few in JavaScript). Dot notation is shorter and more convenient.

Different meanings of the this keyword
Consider this piece of code creating a small object.

click here
<script type="text/javascript">
var myobject = {
id: 'obj',
method: function() {
alert(this.id);
}
};
myobject.method();
var l = document.getElementById("link");
l.onclick = myobject.method;
</script>

When you call myobject.method();, this points to the current object and you receive an alert box with the text 'obj'. But there are exceptions:

If you call this function from within a HTML page via an onclick event, this is refers to the calling object (i.e. the link). You will therefore receive and alert box containing 'link' as message.

This can be useful in many cases, but if you want to access "your" object, you can't. Aaron Boodman proposed a function that was eventually named hitch:

function hitch(obj, meth) {
return function() { return obj[meth].apply(obj, arguments); }
}

You'd use it like this: l.onclick=hitch(myobject, 'method'); Now the this keyword points at the correct object.

You could also change the function to something like this and still use the previous notation:

method = function() {
if (this != myobject) { return myobject.method(arguments); }
alert(this.id);
}

Creating objects with new
I was always wondering how to create objects from a class as I am used to with other programming languages, which means that by instanciating the object is created according to the "building instructions" of a class.

Douglas shows this in more detail on his Private Members in JavaScript page.

I've quickly hacked together this example:

var x = function () {
var created = new Date();
this.when = function () { alert(created); }
}
var p, u = new x();
window.setTimeout("n()", 1000);
function n () {
p = new x();
p.when();
u.when();
alert(typeof p.created);
}

You receive 2 objects p and u that have different creation times. They also have a private variable created which is only accessible via the public function when (because specified via this).

So even as you create an object by using the new Object() or {} notation, you only receive a static object. If you want to instanciate it, you need to create it as function.

Closures
The example above already demonstrated closures. The fact that closures exist in JavaScript make it only possible to create private variables.

A closure is, to put it simply, a function within another function. The inner function has access to it's parents variables but not the other way round.

All together a function is just another data type that can be assigned to a variable. Therefore these two notations can be used interchangably:

function test() { alert(new Date()); }
var test = function() { alert(new Date()); }

The ominous prototype "object" is a way of using the this keyword from "outside".
Modifying the piece of code from before:

var x = function () {
var created = new Date();
}
x.prototype.when = function () { alert(created); }

But there's a pitfall. The created variable is private. Even though the function when now is a member of the object x it does not "see" the variable created. So in the original example the function when had privileged access (see Private Members in JavaScript).

Concluding
All in all I see that JavaScript is a powerful language. Many things that can be accomplished in an elegant (and sometimes quite unusual) way. (Curried JavaScript demonstrates even how to use it as a functional programming language)

I realize that there is a nice and clean solution for almost every problem you come across. This is where libraries come into play. The downside: you can quickly add tons of libraries, leading to large page sizes and memory consumption.

dojo for example is a really great library that provides you with numerous well thought-out functions, making your life a lot easier. But the size is 132 KB, just for the basic functions. More than a mega byte all in all. It circumvents needing to load everything by an in time loading mechanism (dojo.require).

In my opinion we'd need something like a local library storage. A Firefox extension would be a nice first step.
As far as I have looked into that topic, though, there are some difficulties. Foremost there is a problem with namespaces. Firefox clearly separates JS code by extensions from those coming from the web. A good thing, security-wise, but hindering in this case.

Maybe some Firefox guru can tell a way how to circumvent this, I think it might be worth a shot.

digg it, add to delicious

, , ,

10 Realistic Steps to a Faster Web Site

I complained before about bad guides to improve the performance of your website.

digg it, add to delicious

I'd like to give you a more realistic guide on how to achieve the goal. I have written my master thesis in computer sciences on this topic and will refer to it throughout the guide.

1. Determine the bottleneck
When you want to improve the speed of your website, you feel that it's somehow slow. There are various points that can affect the performance of your page. Here are the most common ones.

Before we move on, you should always remember that you answer each question with your target audience in mind.

1.1. File Size
How much data is the user required to load before (s)he can use the page.

It is a frequent question, how much data your web page is allowed to have. You cannot answer this unless you know your target audience.

In the early years of the internet one would suggest a size of 30k max for the whole page (including images, etc.). Now that many people have a broadband connection, I think we can push the level to a value between 60k and 100k. Although, you should consider lowering the size if you also target modem users.

Still, the less data you require to download, the faster your page will appear.

1.2. Latency
The time it takes between your request to the server and when the data reaches your PC.

This time adds together from twice the network latency (which depends on the uplink of the hosting provider, the geographical distance between server and user, and some other factors) and the time it takes until the server produces the output.

Network latency can hardly be optimized without moving the server, so this guide will not cover this.
The processing time of the server combines complex time factors and contains most often much room for improvement.

2. Reducing the file size
First, you need to know how large your page really is. There are some useful tools out there. I picked Web Page Analyzer which does a nice job at this.

I suggest not spending too much time on this, unless your page size is larger than 100kb. So skip to step 3.

Large page sizes are nowadays often caused by large JavaScript libraries. Often you only need a small part of their functionality, so you could use a cut-down version of it. For example when using prototype.js just for Ajax, you could use pt.ajax.js (also see moo.ajax), or the moo.fx as a script.aculo.us replacement.

Digg for example used to have about 290kb, they now have reduced the size to 160kb by leaving out unnecessary libraries.

Also large images can cause large file sizes, this is often caused by the wrong image format. A rule of thumb: JPG for photos, PNG for most other aspects, especially if plain colors are involved. Also: use PNG for screen shots, JPGs are not only larger but also look ugly. You can also use GIF instead of PNG when the image has only few colors and/or you want to create an animation.

Also often large images are scaled via the HTML width and height attributes. You should do this in your graphical editor and scale it there. This will also reduce the size.

Old HTML style can also cause large file size. There is no need for thousands of tags anymore. Use XHTML and CSS!

A further important step to smaller size is on-the-fly compressing of your content. Almost all browsers already support gzip compression. For an Apache 2 web server, for example, there is the mod_deflate module can do this transparently for you.

If you don't have access to your server's configuration, you can use the zlib for PHP or for Django (Python) there is GZipMiddleware, Ruby on Rails has a gzip plugin, too.

Beware of compressing JavaScript, there are quite some bugs with Internet Explorer.

And for heaven's sake, you can also strip the white space after you've completed the previous steps.

3. Check what's causing a high latency
As mentioned, the latency can be caused by two large factors.

3.1. Is it the network latency?
To determine whether the network latency is the blocking factor you can ping your server. This can be done from the command line via the command ping servername.com

If your server admin has disabled the pinging function you can also use a traceroute which uses another method to determine the time tracert servername.com (Windows) or traceroute servername.com (Unix).

If you address an audience that is geographically not very close to you, you can also use a service such as Just Ping which pings the given address from 12 different locations in the world.

3.2. Does it take too long to generate the page?
If the ping times are ok, it might take too long to generate the page. Note that this applies to dynamic pages, for example written in a scripting language such as PHP. Static pages are usually served very quickly.

You can measure the time it takes to generate the page quite easily. You just need to save an time stamp at the beginning of the page and subtract it from the time stamp when the page has been generated. For example in PHP you do it like this (due to technical restrictions a space is inserted before the question mark):

< ?php // Start of the Page $start_time = explode(' ', microtime()); $start_time = $start_time[1] + $start_time[0]; ?>

and at the end of the page:

< ?php $end_time = explode(' ', microtime()); $total_time = $end_time[0] + $end_time[1] - $start_time; printf('Page loaded in %.3f seconds.', $total_time); ?>

The time needed to generate the page is now displayed at the bottom of it.

You can also compare the time between loading a static page (often a file ending in .html) and a dynamic one. I'd advise to use the first method because you are going to need that method to go on optimizing the page.

You can also use a Profiler which usually offers even more information on the generation process.

For PHP you can, as a first easy step, enable Output Buffering and restart the test.

Also you should consider testing your page with a benchmarking program such as ApacheBench (ab). This will stress the server via requesting several copies at once.

It is difficult to say what time suffices for generating a web page. It depends on your own requirements. You should try to keep the generation time under 1 second, as this is a delay which users usually can cope with.

3.3. Is it the rendering performance?
This plays only a minor role in my guide, but still this can be a reason why your page takes long to load.

If you use a complex table structure (which can render slowly), you most probably are using old style HTML, try to switch to XHTML and CSS.

Don't use overly complex JavaScript, like slow scripts in combination with onmousemove events make a page real sluggish. If your JavaScript makes the page load slowly (you can use a similar technique as the PHP time measuring, using the (new Date()).getMilliseconds()), you are doing something wrong. Rethink your concept.

4. Determine the lagging component(s)
As your page usually consists of more than one component (such as header, login window, navigation, footer, etc.) you should next check which one needs tuning. You can do this by integrating a few of the measuring fragments to the page which will show you several split times throughout the page.

The following steps can now be applied to the slowest parts of the page.

5. Enable a Compiler Cache
Scripting languages recompile their script upon each request. As there are far more requests to the unchanged script, it makes no sense to compile the script over and over (especially when core development has finished).

For PHP there is amongst others APC (which will probably be integrated with PHP 6), Python stores a compiled version by itself.

6. Look at the DB Queries
At university most complex queries with lots of JOINs and GROUPs are taught, but in real life it can often be useful to avoid JOINs between (especially large) tables. Instead you do multiple selects which can be cached by the SQL server. This is especially true if you don't need the joined data for every row. It really depends on your application, but trying without a JOIN is often worth it.

Ensure that you use query folding (also called query cache; such as the MySQL Query Cache). Because in a web environment the same SELECT statements are executed over and over. This almost screams for a cache (and explains why avoiding JOINs can be much faster).

7. Send the correct Modification Data
Dynamic Web pages often make one big mistake: They don't have their date of last modification set. This means that the browser always has to load the whole page from the server and cannot use its cache.

In HTTP there are various headers important for caching: for 1.0 there is the Last-Modified header which plays together with the browser-sent If-Modified-Since (see specification). HTTP 1.1 uses the ETag (so called Entity Tag) which allows different last modification dates for the same page (e.g. for different languages). Other relevant headers are Cache-Control and Expires.

Read on about how to set the headers correctly and respond to them (1.0) and 1.1.

8. Consider Component Caching (advanced)
If optimizing the database does not improve your generation time enough, you are most likely doing something complex ;)
So for public pages it's very likely that you will present two users with the same content (at least for a specific component). So instead of doing complex database queries, you can store a pre-rendered copy and use that when needed, to save time.

This is a rather complex topic but can be the ultimate solution to your performance problems. You need to make sure that you don't deliver a stale copy to the client, you need think about how to organize your cache files so you can invalidate them quickly.

Most web frameworks give you a hand when doing component caching: for PHP there is Smarty's template caching, Perl has Mason's Data Caching, Ruby's Rails has Page Caching, Django supports it as well.

This technique can eventually lead to a result when loading your page does not need any request to the data base. This can be a favorable result as a connection to the database is often the most obvious bottleneck.

If your page is not that complex you could also consider just caching the whole page. This is easier but makes the page usually feel less up-to-date.

One more thing: If you have enough RAM you should also consider storing the cache files in a RAM drive. As the data is discardable (as it can be re-generated at any time) a loss when rebooting would not matter. Keeping disk I/O low can boost the speed once again.

9. Reducing the Server Load
Consider that your page loads quickly and everything looks alright, but when too many users access the page, it suddenly becomes slow.

This is most likely due to a lack of resources on the server. You cannot add an indefinite amount of CPU power or RAM into the server but you can handle what you've got more carefully.

9.1. Use a Reverse Proxy (needs access to the server)
Whenever a request needs to be handled, a whole copy (or child process) of the web server executable needs to be held in memory. Not only for the time of generating the page but also until the page has been transferred to the client. Slow clients can cost performance. When you have many users connecting, you can be sure that quite a few slow ones will block the line for somebody else just for transferring back the data.

So there is a solution for this. The well known Squid proxy has a HTTP Acceleration mode which handles communication with the client. It's like a secretary that handles all communication.

It waits patiently until the client has filed his request. Asks the web server to respond, quickly receives the response (while the web server can move on to the next request) and then will patiently return the file to the client.

Also the Squid server is small, lightweight, and specialized for that task. Therefore you need less RAM for more clients which allows a higher throughput (regarding served clients per time unit).

9.2. Take a lightweight HTTP Server (needs access to the server)
Often people also say that Apache is quite huge and does not do it's work quickly enough. Personally I am satisfied with its performance, but when it comes to dealing with scripting languages that handle their web server communication via the (fast)CGI interface, Apache is easily trumped by a lightweight alternative.

It's called LightTPD (pronounced "lighty") and does a good job at doing that special task very quickly. You can already see from a configuration file that it keeps things simple.

I suggest testing both scenarios if you gain from using LightTPD or if you should stay with your old web server. The Apache Web Server is stable and is built on long lasting experience in the web server business, but LightTPD is taking it's chance.

10. Server Scaling (extreme technique)
Once you have gone through all steps and your page still does not load fast enough (most obvious because of too many concurrent users), you can now duplicate your hardware. Because of the previous steps there isn't too much work left.

The Reverse Proxy can act as a load balancer by sending its requests to one of the web servers, either quite-randomly (Round Robin) or server load driven.

Conclusion
All in all you can say that the main strategy for a large page is a combination of caching and intelligent handling of the resources helps you reach the goal. While the first 7 steps apply to any page, the last 3 points are usually only useful (and needed) at sites with many concurrent users.

The guide shows that you don't need a special server to withstand slashdotting or digging.

Further Reading
For more detail on each step I recommend taking a look at my diploma thesis.

MySQL tuning is nicely described in Jeremy Zawodny's High Performance MySQL. A presentation about how Yahoo tunes its Apache Servers. Some tips for Websites running on Java. George Schlossnagle gives some good tips for caching in his Advanced PHP Programming. His tips are not restricted to PHP as a scripting language.

digg it, add to delicious

, ,

Speed up your page, but how?

Today I ran accross the blog entry by Marcelo Calbucci, called "Web Developers: Speed up your pages!".

It's a typical example of good idea, bad execution. Most of the points he mentions are really bad practice.

He suggests reducing traffic (and therefore loading time) by removing whitespace from the source code, to write all code in lower case (for better compression?!?), reduce code by writing invalid xhtml, and to keep javascript function names and variables short. This is nit-picking. And results in a maintenance nightmare.

For big sites, e.g. Google, the white space reduction tricks make sense. But they have enourmous numbers of page impressions. Saving 200 bytes tops by stripping whitespace is nearly worthless for smaller sites. And not worth the trouble. Additionally I bet that Google does not maintain that page as such, but has created some kind of conversion script.

Other thoughts are quite nice but commonplace. Most of the comments (e.g. by Sarah) posted at that article reflect my opinion quite well and deal with each point in more detail.

For most dynamic pages the bottleneck for responding to a client request is the script loading (or running) time. I suggest the writer to read some articles about server caching (my thesis also deals with that topic) and optimization.

Often also the latency between client and server can be held responsible for considerable delays. As the client has to parse the HTML file to decide what files to load next, delays can sum up.

All in all, it's a good idea to deal with the loading time of a page. But you have to search at the right place.

, , , ,

Using a Feedback Form for Spam

Have you ever received weird spam via the feedback form of your site? Something with your own address as sender or with some Mime stuff in the body? Your form is likely to be misused for spamming.

How does it work?

For PHP, for example, there is the mail function that can be used to easily send an e-mail. Most probably you'd use some code like this to send the message from your feedback form.

< ?php $msg .= "Name: " . $_POST["name"] . "\n"; $msg .= "E-Mail: " . $_POST["email"] . "\n"; $msg .= $_POST["msg"]; mail("my.e.mail@addr.es", "feedback from my site", $msg); ?>

That's simple and works well, but it's a little annoying if you want to answer that e-mail. You click the e-mail address to open a new message and have to paste the whole message into the new window for quoting. There's an easy solution: Pretend that the e-mail comes from the customer requesting some info. This can be simply done via the additional_headers parameter of mail.

< ?php $sender = "From: " . $_POST["email"] . "\r\n"; $sender = "From: " . $_POST["name"] . " <" . $_POST["email"] . ">\r\n"; // even nicer, shows the name, not the address
mail("my.e.mail@addr.es", "feedback from my site", $msg, $sender);
?>

Well. We've just introduced 2 potential spamming opportunities. Why? Let's see. For mail transport we use SMTP. Our outgoing mail might look like this (generated by mail).

From: tester < test@test.com>
To: my.e.mail@addr.es
Subject: feedback from my site

this is my message

(Before, the From would have looked something like From: webserver@mydomain.com)
So if the spammer manages to insert another field (like To, CC, or BCC), not only we would receive that e-mail but also the guy entered as CC. This works by inserting a line break into the name or e-mail address. For example, for a given name such as

Alex
CC: other@e.mail.addr.es

that would be the case.
Although this is usually not possible through a normal textbox () a post request can easily be constructed containing that linebreak and the malicious CC.

So be sure to strip out at least the characters \r and \n from name or e-mail address or just strip out any non-latin characters (people with german umlauts in their names, for example, will have to live with that).

So a quite good method would be to use this piece of code:

$name = preg_replace("|[^a-z0-9 \-.,]|i", "", $_POST["name"]);
$email = preg_replace("|[^a-z0-9@.]|i", "", $_POST["email"]);
$sender = "From: " . $name . " < " . $email . ">\r\n";
mail("my.e.mail@addr.es", "feedback from my site", $msg, $sender);

The conclusion is simple (and always the same one): Never trust any data you receive from a user.
Verify all data you receive and strip potentially harmful characters. Common bad characters are:

  • for mails: \r, \n,
  • for HTML: < , > (you could use htmlspecialchars for that),
  • for URLs: &, =,
  • complete the list in the comments ;)

Ah, the conclusion. Never trust any data you receive from a user.

, , ,

Eclipse Everywhere. Buah.

It's been a little quiet lately. This is because I am working on a cute little project that I will be able to present soon. More when the time is ready.

There has been rumor lately that Zend (developer of PHP) will release a PHP Framework. This is nothing new, there has been a IDE (Zend ) for a long time now. But it will be based on Eclipse.

Also Macromedia announced that their new Flex 2.0 environment (Flashbuilder) will be based on Eclispe.

Why on earth Eclipse?! I think this is the most slowest IDEs available. It's based on Java which makes it incredibly slow already and it's so blown up that it's unbelievable.

I just can't understand why developers would use such a tool. I am not willing to buy a GHz monster PC just to have an editor running there. That's a pure waste of money and electricity. Emacs is kinda slow already but it runs on a few MHz.

Can anyone explain to me why to use such a monster?

I thought that maybe everything changed for the better by now and downloaded the whole thing. That's 100MB already. This already shows how much memory it will consume. Ok, I still started it. It took more than 2 minutes on my Powerbook G4. Hello? The features it provides are so not worth that.

I can recommend TextMate (best completition) and EditPlus (best integrated (S)FTP). These are fast, neat text editors. That's what I want.

, , , , ,

Caching of Downloaded Code: Testing Results

Today I did some experimenting with the caching of downloaded code (or On-Demand Javascript, whatever you want to call it).

I've set up a small testing suite that currently tests 3 different ways of downloading code: script-tag insertion via DOM, XmlHttpRequest as a GET and XHR as a POST.

These are my results for now:

Method IE6 Firefox 1.07 Firefox 1.5b2 Safari 2.0 Opera 8.5
script_dom cached cached cached cached cached
xhr_post not cached not cached not cached not cached not cached
xhr_get cached not cached cached not cached not cached

(Results are the same for Win and OS X where both browsers are available (FF & Opera))

Safari Code Downloading Cache Test

This gives an interesting picture: Firefox does not seem to cache any scripts, neither the ones loaded via DOM nor those loaded via XHR. Only IE loads an XHR GET request from cache.

I've got the script in my public testing area, so you can test it for your own browser. Please do so and correct my values if you receive different results.

The sources of my tests are available, too: index.phps and js.phps. I did my testings using the latest prototype.js library. Maybe I will try it later on with another library (e.g. with dojo.io.bind).

I'd be interested in more ways to download code (especially via document.write since I haven't been able to include this properly to my tests) and in your results for other browsers. Just leave a comment.

UPDATE: I have now included the Expires header field with the Javascript file. Now FireFox in both version caches the script with script_dom, in version 1.5b2 it also caches XHR with GET requests.

, , ,