Chapter 4
Tools

In this section we describe the tools which will be used in this thesis. The sequence does not reflect their later use, it was chosen for reasons of better understanding. If not stated otherwise, the tools are Open Source and underlie the GNU General Public License (GPL).

4.1 Apache

In this thesis as a web server the Apache HTTP Server is being used. According to [Net05] it is the web server software used on most hosts today.

4.1.1 History

The development of Apache started in April 1995 as an evolution of the public domain HTTP daemon developed by Rob McCool at the National Center for Supercomputing Applications, University of Illinois, Urbana-Champaign.

Originally Apache was a group of patches for the NCSA daemon, with not too many of those patches developed by the newly founded Apache Group. Soon it was evident, though, that the basis lacked extensibility, and so the server was developed with a new design from scratch. Apache 1.0 was released on December 1, 1995.

Already in April 1996 the Apache web server moved to first place in web server popularity.

The versions of Apache used today are 1.3 and 2.0. While 1.3 was an evolution from the first version, extended with various modules, version 2.0 was once again a new design that intends to match the requirements to web servers in the World Wide Web today. Amongst those features is a better integration of the POSIX thread system and native IPv6 suppport.

Apache runs on several platforms, including Unix based systems and Windows NT. Still when referring to an Apache web server, one commonly refers to a Unix or even more often to a Linux system. The term LAMP (Linux Apache MySQL PHP – the environment used in this thesis) was coined representing a very common configuration.

The Apache Group has meanwhile approached several other projects that are very important in the field of Open Source software. Amongst others the most important projects are the Jakarta Tomcat web server (for Java based applications), Ant (build system (not only) for Java projects), and Struts (a framework for Java apps).

There is also an Apache License (which also applies to the HTTP server) that is primarily based on the BSD license. See [Hub04].

Today the market share of the Apache web server is very close to 70%.

4.1.2 Features

Basically the Apache web server is designed to serve data via the HTTP protocol (versions 1.0 and 1.1). Its functionality can be enhanced by a great variety of modules.

Apache HTTP server 1.3 implements a so-called pre-forking model. The term forking describes the generation of child processes for a father process controlling those sub-processes. With the web server a certain number of child processes is generated without immanent need for them. When several requests arrive, though, no time needs to be spent for forking a new process but the existing processes can be used. If there are not enough child processes, even more can be generated.

This preforking model can be seen as a replacement for threading. Version 2.0 of the Apache web server implements threads which can increase speed in many scenarios. This is because current operating systems heavily support threads as an alternative to forking; threads can be split to different CPUs in multiprocessor (SMP) systems.

Also third party modules are supported which lead to a large variety of new capabilities for the web server. For example, PHP (see 4.2) is commonly integrated as a module, allowing higher performance than CGI.

Other important modules are all sorts of authentication modules (via LDAP, MySQL, DBM, etc.), (highly configurable) logging and rewriting modules (modify the request URI before processing).

4.1.3 Alternatives

For a Linux system there are very few alternatives. The greatest rival according to [Net05] is Microsoft’s IIS which is only available for the desktop monopoly operating system Windows (NT).

The only real alternative to Apache under Linux is the Zeus Web Server (developed by Zeus Technology Ltd., receiving some coverage in [Mid02]), claiming to be the fastest web server available. As it is not available on a free basis (and is far from Open Source) it was not taken into greater consideration.

4.2 PHP

The web application is implemented in the programming language PHP which stands for “PHP: Hypertext Preprocessor”. It is a nowadays quite commonly used language for creating dynamic web pages.

4.2.1 History

Development of PHP was started by Rasmus Lerdorf in 1995, at that time it was called “Personal Home Page Tools”, a set of Perl programs that did some tracking of accesses to his homepage. Later (1997) he re-coded it in C to provide some more features (e.g. easy access to databases) and called it PHP/FI (“Personal Home Page / Forms Interpreter”).

PHP 3 (released in June 1998) was the first version similar to PHP most web sites use today. It was highly extensible and provided a solid infrastructure for many databases and protocols. In the end of 1998 PHP hundreds of thousands of web servers reported to have PHP installed which was approximately 10% of the WWW’s servers.

The break through for PHP came with version 4 (released 1999, using the “Zend Engine”, named after the Zeev Suraski and Andi Gutmans). Many more web servers were supported as well as new features for programmers such as HTTP session support and output buffering as well as security enhanced methods for receiving user data (“magic quotes”). Several millions of sites report today that they use PHP 4 (about 20% of the WWW’s servers).

One of the biggest drawbacks of PHP was fixed with version 5, released only in 2004. It provides full-featured OOP as earlier versions only had rudimentary support for it, e.g. inheritance was supported but no encapsulation.

4.2.2 Language Basics and Structure

PHP is a language specialized on delivering web pages. PHP code is therfore simply integrated with (existing) HTML pages. A simple “Hello World” script would look like the one shown in listing 4.1.


Listing 4.1: Hello World in PHP – helloworld.php
 
1<html><head><title>Hello World Example</title></head> 
2<body><?php echo "Hello World!"; ?></body></html>

The code is declared as PHP by using special tags (<?php and ?>; similar to ASP’s <% and %>). The text between opening and end tag is compiled and interpreted, the HTML code is sent untouched to the client.

PHP is a scripting language. This means that there is no need for the programmer to compile the script before it can be executed. This enables quick prototyping which matches the requirements of web applications: modifications have to be integrated quickly.

In PHP, variables (specified by a dollar sign, e.g. $variable) do not have to be declared before they can be used, they are from the programmer’s view type-free1 , so you can use the same variable for calculations (e.g. as integer) and outputing (as a string) without having to take further care.

Reflecting the web specialized character, there are a few variables that make processing of web pages much easier. The $_GET and $_POST variables automatically contain (in form of an array) the values received from an URL or HTML form, depending on the HTTP method the data was sent to the server.

Given the URL http://localhost/test.php?hello=world the $_GET variable would be a one-element array consisting of the key/value pair [hello] => "world".

The $_COOKIE variable contains the values of HTTP cookies (small portions of data to be stored on the client side). $_SERVER contains server-set data such as the file system path of the script being executed, or the IP address of the remote client. The $_SESSION variable is suitable for storing data which is persistent throughout multiple requests of the same client. With this variable in use, PHP takes care of generating a so-called session id (for identifying the user) and setting a cookie containing this id. If clients do not support cookies the session id is appended to each link (URL rewriting) and a hidden field containing this id is added to each form on the delivered page, too.

In PHP, arrays play an important role. Every variable mentioned until now is by definition an array, i.e. a data structure that maintains key/value pairs, which is internally established by using a hash table. PHP provides many functions for arrays (e.g. foreach will cycle through each element) as in everyday use you mostly have to do with structured data. A database query will commonly return an array representing the data in a natural and intuitive way. Multi-dimensional arrays can be created at will (just by specifying an array as value), making it easy to juggle with data.

What makes the language quite special is the great number of functions provided. Compared to other languages there are few internal functions. These are mainly used for variable manipulation (for example for cropping strings). The majority of functions is provided by third party libraries which are integrated with PHP and provide enormous functionality which can be accessed easily because of the initial integration into PHP.

The scope of functions starts at database wrappers for very many database types (most important ones are MySQL, PostgreSQL, Oracle, and Berkeley DB), a library for image manipulation (GD), compression (such as gzip or bzip2) and encryption libraries (mcrypt), ending with libraries for accessing remote services such as up-/downloading, SOAP calls, or XML-RPC. So a large task of PHP is being a framework to third-party libraries.

4.2.3 Integration with the web server

As a external product, PHP needs to be integrated with a web server. Usually this is Apache. Two scenarios are possible:

The CGI variant is available for all web servers that support a cgi-bin directory, such as IIS on the Windows platform. Generally speaking, this approach should be dismissed when integrating PHP, as a web server module is available which provides an enormous gain of speed.

4.2.4 Additional Libraries

The popularity of PHP causes the rise of a large number of third party tools which provide even more functionality.

On the one hand there is a “semi-official” database of tools which is called PEAR (PHP Extension and Application Repository). It covers various topics such as protocol implementations, abstraction layers for databases, and reference implementations of algorithms (e.g. cryptographic algorithms). The number of maintainers is limited, though. As most of them are experienced programmers, this ensures a high quality of the included components. Additionally a QA (Quality Assurance) team checks for a high standard. Documentation is provided for each project, often voluminous tutorials, too.

On the other hand many source code repositories exist (e.g. Hotscripts.com) which are usually user-contributed. This has both its good and its bad sides: these repositories contain thousands of pieces of source code, so there are not too many “common problems” which have not been solved yet. As everybody (may the programmer be experienced or not) can contribute anything the quality of code becomes (naturally) highly diverse. What is more critical: documentation is commonly bad.

4.2.5 Alternatives

There are many projects on the market that have either developed their own programming language or modified an existing language for use with the WWW. The alternatives can be classified in two categories: scripted and compiled languages.

4.3 MySQL

As a DBMS (Database Management System, actually RDBMS with R meaning Relational) MySQL was chosen. Throughout the thesis it also will be referred to as “database” which is a quite common mis-naming.

MySQL was originally designed to achieve high performance and believed to be one of the fastest DBMSs currently on the market. MySQL uses the GPL as license which makes it open source and therefore freely available.

An API for PHP is provided which is commonly integrated with PHP and adds to its function pool. Actually this integration is the way MySQL is most commonly used today, the rise of PHP also helped MySQL to emerge.

4.3.1 PEAR::DB

PEAR::DB is not a part of the MySQL distribution and is not solely dependant on MySQL either. It is rather an abstraction layer from the PEAR repository (see section 4.2.4) that provides a DBMS independant layer for retrieving data.

Apart from SQL which has to be understood by the DBMS used (many systems use their own flavour of SQL, so does MySQL) switching the DBMS can be easily done by just switching the DSN string (Data Service Name). Also methods for retrieving data (as associative hash, as “normal” array, etc.) do not differ.

4.3.2 Query Cache

Recent versions of MySQL (since 4.0.1) provide a so called query cache – also referred to as Query Folding [Qia96]. SELECT statements are stored together with their results which allows very fast responses when the same query (the exact same string has to be used for querying) is executed the second time.

It is quite common when using a database in connection with a web server that tables do not change very frequently and the same queries are executed over and over. So a large increase of speed can be expected when activating the query cache.

4.3.3 Alternatives

There are some alternatives to MySQL that provide additional features which may make them more favourable for certain uses.

MySQL was chosen for its common use in Open Source projects and its speed. Even though license problems arose in 2004 for using it with PHP, it can now be recommended as a special license for this case of appliance has been published.

4.4 Smarty

Another tool used in this diploma thesis is the Smarty Template Engine. It is a tool – written in PHP, created by Monte Ohrt and Andrei Zmievski in 2001 – to separate program logic, i.e. the PHP code, from design, stored in so-called template files.



Figure 4.1: The MVC design pattern
PIC

Figure 4.1 shows the MVC (Model-View-Controller [KP88]) design pattern which is often tried to be applied on web applications. Using Smarty this pattern can be implemented with separation into these components:

In a scenario without Smarty, View and Model are mixed. This would not only dismiss the design pattern but also reduce reusability of source code [Par04]. The use of Smarty contrasts the design goal PHP originally implements. In fact Smarty only acts as a layer within a PHP script – this is quite obvious as it is coded in PHP itself.



Figure 4.2: Three-tier architecture
PIC

This also represents the common three-tier architecture (see Figure 4.2). It is quite desirable (also in other parts of information engineering) to split apart the data (first tier), the business logic (second tier) and the presentation (third) tier. The MVC model is a corresponding design pattern. More benefits from the three-tier architecture are discussed in [Swe01].

In a company the roles of programmer and layout designer are separate. This is supported and even pushed by Smarty because designer and programmer can concurrently work on the same page with the designer changing the appropriate .tpl file while the programmer makes changes to the PHP code. Therefore, the use of Smarty is highly recommended.

4.4.1 Template Basics

Template files are quite similar to “normal” PHP files, they embed their logic into HTML. A Hello World example using Smarty in combination with PHP would look like this:


Listing 4.2: Hello World in Smarty – hello.tpl
 
1<html><head> 
2<title>Hello World Example with Smarty</title> 
3</head><body>{$hello}</body></html>

Listing 4.3: Hello World in Smarty – hello.php
 
1<?php 
2include("Smarty.class.php"); 
3$smarty = new Smarty(); 
4$hello  = "Hello World!"; 
5$smarty->assign("hello", $hello); 
6$smarty->display("hello.tpl"); 
7?>

In this example, the variable $hello is displayed within the template file, just by putting it into curly brackets. This is the default setting for integrating logic and variables in .tpl files3 . The variable does not go together with those from PHP. They have to be explicitly assigned to Smarty (line 5 of hello.php) to have it accessible in hello.tpl4 . After that the Smarty command for displaying the template file is called.


Listing 4.4: Highlighting alternating lines – alternate.tpl
 
1<html><head><title>Alternate Backgrounds</title></head> 
2<body><table> 
3{section name=d loop=$data} 
4<tr><td 
5{if $smarty.section.d.first} 
6  bgcolor="#CC0000" 
7{elseif $smarty.section.d.index is even} 
8  bgcolor="#CCCCCC" 
9{else} 
10  bgcolor="#DDDDDD" 
11{/if} 
12>{$data[d]}</td></tr> 
13{/section} 
14</table></body></html>

For outputting arrays assigned from PHP in Smarty the helper functions foreach and section are available. In “sections” arrays are traversed with keys from 0 to n. foreach acts the same way as in PHP, providing access to key and value for each entry of the array. While looping, the $smarty.section variable (resp. $smarty.foreach) is filled with values to be used for design functionality. As an example, listings 4.4 and 4.5 show how a table with alternating background colors is generated (see Figure 4.3 for a screenshot of a web browser displaying the page).


Listing 4.5: Highlighting alternating lines – alternate.php
 
1<?php 
2include("Smarty.class.php"); 
3$smarty = new Smarty(); 
4$data   = array(); 
5for ($i = 0; $i < 10; $i++) { 
6  $data[] = "value" . $i; 
7} 
8$smarty->assign("data", $data); 
9$smarty->display("alternate.tpl"); 
10?>


Figure 4.3: Screenshot of the output of the alternating backgrounds example
____PIC

In this short example several more aspects of Smarty and PHP are shown. The program logic of if/elseif/else is available to Smarty for doing simple tasks (intended for design-oriented conditionals, something just like in the example above). The $smarty.section array provides common states, for instance the current index or whether it is the first or last iteration of the loop. Array values are accessed in a PHP like form (index within squared brackets) when using sections.

Finally it is important to state that it should be avoided to integrate logic that does not solely affect (visual) design.

4.4.2 Alternatives

The idea of templating PHP is quite common and various such projects exist.

Smarty was chosen for its features and its steady improvement.

4.5 Squid

A proxy server is a program that acts in favour of a client by means of requesting data and returning it to the client. In computer security this would be referred to as a kind of man-in-the-middle. There exist proxies for various application. In this diploma thesis Squid is used as “a full-featured Web proxy cache”, i.e. it is capable of proxying requests of the protocols HTTP and FTP.

4.5.1 Use cases

Squid is a proxy server for use on Unix/Linux systems (Windows NT is only supported via cygwin). Usually it is installed on a server that acts as a gateway for a (local area) network. Several configurations are possible:

4.5.2 HTTP Acceleration

Neither of these configurations really seems to match the topic of this diploma thesis and the idea of caching web applications itself. However there is another configuration called “HTTP acceleration” which forwards requests to a web server which resides on the same machine. This is often also known as reverse proxying.

The idea of using a proxy on the same server as a web server has to do with design goals of the two programs.

A web server has to provide several features for processing files to be served (in the case of PHP, for example, the interpreter is commonly integrated with the Apache web server via module). For each request a copy of the executable must be held in memory. Therefore, the larger the executable is, the higher the memory consumption will be for a number of requests.

Proxy servers are designed to be very light-weight programs that primarily serve the goal of collecting requests and – this is a crucial point – then do their proxying: contact the web server.

Considering several clients accessing the web server at the same time, for each request an executable has to be loaded and held in memory until the request is completed. With HTTP acceleration requests are collected by the small proxy program (which consumes particularily little memory) and can therefore take many more requests than the web server itself. Only after the request has been transmitted completly the web server is contacted to collect the pages.

4.5.3 Alternatives

Squid was chosen for its being commonly used in production environments and its availability through standard shipping of most Linux distributions.

4.6 Advanced PHP Cache

APC is a tool that speeds up execution of PHP scripts by caching the compiled script in the immediate language which is eventually executed. It was written in 2000 by George Schlossnagle, Daniel Cowgill and Rasmus Lerdorf.



Figure 4.4: PHP script execution
PIC

The idea of reducing execution time is based on the mechanism how PHP executes a script (see Figure 4.4). This is basically done in two steps:

  1. The source file is read, parsed and converted to intermediate language (“compiled”).
  2. PHP, i.e. the Zend Engine virtual machine, executes the intermediate code.

These two steps have to be done every time a script is requested – the compiled result is dismissed after execution. The same goes for each file that is included during execution.

4.6.1 Concept

While this procedure is by design and fits the requirement of a scripting language to have the ability to make changes to a file without further ado, the amount of changes commonly exceeds the number of executions by far. What is more: for many scripts – especially those with many “includes” – it often takes PHP longer to convert the script into intermediate language than to execute it.

In fact step 1 stays the same for most requests (except when a modification was made). APC implements the idea of caching the compilation results until a modification was made to (one of) the PHP source file(s).

APC works as a loadable module for PHP which is simply integrated by specifying it in php.ini:

extension = /usr/lib/php4/apc.so



Figure 4.5: Script execution with compiler cache
PIC

It instantly starts working5 when PHP resp. the web server is restarted. The defaults reserve a total storage space of 30 mega bytes for caching compiled scripts. Figure 4.5 shows how the cache is being used. Grey boxes show where the cache repository is accessed.

4.6.2 Alternatives

There are quite a few compiler caches around also worth a try.

The authors choice was APC for its ongoing development and the PHP open source license.

4.7 Advanced PHP Debugger

The tool called APD is primarily a debugger that can be integrated with PHP. Mainly it provides functions and tools for debugging and profiling. APD acts as a PHP module and is activated and controlled using PHP functions which are provided by the module.

4.7.1 Debugging

The debugging functions of this tool provide the “standard” range of commonly used debuggers. This includes the setting of break points, debugging output, printing of stacks and currently used variables, and overriding or renaming of functions.

For this diploma thesis debugging will not be thoroughly used as a working and approved application is being tested, supposing that no bugs affect the caching procedure (and if, only in a relative measure).

4.7.2 Profiling

Profiling is an important tool for reaching the goal of this diploma thesis. It can be used to spot inefficiencies in source code by measuring the amount of time the processor spent in each function. While the script is being executed, a trace file is generated including compiled information about the on-goings of the current execution.

Afterwards this trace file can be processed with the included tool pprof to gather the information recorded. The output of the tool can be customized to the needs of analysis through several options. For example, a call tree can be printed showing the functions called including their dependencies. The tool is also capable of listing totals for functions such as time and memory consumed or times of calls.

4.7.3 Alternatives

Choosing the right tool for this thesis was hard, as all three of the introduced tools have good and distinct features. If it was for debugging only, a combination of all tools for different cases would have been the best choice. As mainly profiling is done, APD provides the best functions, especially the tool for processing trace files sets it apart from the other tools.

4.8 ApacheBench ab

To retrieve measurable results a load generation tool is used for enumerating the amount of requests a server is able to process in a given time. ab is a tool that is capable of doing so. It ships together with the Apache web server. Adam Twiss of Zeus Technology Ltd started its development in 1996 which was continued in 1998 by the Apache Software Foundation.

ApacheBench is a tool that just does its task of load generation, not much more. The most important settings used are the number of requests (specified by command line option -n) and the number of concurrent connections (-c).

4.8.1 Alternatives