A benchmark is a documented procedure that will measure the time needed by a computer system to execute a well-defined computing task. It is assumed that this time is related to the performance of the computer system and that someh ow the same procedure can be applied to other systems, so that comparisons can be made between different hardware/software configurations.
From the definition of a benchmark, one can easily deduce that there are two basic procedures for benchmarking:
If a single iteration of our test code takes a long time to execute, procedure 1 will be preferred. On the other hand, if the system being tested is able to execute thousands of iterations of our test code per second, procedure 2 should be chosen.
Both procedures 1 and 2 will yield final results in the form "seconds/iteration" or "iterations/second" (these two forms are interchangeable). One could imagine other algorithms, e.g. self-modifying code or measuring the time needed to reach a steady s tate of some sort, but this would increase the complexity of the code and produce results that would probably be next to impossible to analyze and compare.
Sometimes, figures obtained from standard benchmarks on a system being tested are compared with the results obtained on a reference machine. The reference machine's results are called the baseline results. If we divide the results of the system under examination by the baseline results, we obtain a performance index. Obviously, the performance index for the reference machine is 1.0. An index has no units, it is just a relative measurement.
The final result of any benchmarking procedure is always a set of numerical results which we can call speed or performance (for that particular aspect of our system effectively tested by the piece of code).
Under certain conditions we can combine results from similar tests or various indices into a single figure, and the term metric will be used to describe the "units" of performance for this benchmarking mix.
Time measurements for benchmarking purposes are usually taken by defining a starting time and an ending time, the difference between the two being the elapsed wall-clock time. Wall-clock means we are not considering just CPU time, but the "real" time usually provided by an internal asynchronous real-time clock source in the computer or an external clock source (your wrist-watch for example). Some tests, however, make use of CPU time: the time effectively spent by the CPU of the system being test ed in running the specific benchmark, and not other OS routines.
Resolution and precision both measure the information provided by a data point, but should not be confused.
Resolution is the minimum time interval that can be (easily) measured on a given system. In Linux running on i386 architectures I believe this is 1/100 of a second, provided by the GNU C system library function times (see /usr/include/time
.h - not very clear, BTW). Another term used with the same meaning is "granularity". David C. Niemi has developed an interesting technique to lower granularity to very low (sub-millisecond) levels on Linux systems, I hope he will contribute an explanation
 of his algorithm in the next article.
Precision is a measure of the total variability in the results for any given benchmark. Computers are deterministic systems and should always provide the same, identical benchmark results if running under identical conditions. However, since Linux is a multi-tasking, multi-user system, some tasks will be running in the background and will eventually influence the benchmark results.
This "random" error can be expressed as a time measurement (e.g. 20 seconds + or - 0.2 s) or as a percentage of the figure obtained by the benchmark considered (e.g. 20 seconds + or - 1%). Other terms sometimes used to describe variations in results ar e "variance", "noise", or "jitter".
Note that whereas resolution is system dependent, precision is a characteristic of each benchmark. Ideally, a well-designed benchmark will have a precision smaller than or equal to the resolution of the system being tested. It is very important to iden tify the sources of noise for any particular benchmark, since this provides an indication of possibly erroneous results.
A program or program suite specifically designed to measure the performance of a subsystem (hardware, software, or a combination of both). Whetstone is an example of a synthetic benchmark.
A commonly executed application is chosen and the time to execute a given task with this application is used as a benchmark. Application benchmarks try to measure the performance of computer systems for some category of real-world computing task. Measu ring the time your Linux box takes to compile the kernel can be considered as a sort of application benchmark.
A benchmark or its results are said to be irrelevant when they fail to effectively measure the performance characteristic the benchmark was designed for. Conversely, benchmark results are said to be relevant when they allow an accurate prediction of re al-life performance or meaningful comparisons between different systems.
The performance of a Linux system may be measured by all sorts of different benchmarks:
Etc...
Floating-point (FP) instructions are among the least used while running Linux. They probably represent < 0.001% of the instructions executed on an average Linux box, unless one deals with scientific computations. Besides, if you really want to know how well designed the FPU in your processor is, it's easier to have a look at its data sheet and check how many clock cycles it takes to execute a given FPU instruction. But there are more benchmarks that measure FPU performance than anything else. Why ?
Etc...
The original Whetstone benchmark was designed in the 60's by Brian Wichmann at the National Physical Laboratory, in England, as a test for an ALGOL 60 compiler for a hypothetical machine. The compilation system was named after the small town of Whetstone, where it was designed, and the name seems to have stuck to the benchmark itself.
The first practical implementation of the Whetstone benchmark was written by Harold Curnow in FORTRAN in 1972 (Curnow and Wichmann together published a paper on the Whetstone benchmark in 1976 for The Computer Journal). Historically it is the first major synthetic benchmark. It is designed to measure the execution speed of a variety of FP instructions (+, *, sin, cos, atan, sqrt, log, exp) on scalar and vector data, but also contains some integer code. Results are provided in MWIPS (Millions of Whetstone Instructions Per Second). The meaning of the expression "Whetstone Instructions" is not clear, though, at least after close examination of the C source code.
During the late 80's and early 90's it was recognized that Whetstone would not adequately measure the FP performance of parallel multiprocessor supercomputers (e.g. Cray and other mainframes dedicated to scientific computations). This spawned the development of various modern benchmarks, many of them with names like Fhoostone, as a humorous reference to Whetstone. Whetstone however is still widely used, because it provides a very reasonable metric as a measure of uniprocessor FP performance.
Whetstone has other interesting qualities for Linux users:
The version of the Whetstone benchmark that we are going to use for this example was slightly modified by Al Aburto and can be downloaded from his excellent FTP site dedicated to benchmarks. After downloading the file whets.c, you will have to edit slightly the source: a) Uncomment the "#define POSIX1" directive (this enables the Linux compatible timer routine). b) Uncomment the "#define DP" directive (since we are only interested in the Double Precision results).
This benchmark is extremely sensitive to compiler optimization options. Here is the line I used to compile it: cc whets.c -o whets -O2 -fomit-frame-pointer -ffast-math -fforce-addr -fforce-mem -m486 -lm.
Note that some compiler options of some versions of gcc are buggy, most notably one of -O, -O2, -O3, ... together with -funroll-loops can cause gcc to emit incorrect code on a Linux box. You can test your gcc with a short test program available at Uwe Mayer's site. Of course, if your compiler is buggy, then any test results are not written in stone, to say the least (pun intended). In short, don't use -funroll-loops to compile this benchmark, and try to stick to the optimization options listed above.
Just execute whets. Whetstone will display its results on standard output and also write a whets.res file if you give it the information it requests. Run it a few times to confirm that variations in the results are very small.
Some motherboards allow you to disable the L1 (internal) or L2 (external) caches through the BIOS configuration menus (take a look at the motherboard's manual; the ASUS P55T2P4 motherboard, for example, allows disabling both caches separately or together). You may want to experiment with these settings and/or main memory (DRAM) timing settings.
You can try to compile whets.c without any special optimization options, just to verify that compiler quality and compiler optimization options do influence benchmark results.
The Whetstone benchmark main loop executes in a few milliseconds on an average modern machine, so its designers decided to provide a calibration procedure that will first execute 1 pass, then 5, then 25 passes, etc... until the calibration takes more than 2 seconds, and then guess a number of passes xtra that will result in an approximate running time of 100 seconds. It will then execute xtra passes of each one of the 8 sections of the main loop, measure the running time for each (for a total running time very near to 100 seconds) and calculate a rating in MWIPS, the Whetstone metric. This is an interesting variation in the two basic procedures described in Section 1.
The main loop consists of 8 sections each containing a mix of various instructions representative of some type of computational task. Each section is itself a very short, very small loop, and has its own timing calculation. The code that gets looped through for section 8 for example is a single line of C code:
x = sqrt(exp(log(x)/t1); where x = 0.75 and t1=0.50000025, both defined as doubles.
Compiling as specified above with gcc 2.7.2.1, the resulting ELF executable whets is 13 096 bytes long on my system. It calls libc and of course libm for the trigonometric and transcendental math functions, but these should get compiled to very short executable code sequences since all modern CPUs have FPUs with these functions wired-in.
Now that we have an FPU performance figure for our machine, the next step is comparing it to other CPUs. Have you noticed all the data that whets.c asked you after you had run it for the first time? Well, Al Aburto has collected Whetstone results for your convenience at his site, you may want to download the data file and have a look at it. This kind of benchmarking data repository is very important, because it allows comparisons between various different machines. More on this topic in one of my next articles.
Whetstone is not a Linux specific test, it's not even an OS specific test, but it certainly is a good test for the FPU in your Linux box, and also gives an indication of compiler efficiency for specific kinds of applications that involve FP calculations.
I hope this gave you a taste of what benchmarking is all about.
Other references for benchmarking terminology: