Guide to perform meaningful computer benchmarking test for productivity

Here you will learn how to get rid of benchmarketing hype and to get real results to have a clear picture of what you're about to buy.

Benchmark programs try to analyze the “performance” of a computer based in a well-defined test bed (normally not even close as the final configuration of the computer). The benhmarker can take advantage of the limits to confront, like system resources, virtual software environment, applications, processing load simulation, user load simulation, and system use simulation, among others; and do whatever is needed to surpass such limits and give a questionable result. Now, “having an industry standard benchmark is always a good thing, if the benchmarks are not gamed. Sadly, many industry benchmarks are gamed. […] Yet many organizations seem to buy hardware based on benchmarks rather than understand their workloads and buy hardware based on their requirements and, of course, budget.” (Newman, 2015) This explains why the customer needs to know exactly what is meant to be measured. A benchmark program will never be able to reflect the so-called real-world experience; just will reflect how well such program runs in the intended computer. So is responsibility of the customer to determine how the result will be applied in its productive environment.

Surprisingly, correctly doing benchmarks is something really difficult, because there are many possibilities to bring bad or misleading results and to omit things. The white paper “A Nine Year Study of File System and Storage Benchmarking” summarizes this:

In this article we survey 415 file system and storage benchmarks from 106 recent papers. We found that most popular benchmarks are flawed and many research papers do not provide a clear indication of true performance. (Traeger, Zadok, Joukov, & Wright, 2008)

In such white paper can be seen statements about benchmarks should explain what is about to be tested and why, and also they should do (or ease) a kind of analysis of the expected system performance.

In the “Performance Anti-Patterns” article, there are some highlighted points to determine, in this case, what for and how benchmarks should be run. Therefore, a good benchmark should be:

Repeatable, so experiments of comparison can be conducted relatively easily and with a reasonable degree of precision.
Observable, so if poor performance is seen, the developer has a place to start looking. Nothing is more frustrating than a complex benchmark that delivers a single number, leaving the developer with no additional information as to where the problem might lie.
Portable, so that comparisons are possible with your main competitors (even if they are your own previous releases). Maintaining a history of the performance of previous releases is a valuable aid to understanding your own development process.
Easily presented, so that everyone can understand the comparisons in a brief presentation.
Realistic, so that measurements reflect customer-experienced realities.
Runnable, so that all developers can quickly ascertain the effects of their changes. If it takes days to get performance results, it will not happen very often.

Not all benchmarks selected will meet all of these criteria, but it is important that some of them do. Smaalders invites to choose benchmarks that really represent the customer needs or all the efforts will end up optimizing for the wrong behavior. He also encourages resisting the temptation to optimize for the benchmark with the goal of win the contest at any price. Such behavior will offer fake results that could bring lack of confidence in the brand or the provider. Generally a benchmark will highlight the aspect that is optimized for, at the expense of other aspects that are not being measured (and could be important for the customer). (Smaalders, 2006)

There is another thing when comparing various systems with the intention of buying them: the price/performance rate. This rate can be quantified including the 5-year capital cost of the equipment. (Anon & Gray, 1985)

Simple testing with benchmark software does not comprise a full performance analysis. Benchmark programs normally work in a controlled environment, so they can be manipulated to get a desired rate—the highest possible. Nevertheless, such rate will NEVER give us a proper correlation to actual customer experience. Rather, the metrics provided only provide you with a score that is representative of how well it runs on a particular benchmark. A big problem is that the user never really knows what is tested, how aspects of a benchmark score are measured, which flags were set in the compiler, which code or libraries are underneath the program, and, more important, if what the program is testing will reflect the real intended use for the computer. There are, then, some points to remark:

A benchmark will never reflect real-world use. Being a fully automated program, it will open, write, set, read, save, show, look, navigate, move, calculate, assign, and do many other tasks in a fully automated way. Such way will never reflect the pace a human will do his or her job. Therefore, any benchmark program that promises results based in real-world usage is lying.
Not all benchmarks evaluate multitasking performance. The vast majority of benchmarks only evaluate serially executed tasks. They test one thing and, then, the next. They will never put an application to do something meanwhile another application is doing something else (and, if so, normally they try two or, as much, three instances).In real life, users run multiple applications concurrently. Most users open many applications at a time (the Operating System also executes many tasks while we carry out our daily tasks). It is worth to mentioning that a benchmark that reflects just one process/application does not scale linearly with the second, third, and the rest of the processes/applications. Therefore, in modern multicore technology, it is important to take the results of a benchmark program with a grain of salt .
“Rate” is not the same as “performance”. Rate is just a number given by the program. Performance is something more complicated. The rate number depend on the benchmark program and the scale stated its developers. Performance is the accomplishment of a given task measured against preset known standards of accuracy, completeness, cost, and speed (The Business Dictionary, 2017). Therefore, no benchmark program rate reflects a performance index.
Benchmarks are inaccurate. The variation in a rate of a benchmark program can be up to 10% (sometimes, more) . These benchmarks rates are always subjective, which is contrary to the objective principles of science and technology. That is why a benchmark program should be run at least 3 times to calculate a mean of the given rate. After determining the mean, it is expected at least a variation of +/-3 % of the result (such variation can be calculated based in the results gathered).
Benchmark results can be manipulated. There are ways to manipulate the results of a benchmark program and offer artificially high rates just to impress the user. This can be done through many techniques, like excessive tweaking, settings in BIOS, hardware alteration, use of special drivers, code manipulation, and others. Benchmark rates could be obtained with unrealistic settings compared to how the computer will be used in practice: the system is required to be free of programs and processes and to have specific values that depend on the particular benchmark, and this does not reflect the way the computer will be used in productivity. As Henry Newman stated: “What a benchmark tell us is: 1) How much hardware a vendor can cram into a box, 2) How well the vendor team can optimize software [for the benchmark], and, 3) How badly the vendor wants the deal.” (Olds & OrionX, 2011) If there is not a benchmark protocol clearly stated by the customer, the door will be open for any trick the benchmarker can do to reach (or surpass) the desired results and impress the customer.

In the end, a benchmark program is not a precise tool and should be used with care. As Henry Newman quoted in a personal communication, “Comparing system performance tools […] with […] benchmarks is like comparing apples with flying pigs.” (Carrier, 2012)

Scores provided by a benchmark should be used in conjunction with other benchmarks and additional details to derive a full understanding of overall system performance. The benchmark, in the best of cases, is only “measuring the speed” of a computer, but there are many other aspects needed to be taking in account: commercial standards, military standards, functionality, features, certifications, price, among others.

If the computer being evaluated for purchase needs to meet a variety of different user requirements, it is best to predetermine the intended use cases so a detailed set of appropriate qualification criteria be established. In most cases, the computer will likely be used in one or more of the following contexts :

Will be used with current graphical operating environments (Windows 10, or GNU/Linux distributions at the least)
- Or if it will be used with previous versions of operating systems, like Windows 7 or Windows 8.1 , or a previous GNU/Linux distribution.
Basic Productivity (No more than 5 applications running, including):
- Antivirus
- Word processing
- Email
- Light use of spreadsheets
- Web-based applications
- Web browsing with no more than 6 tabs open.
Standard Productivity (5 to 10 applications running, including):
- Same as Basic Productivity, and also…
- Office applications (Word processing with image manipulation, spreadsheets with formulas and some scripts, Presentations, Basic database management)
- Web browsing with up to 15 tabs
- Web conferencing
- Viewing and simple editing of images and videos
- Education
- Can assume copious use of videos, images, animations, web access and applications that currently use accelerated computing.
Power User (More than 10 applications running, including):
- Same as Standard Productivity, and also…
- Application development
  - Using programming languages and environments
  - Creating, managing and testing databases,
  - Testbed with virtualization
  - Controlled testing environments
- Programs for scientific research
  - Specialized applications
  - Engineering and scientific applications
  - Virtual Reality
Just for general knowledge, games are currently considered high performance computing (Stevenson, Le Du, & El Afrit, 2011)
Other criteria
- Is low power consumption required?
- Is important or limited the space occupied by the computer?
- Is mobility required?
- Is battery life important?
- Is weight important?
- Certain other uses that imply portable use in harsh, dusty or noisy environments

The term in the title of this section may seem odd, but in fact it is an issue that is generally ignored or neglected. It refers to the following question: What response time really matters to the user? Speed and performance are actually relative terms. Ilya Grigorik proposes an interesting concept of what the word “performance” means.

“Performance is not just milliseconds, frames, and megabytes. It’s also how these milliseconds, frames, and megabytes translate to how the user perceives the application.” (Grigorik, 2014)

Each application prescribes its own set of requirements according to business criteria, context, user expectations, and perceptual processing time constants entirely oriented to the user . Again, user expectations have a relationship to Maister’s first law; “Satisfaction equals perception minus expectation.” (Maister, 1985) No matter how much life speeds up, or at least is perceived to accelerate (1 frame every 66ms), our reaction times remain constant. If we consider that a user can see around 15 frames per second according to traditional studies (Thorpe, Fize, & Marlot, 1996), the following table (based on the Military Standard 1472G) gives a clear idea of the response time a user normally expects. This is regardless of the type of application (installed on the computer or online) or media (laptop, desktop or mobile device).

Actual Time and User Perception (Seow, 2008)

0 – 100 ms: Instantaneous
100 – 500 ms: Immediate
500 – 1000 ms: Fast
1 – 10 s: A lag is perceived, but the user does not lose focus.
+10 s: The computer is too slow to hold the user’s attention.

For the response to a user request to be perceived as fast, it must arrive in under one second. If it takes one second or more, the user may perceive a certain lag, but her attention will not be called away from the task. After 10 seconds—unless the program delivers some type of information to the user—the task will normally be abandoned and the user will experience annoyance. If the computer being evaluated can deliver responses in comparable time or in under 10 seconds, the user will get more out of the computer. These thresholds are much more useful in the real world than the results of benchmark programs, which do not offer clear guidance as to their meanings .

That said, in an effective benchmark the customer should know: how it be applied, the analysis and the conclusions that will be obtained from it. For the analysis, it is important to understand or bring:

A threshold or reference of the minimum expected result
What is about to be tested
Which are the limiting factors
Any perturbation that could affect the results
The details of the system tested
The price (at least, the average) of the system tested
What conclusions are looked to reach with the results

The threshold can be obtained from public sites (like Futuremark) or it can created locally from a computer currently in use and with a good configuration in order to set the baseline for each benchmark from the computers about to be offered. Jot down the details of the configuration of this computer used as a baseline (processor, memory [amount, speed, configuration, timings], storage, graphics and monitor) to have a clear idea of it.

The analysis of benchmarks requires time and experience to be done correctly. As said, the most important part is to determine what is about to be measured and if the results obtained are meaningful for the intended use of the computers.

The following is a proposed benchmarking protocol in order to ensure—as much as possible—fair and realistic results of the selected or applied benchmarks.

It is recommended for the configuration and benchmarking processes not to leave the benchmarker alone, especially If the benchmarker is a third party. The customer must assign a witness to jot down anything the benchmarker does with the machine to configure it for the benchmark process. In addition, the witness should also note any changes or modifications the benchmarker does after each type of benchmark test. If you, as the customer, have a well-defined protocol, then the benchmarker must not violate it just to win the benchmark process. The benchmarker and the witness should not be the same person, again especially if the benchmarker is a third party.

No matter the brand or the hardware offered, all the computers should meet the configuration criteria:

If you asked for physical quad-core processors, all of them should have four physical cores.
If you asked for certain amount, speed and configuration of RAM, check that all the computers have the same configuration. Jot down any differences :
- Size
- Speed (MT/s)
- Timings and latencies (CAS, RAS, tRAS, tRC, Frequency).
- Single channel vs Dual Channel
If you asked for certain type of storage, check that all the computers includes such kind of storage. Jot down any differences found (throughput, seek times and writing times) :
- Standard Rotational Hard Disk (5400RPM, 7200RPM, 10000RPM, SSHD)
- SSD
Video card should meet Shader Model 6.1 (DirectX 12.1) for Windows 10 or Shader Model 5 (DirectX 11) for previous versions of Windows.
Monitor should meet the same features in every computer
- Refresh times
- Response times (ms)
- Resolution (higher resolutions could bring lower benchmark numbers)
- Color depth
The Operating System should be the same version and compilation.
- In Windows you can see the version and compilation by typing winver and Return in the Cortana box or after pressing Windows+R to bring the Run window.
Every computer should have installed only the current drivers approved by the computer manufacturer. Note: Don’t allow the use of special drivers or tweaked drivers brought by a component manufacturer, as they could bring fake results. Avoid the use of drivers out of the ones validated and publicly available in the Web Site or Setup Tool of the computer manufacturer.
Configure the computer in Balanced mode. This is the way the computers are intended to be used by the customer, and is the most faithful way to get results from the benchmarks.
Install the customer’s commonly used applications. No matter they will not be used in the test, it is the way the computers will be used.
Install any other software (like antivirus and tools) required by the customer. This will set the configuration as close as the end user will be using it.
Install the benchmark programs.

Once the computers are installed, it is recommended to do a hot benchmarking. A hot benchmarking needs an engineer (Engineer, not a technician) taking note of everything that is happening during the benchmark process (skipped parts of the test, missing steps, screen misbehaviors, badly drawn figures or images, and so on). Such anomalies should be noted and reported. It is advisable to avoid cold benchmarking (just run the benchmark, walk away the computer and, then, return just to jot down the result) because no evidence of strange behavior during the benchmark will be noted. As stated, there are ways to manipulate the benchmark rates and doing a cold benchmark is the best way to miss this kind of practices.

As benchmark rates could yield to variations from 5 to 15%, it is recommended to run each test at least 3 times. Each time will need a computer reboot, to wait about 5 minutes after the desktop appears and, then, run the benchmark again. After each benchmark run, it is recommended to do a screen capture of the result to save the evidence. This should be done in each benchmark program selected.

Is up to the customer to decide if the different rates obtained through each run of each benchmark will be averaged, or taking the highest or lowest value. Whatever decision the customer makes, is recommended to apply it to all the different benchmarks used. As a suggestion, go for the average.

After getting such results, those can be normalized (as in the body of this document) and, then, converted linearly into times. With the price of the computers, the customer also can evaluate which one of the systems give the best price/performance ratio.

A benchmarking process requires time, experience and patience. If they are well done, benchmark programs can give a good idea about the expected performance of the computer. All jotted down by the witness will be useful to determine what the contestant will need to provide in case of being selected. The configuration (processor, RAM [amount, speed, timings, channel mode], storage [type, throughput, capacity], monitor, form factor, etc.) jotted down during the benchmarking process will be useful to ensure the computer delivered be configured exactly as tested. This is important because some abusive providers can deliver a specially configured computer just for the testing process, and a very different one in the end. Therefore, this will help the customer to get exactly what was tested.

From all the preceding discussion, the following can be concluded:

The purchases of the customer are making are for computers, not just a processor or certain component. It is therefore necessary to take overall system (holistically) into account when making a buying decision.
The performance of a computer derives from all its elements. This includes hardware and software. The overall performance of the computer will always be equal to the performance of its slowest element.
It is necessary to take into account energy-saving measures, heat generation, the stability of the computer, its certifications for business use and the security services it offers. More than just benchmark-based speed, current technology requires reduced energy consumption and security services be provided.
It is important to be aware that the new technologies integrated into applications and operating systems leverage much more than just the processor. Rather, they focus more on other components such as the CPU, GPGPU, buses, RAM speed and disk transfer rate.

A true measure of computing power is obtained when the computer is measured holistically, not just in the area of serial processing or the CPU. The customer who uses office tools, web browsers, file compression, video players, teleconferencing tools, web based applications and things alike will take advantage of all the features of these new technologies with heterogeneous architecture.

A final note about this is that a threshold or reference is always needed in order to get a better idea of the performance enhancements about to receive. If you don’t want to use the base measure offered by FutureMark for a “Reference Office PC” pounder to current date, you can measure your base computer in your office based in your own criteria (perhaps, a computer with which you feel comfortable with its standard performance). Once you run the benchmark to get its results, you can, then, use such results as a threshold in order to have a minimum expected rates of the solutions offered. Just another thing to note is that “rates” is not the same as “performance”. A “rate” just gives a qualification of the processes executed by the benchmark program. “Performance” is the real result you will obtain as you use such computer in your own tasks.

Anon, E. A., & Gray, J. (February, 1985). A Measure of Transaction Processing Power. Retrieved March 22, 2015, from Internet Archive: https://archive.org/details/bitsavers_ta...

Carrier, J. (April 24, 2012). HPCS I/O Scenarios. Retrieved from OpenSFS: http://cdn.opensfs.org/wp-content/upload...

Computerhope. (March 15, 2015). Thrashing. Retrieved from Computer hope: http://www.computerhope.com/jargon/t/thr...

Gregg, B. (2014). Systems Performance Enterprise and the Cloud (1 ed.). USA: Pearson Education.

Grigorik, I. (March 12, 2014). Speed, Performance, and Human Perception. (Fluent, Ed.) San Francisco, CA, USA. Retrieved March 22, 2015, from https://www.youtube.com/watch?v=7ubJzEi3...

Hoff. (December 30, 2006). Multicore, SMP and SMT Processors. Retrieved 22 de March de 2015, from HoffmanLabs: http://labs.hoffmanlabs.com/node/13

Maister, D. (1985). The Psychology of Waiting Lines. (T. S. Encounter, Ed.) Retrieved March 22, 2015, from David Maister: Professional Business, Professional Life: http://davidmaister.com/wp-content/theme...

Mallik, A. (2007). Hollistic Computer Architectures based on Application, User, and Process Characteristics. Evanston, Illinois, USA: UMI.

Newman, H. (2015). Data Storage Issues: Big Data Benchmarking. Retrieved from InfoStor: http://www.infostor.com/index/blogs_new/...

Olds, D., & OrionX. (December 19, 2011). Benchmarks are $%#&@!! Retrieved from The Register: http://www.theregister.co.uk/2011/12/19/...

Osterhage, W. (2013). Computer Performance Optimization (1 ed.). (Springer-Verlag, Trans.) Niederbachem, Germany: Springer-Verlag Berlin Heidelberg.

Seow, S. (2008). Designing and Engineering Time. Boston, USA: Prentice Hall.

Smaalders, B. (February 23, 2006). Performance Anti-Patterns. doi:1542-7790/06/0200

Stevenson, A., Le Du, Y., & El Afrit, M. (March, 2011). High Performance Computing on Gamer PCs. Retrieved from ArsTechnica: http://arstechnica.com/science/2011/03/h...

The Business Dictionary. (2017). Performance. Retrieved from The Business Dictionary: http://www.businessdictionary.com/defini...

Thorpe, S., Fize, D., & Marlot, C. (June 6, 1996). Speed of processing in the human visual system. Nature, 381, 520-522. Retrieved from Quora: http://cns.bu.edu/Profiles/Mingolla.html...

Traeger, A., Zadok, E., Joukov, N., & Wright, C. (2008, May). A Nine Year Study of File System and Storage Benchmarking. Retrieved March 22, 2015, from File systems and Storage Lab (FSL): http://www.fsl.cs.sunysb.edu/docs/fsbenc...

Vieira, L. (October 3, 2011). The Perception of Performance. Retrieved March 22, 2015, from Sitepoint: http://www.sitepoint.com/the-perception-...

Author

Encom

Member since: 05/04/17

1 Reputation

0 Guides authored

Badges: 0

Guide to perform meaningful computer benchmarking test for productivity

Meaningful Computer Benchmarking Process

Background

Benchmarking results and analysis

Intended Use Cases

The Perceptual Benchmark

Benchmarking Protocol

Set who will do and who will withness the benchmarking process

Set common configurations

Run the benchmarks

Normalize and analyze the results

Final Notes

References

Author

Encom

0 Comments

Add Comment