The Center for Education and Research in
Information Assurance and Security (CERIAS)

CERIAS - Case Study: Empty Optimizations

By Dr G. Jan Wilms

"Virtually every major graphics board vendor has been in the PC Labs doghouse at sometime over the past several years, including ATI, Diamond, Matrox, and Number Nine. They weren't cheating; let's just say they were trying to push their advantages a bit too far." Bill Howard. 'Frontiers of Benchmark Testing'. PC Magazine, June 25th 1996.

Michael M, Editor-in-Chief of PC Magazine was looking at the executive report on the latest graphics benchmarks which were to appear in the June 29th issue. As he got deeper into the summary, his face took on a baffled look. He picked up the phone to call Bill M, Vice President for Technology, and asked him to come by his office with the detailed test results. Five minutes later, they were pouring over the data on Bill's laptop.

What had Michael so puzzled was the Graphics Winmark score of the WinBench performance test for the Pegasus Plumbago card - its top score outperformed the nearest competitor by enough of a margin that it raised a red flag because the informal hands-on tests Michael had done with real applications on the testing machines had given the impression of middle-of-the-road performance. Again he wished that there was a way to produce meaningful test scores using application-based testing; past experiments had shown however that such testing is very susceptible to variations in software version and program modules used. Hence PC Magazine had developed a symthetic test which repeatedly runs through some of the common graphics functions supported by MS Windows. Programs have to use this API since the multi-tasking Windows operating system doesn't allow them to write directly to the display. The benchmark then reports a weighted-average of these iterations as a single score which the lab hoped was a good approximation of in depth performance.

"This isn't another case of Chang modifications to the hardware, is it?" he mused aloud. In the early 80s, Chang, an Engineer in Taiwan had found a way to "fool" CPU benchmark tests by using faster clock crystals so the benchmark code seemingly ran faster. "Those 286es were utter failures in keeping time, but their test results looked great - sometimes as much as 50 percent faster" remembered Bill. Some graphics vendors had tried similar "stunts" recently by patching the Windows Graphical Device Interface. Such "hot-rodding" of the boards, as Billy H a fellow editor calls it, does lead to a 15 percent boost in graphics performance, but makes system calls less reliable and thus is a questionable tactic. "No," replied the VP for Technology, "we had Billy look into this. He suspects it may be a case of caching by the device driver".

"Isn't that a form of optimization?" asked Michael. "Yes," was the reply. "I had Billy explain it to me." And he related that unlike the old framebuffer boards, new graphic adaptors implement some of the common internal windows graphics functions such as area fills and line drawing directly on the video card; the manufacturer's display driver redirects these calls to the specialized hardware instead of running them on the system's CPU. "That must be the unfair acceleration some vendors complained to me about" interjected Michael. "No," was the response, "they pretty much all do that nowadays. The letters you got expressed concern about optimizations from ATI and Weitek that involve caching parts of the screen image in unused video memory". He went on to explain that bypassing system memory and thus repeat trips across the system bus provided substantial savings, especially with BitBlt operations which involve moving an image across the screen. But it is controversial because it benefits only a selected category of applications like Paint.

"Hmm" grunted Bill; "this could tempt some vendors to perform empty optimizations". He meant that an unscrupulous manufacturer could program the device driver to watch the board for an in-place redraw request of the same bitmapped image, and to immediately return the function call without actually doing the re-write. Such a scenario would only occur in a benchmark testing situation. "Why don't you set up a conference call with Pegasus. Let's see if they can explain this surprise finish."

When John T. was paged by his secretary that the PC Labs wanted to talk to him, he didn't know whether he should be worried or relieved. He had been expecting this call for a while now. When he was made Project Manager of the Plumbago card sixteen months earlier, the Chief Engineer had made it clear to him that the future of the company might be riding on the success of this card and that he expected an end- product that would do the same for the 3-D market as their Vodoun card did for the 2-D line. The Plumbago was to be built around the AGX-014 chip manufactured by IIT, but others where using the same chipset, so John was looking for an added advantage to outperform the competition. The obvious place to concentrate on was the graphics device driver - the extension of the operating system that controls the video card and takes advantage of its proprietary opcodes and memory. Whereas a good driver can do very little to overcome the limitations of the hardware, a badly designed driver can cause an excellent device to underperform by failing to take advantage of its power. Device drivers often continue to be tweaked long after the hardware is released.

The Holy Grail of endorsements is the coveted PC Magazine Editor's Choice Award. It will often translate in millions in dollars of sales. This was the goal John set for himself, and to this end he ran the Winbench benchmark test suite each time the hardware was changed or the driver adjusted. He compared the results not only to existing products on the market, but also to a few pre-production models from the competition he could get his hands on. And while he was able to steadily improve the performance of the card, after about 13 months of development he reached a plateau; very solid numbers, but nothing that made the board stand out from similar cards. And always less than boards from Number Nine Corporation.

Desperate, he had a Number Nine board and its driver reverse-engineered. They discovered something curious: unlike bitmap caching, which does benefit some type of applications, the Number Nine caches text strings; this improves performance, but only in the very limited case where the same string is repeated repeatedly, i.e. in tests only. Because time was running out, John didn't have time to add the same "intelligence" to the driver, but because he had previously disassembled the Graphics Winmark test, he knew exactly what string PC Magazine used to test the cards. So under duress, he ordered that a routine be added to the device driver which would "cache" the hardcoded string. Although nothing else significant was altered, this one modification gave an immediate boost to the Benchmark scores.

With a heavy heart, John went to the conference room, and picked up the extension. "John speaking," he said.

Preliminary Analysis Questions

1) How should John react?

should he volunteer the details about the deception?
should he wait until the Labs team figures it out?
Should he categorically deny the accusation?
should he mention what he discovered about Number Nine Corporation?
should he join the other vendors who complain that all caching, including bitmap caching which PC Labs does accept, artificially inflates scores by going around the test?

2) Do you agree with PC Labs' decision not to disqualify drivers that use bitmap caching because there are some limited applications that can benefit from it? 3) What can PC Labs do to make the tests more realistic and more immune to "empty optimizations"? 4) If this "shortcut" only appears in Pegasuses 3-D boards (and the company agrees to ship drivers without the "patch"), and their 2-D board does show superior performance and bang for the buck, should they honor the Vodoun card with the Editor's Choice? 5) Should GDI patching (hot-rodding) be allowed? How do you feel about the fact that these manufacturers deliberately waited to introduce these modifications until after they received Microsoft Windows compatibility certification?

Implications for practice

Discuss how far developers should go in building "intelligence" in their code that would eliminate "redundancies".
Discuss the value of Benchmark testing. Can abuses always be prevented and detected? Can these synthetic numbers reflect real-world performance? What other features should be included in ranking products?

Case Overview

This case revolves around a real confrontation between a leading computer magazine that caught a graphics card vendor in rigging its device driver to artificially inflate its scores on the Benchmark test. The company (whose real name is Hercules) denied the charge but hence fore shipped drivers without the hardcoded string. The case hints at other more subtle ways that other vendors have tried to make "empty optimizations". When caught, their response often amounted to "we were planning to tell you, but we were waiting for you to call and ask, because we wanted to know how you felt about it" (Bill Howard. "Frontiers of Benchmark Testing". PC Magazine, June 25th 1996). PC Labs has had to make some controversial calls in allowing some optimizations, while forbidding others, and has struggled with the decision to award the Matrox company its coveted award for one product, while chastising it for its practices in another. In response to the Hercules debacle, PC Magazine started using random strings for its testing, and publishing both synthetic and application-based scores.

Case Objectives

After analyzing and discussing this case, students should be able to:

Identify the pressures that engineers face imposed by the market place, and appropriate and inappropriate responses to these demands.
Differentiate between synthetic and application-based benchmark testing and know of their respective advantages/disadvantages.
Recognize the role a device driver plays in the performance of a hardware peripheral like a video adaptor.
Define the features that should be included in comparing and ranking competing hardware products

Suggested References

"Waking Up Windows". PC Magazine, April 13th 1993.
Machrone, Bill. "A Little Help from our Friends? Testing and Benchmarking Computer Equipment". PC Magazine, April 27th 1993.
Miller, Michael. "Testing in the Real World - Graphics Adapter Benchmark Testing". PC Magazine, June 29th 1993.
Howard, Bill. "Memo to Vendors: Cut it Out! ". PC Magazine, April 26th 1994.
Howard, Bill. "Frontiers of Benchmark Testing". PC Magazine, June 25th 1996.

Given the importance of benchmarking because of its impact on sales and vendors' bragging rights, it is not surprising that there have been several other controversies. This case can easily be broadened to some of the following:

Another example of "rigging" a hardware device to defeat a specific test can be found in the case of rigged gas pumps that employ a special chip that dispenses less than ordered, except for purchases of 5 or 10 gallons "because those are the quantities official inspectors test for."

http://abcnews.go.com/onair/2020/2020_000510_gastest_feature.html#graphic