"Virtually every major graphics board vendor has been in the PC Labs doghouse at sometime over the past several years, including ATI, Diamond, Matrox, and Number Nine. They weren't cheating; let's just say they were trying to push their advantages a bit too far." Bill Howard. 'Frontiers of Benchmark Testing'. PC Magazine, June 25th 1996.
Michael M, Editor-in-Chief of PC Magazine was looking at the executive report on the latest graphics benchmarks which were to appear in the June 29th issue. As he got deeper into the summary, his face took on a baffled look. He picked up the phone to call Bill M, Vice President for Technology, and asked him to come by his office with the detailed test results. Five minutes later, they were pouring over the data on Bill's laptop.
What had Michael so puzzled was the Graphics Winmark score of the WinBench performance test for the Pegasus Plumbago card - its top score outperformed the nearest competitor by enough of a margin that it raised a red flag because the informal hands-on tests Michael had done with real applications on the testing machines had given the impression of middle-of-the-road performance. Again he wished that there was a way to produce meaningful test scores using application-based testing; past experiments had shown however that such testing is very susceptible to variations in software version and program modules used. Hence PC Magazine had developed a symthetic test which repeatedly runs through some of the common graphics functions supported by MS Windows. Programs have to use this API since the multi-tasking Windows operating system doesn't allow them to write directly to the display. The benchmark then reports a weighted-average of these iterations as a single score which the lab hoped was a good approximation of in depth performance.
"This isn't another case of Chang modifications to the hardware, is it?" he mused aloud. In the early 80s, Chang, an Engineer in Taiwan had found a way to "fool" CPU benchmark tests by using faster clock crystals so the benchmark code seemingly ran faster. "Those 286es were utter failures in keeping time, but their test results looked great - sometimes as much as 50 percent faster" remembered Bill. Some graphics vendors had tried similar "stunts" recently by patching the Windows Graphical Device Interface. Such "hot-rodding" of the boards, as Billy H a fellow editor calls it, does lead to a 15 percent boost in graphics performance, but makes system calls less reliable and thus is a questionable tactic. "No," replied the VP for Technology, "we had Billy look into this. He suspects it may be a case of caching by the device driver".
"Isn't that a form of optimization?" asked Michael. "Yes," was the reply. "I had Billy explain it to me." And he related that unlike the old framebuffer boards, new graphic adaptors implement some of the common internal windows graphics functions such as area fills and line drawing directly on the video card; the manufacturer's display driver redirects these calls to the specialized hardware instead of running them on the system's CPU. "That must be the unfair acceleration some vendors complained to me about" interjected Michael. "No," was the response, "they pretty much all do that nowadays. The letters you got expressed concern about optimizations from ATI and Weitek that involve caching parts of the screen image in unused video memory". He went on to explain that bypassing system memory and thus repeat trips across the system bus provided substantial savings, especially with BitBlt operations which involve moving an image across the screen. But it is controversial because it benefits only a selected category of applications like Paint.
"Hmm" grunted Bill; "this could tempt some vendors to perform empty optimizations". He meant that an unscrupulous manufacturer could program the device driver to watch the board for an in-place redraw request of the same bitmapped image, and to immediately return the function call without actually doing the re-write. Such a scenario would only occur in a benchmark testing situation. "Why don't you set up a conference call with Pegasus. Let's see if they can explain this surprise finish."
When John T. was paged by his secretary that the PC Labs wanted to talk to him, he didn't know whether he should be worried or relieved. He had been expecting this call for a while now. When he was made Project Manager of the Plumbago card sixteen months earlier, the Chief Engineer had made it clear to him that the future of the company might be riding on the success of this card and that he expected an end- product that would do the same for the 3-D market as their Vodoun card did for the 2-D line. The Plumbago was to be built around the AGX-014 chip manufactured by IIT, but others where using the same chipset, so John was looking for an added advantage to outperform the competition. The obvious place to concentrate on was the graphics device driver - the extension of the operating system that controls the video card and takes advantage of its proprietary opcodes and memory. Whereas a good driver can do very little to overcome the limitations of the hardware, a badly designed driver can cause an excellent device to underperform by failing to take advantage of its power. Device drivers often continue to be tweaked long after the hardware is released.
The Holy Grail of endorsements is the coveted PC Magazine Editor's Choice Award. It will often translate in millions in dollars of sales. This was the goal John set for himself, and to this end he ran the Winbench benchmark test suite each time the hardware was changed or the driver adjusted. He compared the results not only to existing products on the market, but also to a few pre-production models from the competition he could get his hands on. And while he was able to steadily improve the performance of the card, after about 13 months of development he reached a plateau; very solid numbers, but nothing that made the board stand out from similar cards. And always less than boards from Number Nine Corporation.
Desperate, he had a Number Nine board and its driver reverse-engineered. They discovered something curious: unlike bitmap caching, which does benefit some type of applications, the Number Nine caches text strings; this improves performance, but only in the very limited case where the same string is repeated repeatedly, i.e. in tests only. Because time was running out, John didn't have time to add the same "intelligence" to the driver, but because he had previously disassembled the Graphics Winmark test, he knew exactly what string PC Magazine used to test the cards. So under duress, he ordered that a routine be added to the device driver which would "cache" the hardcoded string. Although nothing else significant was altered, this one modification gave an immediate boost to the Benchmark scores.
With a heavy heart, John went to the conference room, and picked up the extension. "John speaking," he said.
Preliminary Analysis Questions1) How should John react?
- should he volunteer the details about the deception?
- should he wait until the Labs team figures it out?
- Should he categorically deny the accusation?
- should he mention what he discovered about Number Nine Corporation?
- should he join the other vendors who complain that all caching, including bitmap caching which PC Labs does accept, artificially inflates scores by going around the test?
Implications for practice
- Discuss how far developers should go in building "intelligence" in their code that would eliminate "redundancies".
- Discuss the value of Benchmark testing. Can abuses always be prevented and detected? Can these synthetic numbers reflect real-world performance? What other features should be included in ranking products?
Case OverviewThis case revolves around a real confrontation between a leading computer magazine that caught a graphics card vendor in rigging its device driver to artificially inflate its scores on the Benchmark test. The company (whose real name is Hercules) denied the charge but hence fore shipped drivers without the hardcoded string. The case hints at other more subtle ways that other vendors have tried to make "empty optimizations". When caught, their response often amounted to "we were planning to tell you, but we were waiting for you to call and ask, because we wanted to know how you felt about it" (Bill Howard. "Frontiers of Benchmark Testing". PC Magazine, June 25th 1996). PC Labs has had to make some controversial calls in allowing some optimizations, while forbidding others, and has struggled with the decision to award the Matrox company its coveted award for one product, while chastising it for its practices in another. In response to the Hercules debacle, PC Magazine started using random strings for its testing, and publishing both synthetic and application-based scores.
Case ObjectivesAfter analyzing and discussing this case, students should be able to:
- Identify the pressures that engineers face imposed by the market place, and appropriate and inappropriate responses to these demands.
- Differentiate between synthetic and application-based benchmark testing and know of their respective advantages/disadvantages.
- Recognize the role a device driver plays in the performance of a hardware peripheral like a video adaptor.
- Define the features that should be included in comparing and ranking competing hardware products
- "Waking Up Windows". PC Magazine, April 13th 1993.
- Machrone, Bill. "A Little Help from our Friends? Testing and Benchmarking Computer Equipment". PC Magazine, April 27th 1993.
- Miller, Michael. "Testing in the Real World - Graphics Adapter Benchmark Testing". PC Magazine, June 29th 1993.
- Howard, Bill. "Memo to Vendors: Cut it Out! ". PC Magazine, April 26th 1994.
- Howard, Bill. "Frontiers of Benchmark Testing". PC Magazine, June 25th 1996.
- Veitch, Martin. "Sun Accused of cheating Java Benchmarks". PC Magazine, November 5th 1997.
- Foster, Ed. "Is it OK for Microsoft to bar benchmark results?" InfoWorld, April 17th 2001.