Performance Results
Impressive runtime results using parallel [scientific computing] programs are hard to achieve. It gets trickier when trying to report such results in scientific papers. Authors typically resort to attention deflecting methods when presenting performance results of parallel programs on distributed systems. Though these methods become clear to a trained eye, for someone uninitiated about the area, these go unnoticed sometimes. Of the various such performance-boosting reporting tricks, I found a few amusing and useful:
- Generally, 40% of an application runtime is spent on data movement. To show the results in the best possible light, only the time taken by the "core" part of the program is reported as the complete runtime result.
- Employed in-line assembly code or direct machine code emissions are sometimes left unreported.
- Due to added runtime of 64-bit floating-point arithmetic, 32-bit performance results are shown.
- When comparing against a conventional supercomputer implementation, the parallel algorithm is heavily modified to suit underlying machine architecture.
- Unoptimized, non-vectorized code is compared with the parallel version.
- The performance results are projected to a full system instead of running on real hardware.
- Problems whose size increases as the number of processors increase are selected. Careful choosing is needed because most problems do not scale well in size with processor scale up.
- Runtime comparison results are presented between parallel code on new hardware and an obsolete system.
- Runtime comparison results are presented between parallel code on dedicated hardware and a sequential code on shared hardware.
- The utility is brought into the equation: instead of saying "performance is X mflops", something like: "performance is X mflops per dollar" is said. It is suggestive of the fact that the processors are 100% utilized 100% of the time (even though they may be busy with context switching or syncing).
- The paper/presentation is splattered with cute pictures without mentioning performance (convergence rate is generally reported instead of runtime performance).