1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" 2 "http://www.w3.org/TR/html4/strict.dtd"> 3<!-- Material used from: HTML 4.01 specs: http://www.w3.org/TR/html401/ --> 4<html> 5<head> <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> 6 <title>Polly - Performance</title> 7 <link type="text/css" rel="stylesheet" href="menu.css"> 8 <link type="text/css" rel="stylesheet" href="content.css"> 9</head> 10<body> 11<div id="box"> 12<!--#include virtual="menu.html.incl"--> 13<div id="content"> 14<h1>Performance</h1> 15 16<p>To evaluate the performance benefits Polly currently provides we compiled the 17<a href="https://web.cse.ohio-state.edu/~pouchet.2/software/polybench/">Polybench 182.0</a> benchmark suite. Each benchmark was run with double precision floating 19point values on an Intel Core Xeon X5670 CPU @ 2.93GHz (12 cores, 24 thread) 20system. We used <a href="https://sourceforge.net/projects/pocc/files/">PoCC</a> and the included <a 21href="http://pluto-compiler.sf.net">Pluto</a> transformations to optimize the 22code. The source code of Polly and LLVM/clang was checked out on 2325/03/2011.</p> 24 25<p>The results shown were created fully automatically without manual 26interaction. We did not yet spend any time to tune the results. Hence 27further improvments may be achieved by tuning the code generated by Polly, the 28heuristics used by Pluto or by investigating if more code could be optimized. 29As Pluto was never used at such a low level, its heuristics are probably 30far from perfect. Another area where we expect larger performance improvements 31is the SIMD vector code generation. At the moment, it rarely yields to 32performance improvements, as we did not yet include vectorization in our 33heuristics. By changing this we should be able to significantly increase the 34number of test cases that show improvements.</p> 35 36<p>The polybench test suite contains computation kernels from linear algebra 37routines, stencil computations, image processing and data mining. Polly 38recognices the majority of them and is able to show good speedup. However, 39to show similar speedup on larger examples like the SPEC CPU benchmarks Polly 40still misses support for integer casts, variable-sized multi-dimensional arrays 41and probably several other construts. This support is necessary as such 42constructs appear in larger programs, but not in our limited test suite. 43 44<h2> Sequential runs</h2> 45 46For the sequential runs we used Polly to create a program structure that is 47optimized for data-locality. One of the major optimizations performed is tiling. 48The speedups shown are without the use of any multi-core parallelism. No 49additional hardware is used, but the single available core is used more 50efficiently. 51<h3> Small data size</h3> 52<img src="images/performance/sequential-small.png" /><br /> 53<h3> Large data size</h3> 54<img src="images/performance/sequential-large.png" /> 55<h2> Parallel runs</h2> 56For the parallel runs we used Polly to expose parallelism and to add calls to an 57OpenMP runtime library. With OpenMP we can use all 12 hardware cores 58instead of the single core that was used before. We can see that in several 59cases we obtain more than linear speedup. This additional speedup is due to 60improved data-locality. 61<h3> Small data size</h3> 62<img src="images/performance/parallel-small.png" /><br /> 63<h3> Large data size</h3> 64<img src="images/performance/parallel-large.png" /> 65</div> 66</div> 67</body> 68</html> 69