Fwd: [Developers] Re: [CactusMaint] Profiling of Bench_BSSN_PUGH benchmark (Long mail). (fwd)
goodale at cct.lsu.edu
Wed Mar 15 06:46:19 CST 2006
This message from John seems to have disappeared down a black hole.
Begin forwarded message:
> From: John Shalf <jshalf at lbl.gov>
> Date: March 14, 2006 1:12:19 PM PST
> To: developers at cactuscode.org
> Subject: Re: [Developers] Re: [CactusMaint] Profiling of
> Bench_BSSN_PUGH benchmark (Long mail).
> Some comments below.
> On Mar 14, 2006, at 12:28 PM, David Rideout wrote
>> On 3/14/06, Erik Schnetter <schnetter at cct.lsu.edu> wrote:
>>> On Mar 13, 2006, at 22:51:26, smadiraju at cct.lsu.edu wrote:
>>>> Hi all,
>>>> I am working on optimizing the Cactus "Bench_BSSN_PUGH" benchmark
>>>> different profiling tools.
>>>> 1) After using the IPM profilers it was found that the benchmark
>>>> spends most of its time in Load/Store operations. It would be good
>>>> if we can
>>>> reduce the number of load/stores and there by improve the
>>>> performance of the
> Many of the load stores are present because of register spilling.
> The typical remedy for register spilling is loop splitting.
> However attempts in the past by compiler experts such as Mary
> Hall's group at ISI failed to gain significant performance
> improvements from loop splitting. Perhaps that conversation is
> archived somewhere, but the number of temporary arrays required to
> carry over results from the loop splitting turned out to increase
> bandwidth requirements. Perhaps the loop splitting approach could
> be revisited if applied only to the innermost loops. The temporary
> registers would only need to be 3-planes of the overall 3D data
> block rather than attempting to split on the outermost loop.
> However, rgister spilling was not the problem on the itanium
> (register pressure was very low, but it still performed many
> redundant load-store instructions). It is very useful and
> instructive to look at the work that the Intel developers did on
> the extracted BSSN kernel for the Itanium 1 processor. I have
> attached NewStagLeap.F and StagLeap.F to this message. The
> NewStagLeap contains a fully itemized list of hand-coded assembly
> optimizations that Intel applied that brought the kernel
> performance from 30Mflops up to 2GFlops (single precision) on the
> early-release Itanium-1 systems many years ago. You may gain some
> insight from their optimizations that might be applicable outside
> of the IA64 architecture (don't know for certain).
>>>> 2) I started profiling the benchmark on different machines
>>>> available to me. I
>>>> worked on a Xeon and an Opteron architectures. IBM based
>>>> architectures do not
>>>> have valgrind and oprofile. I was unable to do any profiling on a
>>>> Power5 for
>>>> this reason
> On Power5 systems running AIX5.3, you should be able to use
> HPMCOUNT to get hardware performance profiling data. For systems
> like bassi that are running AIX5.2, the PAPI timers work just
> fine. On bassi, just do
> module load papi
> to get access to PAPI-3 timers.
>>>> 3) The cachegrind produces an output which gives the Total
>>>> number of
>>>> Intsructions, Data read cache misses at L1, L2 and Data write cache
>>>> misses at
>>>> L1, L2 and Instruction cache misses at L1, L2. I wrote a small perl
>>>> which parses this information and pulls out those functions,
>>>> which are
>>>> responsible for the majority of cache misses at Level2.
> so the L2 cache miss rates indicate to me that some cache line
> aliasing is happening. I know that Tom wrote some heuristics into
> the data allocation procedures to stagger the arrays to minimize
> aliasing, but you should definitely check to make sure they working
> as expected. The procedure for staggering these arrays should
> perhaps be re-examined.
>>>> 4) The following are the functions found to cause cache misses from
>>>> runs with varying number of iterations and the probelm size. Any
>>>> regarding the details of these functions are welcome.
>>> The function Mol_RK3Add is from the time integrator. It essentially
>>> adds large arrays to each other. It is clear that this function has
>>> many cache misses, and there is no way to optimise this function.
>>> Although this function has many cache misses, it is usually very
>>> fast, so it is not clear whether it is worth optimising this
>>> function. If this function should be optimised, then the whole time
>>> integration API in Cactus needs to be changed. This would only
>>> happen if there is a clear performance win demonstrated.
> I agree completely.
>>> The function ADM_BSSN_Sources is the main work routine in this
>>> benchmark. It is likely that this function could benefit from cache
>>> optimisations, such as tiling.
> I agree, although attempts to tile in the past (aside from the z-
> slicing that David Rideout described) have actually increased run
> time. It turns out to be difficult to find reasonable blocking
> strategies. We wrote a paper on cache-blocking considerations last
> There are very limited circumstances where Rivera blocking may work
> for you.
>>> The function ADM_BSSN_RemoveTrA is
>>> also a work routine. It traverses its arrays only once, so again
>>> this function cannot be optimised on its own. It may be possible to
>>> fold this routine into the main loop of ADM_BSSN_Sources, which
>>> improve cache performance. Note that ADM_BSSN_Sources traverses its
>>> arrays twice -- combining these two loops may improve performance.
> Combining the two loops may also increase register spilling. It
> may be useful to combine the outer-loop though and do some loop
> splitting on the inner-loops. That way, you have much less state
> to carry between iterations.
>>> The function NewRad is a boundary condition routine. It traverses
>>> only the boundary of the three-dimensional arrays, and it is
>>> not possible to optimise this routine on its own. It is possible to
>>> fold this routine into the main loop, but this would complicate the
>>> code very much, and this would only be accepted if there is a proven
>>> performance win.
> From the data I have, its only worth optimizing NewRad for the
> vector machines. It doesn't appear to account for much of the
> runtime on other systems, but it is the most computationally
> expensive routine on the Cray X1e because it doesn't vectorize.
> While this should be of little concern for a microprocessor, it is
> important to point out that code that doesn't vectorize also does
> not software-pipeline (SWP) that well. SWP is important for
> getting performance on superscalar processors and critical for
> achieving good performance on VLIW machines like the IA64 (that is
> what the IA64 register windowing is about...).
>>> The most promising performance improvement comes probably -- but
>>> is just a guess -- from cache optimisations in ADM_BSSN_Sources.
>>> This function is quite large and can handle many different cases
>>> (differencing stencils, shift storage, etc.). You could remove the
>>> code for the cases that are not used in your benchmark.
> Likewise, it may be important to comment out the conditional
> statements for some of your work in order to ensure that the
> conditionals are not confusing the compiler optimizer. The
> standalone BSSN should probably be updated. The entire purpose of
> the StagLeap (standalone ADM) and Standalone_BSSN was to support
> more aggressive optimization on these kernels without having to
> deal with the full Cactus infrastructure. It will definitely speed
> up your optimization work to refresh the Standalone version of the
> code with the current code base and use that as your platform for
> I also highly recommend looking at the BSSN kernel in GNU GIMPLE form.
> Gimple and SSA provide a much simpler normalized syntax for the
> loop basic blocks (all single-assignment expressions) that are much
> more amenable to processing using PERL. Its easy to get the
> compiler to emit 'gimplified' code for you to work on. Attached is
> StagLeap in gimple form.
> An added bonus is that Gimple uses the same normalized syntax for
> both C and Fortran input files.
>>> you could implement a tiling optimisation and see whether the number
>>> of L2 cache misses changes significantly. You could also experiment
>>> with combining the two loops into one -- the loops are independent,
>>> so this is very easy.
> The most interesting discovery we made in the MSP2005 paper was
> that you can reduce the number of L2 cache misses, but actually see
> the wallclock time go *up* for the code. Indeed, we found a number
> of different tiling strategies were actually bogus because the
> authors reported L2 miss rates, but failed to report the wallclock
> time improvements (indeed, there were no wallclock time
> improvements in many cases). This is because not all L2 cache
> misses are created equal. Some cost much more than others (we have
> an analytic model in the paper that describes that case in detail).
>>> The main loops are located in the file Sources.F. They are three
>>> nested loops for the three dimensions, starting with the "do
>>> k=2,nz-1" and extending up to the corresponding "end do" statements.
>>> Peter Diener can help you find your way in this code, and so can I
>>> once I'm back in two weeks from now.
More information about the Developers