Fwd: [Developers] Re: [CactusMaint] Profiling of Bench_BSSN_PUGH benchmark (Long mail). (fwd)

Tom Goodale goodale at cct.lsu.edu
Wed Mar 15 06:46:19 CST 2006

This message from John seems to have disappeared down a black hole.

Begin forwarded message:
> From: John Shalf <jshalf at lbl.gov>
> Date: March 14, 2006 1:12:19 PM PST
> To: developers at cactuscode.org
> Subject: Re: [Developers] Re: [CactusMaint] Profiling of
> Bench_BSSN_PUGH benchmark (Long mail).
> Some comments below.
> On Mar 14, 2006, at 12:28 PM, David Rideout wrote
>> On 3/14/06, Erik Schnetter <schnetter at cct.lsu.edu> wrote:
>>> On Mar 13, 2006, at 22:51:26, smadiraju at cct.lsu.edu wrote:
>>>> Hi all,
>>>> I am working on optimizing the Cactus "Bench_BSSN_PUGH" benchmark
>>>> using
>>>> different profiling tools.
>>>> 1) After using the IPM profilers it was found that the benchmark
>>>> Bench_BSSN_PUGH
>>>> spends most of its time in Load/Store operations. It would be good
>>>> if we can
>>>> reduce the number of load/stores and there by improve the
>>>> performance of the
>>>> code.
> Many of the load stores are present because of register spilling.
> The typical remedy for register spilling is loop splitting.
> However attempts in the past by compiler experts such as Mary
> Hall's group at ISI failed to gain significant performance
> improvements from loop splitting. Perhaps that conversation is
> archived somewhere, but the number of temporary arrays required to
> carry over results from the loop splitting turned out to increase
> bandwidth requirements.  Perhaps the loop splitting approach could
> be revisited if applied only to the innermost loops.  The temporary
> registers would only need to be 3-planes of the overall 3D data
> block rather than attempting to split on the outermost loop.
> However, rgister spilling was not the problem on the itanium
> (register pressure was very low, but it still performed many
> redundant load-store instructions).   It is very useful and
> instructive to look at the work that the Intel developers did on
> the extracted BSSN kernel for the Itanium 1 processor.  I have
> attached NewStagLeap.F and StagLeap.F to this message.  The
> NewStagLeap contains a fully itemized list of hand-coded assembly
> optimizations that Intel applied that brought the kernel
> performance from 30Mflops up to 2GFlops (single precision) on the
> early-release Itanium-1 systems many years ago.  You may gain some
> insight from their optimizations that might be applicable outside
> of the IA64 architecture (don't know for certain).
>>>> 2) I started profiling the benchmark on different machines
>>>> available to me. I
>>>> worked on a Xeon and an Opteron architectures. IBM based
>>>> architectures do not
>>>> have valgrind and oprofile. I was unable to do any profiling on a
>>>> Power5 for
>>>> this reason
> On Power5 systems running AIX5.3, you should be able to use
> HPMCOUNT to get hardware performance profiling data.  For systems
> like bassi that are running AIX5.2, the PAPI timers work just
> fine.  On bassi, just do
> 	module load papi
> to get access to PAPI-3 timers.
>>>> 3) The cachegrind produces an output which gives the Total
>>>> number of
>>>> Intsructions, Data read cache misses at L1, L2 and Data write cache
>>>> misses at
>>>> L1, L2 and Instruction cache misses at L1, L2. I wrote a small perl
>>>> script
>>>> which parses this information and pulls out those functions,
>>>> which are
>>>> responsible for the majority of cache misses at Level2.
> OK,
> so the L2 cache miss rates indicate to me that some cache line
> aliasing is happening.  I know that Tom wrote some heuristics into
> the data allocation procedures to stagger the arrays to minimize
> aliasing, but you should definitely check to make sure they working
> as expected.  The procedure for staggering these arrays should
> perhaps be re-examined.
>>>> 4) The following are the functions found to cause cache misses from
>>>> different
>>>> runs with varying number of iterations and the probelm size. Any
>>>> inputs
>>>> regarding the details of these functions are welcome.
>>> The function Mol_RK3Add is from the time integrator.  It essentially
>>> adds large arrays to each other.  It is clear that this function has
>>> many cache misses, and there is no way to optimise this function.
>>> Although this function has many cache misses, it is usually very
>>> fast, so it is not clear whether it is worth optimising this
>>> function.  If this function should be optimised, then the whole time
>>> integration API in Cactus needs to be changed.  This would only
>>> happen if there is a clear performance win demonstrated.
> I agree completely.
>>> The function ADM_BSSN_Sources is the main work routine in this
>>> benchmark.  It is likely that this function could benefit from cache
>>> optimisations, such as tiling.
> I agree, although attempts to tile in the past (aside from the z-
> slicing that David Rideout described) have actually increased run
> time.  It turns out to be difficult to find reasonable blocking
> strategies.  We wrote a paper on cache-blocking considerations last
> year
> 	http://crd.lbl.gov/~oliker/papers/msp_2005.pdf
> There are very limited circumstances where Rivera blocking may work
> for you.
>>> The function ADM_BSSN_RemoveTrA is
>>> also a work routine.  It traverses its arrays only once, so again
>>> this function cannot be optimised on its own.  It may be possible to
>>> fold this routine into the main loop of ADM_BSSN_Sources, which
>>> could
>>> improve cache performance.  Note that ADM_BSSN_Sources traverses its
>>> arrays twice -- combining these two loops may improve performance.
> Combining the two loops may also increase register spilling.  It
> may be useful to combine the outer-loop though and do some loop
> splitting on the inner-loops.  That way, you have much less state
> to carry between iterations.
>>> The function NewRad is a boundary condition routine.  It traverses
>>> only the boundary of the three-dimensional arrays, and it is
>>> probably
>>> not possible to optimise this routine on its own.  It is possible to
>>> fold this routine into the main loop, but this would complicate the
>>> code very much, and this would only be accepted if there is a proven
>>> performance win.
> From the data I have, its only worth optimizing NewRad for the
> vector machines.  It doesn't appear to account for much of the
> runtime on other systems, but it is the most computationally
> expensive routine on the Cray X1e because it doesn't vectorize.
> While this should be of little concern for a microprocessor, it is
> important to point out that code that doesn't vectorize also does
> not software-pipeline (SWP) that well.  SWP is important for
> getting performance on superscalar processors and critical for
> achieving good performance on VLIW machines like the IA64 (that is
> what the IA64 register windowing is about...).
>>> The most promising performance improvement comes probably -- but
>>> this
>>> is just a guess -- from cache optimisations in ADM_BSSN_Sources.
>>> This function is quite large and can handle many different cases
>>> (differencing stencils, shift storage, etc.).  You could remove the
>>> code for the cases that are not used in your benchmark.
> Likewise, it may be important to comment out the conditional
> statements for some of your work in order to ensure that the
> conditionals are not confusing the compiler optimizer.  The
> standalone BSSN should probably be updated.  The entire purpose of
> the StagLeap (standalone ADM) and Standalone_BSSN was to support
> more aggressive optimization on these kernels without having to
> deal with the full Cactus infrastructure.  It will definitely speed
> up your optimization work to refresh the Standalone version of the
> code with the current code base and use that as your platform for
> optimization.
> I also highly recommend looking at the BSSN kernel in GNU GIMPLE form.
> 	http://en.wikibooks.org/wiki/GNU_C_Compiler_Internals/
> GNU_C_Compiler_Architecture
> Gimple and SSA provide a much simpler normalized syntax for the
> loop basic blocks (all single-assignment expressions) that are much
> more amenable to processing using PERL.  Its easy to get the
> compiler to emit 'gimplified' code for you to work on.  Attached is
> StagLeap in gimple form.
> An added bonus is that Gimple uses the same normalized syntax for
> both C and Fortran input files.
>>> Afterwards,
>>> you could implement a tiling optimisation and see whether the number
>>> of L2 cache misses changes significantly.  You could also experiment
>>> with combining the two loops into one -- the loops are independent,
>>> so this is very easy.
> The most interesting discovery we made in the MSP2005 paper was
> that you can reduce the number of L2 cache misses, but actually see
> the wallclock time go *up* for the code.  Indeed, we found a number
> of different tiling strategies were actually bogus because the
> authors reported L2 miss rates, but failed to report the wallclock
> time improvements (indeed, there were no wallclock time
> improvements in many cases).  This is because not all L2 cache
> misses are created equal.  Some cost much more than others (we have
> an analytic model in the paper that describes that case in detail).
>>> The main loops are located in the file Sources.F.  They are three
>>> nested loops for the three dimensions, starting with the "do
>>> k=2,nz-1" and extending up to the corresponding "end do" statements.
>>> Peter Diener can help you find your way in this code, and so can I
>>> once I'm back in two weeks from now.

More information about the Developers mailing list