[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [patches] Possible PowerPC LIBC optimization

To: Mark Mitchell <mark@xxxxxxxxxxxxxxxx>
Subject: Re: [patches] Possible PowerPC LIBC optimization
From: Steven Munroe <munroesj@xxxxxxxxxx>
Date: Wed, 16 May 2007 21:31:33 -0500

Mark Mitchell wrote:

Steven Munroe wrote:

My concern is that what works for 750 may not be best for 8540 or 970.
The powerpc-cpu add-on handles this by adding additional <cpu-type>
specific directory layer to the make system search order (i.e.
./sysdeps/powerpc/powerpc32/750) that can override the base. This is
selected at configure time via --with-cpu=750 or your favorite <cpu-type>


As an aside, do you think that any of these kinds of optimizations are
worth doing dynamically, based on the CPU type that we actually have?
So that, for example, "strcpy" can be optimized for your CPU,
independently of what CPU was used when configuring GLIBC?

In some situations, you may know for sure what CPU you're targeting, and
you want to build everything for that.  But, it might also be useful to
be able to dynamically adjust; I've wondered whether we might be able to
get 80% of the bang of a multilib by dynamically choosing a few
performance-critical routines (like memcpy, etc.).  My guess was that
I/O-bound routines (like printf) would be only slightly affected by the
particular CPU for which they were built, and so dynamically selecting a
few CPU-bound routines could make a big difference.  I don't know enough
about GLIBC implementation details to know how possible it would be to
efficiently implement that dynamic selection, though.

It depends and there is more than one type of dynamic selection and anumber of trade-offs to consider.

It is possible to use dynamic in line test to enable processor specificoptimizations based on some flag (for example the AT_HWCAP bits). Thetrick is the tests costs cycles and you have to make sure theperformance gain more than offsets the cost of the test. Even assumingthat the Aux vector is scanned once and the flags cached in a staticvariable this is still significant costs.

On PPC32 static -fpic access requires establishing the GOT address (abl, mflr, addis. addi). See powerpc32 setjmp/_longjmp for examples.Thisis a dependent sequence that does not schedule well for small functions.Also the ABI allows for leaf routines (most the mem* and str*functions) to not set up the GOT and not stack a frame. Accessing astatic requires stacking a frame and address the GOT. There is also aquestion of how many different platforms you can optimize this way (eachplatform is another compare/branch). So this technique is onlyapplicable to a small number of high value optimizations.

The other end of the spectrum is to optimizing the entire library for aspecific platform and use the dynamic linker dl_procinfo to select frommultiple cpu-tuned libraries.<http://sources.redhat.com/ml/libc-alpha/2006-01/msg00094.html>.

This allows the maximum optimization via the gcc (-mcpu=) and cpuspecific optimizations (selected via --with-cpu=). The tradeoff is therequirement to build multiple complete libraries.

Follow-Ups:
- Re: [patches] Possible PowerPC LIBC optimization
  - From: Mark Mitchell

References:
- [patches] Possible PowerPC LIBC optimization
  - From: Mark Hatle
- Re: [patches] Possible PowerPC LIBC optimization
  - From: Steven Munroe
- Re: [patches] Possible PowerPC LIBC optimization
  - From: Mark Mitchell

Prev by Date: [patches] Merge from trunk done
Next by Date: Re: [patches] Possible PowerPC LIBC optimization
Previous by thread: Re: [patches] Possible PowerPC LIBC optimization
Next by thread: Re: [patches] Possible PowerPC LIBC optimization
Index(es):
- Date
- Thread