Mark Mitchell wrote:
It depends and there is more than one type of dynamic selection and a number of trade-offs to consider.Steven Munroe wrote:My concern is that what works for 750 may not be best for 8540 or 970. The powerpc-cpu add-on handles this by adding additional <cpu-type> specific directory layer to the make system search order (i.e. ./sysdeps/powerpc/powerpc32/750) that can override the base. This is selected at configure time via --with-cpu=750 or your favorite <cpu-type>As an aside, do you think that any of these kinds of optimizations are worth doing dynamically, based on the CPU type that we actually have? So that, for example, "strcpy" can be optimized for your CPU, independently of what CPU was used when configuring GLIBC? In some situations, you may know for sure what CPU you're targeting, and you want to build everything for that. But, it might also be useful to be able to dynamically adjust; I've wondered whether we might be able to get 80% of the bang of a multilib by dynamically choosing a few performance-critical routines (like memcpy, etc.). My guess was that I/O-bound routines (like printf) would be only slightly affected by the particular CPU for which they were built, and so dynamically selecting a few CPU-bound routines could make a big difference. I don't know enough about GLIBC implementation details to know how possible it would be to efficiently implement that dynamic selection, though.
It is possible to use dynamic in line test to enable processor specific optimizations based on some flag (for example the AT_HWCAP bits). The trick is the tests costs cycles and you have to make sure the performance gain more than offsets the cost of the test. Even assuming that the Aux vector is scanned once and the flags cached in a static variable this is still significant costs.
On PPC32 static -fpic access requires establishing the GOT address (a bl, mflr, addis. addi). See powerpc32 setjmp/_longjmp for examples.This is a dependent sequence that does not schedule well for small functions. Also the ABI allows for leaf routines (most the mem* and str* functions) to not set up the GOT and not stack a frame. Accessing a static requires stacking a frame and address the GOT. There is also a question of how many different platforms you can optimize this way (each platform is another compare/branch). So this technique is only applicable to a small number of high value optimizations.
The other end of the spectrum is to optimizing the entire library for a specific platform and use the dynamic linker dl_procinfo to select from multiple cpu-tuned libraries. <http://sources.redhat.com/ml/libc-alpha/2006-01/msg00094.html>.
This allows the maximum optimization via the gcc (-mcpu=) and cpu specific optimizations (selected via --with-cpu=). The tradeoff is the requirement to build multiple complete libraries.