[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[patches] Re: Possible PowerPC LIBC optimization



There are some further optimizations I'm playing with. I have added one more dcbt to the public_free function. I am also playing around with reorganizing the mp_ data structure to take into account cache filling.

Currently I am having troubles benchmarking my changes by doing a glibc build. I cannot get consistent results right now. I know that disk access is effecting the results. I have written the following program to flush the L1 data cache of any glibc data and call malloc and free with a repeatable random sequence of sizes. It is biased towards smaller memory allocations. Please suggest any improvements to my test.

#include <stdio.h>
#include <memory.h>
#include <stdlib.h>
/* for flushing mpc750 32k datacache */
int data_cache_flush_array[8193];

/*
read and/or write to each cache line in the array
call in alternating directions to keep as much data
in the cache between malloc/free calls via the LRU
discard policy.
*/
int thrash_the_cache(int up_down)
{
int i;
if (up_down) {
 i = 0;
 do{
   data_cache_flush_array[i*8] = data_cache_flush_array [(i+1)*8];
 }while (i++ <1022);
} else {
i = 1023;
 do{
   data_cache_flush_array[(i+1)*8] = data_cache_flush_array [i*8];
 }while (i-- > 0);
} return i; }

int main(void){

int x,y,z;
char *mem;
int x1,y1,z1;
char *mem1;
int x2,y2,z2;
char *mem2;
int x3,y3,z3;
char *mem3;
z= 8192;
do{
data_cache_flush_array[z]=z;
}while(z--);

srandom(7);
x = 65535;
x1 = 512;
x2 = 4096;
x3 = 256;
y = 100;
y1 = 1000;
y2 = 1000;
y3 = 100;
do {
  thrash_the_cache(1);
  z = random() % x;
  mem = malloc(z);
if(z) mem[z-1] = 0x3f;
  do {
    thrash_the_cache(0);
    z1 = random() % x1;
    mem1 = malloc(z1);
if(z1) mem1[z1-1] = 0x3f;
    do {
      thrash_the_cache(1);
      z2 = random() % x;
      mem2 = malloc(z2);
if(z2) mem2[z2-1] = 0x3f;
      do {
        thrash_the_cache(0);
        z3 = random() % x3;
        mem3 = malloc(z3);
        if(z3) mem3[z3-1] = 0x3f;
        thrash_the_cache(1);
        free(mem3);
}while( (--y3));
      thrash_the_cache(0);
      free(mem2);
}while( (--y2));
    thrash_the_cache(1);
    free(mem1);
}while( (--y1));
  thrash_the_cache(0);
  free(mem);
}while( (--y));


return 1;
}





Mark Hatle writes:
In the PowerPC community Conn Clark has been doing some interesting
optimization work in glibc.  Much of it however, doesn't seem to be
acceptable to the mainline glibc due to being very processor and
architecture specific.
The following information describes a simple change that made a large
performance improvement on the PPC 750 processor, and is believed will
make similar improvements on other PowerPC that contain the dcbt
instruction.
From Conn Clark:
To see where I made the changes just search for "dcbt". The first two
dcbt's in the functions _int_malloc and _int_free are the ones that
make the biggest difference. The rest seem to help but they fall
withing the noise margin of my test(a compile of glibc ).

Attached is the patch that Clark sent me against glibc-2.5.  I don't
think it is directly applicable as stated to glibc, however the idea
behind it appears to be sound.
There are point in the malloc/free that preloading the cache (at least
on PPC) makes sense.  So adding hooks in these locations may allow us to
configure in processor specific items that could dramatically improve
the performance on various processors.
--Mark


Conn
---------------------------------------
Conn Clark
Electronic Systems Technology
415 N. Quay Street Building B1     (509)-735-9092 ext 117
Kennewick, WA. 99336
Observation: In formal computer science advances are made
by standing on the shoulders of giants. Linux has proved
that if there are enough of you, you can advance just as
far by standing on each others toes.