[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[patches] Possible PowerPC LIBC optimization



In the PowerPC community Conn Clark has been doing some interesting
optimization work in glibc.  Much of it however, doesn't seem to be
acceptable to the mainline glibc due to being very processor and
architecture specific.

The following information describes a simple change that made a large
performance improvement on the PPC 750 processor, and is believed will
make similar improvements on other PowerPC that contain the dcbt
instruction.

From Conn Clark:
> To see where I made the changes just search for "dcbt". The first two
> dcbt's in the functions _int_malloc and _int_free are the ones that
> make the biggest difference. The rest seem to help but they fall
> withing the noise margin of my test(a compile of glibc ).

Attached is the patch that Clark sent me against glibc-2.5.  I don't
think it is directly applicable as stated to glibc, however the idea
behind it appears to be sound.

There are point in the malloc/free that preloading the cache (at least
on PPC) makes sense.  So adding hooks in these locations may allow us to
configure in processor specific items that could dramatically improve
the performance on various processors.

--Mark
--- malloc/malloc.c	2007-02-26 11:19:23.000000000 -0600
+++ /tmp/malloc.c	2007-05-15 11:53:34.000000000 -0500
@@ -3980,6 +3981,8 @@
 Void_t*
 _int_malloc(mstate av, size_t bytes)
 {
+ asm __volatile__ ("dcbt 0, %0" ::   "r" (&global_max_fast));
+
   INTERNAL_SIZE_T nb;               /* normalized request size */
   unsigned int    idx;              /* associated bin index */
   mbinptr         bin;              /* associated bin */
@@ -4009,6 +4012,8 @@
   */
 
   checked_request2size(bytes, nb);
+asm __volatile__ ("dcbt 0, %0" ::   "r" (av->fastbins));
+
 
   /*
     If the size qualifies as a fastbin, first check corresponding bin.
@@ -4181,6 +4186,7 @@
 	    assert((fwd->size & NON_MAIN_ARENA) == 0);
             while ((unsigned long)(size) < (unsigned long)(fwd->size)) {
               fwd = fwd->fd;
+              asm __volatile__ ("dcbt %0, %1" :: "r" (sizeof(size_t)),  "r" (fwd->fd));
 	      assert((fwd->size & NON_MAIN_ARENA) == 0);
 	    }
             bck = fwd->bk;
@@ -4414,6 +4420,7 @@
 void
 _int_free(mstate av, Void_t* mem)
 {
+ asm __volatile__ ("dcbt 0, %0" ::   "r" (&global_max_fast));
   mchunkptr       p;           /* chunk corresponding to mem */
   INTERNAL_SIZE_T size;        /* its size */
   mfastbinptr*    fb;          /* associated fastbin */
@@ -4427,6 +4434,8 @@
   const char *errstr = NULL;
 
   p = mem2chunk(mem);
+ asm __volatile__ ("dcbt 0, %0" ::   "r" (av->fastbins)); 
+
   size = chunksize(p);
 
   /* Little security check which won't hurt performance: the
@@ -4650,6 +4659,7 @@
 static void malloc_consolidate(av) mstate av;
 #endif
 {
+
   mfastbinptr*    fb;                 /* current fastbin being consolidated */
   mfastbinptr*    maxfb;              /* last fastbin (for loop control) */
   mchunkptr       p;                  /* current chunk being consolidated */
@@ -4673,6 +4683,7 @@
 
   if (get_max_fast () != 0) {
     clear_fastchunks(av);
+ asm __volatile__ ("dcbt 0, %0" ::   "r" (av->fastbins)); 
 
     unsorted_bin = unsorted_chunks(av);
 
@@ -4693,6 +4704,7 @@
         do {
           check_inuse_chunk(av, p);
           nextp = p->fd;
+asm __volatile__ ("dcbt 0, %0" ::  "r" (nextp));
 
           /* Slightly streamlined version of consolidation code in free() */
           size = p->size & ~(PREV_INUSE|NON_MAIN_ARENA);