Idea #9228: Clearing stale transtable entries

sje · Post by **sje** » Wed Apr 08, 2015 3:36 pm

Multi-threaded table clearing now working

I've finished implementing the multi-threaded memory blaster and it's passed the first set of tests.

There is a speed gain, but it's difficult to quantify because so much depends upon core count, cache issues, possible hyper-threading, table size, memory channel count, memory layout, etc.

The only non-trivial part of the code is the section which partitions the memory region across the threads. This is handled in two passes; the first calculates the start address of each thread's memory segment, and the second determines via subtraction the exact byte count for each segment. There is a need for some care here because of integer division truncation; not all segments are guaranteed to be of the same length.

I recommend multi-threaded memory blasting to any author whose program has a need to clear large tables.

Code: Select all

typedef unsigned int ui;
typedef unsigned char ui8;

typedef struct
&#123;
  pthread_t  pthread;     // Thread ID
  void      *baseptr;     // Memory region start
  size_t     bytelength;  // Memory region length
&#125; MemZapRec;

void MemClearST&#40;void * const baseptr, const size_t bytelength&#41;
&#123;
  // Memory clear&#58; single threaded

  memset&#40;baseptr, 0, bytelength&#41;;
&#125;

void *MemClearTask&#40;void *ptr&#41;
&#123;
  // Called only via pthread_create&#40;)

  const MemZapRec * const mzrptr = &#40;MemZapRec *) ptr;
  
  MemClearST&#40;mzrptr->baseptr, mzrptr->bytelength&#41;;
  return 0;
&#125;

void MemClearMT&#40;void * const baseptr, const size_t bytelength&#41;
&#123;
  // Memory clear&#58; multi threaded

  const ui corecount = FetchCoreCount&#40;);
  MemZapRec memzapvec&#91;CpuCoreLen&#93;;  // CpuCoreLen is the maximum core count for a CPU
  
  // For each thread, set the segment start address

  for &#40;ui index = 0; index < corecount; index++)
    memzapvec&#91;index&#93;.baseptr = (&#40;ui8 *) baseptr&#41; + &#40;index * &#40;bytelength / corecount&#41;);
  
  // For each thread, set the segment length
  
  for &#40;ui index = 0; index < corecount; index++)
  &#123;
    if &#40;index < &#40;corecount - 1&#41;)
      memzapvec&#91;index&#93;.bytelength = &#40;ui8 *) memzapvec&#91;index + 1&#93;.baseptr - &#40;ui8 *) memzapvec&#91;index&#93;.baseptr;
    else
      memzapvec&#91;index&#93;.bytelength = &#40;ui8 *) baseptr + bytelength - &#40;ui8 *) memzapvec&#91;index&#93;.baseptr;
  &#125;;
  
  // For each thread, create/run
  
  for &#40;ui index = 0; index < corecount; index++)
    pthread_create&#40;&memzapvec&#91;index&#93;.pthread, 0, MemClearTask, &#40;void *) &memzapvec&#91;index&#93;);
  
  // For each thread, wait until finish
  
  for &#40;ui index = 0; index < corecount; index++)
    pthread_join&#40;memzapvec&#91;index&#93;.pthread, 0&#41;;
&#125;

void MemClear&#40;void * const baseptr, const size_t bytelength&#41;
&#123;
  // This is the externally visible routine for clearnig a memory region

  if &#40;FetchCoreCount&#40;) == 1&#41;
    MemClearST&#40;baseptr, bytelength&#41;;  // No threading used if only one core
  else
    MemClearMT&#40;baseptr, bytelength&#41;;  // At least two threads will be used
&#125;

sje · Post by **sje** » Thu Apr 09, 2015 7:58 pm

Speed test and revised code

On a four core box, the new multi-threaded code cleared 12 GiB of RAM in about 2.4 seconds compared to about 4.5 seconds used by the old single threaded code.

It is likely that nearly all the gain comes from just having a second thread. It might be possible to save a bit more time overall if the threads are created once at program start time and then reused as needed.

The code has been tested over various permutations of cores (1, 2, 4, 8), word lengths (32, 64), OS (Linux, Mac OS/X), and CPU brand (PowerPC, AMD, Intel).

Revised source for the MemClearMT() routine:

Code: Select all

void MemClearMT&#40;void * const baseptr, const size_t bytelength&#41;
&#123;
  // Fetch the number of CPU cores; this is also the number of threads to be launched

  const ui corecount = FetchCoreCount&#40;);

  // Calculate the memory segment start address separtion

  const size_t separation = bytelength / corecount;

  // The vector of memory zap thread records; element count equals the maximum possible core count

  MemZapRec memzapvec&#91;CpuCoreLen&#93;;

  // For each thread, initialize its parameter block and start the thread

  for &#40;ui index = 0; index < corecount; index++)
  &#123;
    // Calculate the start address of the segment the thread will clear

    void * const segbaseptr = (&#40;ui8 *) baseptr&#41; + &#40;index * separation&#41;;
    size_t segbytelength;

    // Each thread gets the same segment length except the last which may be rounded up

    if &#40;index < &#40;corecount - 1&#41;)
      segbytelength = separation;
    else
      segbytelength = &#40;ui8 *) baseptr + bytelength - &#40;ui8 *) segbaseptr;

    // Set the segment start address and its length in the thread's parameter block

    memzapvec&#91;index&#93;.baseptr = segbaseptr;
    memzapvec&#91;index&#93;.bytelength = segbytelength;

    // Start the thread

    pthread_create&#40;&memzapvec&#91;index&#93;.pthread, 0, MemClearTask, &#40;void *) &memzapvec&#91;index&#93;);
  &#125;;

  // For each thread, wait for it to finish and go away

  for &#40;ui index = 0; index < corecount; index++)
    pthread_join&#40;memzapvec&#91;index&#93;.pthread, 0&#41;;
&#125;

All free for you to use.

sje · Post by **sje** » Fri Apr 10, 2015 1:27 am

On a more modern (2014 vs 2006) quad core machine (3.4 GHz Intel Core i5), the multi-threaded blaster consistently cleared 24 GiB of memory in 0.86 seconds. This includes thread creation/destruction overhead, slightly less than one millisecond per thread.

On the same machine, blasting a 1/16th size segment (1,536 MiB) should take well under 100 milliseconds, a reasonable time even for blitz.

Idea #9228: Clearing stale transtable entries

Multi-threaded table clearing now working

Speed test and revised code

Time needed to clear 24 GiB