Skip to content
  • KOSAKI Motohiro's avatar
    mm: size of quicklists shouldn't be proportional to the number of CPUs · b9541852
    KOSAKI Motohiro authored
    
    
    Quicklists store pages for each CPU as caches.  (Each CPU can cache
    node_free_pages/16 pages)
    
    It is used for page table cache.  exit() will increase the cache size,
    while fork() consumes it.
    
    So for example if an apache-style application runs (one parent and many
    child model), one CPU process will fork() while another CPU will process
    the middleware work and exit().
    
    At that time, the CPU on which the parent runs doesn't have page table
    cache at all.  Others (on which children runs) have maximum caches.
    
    	QList_max = (#ofCPUs - 1) x Free / 16
    	=> QList_max / (Free + QList_max) = (#ofCPUs - 1) / (16 + #ofCPUs - 1)
    
    So, How much quicklist memory is used in the maximum case?
    
    This is proposional to # of CPUs because the limit of per cpu quicklist
    cache doesn't see the number of cpus.
    
    Above calculation mean
    
    	 Number of CPUs per node            2    4    8   16
    	 ==============================  ====================
    	 QList_max / (Free + QList_max)   5.8%  16%  30%  48%
    
    Wow! Quicklist can spend about 50% memory at worst case.
    
    My demonstration program is here
    --------------------------------------------------------------------------------
    #define _GNU_SOURCE
    
    #include <stdio.h>
    #include <errno.h>
    #include <stdlib.h>
    #include <string.h>
    #include <sched.h>
    #include <unistd.h>
    #include <sys/mman.h>
    #include <sys/wait.h>
    
    #define BUFFSIZE 512
    
    int max_cpu(void)	/* get max number of logical cpus from /proc/cpuinfo */
    {
      FILE *fd;
      char *ret, buffer[BUFFSIZE];
      int cpu = 1;
    
      fd = fopen("/proc/cpuinfo", "r");
      if (fd == NULL) {
        perror("fopen(/proc/cpuinfo)");
        exit(EXIT_FAILURE);
      }
      while (1) {
        ret = fgets(buffer, BUFFSIZE, fd);
        if (ret == NULL)
          break;
        if (!strncmp(buffer, "processor", 9))
          cpu = atoi(strchr(buffer, ':') + 2);
      }
      fclose(fd);
      return cpu;
    }
    
    void cpu_bind(int cpu)	/* bind current process to one cpu */
    {
      cpu_set_t mask;
      int ret;
    
      CPU_ZERO(&mask);
      CPU_SET(cpu, &mask);
      ret = sched_setaffinity(0, sizeof(mask), &mask);
      if (ret == -1) {
        perror("sched_setaffinity()");
        exit(EXIT_FAILURE);
      }
      sched_yield();	/* not necessary */
    }
    
    #define MMAP_SIZE (10 * 1024 * 1024)	/* 10 MB */
    #define FORK_INTERVAL 1	/* 1 second */
    
    main(int argc, char *argv[])
    {
      int cpu_max, nextcpu;
      long pagesize;
      pid_t pid;
    
      /* set max number of logical cpu */
      if (argc > 1)
        cpu_max = atoi(argv[1]) - 1;
      else
        cpu_max = max_cpu();
    
      /* get the page size */
      pagesize = sysconf(_SC_PAGESIZE);
      if (pagesize == -1) {
        perror("sysconf(_SC_PAGESIZE)");
        exit(EXIT_FAILURE);
      }
    
      /* prepare parent process */
      cpu_bind(0);
      nextcpu = cpu_max;
    
    loop:
    
      /* select destination cpu for child process by round-robin rule */
      if (++nextcpu > cpu_max)
        nextcpu = 1;
    
      pid = fork();
    
      if (pid == 0) { /* child action */
    
        char *p;
        int i;
    
        /* consume page tables */
        p = mmap(0, MMAP_SIZE, PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
        i = MMAP_SIZE / pagesize;
        while (i-- > 0) {
          *p = 1;
          p += pagesize;
        }
    
        /* move to other cpu */
        cpu_bind(nextcpu);
    /*
        printf("a child moved to cpu%d after mmap().\n", nextcpu);
        fflush(stdout);
     */
    
        /* back page tables to pgtable_quicklist */
        exit(0);
    
      } else if (pid > 0) { /* parent action */
    
        sleep(FORK_INTERVAL);
        waitpid(pid, NULL, WNOHANG);
    
      }
    
      goto loop;
    }
    ----------------------------------------
    
    When above program which does task migration runs, my 8GB box spends
    800MB of memory for quicklist.  This is not memory leak but doesn't seem
    good.
    
    % cat /proc/meminfo
    
    MemTotal:        7701568 kB
    MemFree:         4724672 kB
    (snip)
    Quicklists:       844800 kB
    
    because
    
    - My machine spec is
    	number of numa node: 2
    	number of cpus:      8 (4CPU x2 node)
            total mem:           8GB (4GB x2 node)
            free mem:            about 5GB
    
    - Then, 4.7GB x 16% ~= 880MB.
      So, Quicklist can use 800MB.
    
    So, if following spec machine run that program
    
       CPUs: 64 (8cpu x 8node)
       Mem:  1TB (128GB x8node)
    
    Then, quicklist can waste 300GB (= 1TB x 30%).  It is too large.
    
    So, I don't like cache policies which is proportional to # of cpus.
    
    My patch changes the number of caches
    from:
       per-cpu-cache-amount = memory_on_node / 16
    to
       per-cpu-cache-amount = memory_on_node / 16 / number_of_cpus_on_node.
    
    Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Keiichiro Tokunaga <tokunaga.keiich@jp.fujitsu.com>
    Acked-by: default avatarChristoph Lameter <cl@linux-foundation.org>
    Tested-by: default avatarDavid Miller <davem@davemloft.net>
    Acked-by: default avatarMike Travis <travis@sgi.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    b9541852