1. 25 Feb, 2013 1 commit
  2. 24 Feb, 2013 2 commits
  3. 23 Feb, 2013 15 commits
    • Zhang Yanfei's avatar
      ia64: use %ld to print pages calculated in nr_free_buffer_pages · 6434b94a
      Zhang Yanfei authored
      Now the function nr_free_buffer_pages returns unsigned long, so use %ld
      to print its return value.
      Signed-off-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6434b94a
    • Shaohua Li's avatar
      swap: add per-partition lock for swapfile · ec8acf20
      Shaohua Li authored
      swap_lock is heavily contended when I test swap to 3 fast SSD (even
      slightly slower than swap to 2 such SSD).  The main contention comes
      from swap_info_get().  This patch tries to fix the gap with adding a new
      per-partition lock.
      
      Global data like nr_swapfiles, total_swap_pages, least_priority and
      swap_list are still protected by swap_lock.
      
      nr_swap_pages is an atomic now, it can be changed without swap_lock.  In
      theory, it's possible get_swap_page() finds no swap pages but actually
      there are free swap pages.  But sounds not a big problem.
      
      Accessing partition specific data (like scan_swap_map and so on) is only
      protected by swap_info_struct.lock.
      
      Changing swap_info_struct.flags need hold swap_lock and
      swap_info_struct.lock, because scan_scan_map() will check it.  read the
      flags is ok with either the locks hold.
      
      If both swap_lock and swap_info_struct.lock must be hold, we always hold
      the former first to avoid deadlock.
      
      swap_entry_free() can change swap_list.  To delete that code, we add a
      new highest_priority_index.  Whenever get_swap_page() is called, we
      check it.  If it's valid, we use it.
      
      It's a pity get_swap_page() still holds swap_lock().  But in practice,
      swap_lock() isn't heavily contended in my test with this patch (or I can
      say there are other much more heavier bottlenecks like TLB flush).  And
      BTW, looks get_swap_page() doesn't really need the lock.  We never free
      swap_info[] and we check SWAP_WRITEOK flag.  The only risk without the
      lock is we could swapout to some low priority swap, but we can quickly
      recover after several rounds of swap, so sounds not a big deal to me.
      But I'd prefer to fix this if it's a real problem.
      
      "swap: make each swap partition have one address_space" improved the
      swapout speed from 1.7G/s to 2G/s.  This patch further improves the
      speed to 2.3G/s, so around 15% improvement.  It's a multi-process test,
      so TLB flush isn't the biggest bottleneck before the patches.
      
      [arnd@arndb.de: fix it for nommu]
      [hughd@google.com: add missing unlock]
      [minchan@kernel.org: get rid of lockdep whinge on sys_swapon]
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec8acf20
    • Tang Chen's avatar
      acpi, memory-hotplug: support getting hotplug info from SRAT · 01a178a9
      Tang Chen authored
      We now provide an option for users who don't want to specify physical
      memory address in kernel commandline.
      
               /*
                * For movablemem_map=acpi:
                *
                * SRAT:                |_____| |_____| |_________| |_________| ......
                * node id:                0       1         1           2
                * hotpluggable:           n       y         y           n
                * movablemem_map:              |_____| |_________|
                *
                * Using movablemem_map, we can prevent memblock from allocating memory
                * on ZONE_MOVABLE at boot time.
                */
      
      So user just specify movablemem_map=acpi, and the kernel will use
      hotpluggable info in SRAT to determine which memory ranges should be set
      as ZONE_MOVABLE.
      
      If all the memory ranges in SRAT is hotpluggable, then no memory can be
      used by kernel.  But before parsing SRAT, memblock has already reserve
      some memory ranges for other purposes, such as for kernel image, and so
      on.  We cannot prevent kernel from using these memory.  So we need to
      exclude these ranges even if these memory is hotpluggable.
      
      Furthermore, there could be several memory ranges in the single node
      which the kernel resides in.  We may skip one range that have memory
      reserved by memblock, but if the rest of memory is too small, then the
      kernel will fail to boot.  So, make the whole node which the kernel
      resides in un-hotpluggable.  Then the kernel has enough memory to use.
      
      NOTE: Using this way will cause NUMA performance down because the
            whole node will be set as ZONE_MOVABLE, and kernel cannot use memory
            on it.  If users don't want to lose NUMA performance, just don't use
            it.
      
      [akpm@linux-foundation.org: fix warning]
      [akpm@linux-foundation.org: use strcmp()]
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: "Brown, Len" <len.brown@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      01a178a9
    • Tang Chen's avatar
      acpi, memory-hotplug: extend movablemem_map ranges to the end of node · 27168d38
      Tang Chen authored
      When implementing movablemem_map boot option, we introduced an array
      movablemem_map.map[] to store the memory ranges to be set as
      ZONE_MOVABLE.
      
      Since ZONE_MOVABLE is the latst zone of a node, if user didn't specify
      the whole node memory range, we need to extend it to the node end so
      that we can use it to prevent memblock from allocating memory in the
      ranges user didn't specify.
      
      We now implement movablemem_map boot option like this:
      
              /*
               * For movablemem_map=nn[KMG]@ss[KMG]:
               *
               * SRAT:                |_____| |_____| |_________| |_________| ......
               * node id:                0       1         1           2
               * user specified:                |__|                 |___|
               * movablemem_map:                |___| |_________|    |______| ......
               *
               * Using movablemem_map, we can prevent memblock from allocating memory
               * on ZONE_MOVABLE at boot time.
               *
               * NOTE: In this case, SRAT info will be ingored.
               */
      
      [akpm@linux-foundation.org: clean up code, fix build warning]
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: "Brown, Len" <len.brown@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      27168d38
    • Tang Chen's avatar
      acpi, memory-hotplug: parse SRAT before memblock is ready · e8d19552
      Tang Chen authored
      On linux, the pages used by kernel could not be migrated.  As a result,
      if a memory range is used by kernel, it cannot be hot-removed.  So if we
      want to hot-remove memory, we should prevent kernel from using it.
      
      The way now used to prevent this is specify a memory range by
      movablemem_map boot option and set it as ZONE_MOVABLE.
      
      But when the system is booting, memblock will allocate memory, and
      reserve the memory for kernel.  And before we parse SRAT, and know the
      node memory ranges, memblock is working.  And it may allocate memory in
      ranges to be set as ZONE_MOVABLE.  This memory can be used by kernel,
      and never be freed.
      
      So, let's parse SRAT before memblock is called first.  And it is early
      enough.
      
      The first call of memblock_find_in_range_node() is in:
      
        setup_arch()
          |-->setup_real_mode()
      
      so, this patch add a function early_parse_srat() to parse SRAT, and call
      it before setup_real_mode() is called.
      
      NOTE:
      
      1) early_parse_srat() is called before numa_init(), and has initialized
         numa_meminfo.  So DO NOT clear numa_nodes_parsed in numa_init() and DO
         NOT zero numa_meminfo in numa_init(), otherwise we will lose memory
         numa info.
      
      2) I don't know why using count of memory affinities parsed from SRAT
         as a return value in original acpi_numa_init().  So I add a static
         variable srat_mem_cnt to remember this count and use it as the return
         value of the new acpi_numa_init()
      
      [mhocko@suse.cz: parse SRAT before memblock is ready fix]
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: "Brown, Len" <len.brown@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e8d19552
    • Yasuaki Ishimatsu's avatar
      x86: get pg_data_t's memory from other node · 4d59a751
      Yasuaki Ishimatsu authored
      During the implementation of SRAT support, we met a problem.  In
      setup_arch(), we have the following call series:
      
       1) memblock is ready;
       2) some functions use memblock to allocate memory;
       3) parse ACPI tables, such as SRAT.
      
      Before 3), we don't know which memory is hotpluggable, and as a result,
      we cannot prevent memblock from allocating hotpluggable memory.  So, in
      2), there could be some hotpluggable memory allocated by memblock.
      
      Now, we are trying to parse SRAT earlier, before memblock is ready.  But
      I think we need more investigation on this topic.  So in this v5, I
      dropped all the SRAT support, and v5 is just the same as v3, and it is
      based on 3.8-rc3.
      
      As we planned, we will support getting info from SRAT without users'
      participation at last.  And we will post another patch-set to do so.
      
      And also, I think for now, we can add this boot option as the first step
      of supporting movable node.  Since Linux cannot migrate the direct
      mapped pages, the only way for now is to limit the whole node containing
      only movable memory.
      
      Using SRAT is one way.  But even if we can use SRAT, users still need an
      interface to enable/disable this functionality if they don't want to
      loose their NUMA performance.  So I think, a user interface is always
      needed.
      
      For now, users can disable this functionality by not specifying the boot
      option.  Later, we will post SRAT support, and add another option value
      "movablecore_map=acpi" to using SRAT.
      
      This patch:
      
      If system can create movable node which all memory of the node is
      allocated as ZONE_MOVABLE, setup_node_data() cannot allocate memory for
      the node's pg_data_t.  So, use memblock_alloc_try_nid() instead of
      memblock_alloc_nid() to retry when the first allocation fails.
      Signed-off-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: default avatarJiang Liu <jiang.liu@huawei.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4d59a751
    • Wen Congyang's avatar
      cpu-hotplug,memory-hotplug: clear cpu_to_node() when offlining the node · e13fe869
      Wen Congyang authored
      When the node is offlined, there is no memory/cpu on the node.  If a
      sleep task runs on a cpu of this node, it will be migrated to the cpu on
      the other node.  So we can clear cpu-to-node mapping.
      
      [akpm@linux-foundation.org: numa_clear_node() and numa_set_node() can no longer be __cpuinit]
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e13fe869
    • Wen Congyang's avatar
      cpu_hotplug: clear apicid to node when the cpu is hotremoved · c4c60524
      Wen Congyang authored
      When a cpu is hotpluged, we call acpi_map_cpu2node() in
      _acpi_map_lsapic() to store the cpu's node and apicid's node.  But we
      don't clear the cpu's node in acpi_unmap_lsapic() when this cpu is
      hotremoved.  If the node is also hotremoved, we will get the following
      messages:
      
        kernel BUG at include/linux/gfp.h:329!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat xt_CHECKSUM iptable_mangle bridge stp llc sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel microcode pcspkr i2c_i801 i2c_core lpc_ich mfd_core ioatdma e1000e i7core_edac edac_core sg acpi_memhotplug igb dca sd_mod crc_t10dif megaraid_sas mptsas mptscsih mptbase scsi_transport_sas scsi_mod
        Pid: 3126, comm: init Not tainted 3.6.0-rc3-tangchen-hostbridge+ #13 FUJITSU-SV PRIMEQUEST 1800E/SB
        RIP: 0010:[<ffffffff811bc3fd>]  [<ffffffff811bc3fd>] allocate_slab+0x28d/0x300
        RSP: 0018:ffff88078a049cf8  EFLAGS: 00010246
        RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000246
        RBP: ffff88078a049d38 R08: 00000000000040d0 R09: 0000000000000001
        R10: 0000000000000000 R11: 0000000000000b5f R12: 00000000000052d0
        R13: ffff8807c1417300 R14: 0000000000030038 R15: 0000000000000003
        FS:  00007fa9b1b44700(0000) GS:ffff8807c3800000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 00007fa9b09acca0 CR3: 000000078b855000 CR4: 00000000000007e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Process init (pid: 3126, threadinfo ffff88078a048000, task ffff8807bb6f2650)
        Call Trace:
          new_slab+0x30/0x1b0
          __slab_alloc+0x358/0x4c0
          kmem_cache_alloc_node_trace+0xb4/0x1e0
          alloc_fair_sched_group+0xd0/0x1b0
          sched_create_group+0x3e/0x110
          sched_autogroup_create_attach+0x4d/0x180
          sys_setsid+0xd4/0xf0
          system_call_fastpath+0x16/0x1b
        Code: 89 c4 e9 73 fe ff ff 31 c0 89 de 48 c7 c7 45 de 9e 81 44 89 45 c8 e8 22 05 4b 00 85 db 44 8b 45 c8 0f 89 4f ff ff ff 0f 0b eb fe <0f> 0b 90 eb fd 0f 0b eb fe 89 de 48 c7 c7 45 de 9e 81 31 c0 44
        RIP  [<ffffffff811bc3fd>] allocate_slab+0x28d/0x300
         RSP <ffff88078a049cf8>
        ---[ end trace adf84c90f3fea3e5 ]---
      
      The reason is that the cpu's node is not NUMA_NO_NODE, we will call
      alloc_pages_exact_node() to alloc memory on the node, but the node is
      offlined.
      
      If the node is onlined, we still need cpu's node.  For example: a task
      on the cpu is sleeped when the cpu is hotremoved.  We will choose
      another cpu to run this task when it is waked up.  If we know the cpu's
      node, we will choose the cpu on the same node first.  So we should clear
      cpu-to-node mapping when the node is offlined.
      
      This patch only clears apicid-to-node mapping when the cpu is
      hotremoved.
      
      [akpm@linux-foundation.org: fix section error]
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c4c60524
    • Tang Chen's avatar
      memory-hotplug: remove memmap of sparse-vmemmap · 0197518c
      Tang Chen authored
      Introduce a new API vmemmap_free() to free and remove vmemmap
      pagetables.  Since pagetable implements are different, each architecture
      has to provide its own version of vmemmap_free(), just like
      vmemmap_populate().
      
      Note: vmemmap_free() is not implemented for ia64, ppc, s390, and sparc.
      
      [mhocko@suse.cz: fix implicit declaration of remove_pagetable]
      Signed-off-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarJianguo Wu <wujianguo@huawei.com>
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0197518c
    • Tang Chen's avatar
      memory-hotplug: remove page table of x86_64 architecture · bbcab878
      Tang Chen authored
      Search a page table about the removed memory, and clear page table for
      x86_64 architecture.
      
      [akpm@linux-foundation.org: make kernel_physical_mapping_remove() static]
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarJianguo Wu <wujianguo@huawei.com>
      Signed-off-by: default avatarJiang Liu <jiang.liu@huawei.com>
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bbcab878
    • Wen Congyang's avatar
      memory-hotplug: common APIs to support page tables hot-remove · ae9aae9e
      Wen Congyang authored
      When memory is removed, the corresponding pagetables should alse be
      removed.  This patch introduces some common APIs to support vmemmap
      pagetable and x86_64 architecture direct mapping pagetable removing.
      
      All pages of virtual mapping in removed memory cannot be freed if some
      pages used as PGD/PUD include not only removed memory but also other
      memory.  So this patch uses the following way to check whether a page
      can be freed or not.
      
      1) When removing memory, the page structs of the removed memory are
         filled with 0FD.
      
      2) All page structs are filled with 0xFD on PT/PMD, PT/PMD can be
         cleared.  In this case, the page used as PT/PMD can be freed.
      
      For direct mapping pages, update direct_pages_count[level] when we freed
      their pagetables.  And do not free the pages again because they were
      freed when offlining.
      
      For vmemmap pages, free the pages and their pagetables.
      
      For larger pages, do not split them into smaller ones because there is
      no way to know if the larger page has been split.  As a result, there is
      no way to decide when to split.  We deal the larger pages in the
      following way:
      
      1) For direct mapped pages, all the pages were freed when they were
         offlined.  And since menmory offline is done section by section, all
         the memory ranges being removed are aligned to PAGE_SIZE.  So only need
         to deal with unaligned pages when freeing vmemmap pages.
      
      2) For vmemmap pages being used to store page_struct, if part of the
         larger page is still in use, just fill the unused part with 0xFD.  And
         when the whole page is fulfilled with 0xFD, then free the larger page.
      
      [akpm@linux-foundation.org: fix typo in comment]
      [tangchen@cn.fujitsu.com: do not calculate direct mapping pages when freeing vmemmap pagetables]
      [tangchen@cn.fujitsu.com: do not free direct mapping pages twice]
      [tangchen@cn.fujitsu.com: do not free page split from hugepage one by one]
      [tangchen@cn.fujitsu.com: do not split pages when freeing pagetable pages]
      [akpm@linux-foundation.org: use pmd_page_vaddr()]
      [akpm@linux-foundation.org: fix used-uninitialised bug]
      Signed-off-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarJianguo Wu <wujianguo@huawei.com>
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ae9aae9e
    • Yasuaki Ishimatsu's avatar
      memory-hotplug: implement register_page_bootmem_info_section of sparse-vmemmap · 46723bfa
      Yasuaki Ishimatsu authored
      For removing memmap region of sparse-vmemmap which is allocated bootmem,
      memmap region of sparse-vmemmap needs to be registered by
      get_page_bootmem().  So the patch searches pages of virtual mapping and
      registers the pages by get_page_bootmem().
      
      NOTE: register_page_bootmem_memmap() is not implemented for ia64,
            ppc, s390, and sparc.  So introduce CONFIG_HAVE_BOOTMEM_INFO_NODE
            and revert register_page_bootmem_info_node() when platform doesn't
            support it.
      
            It's implemented by adding a new Kconfig option named
            CONFIG_HAVE_BOOTMEM_INFO_NODE, which will be automatically selected
            by memory-hotplug feature fully supported archs(currently only on
            x86_64).
      
            Since we have 2 config options called MEMORY_HOTPLUG and
            MEMORY_HOTREMOVE used for memory hot-add and hot-remove separately,
            and codes in function register_page_bootmem_info_node() are only
            used for collecting infomation for hot-remove, so reside it under
            MEMORY_HOTREMOVE.
      
            Besides page_isolation.c selected by MEMORY_ISOLATION under
            MEMORY_HOTPLUG is also such case, move it too.
      
      [mhocko@suse.cz: put register_page_bootmem_memmap inside CONFIG_MEMORY_HOTPLUG_SPARSE]
      [linfeng@cn.fujitsu.com: introduce CONFIG_HAVE_BOOTMEM_INFO_NODE and revert register_page_bootmem_info_node()]
      [mhocko@suse.cz: remove the arch specific functions without any implementation]
      [linfeng@cn.fujitsu.com: mm/Kconfig: move auto selects from MEMORY_HOTPLUG to MEMORY_HOTREMOVE as needed]
      [rientjes@google.com: fix defined but not used warning]
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: default avatarWu Jianguo <wujianguo@huawei.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarLin Feng <linfeng@cn.fujitsu.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      46723bfa
    • Wen Congyang's avatar
      memory-hotplug: introduce new arch_remove_memory() for removing page table · 24d335ca
      Wen Congyang authored
      For removing memory, we need to remove page tables.  But it depends on
      architecture.  So the patch introduce arch_remove_memory() for removing
      page table.  Now it only calls __remove_pages().
      
      Note: __remove_pages() for some archtecuture is not implemented
            (I don't know how to implement it for s390).
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      24d335ca
    • Michel Lespinasse's avatar
      mm: remove flags argument to mmap_region · c22c0d63
      Michel Lespinasse authored
      After the MAP_POPULATE handling has been moved to mmap_region() call
      sites, the only remaining use of the flags argument is to pass the
      MAP_NORESERVE flag.  This can be just as easily handled by
      do_mmap_pgoff(), so do that and remove the mmap_region() flags
      parameter.
      
      [akpm@linux-foundation.org: remove double parens]
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Tested-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Cc: Greg Ungerer <gregungerer@westnet.com.au>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c22c0d63
    • Phileas Fogg's avatar
      powerpc/kexec: Disable hard IRQ before kexec · 8520e443
      Phileas Fogg authored
      Disable hard IRQ before kexec a new kernel image.
      Not doing it can result in corrupted data in the memory segments
      reserved for the new kernel.
      Signed-off-by: default avatarPhileas Fogg <phileas-fogg@mail.ru>
      CC: <stable@vger.kernel.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      8520e443
  4. 22 Feb, 2013 2 commits
    • Konrad Rzeszutek Wilk's avatar
      x86-64, xen, mmu: Provide an early version of write_cr3. · 0cc9129d
      Konrad Rzeszutek Wilk authored
      With commit 8170e6be ("x86, 64bit: Use a #PF handler to materialize
      early mappings on demand") we started hitting an early bootup crash
      where the Xen hypervisor would inform us that:
      
          (XEN) d7:v0: unhandled page fault (ec=0000)
          (XEN) Pagetable walk from ffffea000005b2d0:
          (XEN)  L4[0x1d4] = 0000000000000000 ffffffffffffffff
          (XEN) domain_crash_sync called from entry.S
          (XEN) Domain 7 (vcpu#0) crashed on cpu#3:
          (XEN) ----[ Xen-4.2.0  x86_64  debug=n  Not tainted ]----
      
      .. that Xen was unable to context switch back to dom0.
      
      Looking at the calling stack we find:
      
          [<ffffffff8103feba>] xen_get_user_pgd+0x5a  <--
          [<ffffffff8103feba>] xen_get_user_pgd+0x5a
          [<ffffffff81042d27>] xen_write_cr3+0x77
          [<ffffffff81ad2d21>] init_mem_mapping+0x1f9
          [<ffffffff81ac293f>] setup_arch+0x742
          [<ffffffff81666d71>] printk+0x48
      
      We are trying to figure out whether we need to up-date the user PGD as
      well.  Please keep in mind that under 64-bit PV guests we have a limited
      amount of rings: 0 for the Hypervisor, and 1 for both the Linux kernel
      and user-space.  As such the Linux pvops'fied version of write_cr3
      checks if it has to update the user-space cr3 as well.
      
      That clearly is not needed during early bootup.  The recent changes (see
      above git commit) streamline the x86 page table allocation to be much
      simpler (And also incidentally the #PF handler ends up in spirit being
      similar to how the Xen toolstack sets up the initial page-tables).
      
      The fix is to have an early-bootup version of cr3 that just loads the
      kernel %cr3.  The later version - which also handles user-page
      modifications will be used after the initial page tables have been
      setup.
      
      [ hpa: removed a redundant #ifdef and made the new function __init.
        Also note that x86-32 already has such an early xen_write_cr3. ]
      Tested-by: default avatar"H. Peter Anvin" <hpa@zytor.com>
      Reported-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Link: http://lkml.kernel.org/r/1361579812-23709-1-git-send-email-konrad.wilk@oracle.comSigned-off-by: default avatarH. Peter Anvin <hpa@zytor.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0cc9129d
    • Linus Torvalds's avatar
      x86-64: don't set the early IDT to point directly to 'early_idt_handler' · ac630dd9
      Linus Torvalds authored
      The code requires the use of the proper per-exception-vector stub
      functions (set up as the early_idt_handlers[] array - note the 's') that
      make sure to set up the error vector number.  This is true regardless of
      whether CONFIG_EARLY_PRINTK is set or not.
      
      Why? The stack offset for the comparison of __KERNEL_CS won't be right
      otherwise, nor will the new check (from commit 8170e6be: "x86,
      64bit: Use a #PF handler to materialize early mappings on demand") for
      the page fault exception vector.
      Acked-by: default avatarH. Peter Anvin <hpa@zytor.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ac630dd9
  5. 21 Feb, 2013 8 commits
  6. 20 Feb, 2013 12 commits