1. 30 Nov, 2012 1 commit
  2. 28 Nov, 2012 1 commit
  3. 26 Nov, 2012 1 commit
  4. 02 Nov, 2012 1 commit
    • Olaf Hering's avatar
      xen PVonHVM: use E820_Reserved area for shared_info · 9d02b43d
      Olaf Hering authored
      This is a respin of 00e37bdb
      ("xen PVonHVM: move shared_info to MMIO before kexec").
      Currently kexec in a PVonHVM guest fails with a triple fault because the
      new kernel overwrites the shared info page. The exact failure depends on
      the size of the kernel image. This patch moves the pfn from RAM into an
      E820 reserved memory area.
      The pfn containing the shared_info is located somewhere in RAM. This will
      cause trouble if the current kernel is doing a kexec boot into a new
      kernel. The new kernel (and its startup code) can not know where the pfn
      is, so it can not reserve the page. The hypervisor will continue to update
      the pfn, and as a result memory corruption occours in the new kernel.
      The toolstack marks the memory area FC000000-FFFFFFFF as reserved in the
      E820 map. Within that range newer toolstacks (4.3+) will keep 1MB
      starting from FE700000 as reserved for guest use. Older Xen4 toolstacks
      will usually not allocate areas up to FE700000, so FE700000 is expected
      to work also with older toolstacks.
      In Xen3 there is no reserved area at a fixed location. If the guest is
      started on such old hosts the shared_info page will be placed in RAM. As
      a result kexec can not be used.
      Signed-off-by: default avatarOlaf Hering <olaf@aepfle.de>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
  5. 19 Oct, 2012 1 commit
  6. 12 Oct, 2012 2 commits
    • Konrad Rzeszutek Wilk's avatar
      xen/bootup: allow {read|write}_cr8 pvops call. · 1a7bbda5
      Konrad Rzeszutek Wilk authored
      We actually do not do anything about it. Just return a default
      value of zero and if the kernel tries to write anything but 0
      we BUG_ON.
      This fixes the case when an user tries to suspend the machine
      and it blows up in save_processor_state b/c 'read_cr8' is set
      to NULL and we get:
      kernel BUG at /home/konrad/ssd/linux/arch/x86/include/asm/paravirt.h:100!
      invalid opcode: 0000 [#1] SMP
      Pid: 2687, comm: init.late Tainted: G           O 3.6.0upstream-00002-gac264ac-dirty #4 Bochs Bochs
      RIP: e030:[<ffffffff814d5f42>]  [<ffffffff814d5f42>] save_processor_state+0x212/0x270
      .. snip..
      Call Trace:
       [<ffffffff810733bf>] do_suspend_lowlevel+0xf/0xac
       [<ffffffff8107330c>] ? x86_acpi_suspend_lowlevel+0x10c/0x150
       [<ffffffff81342ee2>] acpi_suspend_enter+0x57/0xd5
      CC: stable@vger.kernel.org
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
    • Konrad Rzeszutek Wilk's avatar
      xen/bootup: allow read_tscp call for Xen PV guests. · cd0608e7
      Konrad Rzeszutek Wilk authored
      The hypervisor will trap it. However without this patch,
      we would crash as the .read_tscp is set to NULL. This patch
      fixes it and sets it to the native_read_tscp call.
      CC: stable@vger.kernel.org
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
  7. 24 Sep, 2012 1 commit
  8. 19 Sep, 2012 1 commit
    • Konrad Rzeszutek Wilk's avatar
      xen/boot: Disable BIOS SMP MP table search. · bd49940a
      Konrad Rzeszutek Wilk authored
      As the initial domain we are able to search/map certain regions
      of memory to harvest configuration data. For all low-level we
      use ACPI tables - for interrupts we use exclusively ACPI _PRT
      (so DSDT) and MADT for INT_SRC_OVR.
      The SMP MP table is not used at all. As a matter of fact we do
      not even support machines that only have SMP MP but no ACPI tables.
      Lets follow how Moorestown does it and just disable searching
      for BIOS SMP tables.
      This also fixes an issue on HP Proliant BL680c G5 and DL380 G6:
      9f->100 for 1:1 PTE
      Freeing 9f-100 pfn range: 97 pages freed
      1-1 mapping on 9f->100
      .. snip..
      e820: BIOS-provided physical RAM map:
      Xen: [mem 0x0000000000000000-0x000000000009efff] usable
      Xen: [mem 0x000000000009f400-0x00000000000fffff] reserved
      Xen: [mem 0x0000000000100000-0x00000000cfd1dfff] usable
      .. snip..
      Scan for SMP in [mem 0x00000000-0x000003ff]
      Scan for SMP in [mem 0x0009fc00-0x0009ffff]
      Scan for SMP in [mem 0x000f0000-0x000fffff]
      found SMP MP-table at [mem 0x000f4fa0-0x000f4faf] mapped at [ffff8800000f4fa0]
      (XEN) mm.c:908:d0 Error getting mfn 100 (pfn 5555555555555555) from L1 entry 0000000000100461 for l1e_owner=0, pg_owner=0
      (XEN) mm.c:4995:d0 ptwr_emulate: could not get_page_from_l1e()
      BUG: unable to handle kernel NULL pointer dereference at           (null)
      IP: [<ffffffff81ac07e2>] xen_set_pte_init+0x66/0x71
      . snip..
      Pid: 0, comm: swapper Not tainted 3.6.0-rc6upstream-00188-gb6fb969-dirty #2 HP ProLiant BL680c G5
      .. snip..
      Call Trace:
       [<ffffffff81ad31c6>] __early_ioremap+0x18a/0x248
       [<ffffffff81624731>] ? printk+0x48/0x4a
       [<ffffffff81ad32ac>] early_ioremap+0x13/0x15
       [<ffffffff81acc140>] get_mpc_size+0x2f/0x67
       [<ffffffff81acc284>] smp_scan_config+0x10c/0x136
       [<ffffffff81acc2e4>] default_find_smp_config+0x36/0x5a
       [<ffffffff81ac3085>] setup_arch+0x5b3/0xb5b
       [<ffffffff81624731>] ? printk+0x48/0x4a
       [<ffffffff81abca7f>] start_kernel+0x90/0x390
       [<ffffffff81abc356>] x86_64_start_reservations+0x131/0x136
       [<ffffffff81abfa83>] xen_start_kernel+0x65f/0x661
      (XEN) Domain 0 crashed: 'noreboot' set - not rebooting.
      which is that ioremap would end up mapping 0xff using _PAGE_IOMAP
      (which is what early_ioremap sticks as a flag) - which meant
      we would get MFN 0xFF (pte ff461, which is OK), and then it would
      also map 0x100 (b/c ioremap tries to get page aligned request, and
      it was trying to map 0xf4fa0 + PAGE_SIZE - so it mapped the next page)
      as _PAGE_IOMAP. Since 0x100 is actually a RAM page, and the _PAGE_IOMAP
      bypasses the P2M lookup we would happily set the PTE to 1000461.
      Xen would deny the request since we do not have access to the
      Machine Frame Number (MFN) of 0x100. The P2M[0x100] is for example
      CC: stable@vger.kernel.org
      Fixes-Oracle-Bugzilla: https://bugzilla.oracle.com/bugzilla/show_bug.cgi?id=13665
      Acked-by: default avatarJan Beulich <jbeulich@suse.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
  9. 23 Aug, 2012 2 commits
    • Konrad Rzeszutek Wilk's avatar
      xen/mmu: The xen_setup_kernel_pagetable doesn't need to return anything. · 3699aad0
      Konrad Rzeszutek Wilk authored
      We don't need to return the new PGD - as we do not use it.
      Acked-by: default avatarStefano Stabellini <stefano.stabellini@eu.citrix.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
    • Konrad Rzeszutek Wilk's avatar
      Revert "xen/x86: Workaround 64-bit hypervisor and 32-bit initial domain." and... · 51faaf2b
      Konrad Rzeszutek Wilk authored
      Revert "xen/x86: Workaround 64-bit hypervisor and 32-bit initial domain." and "xen/x86: Use memblock_reserve for sensitive areas."
      This reverts commit 806c312e and
      commit 59b29440
      And also documents setup.c and why we want to do it that way, which
      is that we tried to make the the memblock_reserve more selective so
      that it would be clear what region is reserved. Sadly we ran
      in the problem wherein on a 64-bit hypervisor with a 32-bit
      initial domain, the pt_base has the cr3 value which is not
      neccessarily where the pagetable starts! As Jan put it: "
      Actually, the adjustment turns out to be correct: The page
      tables for a 32-on-64 dom0 get allocated in the order "first L1",
      "first L2", "first L3", so the offset to the page table base is
      indeed 2. When reading xen/include/public/xen.h's comment
      very strictly, this is not a violation (since there nothing is said
      that the first thing in the page table space is pointed to by
      pt_base; I admit that this seems to be implied though, namely
      do I think that it is implied that the page table space is the
      range [pt_base, pt_base + nt_pt_frames), whereas that
      range here indeed is [pt_base - 2, pt_base - 2 + nt_pt_frames),
      which - without a priori knowledge - the kernel would have
      difficulty to figure out)." - so lets just fall back to the
      easy way and reserve the whole region.
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
  10. 21 Aug, 2012 3 commits
    • Konrad Rzeszutek Wilk's avatar
      xen/apic/xenbus/swiotlb/pcifront/grant/tmem: Make functions or variables static. · b8b0f559
      Konrad Rzeszutek Wilk authored
      There is no need for those functions/variables to be visible. Make them
      static and also fix the compile warnings of this sort:
      drivers/xen/<some file>.c: warning: symbol '<blah>' was not declared. Should it be static?
      Some of them just require including the header file that
      declares the functions.
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
    • Konrad Rzeszutek Wilk's avatar
      xen/x86: Workaround 64-bit hypervisor and 32-bit initial domain. · 806c312e
      Konrad Rzeszutek Wilk authored
      If a 64-bit hypervisor is booted with a 32-bit initial domain,
      the hypervisor deals with the initial domain as "compat" and
      does some extra adjustments (like pagetables are 4 bytes instead
      of 8). It also adjusts the xen_start_info->pt_base incorrectly.
      When booted with a 32-bit hypervisor (32-bit initial domain):
      (XEN)  Start info:    cf831000->cf83147c
      (XEN)  Page tables:   cf832000->cf8b5000
      [    0.000000] PT: cf832000 (f832000)
      [    0.000000] Reserving PT: f832000->f8b5000
      And with a 64-bit hypervisor:
      (XEN)  Start info:    00000000cf831000->00000000cf8314b4
      (XEN)  Page tables:   00000000cf832000->00000000cf8b6000
      [    0.000000] PT: cf834000 (f834000)
      [    0.000000] Reserving PT: f834000->f8b8000
      To deal with this, we keep keep track of the highest physical
      address we have reserved via memblock_reserve. If that address
      does not overlap with pt_base, we have a gap which we reserve.
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
    • Konrad Rzeszutek Wilk's avatar
      xen/x86: Use memblock_reserve for sensitive areas. · 59b29440
      Konrad Rzeszutek Wilk authored
      instead of a big memblock_reserve. This way we can be more
      selective in freeing regions (and it also makes it easier
      to understand where is what).
      [v1: Move the auto_translate_physmap to proper line]
      [v2: Per Stefano suggestion add more comments]
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
  11. 16 Aug, 2012 1 commit
  12. 14 Sep, 2012 1 commit
  13. 19 Jul, 2012 7 commits
  14. 07 Jun, 2012 1 commit
  15. 31 May, 2012 1 commit
    • Andre Przywara's avatar
      xen/setup: filter APERFMPERF cpuid feature out · 5e626254
      Andre Przywara authored
      Xen PV kernels allow access to the APERF/MPERF registers to read the
      effective frequency. Access to the MSRs is however redirected to the
      currently scheduled physical CPU, making consecutive read and
      compares unreliable. In addition each rdmsr traps into the hypervisor.
      So to avoid bogus readouts and expensive traps, disable the kernel
      internal feature flag for APERF/MPERF if running under Xen.
      This will
      a) remove the aperfmperf flag from /proc/cpuinfo
      b) not mislead the power scheduler (arch/x86/kernel/cpu/sched.c) to
         use the feature to improve scheduling (by default disabled)
      c) not mislead the cpufreq driver to use the MSRs
      This does not cover userland programs which access the MSRs via the
      device file interface, but this will be addressed separately.
      Signed-off-by: default avatarAndre Przywara <andre.przywara@amd.com>
      Cc: stable@vger.kernel.org # v3.0+
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
  16. 30 May, 2012 1 commit
  17. 07 May, 2012 5 commits
  18. 01 May, 2012 1 commit
  19. 26 Apr, 2012 1 commit
    • Konrad Rzeszutek Wilk's avatar
      xen/enlighten: Disable MWAIT_LEAF so that acpi-pad won't be loaded. · df88b2d9
      Konrad Rzeszutek Wilk authored
      There are exactly four users of __monitor and __mwait:
       - cstate.c (which allows acpi_processor_ffh_cstate_enter to be called
         when the cpuidle API drivers are used. However patch
         "cpuidle: replace xen access to x86 pm_idle and default_idle"
         provides a mechanism to disable the cpuidle and use safe_halt.
       - smpboot (which allows mwait_play_dead to be called). However
         safe_halt is always used so we skip that.
       - intel_idle (same deal as above).
       - acpi_pad.c. This the one that we do not want to run as we
         will hit the below crash.
      Why do we want to expose MWAIT_LEAF in the first place?
      We want it for the xen-acpi-processor driver - which uploads
      C-states to the hypervisor. If MWAIT_LEAF is set, the cstate.c
      sets the proper address in the C-states so that the hypervisor
      can benefit from using the MWAIT functionality. And that is
      the sole reason for using it.
      Without this patch, if a module performs mwait or monitor we
      get this:
      invalid opcode: 0000 [#1] SMP
      CPU 2
      .. snip..
      Pid: 5036, comm: insmod Tainted: G           O 3.4.0-rc2upstream-dirty #2 Intel Corporation S2600CP/S2600CP
      RIP: e030:[<ffffffffa000a017>]  [<ffffffffa000a017>] mwait_check_init+0x17/0x1000 [mwait_check]
      RSP: e02b:ffff8801c298bf18  EFLAGS: 00010282
      RAX: ffff8801c298a010 RBX: ffffffffa03b2000 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: ffff8801c29800d8 RDI: ffff8801ff097200
      RBP: ffff8801c298bf18 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
      R13: ffffffffa000a000 R14: 0000005148db7294 R15: 0000000000000003
      FS:  00007fbb364f2700(0000) GS:ffff8801ff08c000(0000) knlGS:0000000000000000
      CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 000000000179f038 CR3: 00000001c9469000 CR4: 0000000000002660
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process insmod (pid: 5036, threadinfo ffff8801c298a000, task ffff8801c29cd7e0)
       ffff8801c298bf48 ffffffff81002124 ffffffffa03b2000 00000000000081fd
       000000000178f010 000000000178f030 ffff8801c298bf78 ffffffff810c41e6
       00007fff3fb30db9 00007fff3fb30db9 00000000000081fd 0000000000010000
      Call Trace:
       [<ffffffff81002124>] do_one_initcall+0x124/0x170
       [<ffffffff810c41e6>] sys_init_module+0xc6/0x220
       [<ffffffff815b15b9>] system_call_fastpath+0x16/0x1b
      Code: <0f> 01 c8 31 c0 0f 01 c9 c9 c3 00 00 00 00 00 00 00 00 00 00 00 00
      RIP  [<ffffffffa000a017>] mwait_check_init+0x17/0x1000 [mwait_check]
       RSP <ffff8801c298bf18>
      ---[ end trace 16582fc8a3d1e29a ]---
      Kernel panic - not syncing: Fatal exception
      With this module (which is what acpi_pad.c would hit):
      MODULE_AUTHOR("Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>");
      static int __init mwait_check_init(void)
      	__monitor((void *)&current_thread_info()->flags, 0, 0);
      	__mwait(0, 0);
      	return 0;
      static void __exit mwait_check_exit(void)
      Reported-by: default avatarLiu, Jinsong <jinsong.liu@intel.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
  20. 28 Mar, 2012 1 commit
  21. 10 Mar, 2012 1 commit
    • Konrad Rzeszutek Wilk's avatar
      xen/enlighten: Expose MWAIT and MWAIT_LEAF if hypervisor OKs it. · 73c154c6
      Konrad Rzeszutek Wilk authored
      For the hypervisor to take advantage of the MWAIT support it needs
      to extract from the ACPI _CST the register address. But the
      hypervisor does not have the support to parse DSDT so it relies on
      the initial domain (dom0) to parse the ACPI Power Management information
      and push it up to the hypervisor. The pushing of the data is done
      by the processor_harveset_xen module which parses the information that
      the ACPI parser has graciously exposed in 'struct acpi_processor'.
      For the ACPI parser to also expose the Cx states for MWAIT, we need
      to expose the MWAIT capability (leaf 1). Furthermore we also need to
      expose the MWAIT_LEAF capability (leaf 5) for cstate.c to properly
      The hypervisor could expose these flags when it traps the XEN_EMULATE_PREFIX
      operations, but it can't do it since it needs to be backwards compatible.
      Instead we choose to use the native CPUID to figure out if the MWAIT
      capability exists and use the XEN_SET_PDC query hypercall to figure out
      if the hypervisor wants us to expose the MWAIT_LEAF capability or not.
      Note: The XEN_SET_PDC query was implemented in c/s 23783:
      "ACPI: add _PDC input override mechanism".
      With this in place, instead of
       C3 ACPI IOPORT 415
      we get now
      Note: The cpu_idle which would be calling the mwait variants for idling
      never gets set b/c we set the default pm_idle to be the hypercall variant.
      Acked-by: default avatarJan Beulich <JBeulich@suse.com>
      [v2: Fix missing header file include and #ifdef]
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
  22. 20 Feb, 2012 2 commits
    • Konrad Rzeszutek Wilk's avatar
      xen/pat: Disable PAT support for now. · 8eaffa67
      Konrad Rzeszutek Wilk authored
      [Pls also look at https://lkml.org/lkml/2012/2/10/228]
      Using of PAT to change pages from WB to WC works quite nicely.
      Changing it back to WB - not so much. The crux of the matter is
      that the code that does this (__page_change_att_set_clr) has only
      limited information so when it tries to the change it gets
      the "raw" unfiltered information instead of the properly filtered one -
      and the "raw" one tell it that PSE bit is on (while infact it
      is not).  As a result when the PTE is set to be WB from WC, we get
      tons of:
      :WARNING: at arch/x86/xen/mmu.c:475 xen_make_pte+0x67/0xa0()
      :Hardware name: HP xw4400 Workstation
      .. snip..
      :Pid: 27, comm: kswapd0 Tainted: G        W    3.2.2-1.fc16.x86_64 #1
      :Call Trace:
      : [<ffffffff8106dd1f>] warn_slowpath_common+0x7f/0xc0
      : [<ffffffff8106dd7a>] warn_slowpath_null+0x1a/0x20
      : [<ffffffff81005a17>] xen_make_pte+0x67/0xa0
      : [<ffffffff810051bd>] __raw_callee_save_xen_make_pte+0x11/0x1e
      : [<ffffffff81040e15>] ? __change_page_attr_set_clr+0x9d5/0xc00
      : [<ffffffff8114c2e8>] ? __purge_vmap_area_lazy+0x158/0x1d0
      : [<ffffffff8114cca5>] ? vm_unmap_aliases+0x175/0x190
      : [<ffffffff81041168>] change_page_attr_set_clr+0x128/0x4c0
      : [<ffffffff81041542>] set_pages_array_wb+0x42/0xa0
      : [<ffffffff8100a9b2>] ? check_events+0x12/0x20
      : [<ffffffffa0074d4c>] ttm_pages_put+0x1c/0x70 [ttm]
      : [<ffffffffa0074e98>] ttm_page_pool_free+0xf8/0x180 [ttm]
      : [<ffffffffa0074f78>] ttm_pool_mm_shrink+0x58/0x90 [ttm]
      : [<ffffffff8112ba04>] shrink_slab+0x154/0x310
      : [<ffffffff8112f17a>] balance_pgdat+0x4fa/0x6c0
      : [<ffffffff8112f4b8>] kswapd+0x178/0x3d0
      : [<ffffffff815df134>] ? __schedule+0x3d4/0x8c0
      : [<ffffffff81090410>] ? remove_wait_queue+0x50/0x50
      : [<ffffffff8112f340>] ? balance_pgdat+0x6c0/0x6c0
      : [<ffffffff8108fb6c>] kthread+0x8c/0xa0
      for every page. The proper fix for this is has been posted
      and is https://lkml.org/lkml/2012/2/10/228
      "x86/cpa: Use pte_attrs instead of pte_flags on CPA/set_p.._wb/wc operations."
      along with a detailed description of the problem and solution.
      But since that posting has gone nowhere I am proposing
      this band-aid solution so that at least users don't get
      the page corruption (the pages that are WC don't get changed to WB
      and end up being recycled for filesystem or other things causing
      mysterious crashes).
      The negative impact of this patch is that users of WC flag
      (which are InfiniBand, radeon, nouveau drivers) won't be able
      to set that flag - so they are going to see performance degradation.
      But stability is more important here.
      Fixes RH BZ# 742032, 787403, and 745574
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
    • Konrad Rzeszutek Wilk's avatar
      xen/setup: Remove redundant filtering of PTE masks. · 416d7214
      Konrad Rzeszutek Wilk authored
      commit 7347b408
       "xen: Allow
      unprivileged Xen domains to create iomap pages" added a redundant
      line in the early bootup code to filter out the PTE. That
      filtering is already done a bit earlier so this extra processing
      is not required.
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
  23. 24 Jan, 2012 1 commit
  24. 08 Dec, 2011 1 commit
    • Tejun Heo's avatar
      memblock: Kill memblock_init() · fe091c20
      Tejun Heo authored
      memblock_init() initializes arrays for regions and memblock itself;
      however, all these can be done with struct initializers and
      memblock_init() can be removed.  This patch kills memblock_init() and
      initializes memblock with struct initializer.
      The only difference is that the first dummy entries don't have .nid
      set to MAX_NUMNODES initially.  This doesn't cause any behavior
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
  25. 16 Nov, 2011 1 commit