- Jul 24, 2011
-
-
Sasha Levin authored
Current documentation referred to the old method of handling augmented trees. Update documentation to correspond with the changes done in commit b945d6b2 ("rbtree: Undo augmented trees performance damage and regression"). Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: David Woodhouse <David.Woodhouse@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Acked-by:
Ingo Molnar <mingo@elte.hu> Acked-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by:
Sasha Levin <levinsasha928@gmail.com> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Xiao Guangrong authored
The idea is from Avi: | Maybe it's time to kill off bypass_guest_pf=1. It's not as effective as | it used to be, since unsync pages always use shadow_trap_nonpresent_pte, | and since we convert between the two nonpresent_ptes during sync and unsync. Signed-off-by:
Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by:
Avi Kivity <avi@redhat.com>
-
Glauber Costa authored
This patch implements the kvm bits of the steal time infrastructure. The most important part of it, is the steal time clock. It is an continuous clock that shows the accumulated amount of steal time since vcpu creation. It is supposed to survive cpu offlining/onlining. [marcelo: fix build with CONFIG_KVM_GUEST=n] Signed-off-by:
Glauber Costa <glommer@redhat.com> Acked-by:
Rik van Riel <riel@redhat.com> Tested-by:
Eric B Munson <emunson@mgebm.net> CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> CC: Peter Zijlstra <peterz@infradead.org> CC: Avi Kivity <avi@redhat.com> CC: Anthony Liguori <aliguori@us.ibm.com> Signed-off-by:
Avi Kivity <avi@redhat.com> Signed-off-by:
Marcelo Tosatti <mtosatti@redhat.com>
-
Martin Schwidefsky authored
After git commit 66ceed5a removed the tape block device driver, remove its documentation as well. Signed-off-by:
Martin Schwidefsky <schwidefsky@de.ibm.com>
-
- Jul 23, 2011
-
-
Borislav Petkov authored
Refresh sysctl/kernel.txt. More specifically, - drop stale index entries - sync and sort index and entries - reflow sticking out paragraphs to colwidth 72 - correct typos - cleanup whitespace Signed-off-by:
Borislav Petkov <bp@alien8.de> Signed-off-by:
Randy Dunlap <rdunlap@xenotime.net> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Wanlong Gao authored
Only the root cpuset has cpuset.memory_pressure_enabled flag, but not the only one. Signed-off-by:
Wanlong Gao <gaowanlong@cn.fujitsu.com> Signed-off-by:
Randy Dunlap <rdunlap@xenotime.net> Acked-by:
Paul Menage <menage@google.com> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Wanlong Gao authored
Must echo a task id to the cgroups' tasks file, but not to a directory. Signed-off-by:
Wanlong Gao <gaowanlong@cn.fujitsu.com> Acked-by:
Paul Menage <menage@google.com> Signed-off-by:
Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Jul 22, 2011
-
-
Mike Frysinger authored
Signed-off-by:
Mike Frysinger <vapier@gentoo.org>
-
- Jul 21, 2011
-
-
Rusty Russell authored
Also removes a long-unused #define and an extraneous semicolon. Signed-off-by:
Rusty Russell <rusty@rustcorp.com.au>
-
Rusty Russell authored
We used to notify the Host every time we updated a device's status. However, it only really needs to know when we're resetting the device, or failed to initialize it, or when we've finished our feature negotiation. In particular, we used to wait for VIRTIO_CONFIG_S_DRIVER_OK in the status byte before starting the device service threads. But this corresponds to the successful finish of device initialization, which might (like virtio_blk's partition scanning) use the device. So we had a hack, if they used the device before we expected we started the threads anyway. Now we hook into the finalize_features hook in the Guest: at that point we tell the Launcher that it can rely on the features we have acked. On the Launcher side, we look at the status at that point, and start servicing the device. Signed-off-by:
Rusty Russell <rusty@rustcorp.com.au>
-
Sakari Ailus authored
Do not exit on some non-fatal errors: - writev() fails in net_output(). The result is a lost packet or packets. - writev() fails in console_output(). The result is partially lost console output. - readv() fails in net_input(). The result is a lost packet or packets. Rather than bringing the guest down, this patch ignores e.g. an allocation failure on the host side. Example: lguest: page allocation failure. order:4, mode:0x4d0 Pid: 4045, comm: lguest Tainted: G W 2.6.36 #1 Call Trace: [<c138d614>] ? printk+0x18/0x1c [<c106a4e2>] __alloc_pages_nodemask+0x4d2/0x570 [<c1087954>] cache_alloc_refill+0x2a4/0x4d0 [<c1305149>] ? __netif_receive_skb+0x189/0x270 [<c1087c5a>] __kmalloc+0xda/0xf0 [<c12fffa5>] __alloc_skb+0x55/0x100 [<c1305519>] ? net_rx_action+0x79/0x100 [<c12fafed>] sock_alloc_send_pskb+0x18d/0x280 [<c11fda25>] ? _copy_from_user+0x35/0x130 [<c13010b6>] ? memcpy_fromiovecend+0x56/0x80 [<c12a74dc>] tun_chr_aio_write+0x1cc/0x500 [<c108a125>] do_sync_readv_writev+0x95/0xd0 [<c11fda25>] ? _copy_from_user+0x35/0x130 [<c1089fa8>] ? rw_copy_check_uvector+0x58/0x100 [<c108a7bc>] do_readv_writev+0x9c/0x1d0 [<c12a7310>] ? tun_chr_aio_write+0x0/0x500 [<c108a93a>] vfs_writev+0x4a/0x60 [<c108aa21>] sys_writev+0x41/0x80 [<c138f061>] syscall_call+0x7/0xb Mem-Info: DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 Normal per-cpu: CPU 0: hi: 186, btch: 31 usd: 0 HighMem per-cpu: CPU 0: hi: 186, btch: 31 usd: 0 active_anon:134651 inactive_anon:50543 isolated_anon:0 active_file:96881 inactive_file:132007 isolated_file:0 unevictable:0 dirty:3 writeback:0 unstable:0 free:91374 slab_reclaimable:6300 slab_unreclaimable:2802 mapped:2281 shmem:9 pagetables:330 bounce:0 DMA free:3524kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:8kB active_file:8760kB inactive_file:2760kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15868kB mlocked:0kB dirty:0kB writeback:0kB mapped:16kB shmem:0kB slab_reclaimable:88kB slab_unreclaimable:148kB kernel_stack:40kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 865 2016 2016 Normal free:150100kB min:3728kB low:4660kB high:5592kB active_anon:6224kB inactive_anon:15772kB active_file:324084kB inactive_file:325944kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:885944kB mlocked:0kB dirty:12kB writeback:0kB mapped:1520kB shmem:0kB slab_reclaimable:25112kB slab_unreclaimable:11060kB kernel_stack:1888kB pagetables:1320kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 9207 9207 HighMem free:211872kB min:512kB low:1752kB high:2992kB active_anon:532380kB inactive_anon:186392kB active_file:54680kB inactive_file:199324kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1178504kB mlocked:0kB dirty:0kB writeback:0kB mapped:7588kB shmem:36kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 DMA: 3*4kB 65*8kB 35*16kB 18*32kB 11*64kB 9*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3524kB Normal: 35981*4kB 344*8kB 158*16kB 28*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 150100kB HighMem: 5732*4kB 5462*8kB 2826*16kB 1598*32kB 84*64kB 10*128kB 7*256kB 1*512kB 1*1024kB 1*2048kB 9*4096kB = 211872kB 231237 total pagecache pages 2340 pages in swap cache Swap cache stats: add 160060, delete 157720, find 189017/194106 Free swap = 4179840kB Total swap = 4194300kB 524271 pages RAM 296946 pages HighMem 5668 pages reserved 867664 pages shared 82155 pages non-shared Signed-off-by:
Sakari Ailus <sakari.ailus@iki.fi> Signed-off-by:
Rusty Russell <rusty@rustcorp.com.au>
-
Giuseppe CAVALLARO authored
This patch adds new information for the driver especially about its platform structure fields. Signed-off-by:
Giuseppe Cavallaro <peppe.cavallaro@st.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Per Forlin authored
Documentation about the background and the design of mmc non-blocking. Host driver guidelines to minimize request preparation overhead. Signed-off-by:
Per Forlin <per.forlin@linaro.org> Acked-by:
Randy Dunlap <rdunlap@xenotime.net> Signed-off-by:
Chris Ball <cjb@laptop.org>
-
- Jul 20, 2011
-
-
Josef Bacik authored
Btrfs needs to be able to control how filemap_write_and_wait_range() is called in fsync to make it less of a painful operation, so push down taking i_mutex and the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some file systems can drop taking the i_mutex altogether it seems, like ext3 and ocfs2. For correctness sake I just pushed everything down in all cases to make sure that we keep the current behavior the same for everybody, and then each individual fs maintainer can make up their mind about what to do from there. Thanks, Acked-by:
Jan Kara <jack@suse.cz> Signed-off-by:
Josef Bacik <josef@redhat.com> Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Josef Bacik authored
This just gets us ready to support the SEEK_HOLE and SEEK_DATA flags. Turns out using fiemap in things like cp cause more problems than it solves, so lets try and give userspace an interface that doesn't suck. We need to match solaris here, and the definitions are *o* If /whence/ is SEEK_HOLE, the offset of the start of the next hole greater than or equal to the supplied offset is returned. The definition of a hole is provided near the end of the DESCRIPTION. *o* If /whence/ is SEEK_DATA, the file pointer is set to the start of the next non-hole file region greater than or equal to the supplied offset. So in the generic case the entire file is data and there is a virtual hole at the end. That means we will just return i_size for SEEK_HOLE and will return the same offset for SEEK_DATA. This is how Solaris does it so we have to do it the same way. Thanks, Signed-off-by:
Josef Bacik <josef@redhat.com> Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Dave Chinner authored
Now that the per-sb shrinker is responsible for shrinking 2 or more caches, increase the batch size to keep econmies of scale for shrinking each cache. Increase the shrinker batch size to 1024 objects. To allow for a large increase in batch size, add a conditional reschedule to prune_icache_sb() so that we don't hold the LRU spin lock for too long. This mirrors the behaviour of the __shrink_dcache_sb(), and allows us to increase the batch size without needing to worry about problems caused by long lock hold times. To ensure that filesystems using the per-sb shrinker callouts don't cause problems, document that the object freeing method must reschedule appropriately inside loops. Signed-off-by:
Dave Chinner <dchinner@redhat.com> Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Dave Chinner authored
Now we have a per-superblock shrinker implementation, we can add a filesystem specific callout to it to allow filesystem internal caches to be shrunk by the superblock shrinker. Rather than perpetuate the multipurpose shrinker callback API (i.e. nr_to_scan == 0 meaning "tell me how many objects freeable in the cache), two operations will be added. The first will return the number of objects that are freeable, the second is the actual shrinker call. Signed-off-by:
Dave Chinner <dchinner@redhat.com> Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
- Jul 19, 2011
-
-
Al Viro authored
Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
not used by the instances anymore. Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
not used in the instances anymore. Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Takashi Iwai authored
Signed-off-by:
Takashi Iwai <tiwai@suse.de>
-
- Jul 15, 2011
-
-
Stefan Richter authored
of firewire-core and firewire-sbp2. Signed-off-by:
Stefan Richter <stefanr@s5r6.in-berlin.de>
-
Stefan Richter authored
Add overview documentation in Documentation/ABI/stable/firewire-cdev. Improve the inline reference documentation in firewire-cdev.h: - Add /* available since kernel... */ comments to event numbers consistent with the comments on ioctl numbers. - Shorten some documentation on an event and an ioctl that are less interesting to current programming because there are newer preferable variants. - Spell Configuration ROM (name of an IEEE 1212 register) in upper case. - Move the dummy FW_CDEV_VERSION out of the reader's field of vision. We should remove it from the header next year or so. Signed-off-by:
Stefan Richter <stefanr@s5r6.in-berlin.de>
-
Nishanth Menon authored
cpufreq table allocated by opp_init_cpufreq_table is better freed by OPP layer itself. This allows future modifications to the table handling to be transparent to the users. Signed-off-by:
Nishanth Menon <nm@ti.com> Acked-by:
Kevin Hilman <khilman@ti.com> Signed-off-by:
Rafael J. Wysocki <rjw@sisk.pl>
-
Masami Hiramatsu authored
To support probing module init functions, kprobe-tracer allows user to define a probe on non-existed function when it is given with a module name. This also enables user to set a probe on a function on a specific module, even if a same name (but different) function is locally defined in another module. The module name must be in the front of function name and separated by a ':'. e.g. btrfs:btrfs_init_sysfs Signed-off-by:
Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Link: http://lkml.kernel.org/r/20110627072656.6528.89970.stgit@fedora15 Signed-off-by:
Steven Rostedt <rostedt@goodmis.org>
-
Olof Johansson authored
Allocate one bit in the available extra cell to indicate if the gpio should be considered logically inverted. Signed-off-by:
Olof Johansson <olof@lixom.net> Acked-by:
Stephen Warren <swarren@nvidia.com> Signed-off-by:
Grant Likely <grant.likely@secretlab.ca>
-
- Jul 14, 2011
-
-
Andy Lutomirski authored
It turns out that parsing the vDSO is nontrivial if you don't already have an ELF dynamic loader around. So document it in Documentation/ABI and add a reference CC0-licenced parser. This code is dedicated to Go issue 1933: http://code.google.com/p/go/issues/detail?id=1933 Signed-off-by:
Andy Lutomirski <luto@mit.edu> Link: http://lkml.kernel.org/r/a315a9514cd71bcf29436cc31e35aada21a5ff21.1310563276.git.luto@mit.edu Signed-off-by:
H. Peter Anvin <hpa@linux.intel.com>
-
Shawn Guo authored
It adds device tree probe support for spi-imx driver. Signed-off-by:
Shawn Guo <shawn.guo@linaro.org> Cc: Grant Likely <grant.likely@secretlab.ca> Signed-off-by:
Grant Likely <grant.likely@secretlab.ca>
-
- Jul 13, 2011
-
-
Ryusuke Konishi authored
Resize feature was supported by the commit 4e33f9ea but it was not reflected to the list of unsupported features in nilfs2.txt file. This updates the list to fix discrepancy. Signed-off-by:
Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
-
- Jul 12, 2011
-
-
Michał Mirosław authored
v2: incorporated suggestions from Randy Dunlap Signed-off-by:
Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Darren Hart authored
The real-mode kernel header init_size field is located at 0x260 per the field listing in th e"REAL-MODE KERNEL HEADER" section. It is listed as 0x25c in the "DETAILS OF HEADER FIELDS" section, which overlaps with pref_address. Correct the details listing to 0x260. Signed-off-by:
Darren Hart <dvhart@linux.intel.com> Link: http://lkml.kernel.org/r/541cf88e2dfe5b8186d8b96b136d892e769a68c1.1310441260.git.dvhart@linux.intel.com CC: H. Peter Anvin <hpa@zytor.com> Signed-off-by:
H. Peter Anvin <hpa@linux.intel.com>
-
Glauber Costa authored
To implement steal time, we need the hypervisor to pass the guest information about how much time was spent running other processes outside the VM. This is per-vcpu, and using the kvmclock structure for that is an abuse we decided not to make. In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that holds the memory area address containing information about steal time This patch contains the headers for it. I am keeping it separate to facilitate backports to people who wants to backport the kernel part but not the hypervisor, or the other way around. Signed-off-by:
Glauber Costa <glommer@redhat.com> Acked-by:
Rik van Riel <riel@redhat.com> Tested-by:
Eric B Munson <emunson@mgebm.net> CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> CC: Peter Zijlstra <peterz@infradead.org> CC: Anthony Liguori <aliguori@us.ibm.com> Signed-off-by:
Avi Kivity <avi@redhat.com>
-
Paul Mackerras authored
This adds infrastructure which will be needed to allow book3s_hv KVM to run on older POWER processors, including PPC970, which don't support the Virtual Real Mode Area (VRMA) facility, but only the Real Mode Offset (RMO) facility. These processors require a physically contiguous, aligned area of memory for each guest. When the guest does an access in real mode (MMU off), the address is compared against a limit value, and if it is lower, the address is ORed with an offset value (from the Real Mode Offset Register (RMOR)) and the result becomes the real address for the access. The size of the RMA has to be one of a set of supported values, which usually includes 64MB, 128MB, 256MB and some larger powers of 2. Since we are unlikely to be able to allocate 64MB or more of physically contiguous memory after the kernel has been running for a while, we allocate a pool of RMAs at boot time using the bootmem allocator. The size and number of the RMAs can be set using the kvm_rma_size=xx and kvm_rma_count=xx kernel command line options. KVM exports a new capability, KVM_CAP_PPC_RMA, to signal the availability of the pool of preallocated RMAs. The capability value is 1 if the processor can use an RMA but doesn't require one (because it supports the VRMA facility), or 2 if the processor requires an RMA for each guest. This adds a new ioctl, KVM_ALLOCATE_RMA, which allocates an RMA from the pool and returns a file descriptor which can be used to map the RMA. It also returns the size of the RMA in the argument structure. Having an RMA means we will get multiple KMV_SET_USER_MEMORY_REGION ioctl calls from userspace. To cope with this, we now preallocate the kvm->arch.ram_pginfo array when the VM is created with a size sufficient for up to 64GB of guest memory. Subsequently we will get rid of this array and use memory associated with each memslot instead. This moves most of the code that translates the user addresses into host pfns (page frame numbers) out of kvmppc_prepare_vrma up one level to kvmppc_core_prepare_memory_region. Also, instead of having to look up the VMA for each page in order to check the page size, we now check that the pages we get are compound pages of 16MB. However, if we are adding memory that is mapped to an RMA, we don't bother with calling get_user_pages_fast and instead just offset from the base pfn for the RMA. Typically the RMA gets added after vcpus are created, which makes it inconvenient to have the LPCR (logical partition control register) value in the vcpu->arch struct, since the LPCR controls whether the processor uses RMA or VRMA for the guest. This moves the LPCR value into the kvm->arch struct and arranges for the MER (mediated external request) bit, which is the only bit that varies between vcpus, to be set in assembly code when going into the guest if there is a pending external interrupt request. Signed-off-by:
Paul Mackerras <paulus@samba.org> Signed-off-by:
Alexander Graf <agraf@suse.de>
-
Paul Mackerras authored
This lifts the restriction that book3s_hv guests can only run one hardware thread per core, and allows them to use up to 4 threads per core on POWER7. The host still has to run single-threaded. This capability is advertised to qemu through a new KVM_CAP_PPC_SMT capability. The return value of the ioctl querying this capability is the number of vcpus per virtual CPU core (vcore), currently 4. To use this, the host kernel should be booted with all threads active, and then all the secondary threads should be offlined. This will put the secondary threads into nap mode. KVM will then wake them from nap mode and use them for running guest code (while they are still offline). To wake the secondary threads, we send them an IPI using a new xics_wake_cpu() function, implemented in arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage we assume that the platform has a XICS interrupt controller and we are using icp-native.c to drive it. Since the woken thread will need to acknowledge and clear the IPI, we also export the base physical address of the XICS registers using kvmppc_set_xics_phys() for use in the low-level KVM book3s code. When a vcpu is created, it is assigned to a virtual CPU core. The vcore number is obtained by dividing the vcpu number by the number of threads per core in the host. This number is exported to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes to run the guest in single-threaded mode, it should make all vcpu numbers be multiples of the number of threads per core. We distinguish three states of a vcpu: runnable (i.e., ready to execute the guest), blocked (that is, idle), and busy in host. We currently implement a policy that the vcore can run only when all its threads are runnable or blocked. This way, if a vcpu needs to execute elsewhere in the kernel or in qemu, it can do so without being starved of CPU by the other vcpus. When a vcore starts to run, it executes in the context of one of the vcpu threads. The other vcpu threads all go to sleep and stay asleep until something happens requiring the vcpu thread to return to qemu, or to wake up to run the vcore (this can happen when another vcpu thread goes from busy in host state to blocked). It can happen that a vcpu goes from blocked to runnable state (e.g. because of an interrupt), and the vcore it belongs to is already running. In that case it can start to run immediately as long as the none of the vcpus in the vcore have started to exit the guest. We send the next free thread in the vcore an IPI to get it to start to execute the guest. It synchronizes with the other threads via the vcore->entry_exit_count field to make sure that it doesn't go into the guest if the other vcpus are exiting by the time that it is ready to actually enter the guest. Note that there is no fixed relationship between the hardware thread number and the vcpu number. Hardware threads are assigned to vcpus as they become runnable, so we will always use the lower-numbered hardware threads in preference to higher-numbered threads if not all the vcpus in the vcore are runnable, regardless of which vcpus are runnable. Signed-off-by:
Paul Mackerras <paulus@samba.org> Signed-off-by:
Alexander Graf <agraf@suse.de>
-
David Gibson authored
This improves I/O performance for guests using the PAPR paravirtualization interface by making the H_PUT_TCE hcall faster, by implementing it in real mode. H_PUT_TCE is used for updating virtual IOMMU tables, and is used both for virtual I/O and for real I/O in the PAPR interface. Since this moves the IOMMU tables into the kernel, we define a new KVM_CREATE_SPAPR_TCE ioctl to allow qemu to create the tables. The ioctl returns a file descriptor which can be used to mmap the newly created table. The qemu driver models use them in the same way as userspace managed tables, but they can be updated directly by the guest with a real-mode H_PUT_TCE implementation, reducing the number of host/guest context switches during guest IO. There are certain circumstances where it is useful for userland qemu to write to the TCE table even if the kernel H_PUT_TCE path is used most of the time. Specifically, allowing this will avoid awkwardness when we need to reset the table. More importantly, we will in the future need to write the table in order to restore its state after a checkpoint resume or migration. Signed-off-by:
David Gibson <david@gibson.dropbear.id.au> Signed-off-by:
Paul Mackerras <paulus@samba.org> Signed-off-by:
Alexander Graf <agraf@suse.de>
-
Paul Mackerras authored
This adds support for KVM running on 64-bit Book 3S processors, specifically POWER7, in hypervisor mode. Using hypervisor mode means that the guest can use the processor's supervisor mode. That means that the guest can execute privileged instructions and access privileged registers itself without trapping to the host. This gives excellent performance, but does mean that KVM cannot emulate a processor architecture other than the one that the hardware implements. This code assumes that the guest is running paravirtualized using the PAPR (Power Architecture Platform Requirements) interface, which is the interface that IBM's PowerVM hypervisor uses. That means that existing Linux distributions that run on IBM pSeries machines will also run under KVM without modification. In order to communicate the PAPR hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code to include/linux/kvm.h. Currently the choice between book3s_hv support and book3s_pr support (i.e. the existing code, which runs the guest in user mode) has to be made at kernel configuration time, so a given kernel binary can only do one or the other. This new book3s_hv code doesn't support MMIO emulation at present. Since we are running paravirtualized guests, this isn't a serious restriction. With the guest running in supervisor mode, most exceptions go straight to the guest. We will never get data or instruction storage or segment interrupts, alignment interrupts, decrementer interrupts, program interrupts, single-step interrupts, etc., coming to the hypervisor from the guest. Therefore this introduces a new KVMTEST_NONHV macro for the exception entry path so that we don't have to do the KVM test on entry to those exception handlers. We do however get hypervisor decrementer, hypervisor data storage, hypervisor instruction storage, and hypervisor emulation assist interrupts, so we have to handle those. In hypervisor mode, real-mode accesses can access all of RAM, not just a limited amount. Therefore we put all the guest state in the vcpu.arch and use the shadow_vcpu in the PACA only for temporary scratch space. We allocate the vcpu with kzalloc rather than vzalloc, and we don't use anything in the kvmppc_vcpu_book3s struct, so we don't allocate it. We don't have a shared page with the guest, but we still need a kvm_vcpu_arch_shared struct to store the values of various registers, so we include one in the vcpu_arch struct. The POWER7 processor has a restriction that all threads in a core have to be in the same partition. MMU-on kernel code counts as a partition (partition 0), so we have to do a partition switch on every entry to and exit from the guest. At present we require the host and guest to run in single-thread mode because of this hardware restriction. This code allocates a hashed page table for the guest and initializes it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We require that the guest memory is allocated using 16MB huge pages, in order to simplify the low-level memory management. This also means that we can get away without tracking paging activity in the host for now, since huge pages can't be paged or swapped. This also adds a few new exports needed by the book3s_hv code. Signed-off-by:
Paul Mackerras <paulus@samba.org> Signed-off-by:
Alexander Graf <agraf@suse.de>
-
Scott Wood authored
This is a shared page used for paravirtualization. It is always present in the guest kernel's effective address space at the address indicated by the hypercall that enables it. The physical address specified by the hypercall is not used, as e500 does not have real mode. Signed-off-by:
Scott Wood <scottwood@freescale.com> Signed-off-by:
Alexander Graf <agraf@suse.de>
-
Avi Kivity authored
When CR0.WP=0, we sometimes map user pages as kernel pages (to allow the kernel to write to them). Unfortunately this also allows the kernel to fetch from these pages, even if CR4.SMEP is set. Adjust for this by also setting NX on the spte in these circumstances. Signed-off-by:
Avi Kivity <avi@redhat.com>
-
Jan Kiszka authored
The documented behavior did not match the implemented one (which also never changed). Signed-off-by:
Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by:
Avi Kivity <avi@redhat.com>
-
Jan Kiszka authored
Neither host_irq nor the guest_msi struct are used anymore today. Tag the former, drop the latter to avoid confusion. Signed-off-by:
Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by:
Avi Kivity <avi@redhat.com>
-