1. 09 Nov, 2015 1 commit
    • Dan Carpenter's avatar
      vfio/pci: make an array larger · 222e684c
      Dan Carpenter authored
      Smatch complains about a possible out of bounds error:
      
      	drivers/vfio/pci/vfio_pci_config.c:1241 vfio_cap_init()
      	error: buffer overflow 'pci_cap_length' 20 <= 20
      
      The problem is that pci_cap_length[] was defined as large enough to
      hold "PCI_CAP_ID_AF + 1" elements.  The code in vfio_cap_init() assumes
      it has PCI_CAP_ID_MAX + 1 elements.  Originally, PCI_CAP_ID_AF and
      PCI_CAP_ID_MAX were the same but then we introduced PCI_CAP_ID_EA in
      commit f80b0ba9
      
       ("PCI: Add Enhanced Allocation register entries")
      so now the array is too small.
      
      Let's fix this by making the array size PCI_CAP_ID_MAX + 1.  And let's
      make a similar change to pci_ext_cap_length[] for consistency.  Also
      both these arrays can be made const.
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      222e684c
  2. 04 Nov, 2015 2 commits
    • Alex Williamson's avatar
      vfio: Include No-IOMMU mode · 033291ec
      Alex Williamson authored
      
      
      There is really no way to safely give a user full access to a DMA
      capable device without an IOMMU to protect the host system.  There is
      also no way to provide DMA translation, for use cases such as device
      assignment to virtual machines.  However, there are still those users
      that want userspace drivers even under those conditions.  The UIO
      driver exists for this use case, but does not provide the degree of
      device access and programming that VFIO has.  In an effort to avoid
      code duplication, this introduces a No-IOMMU mode for VFIO.
      
      This mode requires building VFIO with CONFIG_VFIO_NOIOMMU and enabling
      the "enable_unsafe_noiommu_mode" option on the vfio driver.  This
      should make it very clear that this mode is not safe.  Additionally,
      CAP_SYS_RAWIO privileges are necessary to work with groups and
      containers using this mode.  Groups making use of this support are
      named /dev/vfio/noiommu-$GROUP and can only make use of the special
      VFIO_NOIOMMU_IOMMU for the container.  Use of this mode, specifically
      binding a device without a native IOMMU group to a VFIO bus driver
      will taint the kernel and should therefore not be considered
      supported.  This patch includes no-iommu support for the vfio-pci bus
      driver only.
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      033291ec
    • Joerg Roedel's avatar
      vfio: Fix bug in vfio_device_get_from_name() · e324fc82
      Joerg Roedel authored
      The vfio_device_get_from_name() function might return a
      non-NULL pointer, when called with a device name that is not
      found in the list. This causes undefined behavior, in my
      case calling an invalid function pointer later on:
      
       kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
       BUG: unable to handle kernel paging request at ffff8800cb3ddc08
      
      [...]
      
       Call Trace:
        [<ffffffffa03bd733>] ? vfio_group_fops_unl_ioctl+0x253/0x410 [vfio]
        [<ffffffff811efc4d>] do_vfs_ioctl+0x2cd/0x4c0
        [<ffffffff811f9657>] ? __fget+0x77/0xb0
        [<ffffffff811efeb9>] SyS_ioctl+0x79/0x90
        [<ffffffff81001bb0>] ? syscall_return_slowpath+0x50/0x130
        [<ffffffff8167f776>] entry_SYSCALL_64_fastpath+0x16/0x75
      
      Fix the issue by returning NULL when there is no device with
      the requested name in the list.
      
      Cc: stable@vger.kernel.org # v4.2+
      Fixes: 4bc94d5d
      
       ("vfio: Fix lockdep issue")
      Signed-off-by: default avatarJoerg Roedel <jroedel@suse.de>
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      e324fc82
  3. 03 Nov, 2015 11 commits
  4. 27 Oct, 2015 3 commits
    • Eric Auger's avatar
      VFIO: platform: clear IRQ_NOAUTOEN when de-assigning the IRQ · 1276ece3
      Eric Auger authored
      
      
      The vfio platform driver currently sets the IRQ_NOAUTOEN before
      doing the request_irq to properly handle the user masking. However
      it does not clear it when de-assigning the IRQ. This brings issues
      when loading the native driver again which may not explicitly enable
      the IRQ. This problem was observed with xgbe driver.
      Signed-off-by: default avatarEric Auger <eric.auger@linaro.org>
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      1276ece3
    • Alex Williamson's avatar
      vfio/pci: Use kernel VPD access functions · 4e1a6355
      Alex Williamson authored
      The PCI VPD capability operates on a set of window registers in PCI
      config space.  Writing to the address register triggers either a read
      or write, depending on the setting of the PCI_VPD_ADDR_F bit within
      the address register.  The data register provides either the source
      for writes or the target for reads.
      
      This model is susceptible to being broken by concurrent access, for
      which the kernel has adopted a set of access functions to serialize
      these registers.  Additionally, commits like 932c435c ("PCI: Add
      dev_flags bit to access VPD through function 0") and 7aa6ca4d
      
      
      ("PCI: Add VPD function 0 quirk for Intel Ethernet devices") indicate
      that VPD registers can be shared between functions on multifunction
      devices creating dependencies between otherwise independent devices.
      
      Fortunately it's quite easy to emulate the VPD registers, simply
      storing copies of the address and data registers in memory and
      triggering a VPD read or write on writes to the address register.
      This allows vfio users to avoid seeing spurious register changes from
      accesses on other devices and enables the use of shared quirks in the
      host kernel.  We can theoretically still race with access through
      sysfs, but the window of opportunity is much smaller.
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Acked-by: default avatarMark Rustad <mark.d.rustad@intel.com>
      4e1a6355
    • Alex Williamson's avatar
      vfio: Whitelist PCI bridges · 5f096b14
      Alex Williamson authored
      
      
      When determining whether a group is viable, we already allow devices
      bound to pcieport.  Generalize this to include any PCI bridge device.
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      5f096b14
  5. 01 Oct, 2015 1 commit
  6. 24 Jul, 2015 1 commit
    • Alex Williamson's avatar
      vfio: Fix lockdep issue · 4bc94d5d
      Alex Williamson authored
      
      
      When we open a device file descriptor, we currently have the
      following:
      
      vfio_group_get_device_fd()
        mutex_lock(&group->device_lock);
          open()
          ...
          if (ret)
            release()
      
      If we hit that error case, we call the backend driver release path,
      which for vfio-pci looks like this:
      
      vfio_pci_release()
        vfio_pci_disable()
          vfio_pci_try_bus_reset()
            vfio_pci_get_devs()
              vfio_device_get_from_dev()
                vfio_group_get_device()
                  mutex_lock(&group->device_lock);
      
      Whoops, we've stumbled back onto group.device_lock and created a
      deadlock.  There's a low likelihood of ever seeing this play out, but
      obviously it needs to be fixed.  To do that we can use a reference to
      the vfio_device for vfio_group_get_device_fd() rather than holding the
      lock.  There was a loop in this function, theoretically allowing
      multiple devices with the same name, but in practice we don't expect
      such a thing to happen and the code is already aborting from the loop
      with break on any sort of error rather than continuing and only
      parsing the first match anyway, so the loop was effectively unused
      already.
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Fixes: 20f30017
      
       ("vfio/pci: Fix racy vfio_device_get_from_dev() call")
      Reported-by: default avatarJoerg Roedel <joro@8bytes.org>
      Tested-by: default avatarJoerg Roedel <jroedel@suse.de>
      4bc94d5d
  7. 22 Jun, 2015 4 commits
  8. 17 Jun, 2015 1 commit
  9. 10 Jun, 2015 15 commits
    • Alexey Kardashevskiy's avatar
      vfio: powerpc/spapr: Support Dynamic DMA windows · e633bc86
      Alexey Kardashevskiy authored
      
      
      This adds create/remove window ioctls to create and remove DMA windows.
      sPAPR defines a Dynamic DMA windows capability which allows
      para-virtualized guests to create additional DMA windows on a PCI bus.
      The existing linux kernels use this new window to map the entire guest
      memory and switch to the direct DMA operations saving time on map/unmap
      requests which would normally happen in a big amounts.
      
      This adds 2 ioctl handlers - VFIO_IOMMU_SPAPR_TCE_CREATE and
      VFIO_IOMMU_SPAPR_TCE_REMOVE - to create and remove windows.
      Up to 2 windows are supported now by the hardware and by this driver.
      
      This changes VFIO_IOMMU_SPAPR_TCE_GET_INFO handler to return additional
      information such as a number of supported windows and maximum number
      levels of TCE tables.
      
      DDW is added as a capability, not as a SPAPR TCE IOMMU v2 unique feature
      as we still want to support v2 on platforms which cannot do DDW for
      the sake of TCE acceleration in KVM (coming soon).
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      e633bc86
    • Alexey Kardashevskiy's avatar
      vfio: powerpc/spapr: Register memory and define IOMMU v2 · 2157e7b8
      Alexey Kardashevskiy authored
      
      
      The existing implementation accounts the whole DMA window in
      the locked_vm counter. This is going to be worse with multiple
      containers and huge DMA windows. Also, real-time accounting would requite
      additional tracking of accounted pages due to the page size difference -
      IOMMU uses 4K pages and system uses 4K or 64K pages.
      
      Another issue is that actual pages pinning/unpinning happens on every
      DMA map/unmap request. This does not affect the performance much now as
      we spend way too much time now on switching context between
      guest/userspace/host but this will start to matter when we add in-kernel
      DMA map/unmap acceleration.
      
      This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
      New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
      2 new ioctls to register/unregister DMA memory -
      VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
      which receive user space address and size of a memory region which
      needs to be pinned/unpinned and counted in locked_vm.
      New IOMMU splits physical pages pinning and TCE table update
      into 2 different operations. It requires:
      1) guest pages to be registered first
      2) consequent map/unmap requests to work only with pre-registered memory.
      For the default single window case this means that the entire guest
      (instead of 2GB) needs to be pinned before using VFIO.
      When a huge DMA window is added, no additional pinning will be
      required, otherwise it would be guest RAM + 2GB.
      
      The new memory registration ioctls are not supported by
      VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
      will require memory to be preregistered in order to work.
      
      The accounting is done per the user process.
      
      This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
      can do with v1 or v2 IOMMUs.
      
      In order to support memory pre-registration, we need a way to track
      the use of every registered memory region and only allow unregistration
      if a region is not in use anymore. So we need a way to tell from what
      region the just cleared TCE was from.
      
      This adds a userspace view of the TCE table into iommu_table struct.
      It contains userspace address, one per TCE entry. The table is only
      allocated when the ownership over an IOMMU group is taken which means
      it is only used from outside of the powernv code (such as VFIO).
      
      As v2 IOMMU supports IODA2 and pre-IODA2 IOMMUs (which do not support
      DDW API), this creates a default DMA window for IODA2 for consistency.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      2157e7b8
    • Alexey Kardashevskiy's avatar
      vfio: powerpc/spapr: powerpc/powernv/ioda2: Use DMA windows API in ownership control · 46d3e1e1
      Alexey Kardashevskiy authored
      
      
      Before the IOMMU user (VFIO) would take control over the IOMMU table
      belonging to a specific IOMMU group. This approach did not allow sharing
      tables between IOMMU groups attached to the same container.
      
      This introduces a new IOMMU ownership flavour when the user can not
      just control the existing IOMMU table but remove/create tables on demand.
      If an IOMMU implements take/release_ownership() callbacks, this lets
      the user have full control over the IOMMU group. When the ownership
      is taken, the platform code removes all the windows so the caller must
      create them.
      Before returning the ownership back to the platform code, VFIO
      unprograms and removes all the tables it created.
      
      This changes IODA2's onwership handler to remove the existing table
      rather than manipulating with the existing one. From now on,
      iommu_take_ownership() and iommu_release_ownership() are only called
      from the vfio_iommu_spapr_tce driver.
      
      Old-style ownership is still supported allowing VFIO to run on older
      P5IOC2 and IODA IO controllers.
      
      No change in userspace-visible behaviour is expected. Since it recreates
      TCE tables on each ownership change, related kernel traces will appear
      more often.
      
      This adds a pnv_pci_ioda2_setup_default_config() which is called
      when PE is being configured at boot time and when the ownership is
      passed from VFIO to the platform code.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      46d3e1e1
    • Alexey Kardashevskiy's avatar
      vfio: powerpc/spapr: powerpc/powernv/ioda: Define and implement DMA windows API · 4793d65d
      Alexey Kardashevskiy authored
      
      
      This extends iommu_table_group_ops by a set of callbacks to support
      dynamic DMA windows management.
      
      create_table() creates a TCE table with specific parameters.
      it receives iommu_table_group to know nodeid in order to allocate
      TCE table memory closer to the PHB. The exact format of allocated
      multi-level table might be also specific to the PHB model (not
      the case now though).
      This callback calculated the DMA window offset on a PCI bus from @num
      and stores it in a just created table.
      
      set_window() sets the window at specified TVT index + @num on PHB.
      
      unset_window() unsets the window from specified TVT.
      
      This adds a free() callback to iommu_table_ops to free the memory
      (potentially a tree of tables) allocated for the TCE table.
      
      create_table() and free() are supposed to be called once per
      VFIO container and set_window()/unset_window() are supposed to be
      called for every group in a container.
      
      This adds IOMMU capabilities to iommu_table_group such as default
      32bit window parameters and others. This makes use of new values in
      vfio_iommu_spapr_tce. IODA1/P5IOC2 do not support DDW so they do not
      advertise pagemasks to the userspace.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      4793d65d
    • Alexey Kardashevskiy's avatar
      powerpc/iommu/powernv: Release replaced TCE · 05c6cfb9
      Alexey Kardashevskiy authored
      
      
      At the moment writing new TCE value to the IOMMU table fails with EBUSY
      if there is a valid entry already. However PAPR specification allows
      the guest to write new TCE value without clearing it first.
      
      Another problem this patch is addressing is the use of pool locks for
      external IOMMU users such as VFIO. The pool locks are to protect
      DMA page allocator rather than entries and since the host kernel does
      not control what pages are in use, there is no point in pool locks and
      exchange()+put_page(oldtce) is sufficient to avoid possible races.
      
      This adds an exchange() callback to iommu_table_ops which does the same
      thing as set() plus it returns replaced TCE and DMA direction so
      the caller can release the pages afterwards. The exchange() receives
      a physical address unlike set() which receives linear mapping address;
      and returns a physical address as the clear() does.
      
      This implements exchange() for P5IOC2/IODA/IODA2. This adds a requirement
      for a platform to have exchange() implemented in order to support VFIO.
      
      This replaces iommu_tce_build() and iommu_clear_tce() with
      a single iommu_tce_xchg().
      
      This makes sure that TCE permission bits are not set in TCE passed to
      IOMMU API as those are to be calculated by platform code from
      DMA direction.
      
      This moves SetPageDirty() to the IOMMU code to make it work for both
      VFIO ioctl interface in in-kernel TCE acceleration (when it becomes
      available later).
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      05c6cfb9
    • Alexey Kardashevskiy's avatar
      vfio: powerpc/spapr/iommu/powernv/ioda2: Rework IOMMU ownership control · f87a8864
      Alexey Kardashevskiy authored
      
      
      This adds tce_iommu_take_ownership() and tce_iommu_release_ownership
      which call in a loop iommu_take_ownership()/iommu_release_ownership()
      for every table on the group. As there is just one now, no change in
      behaviour is expected.
      
      At the moment the iommu_table struct has a set_bypass() which enables/
      disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
      which calls this callback when external IOMMU users such as VFIO are
      about to get over a PHB.
      
      The set_bypass() callback is not really an iommu_table function but
      IOMMU/PE function. This introduces a iommu_table_group_ops struct and
      adds take_ownership()/release_ownership() callbacks to it which are
      called when an external user takes/releases control over the IOMMU.
      
      This replaces set_bypass() with ownership callbacks as it is not
      necessarily just bypass enabling, it can be something else/more
      so let's give it more generic name.
      
      The callbacks is implemented for IODA2 only. Other platforms (P5IOC2,
      IODA1) will use the old iommu_take_ownership/iommu_release_ownership API.
      The following patches will replace iommu_take_ownership/
      iommu_release_ownership calls in IODA2 with full IOMMU table release/
      create.
      
      As we here and touching bypass control, this removes
      pnv_pci_ioda2_setup_bypass_pe() as it does not do much
      more compared to pnv_pci_ioda2_set_bypass. This moves tce_bypass_base
      initialization to pnv_pci_ioda2_setup_dma_pe.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      f87a8864
    • Alexey Kardashevskiy's avatar
      powerpc/spapr: vfio: Switch from iommu_table to new iommu_table_group · 0eaf4def
      Alexey Kardashevskiy authored
      
      
      So far one TCE table could only be used by one IOMMU group. However
      IODA2 hardware allows programming the same TCE table address to
      multiple PE allowing sharing tables.
      
      This replaces a single pointer to a group in a iommu_table struct
      with a linked list of groups which provides the way of invalidating
      TCE cache for every PE when an actual TCE table is updated. This adds
      pnv_pci_link_table_and_group() and pnv_pci_unlink_table_and_group()
      helpers to manage the list. However without VFIO, it is still going
      to be a single IOMMU group per iommu_table.
      
      This changes iommu_add_device() to add a device to a first group
      from the group list of a table as it is only called from the platform
      init code or PCI bus notifier and at these moments there is only
      one group per table.
      
      This does not change TCE invalidation code to loop through all
      attached groups in order to simplify this patch and because
      it is not really needed in most cases. IODA2 is fixed in a later
      patch.
      
      This should cause no behavioural change.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      0eaf4def
    • Alexey Kardashevskiy's avatar
      powerpc/spapr: vfio: Replace iommu_table with iommu_table_group · b348aa65
      Alexey Kardashevskiy authored
      
      
      Modern IBM POWERPC systems support multiple (currently two) TCE tables
      per IOMMU group (a.k.a. PE). This adds a iommu_table_group container
      for TCE tables. Right now just one table is supported.
      
      This defines iommu_table_group struct which stores pointers to
      iommu_group and iommu_table(s). This replaces iommu_table with
      iommu_table_group where iommu_table was used to identify a group:
      - iommu_register_group();
      - iommudata of generic iommu_group;
      
      This removes @data from iommu_table as it_table_group provides
      same access to pnv_ioda_pe.
      
      For IODA, instead of embedding iommu_table, the new iommu_table_group
      keeps pointers to those. The iommu_table structs are allocated
      dynamically.
      
      For P5IOC2, both iommu_table_group and iommu_table are embedded into
      PE struct. As there is no EEH and SRIOV support for P5IOC2,
      iommu_free_table() should not be called on iommu_table struct pointers
      so we can keep it embedded in pnv_phb::p5ioc2.
      
      For pSeries, this replaces multiple calls of kzalloc_node() with a new
      iommu_pseries_alloc_group() helper and stores the table group struct
      pointer into the pci_dn struct. For release, a iommu_table_free_group()
      helper is added.
      
      This moves iommu_table struct allocation from SR-IOV code to
      the generic DMA initialization code in pnv_pci_ioda_setup_dma_pe and
      pnv_pci_ioda2_setup_dma_pe as this is where DMA is actually initialized.
      This change is here because those lines had to be changed anyway.
      
      This should cause no behavioural change.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      b348aa65
    • Alexey Kardashevskiy's avatar
      vfio: powerpc/spapr: Rework groups attaching · 22af4859
      Alexey Kardashevskiy authored
      
      
      This is to make extended ownership and multiple groups support patches
      simpler for review.
      
      This should cause no behavioural change.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      22af4859
    • Alexey Kardashevskiy's avatar
      vfio: powerpc/spapr: Moving pinning/unpinning to helpers · 649354b7
      Alexey Kardashevskiy authored
      
      
      This is a pretty mechanical patch to make next patches simpler.
      
      New tce_iommu_unuse_page() helper does put_page() now but it might skip
      that after the memory registering patch applied.
      
      As we are here, this removes unnecessary checks for a value returned
      by pfn_to_page() as it cannot possibly return NULL.
      
      This moves tce_iommu_disable() later to let tce_iommu_clear() know if
      the container has been enabled because if it has not been, then
      put_page() must not be called on TCEs from the TCE table. This situation
      is not yet possible but it will after KVM acceleration patchset is
      applied.
      
      This changes code to work with physical addresses rather than linear
      mapping addresses for better code readability. Following patches will
      add an xchg() callback for an IOMMU table which will accept/return
      physical addresses (unlike current tce_build()) which will eliminate
      redundant conversions.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      649354b7
    • Alexey Kardashevskiy's avatar
      vfio: powerpc/spapr: Disable DMA mappings on disabled container · 3c56e822
      Alexey Kardashevskiy authored
      
      
      At the moment DMA map/unmap requests are handled irrespective to
      the container's state. This allows the user space to pin memory which
      it might not be allowed to pin.
      
      This adds checks to MAP/UNMAP that the container is enabled, otherwise
      -EPERM is returned.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      3c56e822
    • Alexey Kardashevskiy's avatar
      vfio: powerpc/spapr: Move locked_vm accounting to helpers · 2d270df8
      Alexey Kardashevskiy authored
      
      
      There moves locked pages accounting to helpers.
      Later they will be reused for Dynamic DMA windows (DDW).
      
      This reworks debug messages to show the current value and the limit.
      
      This stores the locked pages number in the container so when unlocking
      the iommu table pointer won't be needed. This does not have an effect
      now but it will with the multiple tables per container as then we will
      allow attaching/detaching groups on fly and we may end up having
      a container with no group attached but with the counter incremented.
      
      While we are here, update the comment explaining why RLIMIT_MEMLOCK
      might be required to be bigger than the guest RAM. This also prints
      pid of the current process in pr_warn/pr_debug.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      2d270df8
    • Alexey Kardashevskiy's avatar
      vfio: powerpc/spapr: Use it_page_size · 00663d4e
      Alexey Kardashevskiy authored
      
      
      This makes use of the it_page_size from the iommu_table struct
      as page size can differ.
      
      This replaces missing IOMMU_PAGE_SHIFT macro in commented debug code
      as recently introduced IOMMU_PAGE_XXX macros do not include
      IOMMU_PAGE_SHIFT.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      00663d4e
    • Alexey Kardashevskiy's avatar
      vfio: powerpc/spapr: Check that IOMMU page is fully contained by system page · e432bc7e
      Alexey Kardashevskiy authored
      
      
      This checks that the TCE table page size is not bigger that the size of
      a page we just pinned and going to put its physical address to the table.
      
      Otherwise the hardware gets unwanted access to physical memory between
      the end of the actual page and the end of the aligned up TCE page.
      
      Since compound_order() and compound_head() work correctly on non-huge
      pages, there is no need for additional check whether the page is huge.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      e432bc7e
    • Alexey Kardashevskiy's avatar
      vfio: powerpc/spapr: Move page pinning from arch code to VFIO IOMMU driver · 9b14a1ff
      Alexey Kardashevskiy authored
      
      
      This moves page pinning (get_user_pages_fast()/put_page()) code out of
      the platform IOMMU code and puts it to VFIO IOMMU driver where it belongs
      to as the platform code does not deal with page pinning.
      
      This makes iommu_take_ownership()/iommu_release_ownership() deal with
      the IOMMU table bitmap only.
      
      This removes page unpinning from iommu_take_ownership() as the actual
      TCE table might contain garbage and doing put_page() on it is undefined
      behaviour.
      
      Besides the last part, the rest of the patch is mechanical.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      9b14a1ff
  10. 09 Jun, 2015 1 commit
    • Alex Williamson's avatar
      vfio/pci: Fix racy vfio_device_get_from_dev() call · 20f30017
      Alex Williamson authored
      
      
      Testing the driver for a PCI device is racy, it can be all but
      complete in the release path and still report the driver as ours.
      Therefore we can't trust drvdata to be valid.  This race can sometimes
      be seen when one port of a multifunction device is being unbound from
      the vfio-pci driver while another function is being released by the
      user and attempting a bus reset.  The device in the remove path is
      found as a dependent device for the bus reset of the release path
      device, the driver is still set to vfio-pci, but the drvdata has
      already been cleared, resulting in a null pointer dereference.
      
      To resolve this, fix vfio_device_get_from_dev() to not take the
      dev_get_drvdata() shortcut and instead traverse through the
      iommu_group, vfio_group, vfio_device path to get a reference we
      can trust.  Once we have that reference, we know the device isn't
      in transition and we can test to make sure the driver is still what
      we expect, so that we don't interfere with devices we don't own.
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      20f30017