1. 20 Nov, 2008 1 commit
    • Tony Luck's avatar
      [IA64] Rationalize kernel mode alignment checking · b704882e
      Tony Luck authored
      Itanium processors can handle some misaligned data accesses. They
      also provide a mode where all such accesses are forced to trap. The
      kernel was schizophrenic about use of this mode:
      * Base kernel code ran in permissive mode where the only traps
        generated were from those cases that the h/w could not handle.
      * Interrupt, syscall and trap code ran in strict mode where all
        unaligned accesses caused traps to the 0x5a00 unaligned reference
      Use strict alignment checking throughout the kernel, but make
      sure that we continue to let user mode use more relaxed mode
      as the default.
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
  2. 06 Oct, 2008 1 commit
  3. 25 Jul, 2008 1 commit
  4. 27 May, 2008 1 commit
    • Isaku Yamahata's avatar
      [IA64] pvops: paravirtualize entry.S · 4df8d22b
      Isaku Yamahata authored
      paravirtualize ia64_swtich_to, ia64_leave_syscall and ia64_leave_kernel.
      They include sensitive or performance critical privileged instructions
      so that they need paravirtualization.
      To paravirtualize them by single source and multi compile
      they are converted into indirect jump. And define each pv instances.
      Cc: Keith Owens <kaos@ocs.com.au>
      Cc: "Dong, Eddie" <eddie.dong@intel.com>
      Signed-off-by: default avatarIsaku Yamahata <yamahata@valinux.co.jp>
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
  5. 14 May, 2008 2 commits
  6. 22 Apr, 2008 1 commit
    • Hidetoshi Seto's avatar
      [IA64] disable interrupts on exit of ia64_trace_syscall · 38477ad7
      Hidetoshi Seto authored
      While testing with CONFIG_VIRT_CPU_ACCOUNTING=y, I found that
      I occasionally get very huge system time in some threads.
      So I dug the issue and finally noticed that it was caused
      because of an interrupt which interrupt in the following window:
      > [arch/ia64/kernel/entry.S: (!CONFIG_PREEMPT && CONFIG_VIRT_CPU_ACCOUNTING)]
      > ENTRY(ia64_leave_syscall)
      >    :
      > (pUStk) rsm psr.i
      >         cmp.eq pLvSys,p0=r0,r0          // pLvSys=1: leave from syscall
      > (pUStk) cmp.eq.unc p6,p0=r0,r0          // p6 <- pUStk
      > .work_processed_syscall:
      >         adds r2=PT(LOADRS)+16,r12
      > (pUStk) mov.m r22=ar.itc                        // fetch time at leave
      >         adds r18=TI_FLAGS+IA64_TASK_SIZE,r13
      >         ;;
      > <<< window: from here >>>
      > (p6)    ld4 r31=[r18]  // load current_thread_info()->flags
      >         ld8 r19=[r2],PT(B6)-PT(LOADRS)
      >         adds r3=PT(AR_BSPSTORE)+16,r12
      >         ;;
      >         mov r16=ar.bsp
      >         ld8 r18=[r2],PT(R9)-PT(B6)
      > (p6)    and r15=TIF_WORK_MASK,r31  // any work other than TIF_SYSCALL_TRACE?
      >         ;;
      >         ld8 r23=[r3],PT(R11)-PT(AR_BSPSTORE)
      > (p6)    cmp4.ne.unc p6,p0=r15, r0               // any special work pending?
      > (p6)    br.cond.spnt .work_pending_syscall
      >         ;;
      >         ld8 r9=[r2],PT(CR_IPSR)-PT(R9)
      >         ld8 r11=[r3],PT(CR_IIP)-PT(R11)
      > (pNonSys) break 0 // bug check: we shouldn't be here if pNonSys is TRUE!
      >         ;;
      >         invala
      > <<< window: to here >>>
      >         rsm psr.i | psr.ic // turn off interrupts and interruption collection
      If pUStk is true, it means we are going to return user mode, hence we fetch
      ar.itc to get time at leave from system.
      It seems that it is not possible to interrupt the window if pUStk is true,
      because interrupts are disabled early.  And also disabling interrupt makes
      sense because it is safe for referring current_thread_info()->flags.
      However interrupting the window while pUStk is true was possible.
      The route was:
      -> .work_pending_syscall_end
      -> .work_processed_syscall
      Only in case entering the window from this route, interrupts are enabled
      during in the window even if pUStk is true.  I suppose interrupts must be
      disabled here anyway if pUStk is true.
      I'm not sure but afraid that what kind of bad effect were there, other
      than crazy system time which I found.
      FYI, there was a commit 6f6d7582
      points out a bug at same point(exit of ia64_trace_syscall) in 2006.
      It can be said that there was an another bug.
      Signed-off-by: default avatarHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
  7. 20 Feb, 2008 1 commit
    • Hidetoshi Seto's avatar
      [IA64] VIRT_CPU_ACCOUNTING (accurate cpu time accounting) · b64f34cd
      Hidetoshi Seto authored
      This patch implements VIRT_CPU_ACCOUNTING for ia64,
      which enable us to use more accurate cpu time accounting.
      The VIRT_CPU_ACCOUNTING is an item of kernel config, which s390
      and powerpc arch have.  By turning this config on, these archs
      change the mechanism of cpu time accounting from tick-sampling
      based one to state-transition based one.
      The state-transition based accounting is done by checking time
      (cycle counter in processor) at every state-transition point,
      such as entrance/exit of kernel, interrupt, softirq etc.
      The difference between point to point is the actual time consumed
      during in the state. There is no doubt about that this value is
      more accurate than that of tick-sampling based accounting.
      Signed-off-by: default avatarHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
  8. 08 Feb, 2008 1 commit
  9. 05 Feb, 2008 1 commit
    • Davide Libenzi's avatar
      timerfd: new timerfd API · 4d672e7a
      Davide Libenzi authored
      This is the new timerfd API as it is implemented by the following patch:
      int timerfd_create(int clockid, int flags);
      int timerfd_settime(int ufd, int flags,
      		    const struct itimerspec *utmr,
      		    struct itimerspec *otmr);
      int timerfd_gettime(int ufd, struct itimerspec *otmr);
      The timerfd_create() API creates an un-programmed timerfd fd.  The "clockid"
      parameter can be either CLOCK_MONOTONIC or CLOCK_REALTIME.
      The timerfd_settime() API give new settings by the timerfd fd, by optionally
      retrieving the previous expiration time (in case the "otmr" parameter is not
      The time value specified in "utmr" is absolute, if the TFD_TIMER_ABSTIME bit
      is set in the "flags" parameter.  Otherwise it's a relative time.
      The timerfd_gettime() API returns the next expiration time of the timer, or
      {0, 0} if the timerfd has not been set yet.
      Like the previous timerfd API implementation, read(2) and poll(2) are
      supported (with the same interface).  Here's a simple test program I used to
      exercise the new timerfd APIs:
      [akpm@linux-foundation.org: coding-style cleanups]
      [akpm@linux-foundation.org: fix ia64 build]
      [akpm@linux-foundation.org: fix m68k build]
      [akpm@linux-foundation.org: fix mips build]
      [akpm@linux-foundation.org: fix alpha, arm, blackfin, cris, m68k, s390, sparc and sparc64 builds]
      [heiko.carstens@de.ibm.com: fix s390]
      [akpm@linux-foundation.org: fix powerpc build]
      [akpm@linux-foundation.org: fix sparc64 more]
      Signed-off-by: default avatarDavide Libenzi <davidel@xmailserver.org>
      Cc: Michael Kerrisk <mtk-manpages@gmx.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: Michael Kerrisk <mtk-manpages@gmx.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  10. 19 Jul, 2007 1 commit
  11. 14 May, 2007 1 commit
  12. 10 May, 2007 1 commit
  13. 08 May, 2007 2 commits
  14. 06 Feb, 2007 1 commit
    • Chen, Kenneth W's avatar
      [IA64] remove per-cpu ia64_phys_stacked_size_p8 · a0776ec8
      Chen, Kenneth W authored
      It's not efficient to use a per-cpu variable just to store
      how many physical stack register a cpu has.  Ever since the
      incarnation of ia64 up till upcoming Montecito processor, that
      variable has "glued" to 96. Having a variable in memory means
      that the kernel is burning an extra cacheline access on every
      syscall and kernel exit path.  Such "static" value is better
      served with the instruction patching utility exists today.
      Convert ia64_phys_stacked_size_p8 into dynamic insn patching.
      This also has a pleasant side effect of eliminating access to
      per-cpu area while psr.ic=0 in the kernel exit path. (fixable
      for per-cpu DTC work, but why bother?)
      There are some concerns with the default value that the instruc-
      tion encoded in the kernel image.  It shouldn't be concerned.
      The reasons are:
      (1) cpu_init() is called at CPU initialization.  In there, we
          find out physical stack register size from PAL and patch
          two instructions in kernel exit code.  The code in question
          can not be executed before the patching is done.
      (2) current implementation stores zero in ia64_phys_stacked_size_p8,
          and that's what the current kernel exit path loads the value with.
          With the new code, it is equivalent that we store reg size 96
          in ia64_phys_stacked_size_p8, thus creating a better safety net.
          Given (1) above can never fail, having (2) is just a bonus.
      All in all, this patch allow one less memory reference in the kernel
      exit path, thus reducing syscall and interrupt return latency; and
      avoid polluting potential useful data in the CPU cache.
      Signed-off-by: default avatarKen Chen <kenneth.w.chen@intel.com>
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
  15. 05 Feb, 2007 1 commit
  16. 07 Dec, 2006 1 commit
    • Zou Nan hai's avatar
      [IA64] IA64 Kexec/kdump · a7956113
      Zou Nan hai authored
      Changes and updates.
      1. Remove fake rendz path and related code according to discuss with Khalid Aziz.
      2. fc.i offset fix in relocate_kernel.S.
      3. iospic shutdown code eoi and mask race fix from Fujitsu.
      4. Warm boot hook in machine_kexec to SN SAL code from Jack Steiner.
      5. Send slave to SAL slave loop patch from Jay Lan.
      6. Kdump on non-recoverable MCA event patch from Jay Lan
      7. Use CTL_UNNUMBERED in kdump_on_init sysctl.
      Signed-off-by: default avatarZou Nan hai <nanhai.zou@intel.com>
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
  17. 03 Oct, 2006 1 commit
  18. 02 Oct, 2006 1 commit
    • Arnd Bergmann's avatar
      [PATCH] rename the provided execve functions to kernel_execve · 3db03b4a
      Arnd Bergmann authored
      Some architectures provide an execve function that does not set errno, but
      instead returns the result code directly.  Rename these to kernel_execve to
      get the right semantics there.  Moreover, there is no reasone for any of these
      architectures to still provide __KERNEL_SYSCALLS__ or _syscallN macros, so
      remove these right away.
      [akpm@osdl.org: build fix]
      [bunk@stusta.de: build fix]
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Andi Kleen <ak@muc.de>
      Acked-by: default avatarPaul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Ian Molton <spyro@f2s.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Hirokazu Takata <takata.hirokazu@renesas.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
      Cc: Richard Curnow <rc@rc0.org.uk>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Roman Zippel <zippel@linux-m68k.org>
      Signed-off-by: default avatarAdrian Bunk <bunk@stusta.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
  19. 26 Sep, 2006 1 commit
  20. 08 Sep, 2006 1 commit
  21. 30 Jun, 2006 1 commit
  22. 23 Jun, 2006 1 commit
    • Christoph Lameter's avatar
      [PATCH] page migration: sys_move_pages(): support moving of individual pages · 742755a1
      Christoph Lameter authored
      move_pages() is used to move individual pages of a process. The function can
      be used to determine the location of pages and to move them onto the desired
      node. move_pages() returns status information for each page.
      long move_pages(pid, number_of_pages_to_move,
      		nodes[] or NULL,
      The addresses of pages is an array of void * pointing to the
      pages to be moved.
      The nodes array contains the node numbers that the pages should be moved
      to. If a NULL is passed instead of an array then no pages are moved but
      the status array is updated. The status request may be used to determine
      the page state before issuing another move_pages() to move pages.
      The status array will contain the state of all individual page migration
      attempts when the function terminates. The status array is only valid if
      move_pages() completed successfullly.
      Possible page states in status[]:
      0..MAX_NUMNODES	The page is now on the indicated node.
      -ENOENT		Page is not present
      -EACCES		Page is mapped by multiple processes and can only
      		be moved if MPOL_MF_MOVE_ALL is specified.
      -EPERM		The page has been mlocked by a process/driver and
      		cannot be moved.
      -EBUSY		Page is busy and cannot be moved. Try again later.
      -EFAULT		Invalid address (no VMA or zero page).
      -ENOMEM		Unable to allocate memory on target node.
      -EIO		Unable to write back page. The page must be written
      		back in order to move it since the page is dirty and the
      		filesystem does not provide a migration function that
      		would allow the moving of dirty pages.
      -EINVAL		A dirty page cannot be moved. The filesystem does not provide
      		a migration function and has no ability to write back pages.
      The flags parameter indicates what types of pages to move:
      MPOL_MF_MOVE	Move pages that are only mapped by the process.
      MPOL_MF_MOVE_ALL Also move pages that are mapped by multiple processes.
      		Requires sufficient capabilities.
      Possible return codes from move_pages()
      -ENOENT		No pages found that would require moving. All pages
      		are either already on the target node, not present, had an
      		invalid address or could not be moved because they were
      		mapped by multiple processes.
      -EINVAL		Flags other than MPOL_MF_MOVE(_ALL) specified or an attempt
      		to migrate pages in a kernel thread.
      -EPERM		MPOL_MF_MOVE_ALL specified without sufficient priviledges.
      		or an attempt to move a process belonging to another user.
      -EACCES		One of the target nodes is not allowed by the current cpuset.
      -ENODEV		One of the target nodes is not online.
      -ESRCH		Process does not exist.
      -E2BIG		Too many pages to move.
      -ENOMEM		Not enough memory to allocate control array.
      -EFAULT		Parameters could not be accessed.
      A test program for move_pages() may be found with the patches
      on ftp.kernel.org:/pub/linux/kernel/people/christoph/pmig/patches-2.6.17-rc4-mm3
      From: Christoph Lameter <clameter@sgi.com>
        Detailed results for sys_move_pages()
        Pass a pointer to an integer to get_new_page() that may be used to
        indicate where the completion status of a migration operation should be
        placed.  This allows sys_move_pags() to report back exactly what happened to
        each page.
        Wish there would be a better way to do this. Looks a bit hacky.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Jes Sorensen <jes@trained-monkey.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Andi Kleen <ak@muc.de>
      Cc: Michael Kerrisk <mtk-manpages@gmx.net>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
  23. 26 Apr, 2006 1 commit
    • Jens Axboe's avatar
      [PATCH] Add support for the sys_vmsplice syscall · 912d35f8
      Jens Axboe authored
      sys_splice() moves data to/from pipes with a file input/output. sys_vmsplice()
      moves data to a pipe, with the input being a user address range instead.
      This uses an approach suggested by Linus, where we can hold partial ranges
      inside the pages[] map. Hopefully this will be useful for network
      receive support as well.
      Signed-off-by: default avatarJens Axboe <axboe@suse.de>
  24. 11 Apr, 2006 1 commit
    • Jens Axboe's avatar
      [PATCH] splice: add support for sys_tee() · 70524490
      Jens Axboe authored
      Basically an in-kernel implementation of tee, which uses splice and the
      pipe buffers as an intelligent way to pass data around by reference.
      Where the user space tee consumes the input and produces a stdout and
      file output, this syscall merely duplicates the data inside a pipe to
      another pipe. No data is copied, the output just grabs a reference to the
      input pipe data.
      Signed-off-by: default avatarJens Axboe <axboe@suse.de>
  25. 06 Apr, 2006 1 commit
  26. 04 Apr, 2006 1 commit
  27. 30 Mar, 2006 1 commit
    • Jens Axboe's avatar
      [PATCH] Introduce sys_splice() system call · 5274f052
      Jens Axboe authored
      This adds support for the sys_splice system call. Using a pipe as a
      transport, it can connect to files or sockets (latter as output only).
      From the splice.c comments:
         "splice": joining two ropes together by interweaving their strands.
         This is the "extended pipe" functionality, where a pipe is used as
         an arbitrary in-memory buffer. Think of a pipe as a small kernel
         buffer that you can use to transfer data from one end to the other.
         The traditional unix read/write is extended with a "splice()" operation
         that transfers data buffers to or from a pipe buffer.
         Named by Larry McVoy, original implementation from Linus, extended by
         Jens to support splicing to files and fixing the initial implementation
      Signed-off-by: default avatarJens Axboe <axboe@suse.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
  28. 16 Feb, 2006 1 commit
  29. 08 Feb, 2006 1 commit
  30. 06 Feb, 2006 1 commit
  31. 26 Jan, 2006 1 commit
  32. 08 Jan, 2006 1 commit
    • Christoph Lameter's avatar
      [PATCH] Swap Migration V5: sys_migrate_pages interface · 39743889
      Christoph Lameter authored
      sys_migrate_pages implementation using swap based page migration
      This is the original API proposed by Ray Bryant in his posts during the first
      half of 2005 on linux-mm@kvack.org and linux-kernel@vger.kernel.org.
      The intent of sys_migrate is to migrate memory of a process.  A process may
      have migrated to another node.  Memory was allocated optimally for the prior
      context.  sys_migrate_pages allows to shift the memory to the new node.
      sys_migrate_pages is also useful if the processes available memory nodes have
      changed through cpuset operations to manually move the processes memory.  Paul
      Jackson is working on an automated mechanism that will allow an automatic
      migration if the cpuset of a process is changed.  However, a user may decide
      to manually control the migration.
      This implementation is put into the policy layer since it uses concepts and
      functions that are also needed for mbind and friends.  The patch also provides
      a do_migrate_pages function that may be useful for cpusets to automatically
      move memory.  sys_migrate_pages does not modify policies in contrast to Ray's
      The current code here is based on the swap based page migration capability and
      thus is not able to preserve the physical layout relative to it containing
      nodeset (which may be a cpuset).  When direct page migration becomes available
      then the implementation needs to be changed to do a isomorphic move of pages
      between different nodesets.  The current implementation simply evicts all
      pages in source nodeset that are not in the target nodeset.
      Patch supports ia64, i386 and x86_64.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
  33. 16 Sep, 2005 1 commit
  34. 09 Sep, 2005 2 commits
  35. 07 Sep, 2005 1 commit
  36. 01 Aug, 2005 1 commit
    • Ingo Molnar's avatar
      [PATCH] remove sys_set_zone_reclaim() · 6cb54819
      Ingo Molnar authored
      This removes sys_set_zone_reclaim() for now.  While i'm sure Martin is
      trying to solve a real problem, we must not hard-code an incomplete and
      insufficient approach into a syscall, because syscalls are pretty much
      for eternity.  I am quite strongly convinced that this syscall must not
      hit v2.6.13 in its current form.
      Firstly, the syscall lacks basic syscall design: e.g. it allows the
      global setting of VM policy for unprivileged users. (!) [ Imagine an
      Oracle installation and a SAP installation on the same NUMA box fighting
      over the 'optimal' setting for this flag. What will they do? Will they
      try to set the flag to their own preferred value every second or so? ]
      Secondly, it was added based on a single datapoint from Martin:
      where Martin characterizes the numbers the following way:
       ' Run-to-run variability for "make -j" is huge, so these numbers aren't
         terribly useful except to see that with reclaim the benchmark still
         finishes in a reasonable amount of time. '
      in other words: the fundamental problem has likely not been solved, only
      a tendential move into the right direction has been observed, and a
      handful of numbers were picked out of a set of hugely variable results,
      without showing the variability data. How much variance is there
      I'd really suggest to first walk the walk and see what's needed to get
      stable & predictable kernel compilation numbers on that NUMA box, before
      adding random syscalls to tune a particular aspect of the VM ... which
      approach might not even matter once the whole picture has been analyzed
      and understood!
      The third, most important point is that the syscall exposes VM tuning
      internals in a completely unstructured way. What sense does it make to
      have a _GLOBAL_ per-node setting for 'should we go to another node for
      reclaim'? If then it might make sense to do this per-app, via numalib or
      The change is minimalistic in that it doesnt remove the syscall and the
      underlying infrastructure changes, only the user-visible changes.  We
      could perhaps add a CAP_SYS_ADMIN-only sysctl for this hack, a'ka
      /proc/sys/vm/swappiness, but even that looks quite counterproductive
      when the generic approach is that we are trying to reduce the number of
      external factors in the VM balance picture.
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
  37. 27 Jul, 2005 1 commit