1. 18 Nov, 2010 1 commit
  2. 13 Nov, 2010 1 commit
  3. 11 Oct, 2010 1 commit
    • Tristan Ye's avatar
      ocfs2: Add a mount option "coherency=*" to handle cluster coherency for O_DIRECT writes. · 7bdb0d18
      Tristan Ye authored
      
      
      Currently, the default behavior of O_DIRECT writes was allowing
      concurrent writing among nodes to the same file, with no cluster
      coherency guaranteed (no EX lock held).  This can leave stale data in
      the cache for buffered reads on other nodes.
      
      The new mount option introduce a chance to choose two different
      behaviors for O_DIRECT writes:
      
          * coherency=full, as the default value, will disallow
                            concurrent O_DIRECT writes by taking
                            EX locks.
      
          * coherency=buffered, allow concurrent O_DIRECT writes
                                without EX lock among nodes, which
                                gains high performance at risk of
                                getting stale data on other nodes.
      Signed-off-by: default avatarTristan Ye <tristan.ye@oracle.com>
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      7bdb0d18
  4. 07 Oct, 2010 1 commit
    • Sunil Mushran's avatar
      · 2c442719
      Sunil Mushran authored
      
      ocfs2: Add support for heartbeat=global mount option
      
      Adds support for heartbeat=global mount option. It ensures that the heartbeat
      mode passed matches the one enabled on disk.
      Signed-off-by: default avatarSunil Mushran <sunil.mushran@oracle.com>
      2c442719
  5. 09 Oct, 2010 1 commit
    • Sunil Mushran's avatar
      · 98f486f2
      Sunil Mushran authored
      
      ocfs2: Add an incompat feature flag OCFS2_FEATURE_INCOMPAT_CLUSTERINFO
      
      OCFS2_FEATURE_INCOMPAT_CLUSTERINFO allows us to use sb->s_cluster_info for
      both userspace and o2cb cluster stacks. It also allows us to extend cluster
      info to include stack flags.
      
      This patch also adds stackflags to sb->s_clusterinfo. It also introduces a
      clusterinfo flag OCFS2_CLUSTER_O2CB_GLOBAL_HEARTBEAT to denote the enabled
      global heartbeat mode.
      
      This incompat flag can be set/cleared using tunefs.ocfs2 --fs-features. The
      clusterinfo flag is set/cleared using tunefs.ocfs2 --update-cluster-stack.
      Signed-off-by: default avatarSunil Mushran <sunil.mushran@oracle.com>
      98f486f2
  6. 10 Sep, 2010 2 commits
    • Tao Ma's avatar
      ocfs2: Cache system inodes of other slots. · b4d693fc
      Tao Ma authored
      
      
      Durring orphan scan, if we are slot 0, and we are replaying
      orphan_dir:0001, the general process is that for every file
      in this dir:
      1. we will iget orphan_dir:0001, since there is no inode for it.
         we will have to create an inode and read it from the disk.
      2. do the normal work, such as delete_inode and remove it from
         the dir if it is allowed.
      3. call iput orphan_dir:0001 when we are done. In this case,
         since we have no dcache for this inode, i_count will
         reach 0, and VFS will have to call clear_inode and in
         ocfs2_clear_inode we will checkpoint the inode which will let
         ocfs2_cmt and journald begin to work.
      4. We loop back to 1 for the next file.
      
      So you see, actually for every deleted file, we have to read the
      orphan dir from the disk and checkpoint the journal. It is very
      time consuming and cause a lot of journal checkpoint I/O.
      A better solution is that we can have another reference for these
      inodes in ocfs2_super. So if there is no other race among
      nodes(which will let dlmglue to checkpoint the inode), for step 3,
      clear_inode won't be called and for step 1, we may only need to
      read the inode for the 1st time. This is a big win for us.
      
      So this patch will try to cache system inodes of other slots so
      that we will have one more reference for these inodes and avoid
      the extra inode read and journal checkpoint.
      Signed-off-by: default avatarTao Ma <tao.ma@oracle.com>
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      b4d693fc
    • Goldwyn Rodrigues's avatar
      Reorganize data elements to reduce struct sizes · 83fd9c7f
      Goldwyn Rodrigues authored
      
      
      Thanks for the comments. I have incorportated them all.
      
      CONFIG_OCFS2_FS_STATS is enabled and CONFIG_DEBUG_LOCK_ALLOC is disabled.
      Statistics now look like -
      ocfs2_write_ctxt: 2144 - 2136 = 8
      ocfs2_inode_info: 1960 - 1848 = 112
      ocfs2_journal: 168 - 160 = 8
      ocfs2_lock_res: 336 - 304 = 32
      ocfs2_refcount_tree: 512 - 472 = 40
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.de>
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      83fd9c7f
  7. 05 May, 2010 4 commits
    • Mark Fasheh's avatar
      ocfs2: Add dir_resv_level mount option · 83f92318
      Mark Fasheh authored
      
      
      The default behavior for directory reservations stays the same, but we add a
      mount option so people can tweak the size of directory reservations
      according to their workloads.
      Signed-off-by: default avatarMark Fasheh <mfasheh@suse.com>
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      83f92318
    • Mark Fasheh's avatar
      ocfs2: increase the default size of local alloc windows · 6b82021b
      Mark Fasheh authored
      
      
      I have observed that the current size of 8M gives us pretty poor
      fragmentation on multi-threaded workloads which do lots of writes.
      
      Generally, I can increase the size of local alloc windows and observe a
      marked decrease in fragmentation, even up and beyond window sizes of 512
      megabytes. This makes sense for a couple reasons - larger local alloc means
      more room for reservation windows. On multi-node workloads the larger local
      alloc helps as well because we don't have to do window slides as often.
      
      Also, I removed the OCFS2_DEFAULT_LOCAL_ALLOC_SIZE constant as it is no
      longer used and the comment above it was out of date.
      
      To test fragmentation, I used a workload which launched 4 threads that did
      4k writes into a series of about 140 alternating files.
      
      With resv_level=2, and a 4k/4k file system I observed the following average
      fragmentation for various localalloc= parameters:
      
      localalloc=	avg. fragmentation
      	8		48
      	32		16
      	64		10
      	120		7
      
      On larger cluster sizes, the difference is more dramatic.
      
      The new default size top out at 256M, which we'll only get for cluster
      sizes of 32K and above.
      Signed-off-by: default avatarMark Fasheh <mfasheh@suse.com>
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      6b82021b
    • Mark Fasheh's avatar
      ocfs2: clean up localalloc mount option size parsing · 73c8a800
      Mark Fasheh authored
      
      
      This patch pulls the local alloc sizing code into localalloc.c and provides
      a callout to it from ocfs2_fill_super(). Behavior is essentially unchanged
      except that I correctly calculate the maximum local alloc size. The old code
      in ocfs2_parse_options() calculated the max size as:
      
      ocfs2_local_alloc_size(sb) * 8
      
      which is correct, in bits. Unfortunately though the option passed in is in
      megabytes. Ultimately, this bug made no real difference - the shrink code
      would catch a too-large size and bring it down to something reasonable.
      Still, it's less than efficient as-is.
      Signed-off-by: default avatarMark Fasheh <mfasheh@suse.com>
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      73c8a800
    • Mark Fasheh's avatar
      ocfs2: allocation reservations · d02f00cc
      Mark Fasheh authored
      
      
      This patch improves Ocfs2 allocation policy by allowing an inode to
      reserve a portion of the local alloc bitmap for itself. The reserved
      portion (allocation window) is advisory in that other allocation
      windows might steal it if the local alloc bitmap becomes
      full. Otherwise, the reservations are honored and guaranteed to be
      free. When the local alloc window is moved to a different portion of
      the bitmap, existing reservations are discarded.
      
      Reservation windows are represented internally by a red-black
      tree. Within that tree, each node represents the reservation window of
      one inode. An LRU of active reservations is also maintained. When new
      data is written, we allocate it from the inodes window. When all bits
      in a window are exhausted, we allocate a new one as close to the
      previous one as possible. Should we not find free space, an existing
      reservation is pulled off the LRU and cannibalized.
      Signed-off-by: default avatarMark Fasheh <mfasheh@suse.com>
      d02f00cc
  8. 23 Mar, 2010 1 commit
    • Mark Fasheh's avatar
      ocfs2: Clear undo bits when local alloc is freed · b4414eea
      Mark Fasheh authored
      
      
      When the local alloc file changes windows, unused bits are freed back to the
      global bitmap. By defnition, those bits can not be in use by any file. Also,
      the local alloc will never have been able to allocate those bits if they
      were part of a previous truncate. Therefore it makes sense that we should
      clear unused local alloc bits in the undo buffer so that they can be used
      immediatly.
      
      [ Modified to call it ocfs2_release_clusters() -- Joel ]
      Signed-off-by: default avatarMark Fasheh <mfasheh@suse.com>
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      b4414eea
  9. 22 Apr, 2010 1 commit
  10. 13 Apr, 2010 1 commit
  11. 02 Mar, 2010 1 commit
  12. 26 Feb, 2010 2 commits
  13. 03 Feb, 2010 1 commit
    • Sunil Mushran's avatar
      ocfs2: Prevent a livelock in dlmglue · a1912826
      Sunil Mushran authored
      
      
      There is possibility of a livelock in __ocfs2_cluster_lock(). If a node were
      to get an ast for an upconvert request, followed immediately by a bast,
      there is a small window where the fs may downconvert the lock before the
      process requesting the upconvert is able to take the lock.
      
      This patch adds a new flag to indicate that the upconvert is still in
      progress and that the dc thread should not downconvert it right now.
      
      Wengang Wang <wen.gang.wang@oracle.com> and Joel Becker
      <joel.becker@oracle.com> contributed heavily to this patch.
      Reported-by: default avatarDavid Teigland <teigland@redhat.com>
      Signed-off-by: default avatarSunil Mushran <sunil.mushran@oracle.com>
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      a1912826
  14. 13 Nov, 2009 1 commit
  15. 29 Oct, 2009 1 commit
    • Jan Kara's avatar
      ocfs2: Make acl use the default · 5297aad8
      Jan Kara authored
      
      
      Change acl mount options handling to match the one of XFS and BTRFS and
      hopefully it is also easier to use now. When admin does not specify any
      acl mount option, acls are enabled if and only if the filesystem has
      xattr feature enabled. If admin specifies 'acl' mount option, we fail
      the mount if the filesystem does not have xattr feature and thus acls
      cannot be enabled.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      5297aad8
  16. 22 Sep, 2009 3 commits
  17. 04 Sep, 2009 5 commits
    • Joel Becker's avatar
      ocfs2: Pass struct ocfs2_caching_info to the journal functions. · 0cf2f763
      Joel Becker authored
      
      
      The next step in divorcing metadata I/O management from struct inode is
      to pass struct ocfs2_caching_info to the journal functions.  Thus the
      journal locks a metadata cache with the cache io_lock function.  It also
      can compare ci_last_trans and ci_created_trans directly.
      
      This is a large patch because of all the places we change
      ocfs2_journal_access..(handle, inode, ...) to
      ocfs2_journal_access..(handle, INODE_CACHE(inode), ...).
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      0cf2f763
    • Joel Becker's avatar
      ocfs2: move ip_created_trans to struct ocfs2_caching_info · 292dd27e
      Joel Becker authored
      
      
      Similar ip_last_trans, ip_created_trans tracks the creation of a journal
      managed inode.  This specifically tracks what transaction created the
      inode.  This is so the code can know if the inode has ever been written
      to disk.
      
      This behavior is desirable for any journal managed object.  We move it
      to struct ocfs2_caching_info as ci_created_trans so that any object
      using ocfs2_caching_info can rely on this behavior.
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      292dd27e
    • Joel Becker's avatar
      ocfs2: move ip_last_trans to struct ocfs2_caching_info · 66fb345d
      Joel Becker authored
      
      
      We have the read side of metadata caching isolated to struct
      ocfs2_caching_info, now we need the write side.  This means the journal
      functions.  The journal only does a couple of things with struct inode.
      
      This change moves the ip_last_trans field onto struct
      ocfs2_caching_info as ci_last_trans.  This field tells the journal
      whether a pending journal flush is required.
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      66fb345d
    • Joel Becker's avatar
      ocfs2: Change metadata caching locks to an operations structure. · 6e5a3d75
      Joel Becker authored
      
      
      We don't really want to cart around too many new fields on the
      ocfs2_caching_info structure.  So let's wrap all our access of the
      parent object in a set of operations.  One pointer on caching_info, and
      more flexibility to boot.
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      6e5a3d75
    • Joel Becker's avatar
      ocfs2: Make the ocfs2_caching_info structure self-contained. · 47460d65
      Joel Becker authored
      
      
      We want to use the ocfs2_caching_info structure in places that are not
      inodes.  To do that, it can no longer rely on referencing the inode
      directly.
      
      This patch moves the flags to ocfs2_caching_info->ci_flags, stores
      pointers to the parent's locks on the ocfs2_caching_info, and renames
      the constants and flags to reflect its independant state.
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      47460d65
  18. 21 Jul, 2009 1 commit
    • Jan Kara's avatar
      ocfs2: Fix deadlock on umount · f7b1aa69
      Jan Kara authored
      In commit ea455f8a
      
      , we moved the dentry lock
      put process into ocfs2_wq. This causes problems during umount because ocfs2_wq
      can drop references to inodes while they are being invalidated by
      invalidate_inodes() causing all sorts of nasty things (invalidate_inodes()
      ending in an infinite loop, "Busy inodes after umount" messages etc.).
      
      We fix the problem by stopping ocfs2_wq from doing any further releasing of
      inode references on the superblock being unmounted, wait until it finishes
      the current round of releasing and finally cleaning up all the references in
      dentry_lock_list from ocfs2_put_super().
      
      The issue was tracked down by Tao Ma <tao.ma@oracle.com>.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      f7b1aa69
  19. 22 Jun, 2009 2 commits
  20. 03 Jun, 2009 3 commits
    • Joel Becker's avatar
      ocfs2: Add statistics for the checksum and ecc operations. · 73be192b
      Joel Becker authored
      
      
      It would be nice to know how often we get checksum failures.  Even
      better, how many of them we can fix with the single bit ecc.  So, we add
      a statistics structure.  The structure can be installed into debugfs
      wherever the user wants.
      
      For ocfs2, we'll put it in the superblock-specific debugfs directory and
      pass it down from our higher-level functions.  The stats are only
      registered with debugfs when the filesystem supports metadata ecc.
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      73be192b
    • Srinivas Eeda's avatar
      ocfs2 patch to track delayed orphan scan timer statistics · 15633a22
      Srinivas Eeda authored
      
      
      Patch to track delayed orphan scan timer statistics.
      
      Modifies ocfs2_osb_dump to print the following:
        Orphan Scan=> Local: 10  Global: 21  Last Scan: 67 seconds ago
      Signed-off-by: default avatarSrinivas Eeda <srinivas.eeda@oracle.com>
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      15633a22
    • Srinivas Eeda's avatar
      ocfs2: timer to queue scan of all orphan slots · 83273932
      Srinivas Eeda authored
      
      
      When a dentry is unlinked, the unlinking node takes an EX on the dentry lock
      before moving the dentry to the orphan directory. Other nodes that have
      this dentry in cache have a PR on the same dentry lock.  When the EX is
      requested, the other nodes flag the corresponding inode as MAYBE_ORPHANED
      during downconvert.  The inode is finally deleted when the last node to iput
      the inode sees that i_nlink==0 and the MAYBE_ORPHANED flag is set.
      
      A problem arises if a node is forced to free dentry locks because of memory
      pressure. If this happens, the node will no longer get downconvert
      notifications for the dentries that have been unlinked on another node.
      If it also happens that node is actively using the corresponding inode and
      happens to be the one performing the last iput on that inode, it will fail
      to delete the inode as it will not have the MAYBE_ORPHANED flag set.
      
      This patch fixes this shortcoming by introducing a periodic scan of the
      orphan directories to delete such inodes. Care has been taken to distribute
      the workload across the cluster so that no one node has to perform the task
      all the time.
      Signed-off-by: default avatarSrinivas Eeda <srinivas.eeda@oracle.com>
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      83273932
  21. 03 Apr, 2009 6 commits
    • Srinivas Eeda's avatar
      ocfs2: recover orphans in offline slots during recovery and mount · 9140db04
      Srinivas Eeda authored
      
      
      During recovery, a node recovers orphans in it's slot and the dead node(s). But
      if the dead nodes were holding orphans in offline slots, they will be left
      unrecovered.
      
      If the dead node is the last one to die and is holding orphans in other slots
      and is the first one to mount, then it only recovers it's own slot, which
      leaves orphans in offline slots.
      
      This patch queues complete_recovery to clean orphans for all offline slots
      during mount and node recovery.
      Signed-off-by: default avatarSrinivas Eeda <srinivas.eeda@oracle.com>
      Acked-by: default avatarJoel Becker <joel.becker@oracle.com>
      Signed-off-by: default avatarMark Fasheh <mfasheh@suse.com>
      9140db04
    • wengang wang's avatar
      ocfs2: fix rare stale inode errors when exporting via nfs · 6ca497a8
      wengang wang authored
      
      
      For nfs exporting, ocfs2_get_dentry() returns the dentry for fh.
      ocfs2_get_dentry() may read from disk when the inode is not in memory,
      without any cross cluster lock. this leads to the file system loading a
      stale inode.
      
      This patch fixes above problem.
      
      Solution is that in case of inode is not in memory, we get the cluster
      lock(PR) of alloc inode where the inode in question is allocated from (this
      causes node on which deletion is done sync the alloc inode) before reading
      out the inode itsself. then we check the bitmap in the group (the inode in
      question allcated from) to see if the bit is clear. if it's clear then it's
      stale. if the bit is set, we then check generation as the existing code
      does.
      
      We have to read out the inode in question from disk first to know its alloc
      slot and allot bit. And if its not stale we read it out using ocfs2_iget().
      The second read should then be from cache.
      
      And also we have to add a per superblock nfs_sync_lock to cover the lock for
      alloc inode and that for inode in question. this is because ocfs2_get_dentry()
      and ocfs2_delete_inode() lock on them in reverse order. nfs_sync_lock is locked
      in EX mode in ocfs2_get_dentry() and in PR mode in ocfs2_delete_inode(). so
      that mutliple ocfs2_delete_inode() can run concurrently in normal case.
      
      [mfasheh@suse.com: build warning fixes and comment cleanups]
      Signed-off-by: default avatarWengang Wang <wen.gang.wang@oracle.com>
      Acked-by: default avatarJoel Becker <joel.becker@oracle.com>
      Signed-off-by: default avatarMark Fasheh <mfasheh@suse.com>
      6ca497a8
    • Tao Ma's avatar
      ocfs2: Optimize inode group allocation by recording last used group. · feb473a6
      Tao Ma authored
      In ocfs2, the block group search looks for the "emptiest" group
      to allocate from. So if the allocator has many equally(or almost
      equally) empty groups, new block group will tend to get spread
      out amongst them.
      
      So we add osb_inode_alloc_group in ocfs2_super to record the last
      used inode allocation group.
      For more details, please see
      http://oss.oracle.com/osswiki/OCFS2/DesignDocs/InodeAllocationStrategy
      
      .
      
      I have done some basic test and the results are a ten times improvement on
      some cold-cache stat workloads.
      Signed-off-by: default avatarTao Ma <tao.ma@oracle.com>
      Signed-off-by: default avatarMark Fasheh <mfasheh@suse.com>
      feb473a6
    • Mark Fasheh's avatar
      ocfs2: fix leaf start calculation in ocfs2_dx_dir_rebalance() · 1d46dc08
      Mark Fasheh authored
      
      
      ocfs2_dx_dir_rebalance() is passed the block offset of a dx leaf which needs
      rebalancing. Since we rebalance an entire cluster at a time however, this
      function needs to calculate the beginning of that cluster, in blocks. The
      calculation was wrong, which would result in a read of non-leaf blocks. Fix
      the calculation by adding ocfs2_block_to_cluster_start() which is a more
      straight-forward way of determining this.
      Reported-by: default avatarTristan Ye <tristan.ye@oracle.com>
      Signed-off-by: default avatarMark Fasheh <mfasheh@suse.com>
      1d46dc08
    • Mark Fasheh's avatar
      ocfs2: Increase max links count · 198a1ca3
      Mark Fasheh authored
      
      
      Since we've now got a directory format capable of handling a large number of
      entries, we can increase the maximum link count supported. This only gets
      increased if the directory indexing feature is turned on.
      Signed-off-by: default avatarMark Fasheh <mfasheh@suse.com>
      Acked-by: default avatarJoel Becker <joel.becker@oracle.com>
      198a1ca3
    • Mark Fasheh's avatar
      ocfs2: Add a name indexed b-tree to directory inodes · 9b7895ef
      Mark Fasheh authored
      
      
      This patch makes use of Ocfs2's flexible btree code to add an additional
      tree to directory inodes. The new tree stores an array of small,
      fixed-length records in each leaf block. Each record stores a hash value,
      and pointer to a block in the traditional (unindexed) directory tree where a
      dirent with the given name hash resides. Lookup exclusively uses this tree
      to find dirents, thus providing us with constant time name lookups.
      
      Some of the hashing code was copied from ext3. Unfortunately, it has lots of
      unfixed checkpatch errors. I left that as-is so that tracking changes would
      be easier.
      Signed-off-by: default avatarMark Fasheh <mfasheh@suse.com>
      Acked-by: default avatarJoel Becker <joel.becker@oracle.com>
      9b7895ef