1. 26 Jul, 2016 1 commit
  2. 11 Jul, 2016 1 commit
  3. 15 Jun, 2016 1 commit
    • Jon Paul Maloy's avatar
      tipc: add neighbor monitoring framework · 35c55c98
      Jon Paul Maloy authored
      TIPC based clusters are by default set up with full-mesh link
      connectivity between all nodes. Those links are expected to provide
      a short failure detection time, by default set to 1500 ms. Because
      of this, the background load for neighbor monitoring in an N-node
      cluster increases with a factor N on each node, while the overall
      monitoring traffic through the network infrastructure increases at
      a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
      scale well beyond ~100 nodes unless we significantly increase failure
      discovery tolerance.
      
      This commit introduces a framework and an algorithm that drastically
      reduces this background load, while basically maintaining the original
      failure detection times across the whole cluster. Using this algorithm,
      background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
      at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
      now have to actively monitor 38 neighbors in a 400-node cluster, instead
      of as before 399.
      
      This "Overlapping Ring Supervision Algorithm" is completely distributed
      and employs no centralized or coordinated state. It goes as follows:
      
      - Each node makes up a linearly ascending, circular list of all its N
        known neighbors, based on their TIPC node identity. This algorithm
        must be the same on all nodes.
      
      - The node then selects the next M = sqrt(N) - 1 nodes downstream from
        itself in the list, and chooses to actively monitor those. This is
        called its "local monitoring domain".
      
      - It creates a domain record describing the monitoring domain, and
        piggy-backs this in the data area of all neighbor monitoring messages
        (LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
        the cluster eventually (default within 400 ms) will learn about
        its monitoring domain.
      
      - Whenever a node discovers a change in its local domain, e.g., a node
        has been added or has gone down, it creates and sends out a new
        version of its node record to inform all neighbors about the change.
      
      - A node receiving a domain record from anybody outside its local domain
        matches this against its own list (which may not look the same), and
        chooses to not actively monitor those members of the received domain
        record that are also present in its own list. Instead, it relies on
        indications from the direct monitoring nodes if an indirectly
        monitored node has gone up or down. If a node is indicated lost, the
        receiving node temporarily activates its own direct monitoring towards
        that node in order to confirm, or not, that it is actually gone.
      
      - Since each node is actively monitoring sqrt(N) downstream neighbors,
        each node is also actively monitored by the same number of upstream
        neighbors. This means that all non-direct monitoring nodes normally
        will receive sqrt(N) indications that a node is gone.
      
      - A major drawback with ring monitoring is how it handles failures that
        cause massive network partitionings. If both a lost node and all its
        direct monitoring neighbors are inside the lost partition, the nodes in
        the remaining partition will never receive indications about the loss.
        To overcome this, each node also chooses to actively monitor some
        nodes outside its local domain. Those nodes are called remote domain
        "heads", and are selected in such a way that no node in the cluster
        will be more than two direct monitoring hops away. Because of this,
        each node, apart from monitoring the member of its local domain, will
        also typically monitor sqrt(N) remote head nodes.
      
      - As an optimization, local list status, domain status and domain
        records are marked with a generation number. This saves senders from
        unnecessarily conveying  unaltered domain records, and receivers from
        performing unneeded re-adaptations of their node monitoring list, such
        as re-assigning domain heads.
      
      - As a measure of caution we have added the possibility to disable the
        new algorithm through configuration. We do this by keeping a threshold
        value for the cluster size; a cluster that grows beyond this value
        will switch from full-mesh to ring monitoring, and vice versa when
        it shrinks below the value. This means that if the threshold is set to
        a value larger than any anticipated cluster size (default size is 32)
        the new algorithm is effectively disabled. A patch set for altering the
        threshold value and for listing the table contents will follow shortly.
      
      - This change is fully backwards compatible.
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      35c55c98
  4. 13 Apr, 2016 1 commit
  5. 20 Nov, 2015 1 commit
  6. 24 Oct, 2015 4 commits
    • Jon Paul Maloy's avatar
      tipc: clean up unused code and structures · 2af5ae37
      Jon Paul Maloy authored
      After the previous changes in this series, we can now remove some
      unused code and structures, both in the broadcast, link aggregation
      and link code.
      
      There are no functional changes in this commit.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2af5ae37
    • Jon Paul Maloy's avatar
      tipc: let neighbor discoverer tranmsit consumable buffers · 60852d67
      Jon Paul Maloy authored
      The neighbor discovery function currently uses the function
      tipc_bearer_send() for transmitting packets, assuming that the
      sent buffers are not consumed by the called function.
      
      We want to change this, in order to avoid unnecessary buffer cloning
      elswhere in the code.
      
      This commit introduces a new function tipc_bearer_skb() which consumes
      the sent buffers, and let the discoverer functions use this new call
      instead. The discoverer does now itself perform the cloning when
      that is necessary.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      60852d67
    • Jon Paul Maloy's avatar
      tipc: introduce jumbo frame support for broadcast · 959e1781
      Jon Paul Maloy authored
      Until now, we have only been supporting a fix MTU size of 1500 bytes
      for all broadcast media, irrespective of their actual capability.
      
      We now make the broadcast MTU adaptable to the carrying media, i.e.,
      we use the smallest MTU supported by any of the interfaces attached
      to TIPC.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      959e1781
    • Jon Paul Maloy's avatar
      tipc: simplify bearer level broadcast · b06b281e
      Jon Paul Maloy authored
      Until now, we have been keeping track of the exact set of broadcast
      destinations though the help structure tipc_node_map. This leads us to
      have to maintain a whole infrastructure for supporting this, including
      a pseudo-bearer and a number of functions to manipulate both the bearers
      and the node map correctly. Apart from the complexity, this approach is
      also limiting, as struct tipc_node_map only can support cluster local
      broadcast if we want to avoid it becoming excessively large. We want to
      eliminate this limitation, in order to enable introduction of scoped
      multicast in the future.
      
      A closer analysis reveals that it is unnecessary maintaining this "full
      set" overview; it is sufficient to keep a counter per bearer, indicating
      how many nodes can be reached via this bearer at the moment. The protocol
      is now robust enough to handle transitional discrepancies between the
      nominal number of reachable destinations, as expected by the broadcast
      protocol itself, and the number which is actually reachable at the
      moment. The initial broadcast synchronization, in conjunction with the
      retransmission mechanism, ensures that all packets will eventually be
      acknowledged by the correct set of destinations.
      
      This commit introduces these changes.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b06b281e
  7. 20 Jul, 2015 1 commit
    • Jon Paul Maloy's avatar
      tipc: make media xmit call outside node spinlock context · af9b028e
      Jon Paul Maloy authored
      Currently, message sending is performed through a deep call chain,
      where the node spinlock is grabbed and held during a significant
      part of the transmission time. This is clearly detrimental to
      overall throughput performance; it would be better if we could send
      the message after the spinlock has been released.
      
      In this commit, we do instead let the call revert on the stack after
      the buffer chain has been added to the transmission queue, whereafter
      clones of the buffers are transmitted to the device layer outside the
      spinlock scope.
      
      As a further step in our effort to separate the roles of the node
      and link entities we also move the function tipc_link_xmit() to
      node.c, and rename it to tipc_node_xmit().
      Reviewed-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      af9b028e
  8. 14 May, 2015 1 commit
    • Jon Paul Maloy's avatar
      tipc: simplify include dependencies · a6bf70f7
      Jon Paul Maloy authored
      When we try to add new inline functions in the code, we sometimes
      run into circular include dependencies.
      
      The main problem is that the file core.h, which really should be at
      the root of the dependency chain, instead is a leaf. I.e., core.h
      includes a number of header files that themselves should be allowed
      to include core.h. In reality this is unnecessary, because core.h does
      not need to know the full signature of any of the structs it refers to,
      only their type declaration.
      
      In this commit, we remove all dependencies from core.h towards any
      other tipc header file.
      
      As a consequence of this change, we can now move the function
      tipc_own_addr(net) from addr.c to addr.h, and make it inline.
      
      There are no functional changes in this commit.
      Reviewed-by: default avatarErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a6bf70f7
  9. 05 Mar, 2015 1 commit
  10. 27 Feb, 2015 2 commits
  11. 09 Feb, 2015 3 commits
  12. 12 Jan, 2015 4 commits
  13. 26 Nov, 2014 1 commit
  14. 21 Nov, 2014 5 commits
  15. 14 May, 2014 1 commit
    • Jon Paul Maloy's avatar
      tipc: improve and extend media address conversion functions · 38504c28
      Jon Paul Maloy authored
      TIPC currently handles two media specific addresses: Ethernet MAC
      addresses and InfiniBand addresses. Those are kept in three different
      formats:
      
      1) A "raw" format as obtained from the device. This format is known
         only by the media specific adapter code in eth_media.c and
         ib_media.c.
      2) A "generic" internal format, in the form of struct tipc_media_addr,
         which can be referenced and passed around by the generic media-
         unaware code.
      3) A serialized version of the latter, to be conveyed in neighbor
         discovery messages.
      
      Conversion between the three formats can only be done by the media
      specific code, so we have function pointers for this purpose in
      struct tipc_media. Here, the media adapters can install their own
      conversion functions at startup.
      
      We now introduce a new such function, 'raw2addr()', whose purpose
      is to convert from format 1 to format 2 above. We also try to as far
      as possible uniform commenting, variable names and usage of these
      functions, with the purpose of making them more comprehensible.
      
      We can now also remove the function tipc_l2_media_addr_set(), whose
      job is done better by the new function.
      
      Finally, we expand the field for serialized addresses (format 3)
      in discovery messages from 20 to 32 bytes. This is permitted
      according to the spec, and reduces the risk of problems when we
      add new media in the future.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      38504c28
  16. 22 Apr, 2014 3 commits
  17. 28 Mar, 2014 1 commit
  18. 27 Mar, 2014 2 commits
  19. 13 Feb, 2014 2 commits
    • Ying Xue's avatar
      tipc: remove bearer_lock from tipc_bearer struct · a8304529
      Ying Xue authored
      After the earlier commits ("tipc: remove 'links' list from
      tipc_bearer struct") and ("tipc: introduce new spinlock to protect
      struct link_req"), there is no longer any need to protect struct
      link_req or or any link list by use of bearer_lock. Furthermore,
      we have eliminated the need for using bearer_lock during downcalls
      (send) from the link to the bearer, since we have ensured that
      bearers always have a longer life cycle that their associated links,
      and always contain valid data.
      
      So, the only need now for a lock protecting bearers is for guaranteeing
      consistency of the bearer list itself. For this, it is sufficient, at
      least for the time being, to continue applying 'net_lock´ in write mode.
      
      By removing bearer_lock we also pre-empt introduction of issue b) descibed
      in the previous commit "tipc: remove 'links' list from tipc_bearer struct":
      
      "b) When the outer protection from net_lock is gone, taking
          bearer_lock and node_lock in opposite order of method 1) and 2)
          will become an obvious deadlock hazard".
      
      Therefore, we now eliminate the bearer_lock spinlock.
      Signed-off-by: default avatarYing Xue <ying.xue@windriver.com>
      Reviewed-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a8304529
    • Ying Xue's avatar
      tipc: remove 'links' list from tipc_bearer struct · c61dd61d
      Ying Xue authored
      In our ongoing effort to simplify the TIPC locking structure,
      we see a need to remove the linked list for tipc_links
      in the bearer. This can be explained as follows.
      
      Currently, we have three different ways to access a link,
      via three different lists/tables:
      
      1: Via a node hash table:
         Used by the time-critical outgoing/incoming data paths.
         (e.g. link_send_sections_fast() and tipc_recv_msg() ):
      
      grab net_lock(read)
         find node from node hash table
         grab node_lock
             select link
             grab bearer_lock
                send_msg()
             release bearer_lock
         release node lock
      release net_lock
      
      2: Via a global linked list for nodes:
         Used by configuration commands (link_cmd_set_value())
      
      grab net_lock(read)
         find node and link from global node list (using link name)
         grab node_lock
             update link
         release node lock
      release net_lock
      
      (Same locking order as above. No problem.)
      
      3: Via the bearer's linked link list:
         Used by notifications from interface (e.g. tipc_disable_bearer() )
      
      grab net_lock(write)
         grab bearer_lock
            get link ptr from bearer's link list
            get node from link
            grab node_lock
               delete link
            release node lock
         release bearer_lock
      release net_lock
      
      (Different order from above, but works because we grab the
      outer net_lock in write mode first, excluding all other access.)
      
      The first major goal in our simplification effort is to get rid
      of the "big" net_lock, replacing it with rcu-locks when accessing
      the node list and node hash array. This will come in a later patch
      series.
      
      But to get there we first need to rewrite access methods ##2 and 3,
      since removal of net_lock would introduce three major problems:
      
      a) In access method #2, we access the link before taking the
         protecting node_lock. This will not work once net_lock is gone,
         so we will have to change the access order. We will deal with
         this in a later commit in this series, "tipc: add node lock
         protection to link found by link_find_link()".
      
      b) When the outer protection from net_lock is gone, taking
         bearer_lock and node_lock in opposite order of method 1) and 2)
         will become an obvious deadlock hazard. This is fixed in the
         commit ("tipc: remove bearer_lock from tipc_bearer struct")
         later in this series.
      
      c) Similar to what is described in problem a), access method #3
         starts with using a link pointer that is unprotected by node_lock,
         in order to via that pointer find the correct node struct and
         lock it. Before we remove net_lock, this access order must be
         altered. This is what we do with this commit.
      
      We can avoid introducing problem problem c) by even here using the
      global node list to find the node, before accessing its links. When
      we loop though the node list we use the own bearer identity as search
      criteria, thus easily finding the links that are associated to the
      resetting/disabling bearer. It should be noted that although this
      method is somewhat slower than the current list traversal, it is in
      no way time critical. This is only about resetting or deleting links,
      something that must be considered relatively infrequent events.
      
      As a bonus, we can get rid of the mutual pointers between links and
      bearers. After this commit, pointer dependency go in one direction
      only: from the link to the bearer.
      
      This commit pre-empts introduction of problem c) as described above.
      Signed-off-by: default avatarYing Xue <ying.xue@windriver.com>
      Reviewed-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c61dd61d
  20. 07 Jan, 2014 1 commit
  21. 04 Jan, 2014 1 commit
  22. 10 Dec, 2013 2 commits
    • Ying Xue's avatar
      tipc: eliminate code duplication in media layer · e4d050cb
      Ying Xue authored
      Currently TIPC supports two L2 media types, Ethernet and Infiniband.
      Because both these media are accessed through the common net_device API,
      several functions in the two media adaptation files turn out to be
      fully or almost identical, leading to unnecessary code duplication.
      
      In this commit we extract this common code from the two media files
      and move them to the generic bearer.c. Additionally, we change
      the function names to reflect their real role: to access L2 media,
      irrespective of type.
      Signed-off-by: default avatarYing Xue <ying.xue@windriver.com>
      Cc: Patrick McHardy <kaber@trash.net>
      Reviewed-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e4d050cb
    • Ying Xue's avatar
      tipc: relocate common functions from media to bearer · 6e967adf
      Ying Xue authored
      Currently, registering a TIPC stack handler in the network device layer
      is done twice, once for Ethernet (eth_media) and Infiniband (ib_media)
      repectively. But, as this registration is not media specific, we can
      avoid some code duplication by moving the registering function to
      the generic bearer layer, to the file bearer.c, and call it only once.
      The same is true for the network device event notifier.
      
      As a side effect, the two workqueues we are using for for setting up/
      cleaning up media can now be eliminated. Furthermore, the array for
      storing the specific media type structs, media_array[], can be entirely
      deleted.
      
      Note that the eth_started and ib_started flags were removed during the
      code relocation.  There is now only one call to bearer_setup and
      bearer_cleanup, and these can logically not race against each other.
      
      Despite its size, this cleanup work incurs no functional changes in TIPC.
      In particular, it should be noted that the sequence ordering of received
      packets is unaffected by this change, since packet reception never was
      subject to any work queue handling in the first place.
      Signed-off-by: default avatarYing Xue <ying.xue@windriver.com>
      Cc: Patrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6e967adf