1. 01 Sep, 2016 1 commit
    • Parthasarathy Bhuvaragan's avatar
      tipc: fix random link resets while adding a second bearer · d2f394dc
      Parthasarathy Bhuvaragan authored
      In a dual bearer configuration, if the second tipc link becomes
      active while the first link still has pending nametable "bulk"
      updates, it randomly leads to reset of the second link.
      
      When a link is established, the function named_distribute(),
      fills the skb based on node mtu (allows room for TUNNEL_PROTOCOL)
      with NAME_DISTRIBUTOR message for each PUBLICATION.
      However, the function named_distribute() allocates the buffer by
      increasing the node mtu by INT_H_SIZE (to insert NAME_DISTRIBUTOR).
      This consumes the space allocated for TUNNEL_PROTOCOL.
      
      When establishing the second link, the link shall tunnel all the
      messages in the first link queue including the "bulk" update.
      As size of the NAME_DISTRIBUTOR messages while tunnelling, exceeds
      the link mtu the transmission fails (-EMSGSIZE).
      
      Thus, the synch point based on the message count of the tunnel
      packets is never reached leading to link timeout.
      
      In this commit, we adjust the size of name distributor message so that
      they can be tunnelled.
      Reviewed-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d2f394dc
  2. 25 Aug, 2016 1 commit
  3. 15 Aug, 2016 1 commit
    • Vegard Nossum's avatar
      tipc: fix NULL pointer dereference in shutdown() · d2fbdf76
      Vegard Nossum authored
      tipc_msg_create() can return a NULL skb and if so, we shouldn't try to
      call tipc_node_xmit_skb() on it.
      
          general protection fault: 0000 [#1] PREEMPT SMP KASAN
          CPU: 3 PID: 30298 Comm: trinity-c0 Not tainted 4.7.0-rc7+ #19
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
          task: ffff8800baf09980 ti: ffff8800595b8000 task.ti: ffff8800595b8000
          RIP: 0010:[<ffffffff830bb46b>]  [<ffffffff830bb46b>] tipc_node_xmit_skb+0x6b/0x140
          RSP: 0018:ffff8800595bfce8  EFLAGS: 00010246
          RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000003023b0e0
          RDX: 0000000000000000 RSI: dffffc0000000000 RDI: ffffffff83d12580
          RBP: ffff8800595bfd78 R08: ffffed000b2b7f32 R09: 0000000000000000
          R10: fffffbfff0759725 R11: 0000000000000000 R12: 1ffff1000b2b7f9f
          R13: ffff8800595bfd58 R14: ffffffff83d12580 R15: dffffc0000000000
          FS:  00007fcdde242700(0000) GS:ffff88011af80000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 00007fcddde1db10 CR3: 000000006874b000 CR4: 00000000000006e0
          DR0: 00007fcdde248000 DR1: 00007fcddd73d000 DR2: 00007fcdde248000
          DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000090602
          Stack:
           0000000000000018 0000000000000018 0000000041b58ab3 ffffffff83954208
           ffffffff830bb400 ffff8800595bfd30 ffffffff8309d767 0000000000000018
           0000000000000018 ffff8800595bfd78 ffffffff8309da1a 00000000810ee611
          Call Trace:
           [<ffffffff830c84a3>] tipc_shutdown+0x553/0x880
           [<ffffffff825b4a3b>] SyS_shutdown+0x14b/0x170
           [<ffffffff8100334c>] do_syscall_64+0x19c/0x410
           [<ffffffff83295ca5>] entry_SYSCALL64_slow_path+0x25/0x25
          Code: 90 00 b4 0b 83 c7 00 f1 f1 f1 f1 4c 8d 6d e0 c7 40 04 00 00 00 f4 c7 40 08 f3 f3 f3 f3 48 89 d8 48 c1 e8 03 c7 45 b4 00 00 00 00 <80> 3c 30 00 75 78 48 8d 7b 08 49 8d 75 c0 48 b8 00 00 00 00 00
          RIP  [<ffffffff830bb46b>] tipc_node_xmit_skb+0x6b/0x140
           RSP <ffff8800595bfce8>
          ---[ end trace 57b0484e351e71f1 ]---
      
      I feel like we should maybe return -ENOMEM or -ENOBUFS, but I'm not sure
      userspace is equipped to handle that. Anyway, this is better than a GPF
      and looks somewhat consistent with other tipc_msg_create() callers.
      Signed-off-by: default avatarVegard Nossum <vegard.nossum@oracle.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Acked-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d2fbdf76
  4. 10 Aug, 2016 1 commit
  5. 30 Jul, 2016 1 commit
  6. 26 Jul, 2016 5 commits
  7. 11 Jul, 2016 3 commits
    • Jon Paul Maloy's avatar
      tipc: reset all unicast links when broadcast send link fails · 1fc07f3e
      Jon Paul Maloy authored
      In test situations with many nodes and a heavily stressed system we have
      observed that the transmission broadcast link may fail due to an
      excessive number of retransmissions of the same packet. In such
      situations we need to reset all unicast links to all peers, in order to
      reset and re-synchronize the broadcast link.
      
      In this commit, we add a new function tipc_bearer_reset_all() to be used
      in such situations. The function scans across all bearers and resets all
      their pertaining links.
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1fc07f3e
    • Jon Paul Maloy's avatar
      tipc: ensure correct broadcast send buffer release when peer is lost · a71eb720
      Jon Paul Maloy authored
      After a new receiver peer has been added to the broadcast transmission
      link, we allow immediate transmission of new broadcast packets, trusting
      that the new peer will not accept the packets until it has received the
      previously sent unicast broadcast initialiation message. In the same
      way, the sender must not accept any acknowledges until it has itself
      received the broadcast initialization from the peer, as well as
      confirmation of the reception of its own initialization message.
      
      Furthermore, when a receiver peer goes down, the sender has to produce
      the missing acknowledges from the lost peer locally, in order ensure
      correct release of the buffers that were expected to be acknowledged by
      the said peer.
      
      In a highly stressed system we have observed that contact with a peer
      may come up and be lost before the above mentioned broadcast initial-
      ization and confirmation have been received. This leads to the locally
      produced acknowledges being rejected, and the non-acknowledged buffers
      to linger in the broadcast link transmission queue until it fills up
      and the link goes into permanent congestion.
      
      In this commit, we remedy this by temporarily setting the corresponding
      broadcast receive link state to ESTABLISHED and the 'bc_peer_is_up'
      state to true before we issue the local acknowledges. This ensures that
      those acknowledges will always be accepted. The mentioned state values
      are restored immediately afterwards when the link is reset.
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a71eb720
    • Jon Paul Maloy's avatar
      tipc: extend broadcast link initialization criteria · 2d18ac4b
      Jon Paul Maloy authored
      At first contact between two nodes, an endpoint might sometimes have
      time to send out a LINK_PROTOCOL/STATE packet before it has received
      the broadcast initialization packet from the peer, i.e., before it has
      received a valid broadcast packet number to add to the 'bc_ack' field
      of the protocol message.
      
      This means that the peer endpoint will receive a protocol packet with an
      invalid broadcast acknowledge value of 0. Under unlucky circumstances
      this may lead to the original, already received acknowledge value being
      overwritten, so that the whole broadcast link goes stale after a while.
      
      We fix this by delaying the setting of the link field 'bc_peer_is_up'
      until we know that the peer really has received our own broadcast
      initialization message. The latter is always sent out as the first
      unicast message on a link, and always with seqeunce number 1. Because
      of this, we only need to look for a non-zero unicast acknowledge value
      in the arriving STATE messages, and once that is confirmed we know we
      are safe and can set the mentioned field. Before this moment, we must
      ignore all broadcast acknowledges from the peer.
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2d18ac4b
  8. 01 Jul, 2016 1 commit
    • Richard Alpe's avatar
      tipc: fix nl compat regression for link statistics · 55e77a3e
      Richard Alpe authored
      Fix incorrect use of nla_strlcpy() where the first NLA_HDRLEN bytes
      of the link name where left out.
      
      Making the output of tipc-config -ls look something like:
      Link statistics:
      dcast-link
      1:data0-1.1.2:data0
      1:data0-1.1.3:data0
      
      Also, for the record, the patch that introduce this regression
      claims "Sending the whole object out can cause a leak". Which isn't
      very likely as this is a compat layer, where the data we are parsing
      is generated by us and we know the string to be NULL terminated. But
      you can of course never be to secure.
      
      Fixes: 5d2be142 (tipc: fix an infoleak in tipc_nl_compat_link_dump)
      Signed-off-by: default avatarRichard Alpe <richard.alpe@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      55e77a3e
  9. 29 Jun, 2016 2 commits
  10. 27 Jun, 2016 1 commit
  11. 22 Jun, 2016 1 commit
    • Jon Paul Maloy's avatar
      tipc: unclone unbundled buffers before forwarding · 27777daa
      Jon Paul Maloy authored
      When extracting an individual message from a received "bundle" buffer,
      we just create a clone of the base buffer, and adjust it to point into
      the right position of the linearized data area of the latter. This works
      well for regular message reception, but during periods of extremely high
      load it may happen that an extracted buffer, e.g, a connection probe, is
      reversed and forwarded through an external interface while the preceding
      extracted message is still unhandled. When this happens, the header or
      data area of the preceding message will be partially overwritten by a
      MAC header, leading to unpredicatable consequences, such as a link
      reset.
      
      We now fix this by ensuring that the msg_reverse() function never
      returns a cloned buffer, and that the returned buffer always contains
      sufficient valid head and tail room to be forwarded.
      Reported-by: default avatarErik Hugne <erik.hugne@gmail.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      27777daa
  12. 17 Jun, 2016 2 commits
    • Jon Paul Maloy's avatar
      tipc: fix socket timer deadlock · f1d048f2
      Jon Paul Maloy authored
      We sometimes observe a 'deadly embrace' type deadlock occurring
      between mutually connected sockets on the same node. This happens
      when the one-hour peer supervision timers happen to expire
      simultaneously in both sockets.
      
      The scenario is as follows:
      
      CPU 1:                          CPU 2:
      --------                        --------
      tipc_sk_timeout(sk1)            tipc_sk_timeout(sk2)
        lock(sk1.slock)                 lock(sk2.slock)
        msg_create(probe)               msg_create(probe)
        unlock(sk1.slock)               unlock(sk2.slock)
        tipc_node_xmit_skb()            tipc_node_xmit_skb()
          tipc_node_xmit()                tipc_node_xmit()
            tipc_sk_rcv(sk2)                tipc_sk_rcv(sk1)
              lock(sk2.slock)                 lock((sk1.slock)
              filter_rcv()                    filter_rcv()
                tipc_sk_proto_rcv()             tipc_sk_proto_rcv()
                  msg_create(probe_rsp)           msg_create(probe_rsp)
                  tipc_sk_respond()               tipc_sk_respond()
                    tipc_node_xmit_skb()            tipc_node_xmit_skb()
                      tipc_node_xmit()                tipc_node_xmit()
                        tipc_sk_rcv(sk1)                tipc_sk_rcv(sk2)
                          lock((sk1.slock)                lock((sk2.slock)
                          ===> DEADLOCK                   ===> DEADLOCK
      
      Further analysis reveals that there are three different locations in the
      socket code where tipc_sk_respond() is called within the context of the
      socket lock, with ensuing risk of similar deadlocks.
      
      We now solve this by passing a buffer queue along with all upcalls where
      sk_lock.slock may potentially be held. Response or rejected message
      buffers are accumulated into this queue instead of being sent out
      directly, and only sent once we know we are safely outside the slock
      context.
      Reported-by: default avatarGUNA <gbalasun@gmail.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f1d048f2
    • Dan Carpenter's avatar
      tipc: potential shift wrapping bug in map_set() · 0350cb48
      Dan Carpenter authored
      "up_map" is a u64 type but we're not using the high 32 bits.
      
      Fixes: 35c55c98 ('tipc: add neighbor monitoring framework')
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0350cb48
  13. 15 Jun, 2016 3 commits
    • Ying Xue's avatar
      tipc: eliminate uninitialized variable warning · c91522f8
      Ying Xue authored
      net/tipc/link.c: In function ‘tipc_link_timeout’:
      net/tipc/link.c:744:28: warning: ‘mtyp’ may be used uninitialized in this function [-Wuninitialized]
      
      Fixes: 42b18f60 ("tipc: refactor function tipc_link_timeout()")
      Acked-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c91522f8
    • Ying Xue's avatar
      tipc: fix suspicious RCU usage · 66d95b67
      Ying Xue authored
      When run tipcTS&tipcTC test suite, the following complaint appears:
      
      [   56.926168] ===============================
      [   56.926169] [ INFO: suspicious RCU usage. ]
      [   56.926171] 4.7.0-rc1+ #160 Not tainted
      [   56.926173] -------------------------------
      [   56.926174] net/tipc/bearer.c:408 suspicious rcu_dereference_protected() usage!
      [   56.926175]
      [   56.926175] other info that might help us debug this:
      [   56.926175]
      [   56.926177]
      [   56.926177] rcu_scheduler_active = 1, debug_locks = 1
      [   56.926179] 3 locks held by swapper/4/0:
      [   56.926180]  #0:  (((&req->timer))){+.-...}, at: [<ffffffff810e79b5>] call_timer_fn+0x5/0x340
      [   56.926203]  #1:  (&(&req->lock)->rlock){+.-...}, at: [<ffffffffa000c29b>] disc_timeout+0x1b/0xd0 [tipc]
      [   56.926212]  #2:  (rcu_read_lock){......}, at: [<ffffffffa00055e0>] tipc_bearer_xmit_skb+0xb0/0x2e0 [tipc]
      [   56.926218]
      [   56.926218] stack backtrace:
      [   56.926221] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.7.0-rc1+ #160
      [   56.926222] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
      [   56.926224]  0000000000000000 ffff880016803d28 ffffffff813c4423 ffff8800154252c0
      [   56.926227]  0000000000000001 ffff880016803d58 ffffffff810b7512 ffff8800124d8120
      [   56.926230]  ffff880013f8a160 ffff8800132b5ccc ffff8800124d8120 ffff880016803d88
      [   56.926234] Call Trace:
      [   56.926235]  <IRQ>  [<ffffffff813c4423>] dump_stack+0x67/0x94
      [   56.926250]  [<ffffffff810b7512>] lockdep_rcu_suspicious+0xe2/0x120
      [   56.926256]  [<ffffffffa00051f1>] tipc_l2_send_msg+0x131/0x1c0 [tipc]
      [   56.926261]  [<ffffffffa000567c>] tipc_bearer_xmit_skb+0x14c/0x2e0 [tipc]
      [   56.926266]  [<ffffffffa00055e0>] ? tipc_bearer_xmit_skb+0xb0/0x2e0 [tipc]
      [   56.926273]  [<ffffffffa000c280>] ? tipc_disc_init_msg+0x1f0/0x1f0 [tipc]
      [   56.926278]  [<ffffffffa000c280>] ? tipc_disc_init_msg+0x1f0/0x1f0 [tipc]
      [   56.926283]  [<ffffffffa000c2d6>] disc_timeout+0x56/0xd0 [tipc]
      [   56.926288]  [<ffffffff810e7a68>] call_timer_fn+0xb8/0x340
      [   56.926291]  [<ffffffff810e79b5>] ? call_timer_fn+0x5/0x340
      [   56.926296]  [<ffffffffa000c280>] ? tipc_disc_init_msg+0x1f0/0x1f0 [tipc]
      [   56.926300]  [<ffffffff810e8f4a>] run_timer_softirq+0x23a/0x390
      [   56.926306]  [<ffffffff810f89ff>] ? clockevents_program_event+0x7f/0x130
      [   56.926316]  [<ffffffff819727c3>] __do_softirq+0xc3/0x4a2
      [   56.926323]  [<ffffffff8106ba5a>] irq_exit+0x8a/0xb0
      [   56.926327]  [<ffffffff81972456>] smp_apic_timer_interrupt+0x46/0x60
      [   56.926331]  [<ffffffff81970a49>] apic_timer_interrupt+0x89/0x90
      [   56.926333]  <EOI>  [<ffffffff81027fda>] ? default_idle+0x2a/0x1a0
      [   56.926340]  [<ffffffff81027fd8>] ? default_idle+0x28/0x1a0
      [   56.926342]  [<ffffffff810289cf>] arch_cpu_idle+0xf/0x20
      [   56.926345]  [<ffffffff810adf0f>] default_idle_call+0x2f/0x50
      [   56.926347]  [<ffffffff810ae145>] cpu_startup_entry+0x215/0x3e0
      [   56.926353]  [<ffffffff81040ad9>] start_secondary+0xf9/0x100
      
      The warning appears as rtnl_dereference() is wrongly used in
      tipc_l2_send_msg() under RCU read lock protection. Instead the proper
      usage should be that rcu_dereference_rtnl() is called here.
      
      Fixes: 5b7066c3 ("tipc: stricter filtering of packets in bearer layer")
      Acked-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      66d95b67
    • Jon Paul Maloy's avatar
      tipc: add neighbor monitoring framework · 35c55c98
      Jon Paul Maloy authored
      TIPC based clusters are by default set up with full-mesh link
      connectivity between all nodes. Those links are expected to provide
      a short failure detection time, by default set to 1500 ms. Because
      of this, the background load for neighbor monitoring in an N-node
      cluster increases with a factor N on each node, while the overall
      monitoring traffic through the network infrastructure increases at
      a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
      scale well beyond ~100 nodes unless we significantly increase failure
      discovery tolerance.
      
      This commit introduces a framework and an algorithm that drastically
      reduces this background load, while basically maintaining the original
      failure detection times across the whole cluster. Using this algorithm,
      background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
      at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
      now have to actively monitor 38 neighbors in a 400-node cluster, instead
      of as before 399.
      
      This "Overlapping Ring Supervision Algorithm" is completely distributed
      and employs no centralized or coordinated state. It goes as follows:
      
      - Each node makes up a linearly ascending, circular list of all its N
        known neighbors, based on their TIPC node identity. This algorithm
        must be the same on all nodes.
      
      - The node then selects the next M = sqrt(N) - 1 nodes downstream from
        itself in the list, and chooses to actively monitor those. This is
        called its "local monitoring domain".
      
      - It creates a domain record describing the monitoring domain, and
        piggy-backs this in the data area of all neighbor monitoring messages
        (LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
        the cluster eventually (default within 400 ms) will learn about
        its monitoring domain.
      
      - Whenever a node discovers a change in its local domain, e.g., a node
        has been added or has gone down, it creates and sends out a new
        version of its node record to inform all neighbors about the change.
      
      - A node receiving a domain record from anybody outside its local domain
        matches this against its own list (which may not look the same), and
        chooses to not actively monitor those members of the received domain
        record that are also present in its own list. Instead, it relies on
        indications from the direct monitoring nodes if an indirectly
        monitored node has gone up or down. If a node is indicated lost, the
        receiving node temporarily activates its own direct monitoring towards
        that node in order to confirm, or not, that it is actually gone.
      
      - Since each node is actively monitoring sqrt(N) downstream neighbors,
        each node is also actively monitored by the same number of upstream
        neighbors. This means that all non-direct monitoring nodes normally
        will receive sqrt(N) indications that a node is gone.
      
      - A major drawback with ring monitoring is how it handles failures that
        cause massive network partitionings. If both a lost node and all its
        direct monitoring neighbors are inside the lost partition, the nodes in
        the remaining partition will never receive indications about the loss.
        To overcome this, each node also chooses to actively monitor some
        nodes outside its local domain. Those nodes are called remote domain
        "heads", and are selected in such a way that no node in the cluster
        will be more than two direct monitoring hops away. Because of this,
        each node, apart from monitoring the member of its local domain, will
        also typically monitor sqrt(N) remote head nodes.
      
      - As an optimization, local list status, domain status and domain
        records are marked with a generation number. This saves senders from
        unnecessarily conveying  unaltered domain records, and receivers from
        performing unneeded re-adaptations of their node monitoring list, such
        as re-assigning domain heads.
      
      - As a measure of caution we have added the possibility to disable the
        new algorithm through configuration. We do this by keeping a threshold
        value for the cluster size; a cluster that grows beyond this value
        will switch from full-mesh to ring monitoring, and vice versa when
        it shrinks below the value. This means that if the threshold is set to
        a value larger than any anticipated cluster size (default size is 32)
        the new algorithm is effectively disabled. A patch set for altering the
        threshold value and for listing the table contents will follow shortly.
      
      - This change is fully backwards compatible.
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      35c55c98
  14. 08 Jun, 2016 2 commits
    • Jon Paul Maloy's avatar
      tipc: change node timer unit from jiffies to ms · 5ca509fc
      Jon Paul Maloy authored
      The node keepalive interval is recalculated at each timer expiration
      to catch any changes in the link tolerance, and stored in a field in
      struct tipc_node. We use jiffies as unit for the stored value.
      
      This is suboptimal, because it makes the calculation unnecessary
      complex, including two unit conversions. The conversions also lead to
      a rounding error that causes the link "abort limit" to be 3 in the
      normal case, instead of 4, as intended. This again leads to unnecessary
      link resets when the network is pushed close to its limit, e.g., in an
      environment with hundreds of nodes or namesapces.
      
      In this commit, we do instead let the keepalive value be calculated and
      stored in milliseconds, so that there is only one conversion and the
      rounding error is eliminated.
      
      We also remove a redundant "keepalive" field in struct tipc_link. This
      is remnant from the previous implementation.
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5ca509fc
    • Jon Paul Maloy's avatar
      tipc: correct error in node fsm · c4282ca7
      Jon Paul Maloy authored
      commit 88e8ac70 ("tipc: reduce transmission rate of reset messages
      when link is down") revealed a flaw in the node FSM, as defined in
      the log of commit 66996b6c ("tipc: extend node FSM").
      
      We see the following scenario:
      1: Node B receives a RESET message from node A before its link endpoint
         is fully up, i.e., the node FSM is in state SELF_UP_PEER_COMING. This
         event will not change the node FSM state, but the (distinct) link FSM
         will move to state RESETTING.
      2: As an effect of the previous event, the local endpoint on B will
         declare node A lost, and post the event SELF_DOWN to the its node
         FSM. This moves the FSM state to SELF_DOWN_PEER_LEAVING, meaning
         that no messages will be accepted from A until it receives another
         RESET message that confirms that A's endpoint has been reset. This
         is  wasteful, since we know this as a fact already from the first
         received RESET, but worse is that the link instance's FSM has not
         wasted this information, but instead moved on to state ESTABLISHING,
         meaning that it repeatedly sends out ACTIVATE messages to the reset
         peer A.
      3: Node A will receive one of the ACTIVATE messages, move its link FSM
         to state ESTABLISHED, and start repeatedly sending out STATE messages
         to node B.
      4: Node B will consistently drop these messages, since it can only accept
         accept a RESET according to its node FSM.
      5: After four lost STATE messages node A will reset its link and start
         repeatedly sending out RESET messages to B.
      6: Because of the reduced send rate for RESET messages, it is very
         likely that A will receive an ACTIVATE (which is sent out at a much
         higher frequency) before it gets the chance to send a RESET, and A
         may hence quickly move back to state ESTABLISHED and continue sending
         out STATE messages, which will again be dropped by B.
      7: GOTO 5.
      8: After having repeated the cycle 5-7 a number of times, node A will
         by chance get in between with sending a RESET, and the situation is
         resolved.
      
      Unfortunately, we have seen that it may take a substantial amount of
      time before this vicious loop is broken, sometimes in the order of
      minutes.
      
      We correct this by making a small correction to the node FSM: When a
      node in state SELF_UP_PEER_COMING receives a SELF_DOWN event, it now
      moves directly back to state SELF_DOWN_PEER_DOWN, instead of as now
      SELF_DOWN_PEER_LEAVING. This is logically consistent, since we don't
      need to wait for RESET confirmation from of an endpoint that we alread
      know has been reset. It also means that node B in the scenario above
      will not be dropping incoming STATE messages, and the link can come up
      immediately.
      
      Finally, a symmetry comparison reveals that the  FSM has a similar
      error when receiving the event PEER_DOWN in state PEER_UP_SELF_COMING.
      Instead of moving to PERR_DOWN_SELF_LEAVING, it should move directly
      to SELF_DOWN_PEER_DOWN. Although we have never seen any negative effect
      of this logical error, we choose fix this one, too.
      
      The node FSM looks as follows after those changes:
      
                                 +----------------------------------------+
                                 |                           PEER_DOWN_EVT|
                                 |                                        |
        +------------------------+----------------+                       |
        |SELF_DOWN_EVT           |                |                       |
        |                        |                |                       |
        |              +-----------+          +-----------+               |
        |              |NODE_      |          |NODE_      |               |
        |   +----------|FAILINGOVER|<---------|SYNCHING   |-----------+   |
        |   |SELF_     +-----------+ FAILOVER_+-----------+   PEER_   |   |
        |   |DOWN_EVT   |          A BEGIN_EVT  A         |   DOWN_EVT|   |
        |   |           |          |            |         |           |   |
        |   |           |          |            |         |           |   |
        |   |           |FAILOVER_ |FAILOVER_   |SYNCH_   |SYNCH_     |   |
        |   |           |END_EVT   |BEGIN_EVT   |BEGIN_EVT|END_EVT    |   |
        |   |           |          |            |         |           |   |
        |   |           |          |            |         |           |   |
        |   |           |         +--------------+        |           |   |
        |   |           +-------->|   SELF_UP_   |<-------+           |   |
        |   |   +-----------------|   PEER_UP    |----------------+   |   |
        |   |   |SELF_DOWN_EVT    +--------------+   PEER_DOWN_EVT|   |   |
        |   |   |                    A        A                   |   |   |
        |   |   |                    |        |                   |   |   |
        |   |   |         PEER_UP_EVT|        |SELF_UP_EVT        |   |   |
        |   |   |                    |        |                   |   |   |
        V   V   V                    |        |                   V   V   V
      +------------+       +-----------+    +-----------+       +------------+
      |SELF_DOWN_  |       |SELF_UP_   |    |PEER_UP_   |       |PEER_DOWN   |
      |PEER_LEAVING|       |PEER_COMING|    |SELF_COMING|       |SELF_LEAVING|
      +------------+       +-----------+    +-----------+       +------------+
             |               |       A        A       |                |
             |               |       |        |       |                |
             |       SELF_   |       |SELF_   |PEER_  |PEER_           |
             |       DOWN_EVT|       |UP_EVT  |UP_EVT |DOWN_EVT        |
             |               |       |        |       |                |
             |               |       |        |       |                |
             |               |    +--------------+    |                |
             |PEER_DOWN_EVT  +--->|  SELF_DOWN_  |<---+   SELF_DOWN_EVT|
             +------------------->|  PEER_DOWN   |<--------------------+
                                  +--------------+
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c4282ca7
  15. 02 Jun, 2016 1 commit
  16. 25 May, 2016 1 commit
  17. 19 May, 2016 1 commit
  18. 17 May, 2016 1 commit
  19. 16 May, 2016 1 commit
  20. 12 May, 2016 1 commit
    • Jon Paul Maloy's avatar
      tipc: eliminate risk of double link_up events · e7142c34
      Jon Paul Maloy authored
      When an ACTIVATE or data packet is received in a link in state
      ESTABLISHING, the link does not immediately change state to
      ESTABLISHED, but does instead return a LINK_UP event to the caller,
      which will execute the state change in a different lock context.
      
      This non-atomic approach incurs a low risk that we may have two
      LINK_UP events pending simultaneously for the same link, resulting
      in the final part of the setup procedure being executed twice. The
      only potential harm caused by this it that we may see two LINK_UP
      events issued to subsribers of the topology server, something that
      may cause confusion.
      
      This commit eliminates this risk by checking if the link is already
      up before proceeding with the second half of the setup.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e7142c34
  21. 03 May, 2016 3 commits
    • Jon Paul Maloy's avatar
      tipc: redesign connection-level flow control · 10724cc7
      Jon Paul Maloy authored
      There are two flow control mechanisms in TIPC; one at link level that
      handles network congestion, burst control, and retransmission, and one
      at connection level which' only remaining task is to prevent overflow
      in the receiving socket buffer. In TIPC, the latter task has to be
      solved end-to-end because messages can not be thrown away once they
      have been accepted and delivered upwards from the link layer, i.e, we
      can never permit the receive buffer to overflow.
      
      Currently, this algorithm is message based. A counter in the receiving
      socket keeps track of number of consumed messages, and sends a dedicated
      acknowledge message back to the sender for each 256 consumed message.
      A counter at the sending end keeps track of the sent, not yet
      acknowledged messages, and blocks the sender if this number ever reaches
      512 unacknowledged messages. When the missing acknowledge arrives, the
      socket is then woken up for renewed transmission. This works well for
      keeping the message flow running, as it almost never happens that a
      sender socket is blocked this way.
      
      A problem with the current mechanism is that it potentially is very
      memory consuming. Since we don't distinguish between small and large
      messages, we have to dimension the socket receive buffer according
      to a worst-case of both. I.e., the window size must be chosen large
      enough to sustain a reasonable throughput even for the smallest
      messages, while we must still consider a scenario where all messages
      are of maximum size. Hence, the current fix window size of 512 messages
      and a maximum message size of 66k results in a receive buffer of 66 MB
      when truesize(66k) = 131k is taken into account. It is possible to do
      much better.
      
      This commit introduces an algorithm where we instead use 1024-byte
      blocks as base unit. This unit, always rounded upwards from the
      actual message size, is used when we advertise windows as well as when
      we count and acknowledge transmitted data. The advertised window is
      based on the configured receive buffer size in such a way that even
      the worst-case truesize/msgsize ratio always is covered. Since the
      smallest possible message size (from a flow control viewpoint) now is
      1024 bytes, we can safely assume this ratio to be less than four, which
      is the value we are now using.
      
      This way, we have been able to reduce the default receive buffer size
      from 66 MB to 2 MB with maintained performance.
      
      In order to keep this solution backwards compatible, we introduce a
      new capability bit in the discovery protocol, and use this throughout
      the message sending/reception path to always select the right unit.
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      10724cc7
    • Jon Paul Maloy's avatar
      tipc: propagate peer node capabilities to socket layer · 60020e18
      Jon Paul Maloy authored
      During neighbor discovery, nodes advertise their capabilities as a bit
      map in a dedicated 16-bit field in the discovery message header. This
      bit map has so far only be stored in the node structure on the peer
      nodes, but we now see the need to keep a copy even in the socket
      structure.
      
      This commit adds this functionality.
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      60020e18
    • Jon Paul Maloy's avatar
      tipc: re-enable compensation for socket receive buffer double counting · 7c8bcfb1
      Jon Paul Maloy authored
      In the refactoring commit d570d864 ("tipc: enqueue arrived buffers
      in socket in separate function") we did by accident replace the test
      
      if (sk->sk_backlog.len == 0)
           atomic_set(&tsk->dupl_rcvcnt, 0);
      
      with
      
      if (sk->sk_backlog.len)
           atomic_set(&tsk->dupl_rcvcnt, 0);
      
      This effectively disables the compensation we have for the double
      receive buffer accounting that occurs temporarily when buffers are
      moved from the backlog to the socket receive queue. Until now, this
      has gone unnoticed because of the large receive buffer limits we are
      applying, but becomes indispensable when we reduce this buffer limit
      later in this series.
      
      We now fix this by inverting the mentioned condition.
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7c8bcfb1
  22. 01 May, 2016 2 commits
    • Hamish Martin's avatar
      tipc: only process unicast on intended node · efe79050
      Hamish Martin authored
      We have observed complete lock up of broadcast-link transmission due to
      unacknowledged packets never being removed from the 'transmq' queue. This
      is traced to nodes having their ack field set beyond the sequence number
      of packets that have actually been transmitted to them.
      Consider an example where node 1 has sent 10 packets to node 2 on a
      link and node 3 has sent 20 packets to node 2 on another link. We
      see examples of an ack from node 2 destined for node 3 being treated as
      an ack from node 2 at node 1. This leads to the ack on the node 1 to node
      2 link being increased to 20 even though we have only sent 10 packets.
      When node 1 does get around to sending further packets, none of the
      packets with sequence numbers less than 21 are actually removed from the
      transmq.
      To resolve this we reinstate some code lost in commit d999297c ("tipc:
      reduce locking scope during packet reception") which ensures that only
      messages destined for the receiving node are processed by that node. This
      prevents the sequence numbers from getting out of sync and resolves the
      packet leakage, thereby resolving the broadcast-link transmission
      lock-ups we observed.
      
      While we are aware that this change only patches over a root problem that
      we still haven't identified, this is a sanity test that it is always
      legitimate to do. It will remain in the code even after we identify and
      fix the real problem.
      Reviewed-by: default avatarChris Packham <chris.packham@alliedtelesis.co.nz>
      Reviewed-by: default avatarJohn Thompson <john.thompson@alliedtelesis.co.nz>
      Signed-off-by: default avatarHamish Martin <hamish.martin@alliedtelesis.co.nz>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      efe79050
    • Jon Paul Maloy's avatar
      tipc: set 'active' state correctly for first established link · def22c47
      Jon Paul Maloy authored
      When we are displaying statistics for the first link established between
      two peers, it will always be presented as STANDBY although it in reality
      is ACTIVE.
      
      This happens because we forget to set the 'active' flag in the link
      instance at the moment it is established. Although this is a bug, it only
      has impact on the presentation view of the link, not on its actual
      functionality.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      def22c47
  23. 28 Apr, 2016 1 commit
  24. 24 Apr, 2016 1 commit
  25. 18 Apr, 2016 1 commit
  26. 15 Apr, 2016 1 commit
    • Jon Paul Maloy's avatar
      tipc: let first message on link be a state message · 34b9cd64
      Jon Paul Maloy authored
      According to the link FSM, a received traffic packet can take a link
      from state ESTABLISHING to ESTABLISHED, but the link can still not be
      fully set up in one atomic operation. This means that even if the the
      very first packet on the link is a traffic packet with sequence number
      1 (one), it has to be dropped and retransmitted.
      
      This can be avoided if we let the mentioned packet be preceded by a
      LINK_PROTOCOL/STATE message, which takes up the endpoint before the
      arrival of the traffic.
      
      We add this small feature in this commit.
      
      This is a fully compatible change.
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      34b9cd64