Skip to content
  • Daniel Borkmann's avatar
    net: sched: fix call_rcu() race on classifier module unloads · c78e1746
    Daniel Borkmann authored
    Vijay reported that a loop as simple as ...
    
      while true; do
        tc qdisc add dev foo root handle 1: prio
        tc filter add dev foo parent 1: u32 match u32 0 0  flowid 1
        tc qdisc del dev foo root
        rmmod cls_u32
      done
    
    ... will panic the kernel. Moreover, he bisected the change
    apparently introducing it to 78fd1d0a ("netlink: Re-add
    locking to netlink_lookup() and seq walker").
    
    The removal of synchronize_net() from the netlink socket
    triggering the qdisc to be removed, seems to have uncovered
    an RCU resp. module reference count race from the tc API.
    Given that RCU conversion was done after e341694e ("netlink:
    Convert netlink_lookup() to use RCU protected hash table")
    which added the synchronize_net() originally, occasion of
    hitting the bug was less likely (not impossible though):
    
    When qdiscs that i) support attaching classifiers and,
    ii) have at least one of them attached, get deleted, they
    invoke tcf_destroy_chain(), and thus call into ->destroy()
    handler from a classifier module.
    
    After RCU conversion, all classifier that have an internal
    prio list, unlink them and initiate freeing via call_rcu()
    deferral.
    
    Meanhile, tcf_destroy() releases already reference to the
    tp->ops->owner module before the queued RCU callback handler
    has been invoked.
    
    Subsequent rmmod on the classifier module is then not prevented
    since all module references are already dropped.
    
    By the time, the kernel invokes the RCU callback handler from
    the module, that function address is then invalid.
    
    One way to fix it would be to add an rcu_barrier() to
    unregister_tcf_proto_ops() to wait for all pending call_rcu()s
    to complete.
    
    synchronize_rcu() is not appropriate as under heavy RCU
    callback load, registered call_rcu()s could be deferred
    longer than a grace period. In case we don't have any pending
    call_rcu()s, the barrier is allowed to return immediately.
    
    Since we came here via unregister_tcf_proto_ops(), there
    are no users of a given classifier anymore. Further nested
    call_rcu()s pointing into the module space are not being
    done anywhere.
    
    Only cls_bpf_delete_prog() may schedule a work item, to
    unlock pages eventually, but that is not in the range/context
    of cls_bpf anymore.
    
    Fixes: 25d8c0d5 ("net: rcu-ify tcf_proto")
    Fixes: 9888faef
    
     ("net: sched: cls_basic use RCU")
    Reported-by: default avatarVijay Subramanian <subramanian.vijay@gmail.com>
    Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
    Cc: John Fastabend <john.r.fastabend@intel.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Thomas Graf <tgraf@suug.ch>
    Cc: Jamal Hadi Salim <jhs@mojatatu.com>
    Cc: Alexei Starovoitov <ast@plumgrid.com>
    Tested-by: default avatarVijay Subramanian <subramanian.vijay@gmail.com>
    Acked-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
    Acked-by: default avatarEric Dumazet <edumazet@google.com>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    c78e1746