Skip to content
  • Daniel Borkmann's avatar
    netlink, mmap: fix edge-case leakages in nf queue zero-copy · 6bb0fef4
    Daniel Borkmann authored
    When netlink mmap on receive side is the consumer of nf queue data,
    it can happen that in some edge cases, we write skb shared info into
    the user space mmap buffer:
    
    Assume a possible rx ring frame size of only 4096, and the network skb,
    which is being zero-copied into the netlink skb, contains page frags
    with an overall skb->len larger than the linear part of the netlink
    skb.
    
    skb_zerocopy(), which is generic and thus not aware of the fact that
    shared info cannot be accessed for such skbs then tries to write and
    fill frags, thus leaking kernel data/pointers and in some corner cases
    possibly writing out of bounds of the mmap area (when filling the
    last slot in the ring buffer this way).
    
    I.e. the ring buffer slot is then of status NL_MMAP_STATUS_VALID, has
    an advertised length larger than 4096, where the linear part is visible
    at the slot beginning, and the leaked sizeof(struct skb_shared_info)
    has been written to the beginning of the next slot (also corrupting
    the struct nl_mmap_hdr slot header incl. status etc), since skb->end
    points to skb->data + ring->frame_size - NL_MMAP_HDRLEN.
    
    The fix adds and lets __netlink_alloc_skb() take the actual needed
    linear room for the network skb + meta data into account. It's completely
    irrelevant for non-mmaped netlink sockets, but in case mmap sockets
    are used, it can be decided whether the available skb_tailroom() is
    really large enough for the buffer, or whether it needs to internally
    fallback to a normal alloc_skb().
    
    >From nf queue side, the information whether the destination port is
    an mmap RX ring is not really available without extra port-to-socket
    lookup, thus it can only be determined in lower layers i.e. when
    __netlink_alloc_skb() is called that checks internally for this. I
    chose to add the extra ldiff parameter as mmap will then still work:
    We have data_len and hlen in nfqnl_build_packet_message(), data_len
    is the full length (capped at queue->copy_range) for skb_zerocopy()
    and hlen some possible part of data_len that needs to be copied; the
    rem_len variable indicates the needed remaining linear mmap space.
    
    The only other workaround in nf queue internally would be after
    allocation time by f.e. cap'ing the data_len to the skb_tailroom()
    iff we deal with an mmap skb, but that would 1) expose the fact that
    we use a mmap skb to upper layers, and 2) trim the skb where we
    otherwise could just have moved the full skb into the normal receive
    queue.
    
    After the patch, in my test case the ring slot doesn't fit and therefore
    shows NL_MMAP_STATUS_COPY, where a full skb carries all the data and
    thus needs to be picked up via recv().
    
    Fixes: 3ab1f683
    
     ("nfnetlink: add support for memory mapped netlink")
    Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    6bb0fef4