Skip to content
  • Borislav Petkov's avatar
    x86/alternatives: Make JMPs more robust · 48c7a250
    Borislav Petkov authored
    
    
    Up until now we had to pay attention to relative JMPs in alternatives
    about how their relative offset gets computed so that the jump target
    is still correct. Or, as it is the case for near CALLs (opcode e8), we
    still have to go and readjust the offset at patching time.
    
    What is more, the static_cpu_has_safe() facility had to forcefully
    generate 5-byte JMPs since we couldn't rely on the compiler to generate
    properly sized ones so we had to force the longest ones. Worse than
    that, sometimes it would generate a replacement JMP which is longer than
    the original one, thus overwriting the beginning of the next instruction
    at patching time.
    
    So, in order to alleviate all that and make using JMPs more
    straight-forward we go and pad the original instruction in an
    alternative block with NOPs at build time, should the replacement(s) be
    longer. This way, alternatives users shouldn't pay special attention
    so that original and replacement instruction sizes are fine but the
    assembler would simply add padding where needed and not do anything
    otherwise.
    
    As a second aspect, we go and recompute JMPs at patching time so that we
    can try to make 5-byte JMPs into two-byte ones if possible. If not, we
    still have to recompute the offsets as the replacement JMP gets put far
    away in the .altinstr_replacement section leading to a wrong offset if
    copied verbatim.
    
    For example, on a locally generated kernel image
    
      old insn VA: 0xffffffff810014bd, CPU feat: X86_FEATURE_ALWAYS, size: 2
      __switch_to:
       ffffffff810014bd:      eb 21                   jmp ffffffff810014e0
      repl insn: size: 5
      ffffffff81d0b23c:       e9 b1 62 2f ff          jmpq ffffffff810014f2
    
    gets corrected to a 2-byte JMP:
    
      apply_alternatives: feat: 3*32+21, old: (ffffffff810014bd, len: 2), repl: (ffffffff81d0b23c, len: 5)
      alt_insn: e9 b1 62 2f ff
      recompute_jumps: next_rip: ffffffff81d0b241, tgt_rip: ffffffff810014f2, new_displ: 0x00000033, ret len: 2
      converted to: eb 33 90 90 90
    
    and a 5-byte JMP:
    
      old insn VA: 0xffffffff81001516, CPU feat: X86_FEATURE_ALWAYS, size: 2
      __switch_to:
       ffffffff81001516:      eb 30                   jmp ffffffff81001548
      repl insn: size: 5
       ffffffff81d0b241:      e9 10 63 2f ff          jmpq ffffffff81001556
    
    gets shortened into a two-byte one:
    
      apply_alternatives: feat: 3*32+21, old: (ffffffff81001516, len: 2), repl: (ffffffff81d0b241, len: 5)
      alt_insn: e9 10 63 2f ff
      recompute_jumps: next_rip: ffffffff81d0b246, tgt_rip: ffffffff81001556, new_displ: 0x0000003e, ret len: 2
      converted to: eb 3e 90 90 90
    
    ... and so on.
    
    This leads to a net win of around
    
    40ish replacements * 3 bytes savings =~ 120 bytes of I$
    
    on an AMD guest which means some savings of precious instruction cache
    bandwidth. The padding to the shorter 2-byte JMPs are single-byte NOPs
    which on smart microarchitectures means discarding NOPs at decode time
    and thus freeing up execution bandwidth.
    
    Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
    48c7a250