Skip to content
  • Markus Armbruster's avatar
    qjson: to_json() case QTYPE_QSTRING is buggy, rewrite · e2ec3f97
    Markus Armbruster authored
    
    
    Known bugs in to_json():
    
    * A start byte for a three-byte sequence followed by less than two
      continuation bytes is split into one-byte sequences.
    
    * Start bytes for sequences longer than three bytes get misinterpreted
      as start bytes for three-byte sequences.  Continuation bytes beyond
      byte three become one-byte sequences.
    
      This means all characters outside the BMP are decoded incorrectly.
    
    * One-byte sequences with the MSB are put into the JSON string
      verbatim when char is unsigned, producing invalid UTF-8.  When char
      is signed, they're replaced by "\\uFFFF" instead.
    
      This includes \xFE, \xFF, and stray continuation bytes.
    
    * Overlong sequences are happily accepted, unless screwed up by the
      bugs above.
    
    * Likewise, sequences encoding surrogate code points or noncharacters.
    
    * Unlike other control characters, ASCII DEL is not escaped.  Except
      in overlong encodings.
    
    My rewrite fixes them as follows:
    
    * Malformed UTF-8 sequences are replaced.
    
      Except the overlong encoding \xC0\x80 of U+0000 is still accepted.
      Permits embedding NUL characters in C strings.  This trick is known
      as "Modified UTF-8".
    
    * Sequences encoding code points beyond Unicode range are replaced.
    
    * Sequences encoding code points beyond the BMP produce a surrogate
      pair.
    
    * Sequences encoding surrogate code points are replaced.
    
    * Sequences encoding noncharacters are replaced.
    
    * ASCII DEL is now always escaped.
    
    The replacement character is U+FFFD.
    
    Signed-off-by: default avatarMarkus Armbruster <armbru@redhat.com>
    Reviewed-by: default avatarLaszlo Ersek <lersek@redhat.com>
    Signed-off-by: default avatarBlue Swirl <blauwirbel@gmail.com>
    e2ec3f97