这是indexloc提供的服务,不要输入任何密码
Skip to content

feature COMBINE_UNICODE_SURROGATES_IN_UTF8 doesn't work when custom characterEscape is used #1398

@stackunderflow111

Description

@stackunderflow111

Version: 2.18.0+

Hi!

I believe I have found a bug for the COMBINE_UNICODE_SURROGATES_IN_UTF8 feature introduced in version 2.18. It doesn't work when custom characterEscapes is used.

An example:

    public static void main(String[] args) throws IOException {
        JsonFactory surrogatePairFactory = JsonFactory.builder()
                .build();
        JsonFactory utf8Factory = JsonFactory.builder()
                .enable(JsonWriteFeature.COMBINE_UNICODE_SURROGATES_IN_UTF8)
                .build();
        JsonFactory utf8FactoryWithCharacterEscapes = new JsonFactoryBuilder()
                .characterEscapes(JsonpCharacterEscapes.instance())
                .enable(JsonWriteFeature.COMBINE_UNICODE_SURROGATES_IN_UTF8)
                .build();
        System.out.println(writeEmoji(surrogatePairFactory));
        System.out.println(writeEmoji(utf8Factory));
        System.out.println(writeEmoji(utf8FactoryWithCharacterEscapes));
    }

    private static String writeEmoji(JsonFactory f) throws IOException {
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        try (JsonGenerator gen = f.createGenerator(out)) {
            gen.writeStartObject();
            // 0x1F60A - emoji
            gen.writeStringField("test_emoji", new String(Character.toChars(0x1F60A)));
            gen.writeEndObject();
        }
        return out.toString(StandardCharsets.UTF_8);
    }

The output:

Image

It's expected that the third line (printed by utf8FactoryWithCharacterEscapes) should be the same as the second line (printed by utf8Factory), but they are different.

The reason seems to be that when custom characterEscapes is used, the code calls the two _writeCustomStringSegment2() methods, shown below, which do not check the COMBINE_UNICODE_SURROGATES_IN_UTF8 feature.

private final void _writeCustomStringSegment2(final char[] cbuf, int offset, final int end) throws IOException
{
// Ok: caller guarantees buffer can have room; but that may require flushing:
if ((_outputTail + 6 * (end - offset)) > _outputEnd) {
_flushBuffer();
}
int outputPtr = _outputTail;
final byte[] outputBuffer = _outputBuffer;
final int[] escCodes = _outputEscapes;
// may or may not have this limit
final int maxUnescaped = (_maximumNonEscapedChar <= 0) ? 0xFFFF : _maximumNonEscapedChar;
final CharacterEscapes customEscapes = _characterEscapes; // non-null
while (offset < end) {
int ch = cbuf[offset++];
if (ch <= 0x7F) {
if (escCodes[ch] == 0) {
outputBuffer[outputPtr++] = (byte) ch;
continue;
}
int escape = escCodes[ch];
if (escape > 0) { // 2-char escape, fine
outputBuffer[outputPtr++] = BYTE_BACKSLASH;
outputBuffer[outputPtr++] = (byte) escape;
} else if (escape == CharacterEscapes.ESCAPE_CUSTOM) {
SerializableString esc = customEscapes.getEscapeSequence(ch);
if (esc == null) {
_reportError("Invalid custom escape definitions; custom escape not found for character code 0x"
+Integer.toHexString(ch)+", although was supposed to have one");
}
outputPtr = _writeCustomEscape(outputBuffer, outputPtr, esc, end-offset);
} else {
// ctrl-char, 6-byte escape...
outputPtr = _writeGenericEscape(ch, outputPtr);
}
continue;
}
if (ch > maxUnescaped) { // [JACKSON-102] Allow forced escaping if non-ASCII (etc) chars:
outputPtr = _writeGenericEscape(ch, outputPtr);
continue;
}
SerializableString esc = customEscapes.getEscapeSequence(ch);
if (esc != null) {
outputPtr = _writeCustomEscape(outputBuffer, outputPtr, esc, end-offset);
continue;
}
if (ch <= 0x7FF) { // fine, just needs 2 byte output
outputBuffer[outputPtr++] = (byte) (0xc0 | (ch >> 6));
outputBuffer[outputPtr++] = (byte) (0x80 | (ch & 0x3f));
} else {
outputPtr = _outputMultiByteChar(ch, outputPtr);
}
}
_outputTail = outputPtr;
}

I believe the fix is easy, we can just port the changes we made in #1335 and #1360 to the two _writeCustomStringSegment2() methods. I am working on a pull request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    2.18Issues planned at earliest for 2.18

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions