-
-
Notifications
You must be signed in to change notification settings - Fork 820
Description
Version: 2.18.0+
Hi!
I believe I have found a bug for the COMBINE_UNICODE_SURROGATES_IN_UTF8 feature introduced in version 2.18. It doesn't work when custom characterEscapes is used.
An example:
public static void main(String[] args) throws IOException {
JsonFactory surrogatePairFactory = JsonFactory.builder()
.build();
JsonFactory utf8Factory = JsonFactory.builder()
.enable(JsonWriteFeature.COMBINE_UNICODE_SURROGATES_IN_UTF8)
.build();
JsonFactory utf8FactoryWithCharacterEscapes = new JsonFactoryBuilder()
.characterEscapes(JsonpCharacterEscapes.instance())
.enable(JsonWriteFeature.COMBINE_UNICODE_SURROGATES_IN_UTF8)
.build();
System.out.println(writeEmoji(surrogatePairFactory));
System.out.println(writeEmoji(utf8Factory));
System.out.println(writeEmoji(utf8FactoryWithCharacterEscapes));
}
private static String writeEmoji(JsonFactory f) throws IOException {
ByteArrayOutputStream out = new ByteArrayOutputStream();
try (JsonGenerator gen = f.createGenerator(out)) {
gen.writeStartObject();
// 0x1F60A - emoji
gen.writeStringField("test_emoji", new String(Character.toChars(0x1F60A)));
gen.writeEndObject();
}
return out.toString(StandardCharsets.UTF_8);
}The output:
It's expected that the third line (printed by utf8FactoryWithCharacterEscapes) should be the same as the second line (printed by utf8Factory), but they are different.
The reason seems to be that when custom characterEscapes is used, the code calls the two _writeCustomStringSegment2() methods, shown below, which do not check the COMBINE_UNICODE_SURROGATES_IN_UTF8 feature.
jackson-core/src/main/java/com/fasterxml/jackson/core/json/UTF8JsonGenerator.java
Lines 1684 to 1739 in 9fcf1e7
| private final void _writeCustomStringSegment2(final char[] cbuf, int offset, final int end) throws IOException | |
| { | |
| // Ok: caller guarantees buffer can have room; but that may require flushing: | |
| if ((_outputTail + 6 * (end - offset)) > _outputEnd) { | |
| _flushBuffer(); | |
| } | |
| int outputPtr = _outputTail; | |
| final byte[] outputBuffer = _outputBuffer; | |
| final int[] escCodes = _outputEscapes; | |
| // may or may not have this limit | |
| final int maxUnescaped = (_maximumNonEscapedChar <= 0) ? 0xFFFF : _maximumNonEscapedChar; | |
| final CharacterEscapes customEscapes = _characterEscapes; // non-null | |
| while (offset < end) { | |
| int ch = cbuf[offset++]; | |
| if (ch <= 0x7F) { | |
| if (escCodes[ch] == 0) { | |
| outputBuffer[outputPtr++] = (byte) ch; | |
| continue; | |
| } | |
| int escape = escCodes[ch]; | |
| if (escape > 0) { // 2-char escape, fine | |
| outputBuffer[outputPtr++] = BYTE_BACKSLASH; | |
| outputBuffer[outputPtr++] = (byte) escape; | |
| } else if (escape == CharacterEscapes.ESCAPE_CUSTOM) { | |
| SerializableString esc = customEscapes.getEscapeSequence(ch); | |
| if (esc == null) { | |
| _reportError("Invalid custom escape definitions; custom escape not found for character code 0x" | |
| +Integer.toHexString(ch)+", although was supposed to have one"); | |
| } | |
| outputPtr = _writeCustomEscape(outputBuffer, outputPtr, esc, end-offset); | |
| } else { | |
| // ctrl-char, 6-byte escape... | |
| outputPtr = _writeGenericEscape(ch, outputPtr); | |
| } | |
| continue; | |
| } | |
| if (ch > maxUnescaped) { // [JACKSON-102] Allow forced escaping if non-ASCII (etc) chars: | |
| outputPtr = _writeGenericEscape(ch, outputPtr); | |
| continue; | |
| } | |
| SerializableString esc = customEscapes.getEscapeSequence(ch); | |
| if (esc != null) { | |
| outputPtr = _writeCustomEscape(outputBuffer, outputPtr, esc, end-offset); | |
| continue; | |
| } | |
| if (ch <= 0x7FF) { // fine, just needs 2 byte output | |
| outputBuffer[outputPtr++] = (byte) (0xc0 | (ch >> 6)); | |
| outputBuffer[outputPtr++] = (byte) (0x80 | (ch & 0x3f)); | |
| } else { | |
| outputPtr = _outputMultiByteChar(ch, outputPtr); | |
| } | |
| } | |
| _outputTail = outputPtr; | |
| } |
I believe the fix is easy, we can just port the changes we made in #1335 and #1360 to the two _writeCustomStringSegment2() methods. I am working on a pull request.