Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 32 additions & 11 deletions serializer/src/main/java/org/apache/xml/serializer/ToStream.java
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,10 @@ abstract public class ToStream extends SerializerBase
*/
private boolean m_expandDTDEntities = true;

/**
* Track multibyte character in order to serialize when the whole byte sequence is available.
*/
private char m_highUTF16Surrogate = 0;

/**
* Default constructor
Expand Down Expand Up @@ -1595,23 +1599,40 @@ else if (m_encodingInfo.isInEncoding(ch)) {
// not in the normal ASCII range, we also
// just leave it get added on to the clean characters
}
else if (Encodings.isHighUTF16Surrogate(ch) && i < end-1 && Encodings.isLowUTF16Surrogate(chars[i+1])) {
// So, this is a (valid) surrogate pair
if (! m_encodingInfo.isInEncoding(ch, chars[i+1])) {
int codepoint = Encodings.toCodePoint(ch, chars[i+1]);
writeOutCleanChars(chars, i, lastDirtyCharProcessed);
writer.write("&#");
writer.write(Integer.toString(codepoint));
writer.write(';');
lastDirtyCharProcessed = i+1;
}
i++; // skip the low surrogate, too
else if (Encodings.isHighUTF16Surrogate(ch)) {
// Store for later processing. We may be at the end of a buffer,
// and must wait till low surrogate arrives
// before we can do anything with this.
writeOutCleanChars(chars, i, lastDirtyCharProcessed);
m_highUTF16Surrogate = ch;
lastDirtyCharProcessed = i;
}
else if (m_highUTF16Surrogate != 0 && Encodings.isLowUTF16Surrogate(ch)) {
// The complete utf16 byte sequence is now available and may be serialized.
if (! m_encodingInfo.isInEncoding(m_highUTF16Surrogate, ch)) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1602: If ch is high surrogate, dump clean chars and retain ch. This works within the same buffer or across buffer boundaries
1610: if we have a retained high surrogate and ch is a valid low surrogate, we have 2 use cases.
1. encoding doesn't support the multibyte char, escape entity.
2. encoding does support the multibyte char, output chars as is.

int codepoint = Encodings.toCodePoint(m_highUTF16Surrogate, ch);
writer.write("&#");
writer.write(Integer.toString(codepoint));
writer.write(';');
} else {
writer.write(m_highUTF16Surrogate);
writer.write(ch);
}
lastDirtyCharProcessed = i;
m_highUTF16Surrogate = 0;
}
else {
// This is a fallback plan, we get here if the
// encoding doesn't contain ch and it's not part
// of a surrogate pair
// The right thing is to write out an entity
if(m_highUTF16Surrogate != 0) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not a valid scenario and should throw an error. I am not certain what is the correct way to address a situation where we have seen a high surrogate not followed by a low one.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I did encounter this scenario, that's why I coded this. But in theory this should not be happening.

writer.write("&#");
writer.write(Integer.toString(m_highUTF16Surrogate));
writer.write(';');
m_highUTF16Surrogate = 0;
}

writeOutCleanChars(chars, i, lastDirtyCharProcessed);
writer.write("&#");
writer.write(Integer.toString(ch));
Expand Down