-
Notifications
You must be signed in to change notification settings - Fork 46
UTF16 buffer boundary XALANJ-2725 #166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -175,6 +175,10 @@ abstract public class ToStream extends SerializerBase | |
| */ | ||
| private boolean m_expandDTDEntities = true; | ||
|
|
||
| /** | ||
| * Track multibyte character in order to serialize when the whole byte sequence is available. | ||
| */ | ||
| private char m_highUTF16Surrogate = 0; | ||
|
|
||
| /** | ||
| * Default constructor | ||
|
|
@@ -1595,23 +1599,40 @@ else if (m_encodingInfo.isInEncoding(ch)) { | |
| // not in the normal ASCII range, we also | ||
| // just leave it get added on to the clean characters | ||
| } | ||
| else if (Encodings.isHighUTF16Surrogate(ch) && i < end-1 && Encodings.isLowUTF16Surrogate(chars[i+1])) { | ||
| // So, this is a (valid) surrogate pair | ||
| if (! m_encodingInfo.isInEncoding(ch, chars[i+1])) { | ||
| int codepoint = Encodings.toCodePoint(ch, chars[i+1]); | ||
| writeOutCleanChars(chars, i, lastDirtyCharProcessed); | ||
| writer.write("&#"); | ||
| writer.write(Integer.toString(codepoint)); | ||
| writer.write(';'); | ||
| lastDirtyCharProcessed = i+1; | ||
| } | ||
| i++; // skip the low surrogate, too | ||
| else if (Encodings.isHighUTF16Surrogate(ch)) { | ||
| // Store for later processing. We may be at the end of a buffer, | ||
| // and must wait till low surrogate arrives | ||
| // before we can do anything with this. | ||
| writeOutCleanChars(chars, i, lastDirtyCharProcessed); | ||
| m_highUTF16Surrogate = ch; | ||
| lastDirtyCharProcessed = i; | ||
| } | ||
| else if (m_highUTF16Surrogate != 0 && Encodings.isLowUTF16Surrogate(ch)) { | ||
| // The complete utf16 byte sequence is now available and may be serialized. | ||
| if (! m_encodingInfo.isInEncoding(m_highUTF16Surrogate, ch)) { | ||
| int codepoint = Encodings.toCodePoint(m_highUTF16Surrogate, ch); | ||
| writer.write("&#"); | ||
| writer.write(Integer.toString(codepoint)); | ||
| writer.write(';'); | ||
| } else { | ||
| writer.write(m_highUTF16Surrogate); | ||
| writer.write(ch); | ||
| } | ||
| lastDirtyCharProcessed = i; | ||
| m_highUTF16Surrogate = 0; | ||
| } | ||
| else { | ||
| // This is a fallback plan, we get here if the | ||
| // encoding doesn't contain ch and it's not part | ||
| // of a surrogate pair | ||
| // The right thing is to write out an entity | ||
| if(m_highUTF16Surrogate != 0) { | ||
|
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is probably not a valid scenario and should throw an error. I am not certain what is the correct way to address a situation where we have seen a high surrogate not followed by a low one.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Btw, I did encounter this scenario, that's why I coded this. But in theory this should not be happening. |
||
| writer.write("&#"); | ||
| writer.write(Integer.toString(m_highUTF16Surrogate)); | ||
| writer.write(';'); | ||
| m_highUTF16Surrogate = 0; | ||
| } | ||
|
|
||
| writeOutCleanChars(chars, i, lastDirtyCharProcessed); | ||
| writer.write("&#"); | ||
| writer.write(Integer.toString(ch)); | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1602: If ch is high surrogate, dump clean chars and retain ch. This works within the same buffer or across buffer boundaries
1610: if we have a retained high surrogate and ch is a valid low surrogate, we have 2 use cases.
1. encoding doesn't support the multibyte char, escape entity.
2. encoding does support the multibyte char, output chars as is.