Skip to content

XML, Java, Unicode, and the See-No-Evil Monkey

by on May 21, 2015

The CoreMedia CMS stores quite a lot of data in XML: rich text, configuration options, page attributes. XML is quite mature and it comes in handy that XML supports the full range of Unicode characters for managing sites throughout the world. The backend being developed in Java, we rely heavily on the XML processing facilities built into Java.

Enters the see-no-evil monkey, or rather its Unicode incarnation. It is joined by its fellow Unicode characters that did not fit into the base plane of 65536 characters, like various Chinese symbols: the so-called supplementary characters. The problem is that the Xerces XML parser built into Java has a bug when handling supplementary characters.

Identified as JDK-8058175, the bug causes random characters to be inserted when a supplementary character is encountered in an attribute value. This is not just annoying, for example padding a comment with junk characters just because the user chose to include an emoji. It can actually be a security problem, because the inserted characters stem from an uncleared buffer, which might contain secret information or data for a cross-site scripting (XSS) attack.

The bug will be fixed in JDK 9, but that is not available yet and it will take a long time before we can discontinue support for older JDKs on all platforms. The bug is long fixed in current Xerces versions, but replacing the Xerces built into the JDK with a newer version is notoriously tricky, especially when running in application servers which tend to have their own opinion about the class loading order. You may want to have a look at this nice Stack Overflow question for the problem and a general idea of why we do not want to tweak the Xerces version for every installation.

So we had to develop a workaround. Because the bug is hidden deep inside of Xerces, we can only preprocess the XML file to avoid the erroneous behavior. At its core, the workaround is deceptively simple: replace the supplementary character with equivalent character entities, which Xerces happens to process without problems.

if (escape && Character.isSupplementaryCodePoint(currentCodePoint)) {
  output.append(“&#”).append(currentCodePoint).append(“;”);
} else {
  output.appendCodePoint(currentCodePoint);
}

The difficulty is, of course, to determine whether supplementary characters need escaping at a given position in an input stream. Escaping would be unnecessary in a comment and incorrect in a tag name. That means that we have to parse an XML file at least to the level that it is possible to determine whether the character currently being processed belong to an attribute value. The XML specification is restrictive enough to make just that distinction by keeping track of the current type of grammatical object (comment, cdata, tag, …) and looking for a small number of limiting character sequences. A hand-written parser with a finite lookahead will do.

Now the changed XML file has to be presented to Xerces in a convenient way. This is done by a modified SAX InputSource, which hides the original stream and always returns a corrected character stream to the XML parser. The XmlStreamReader from the Apache Commons IO package came in handy to infer the encoding of byte streams, which is normally also done by Xerces, but which has to be moved into the InputSource to be able to detect supplementary characters in arbitrary encodings.

The final result is the FullUnicodeInputSource, which is a drop-in replacement of the original SAX InputSource. It is available in source form in a GitHub at https://github.com/okummer/FullUnicodeInputSource for your convenience. Though provided as a Maven project, we do not provide a pre-built release at this early point.

On a more general level, it is worth remembering that a char in Java is not a character. It used to be when Java was invented, but today it just isn’t. It is an item of a UTF-16 representation of a character string. Still, Java has a lot of support for handling all of modern Unicode versions since JSR-204 took care of the problem. It’s worth to have closer look.

So all is well? The much nicer solution would be to get the fix of the original bug included in the maintenance releases of previous Java version. That fix would be quite literally one thousandth of the size of the workaround. But until that time, we cannot play see-no-evil monkey and pretend the problem is not there. Or not listen and hush things up. Like the hear-no-evil monkey and the speak-no-evil monkey that might suddenly pop up in XML attributes when their sibling is being processed.

From → Uncategorized

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s