XML Tips and Tricks: Line Breaks in XML Documents

XML Tips and Tricks: Line Breaks in XML Documents

XML documents were designed as a universal data storage standard. The idea was that any program could read and process an XML file, enabling data to be freely shared rather than kept hidden in proprietary data silos’. A reasonable suggestion, but the lack of a single standard for displaying the contents of XML files has lead to several different ways of interpreting the stored data. For the purposes of this article we will be exploring how to insert a line break into xml text nodes. It is also possible to add line breaks into attribute values, but the XHTML specification recommends against this.

Line Breaks Across Operating Systems

As XML is designed as a universal format, it needs to work across all operating systems. Special characters are often used to signify the end of a line, dating all the way back to the line feed (LF, ‘\n’, 0x0A) and carriage return (CR, ‘\r’, 0x0D) symbols used in the ASCII standard. Early computer systems used both CR+LF, primarily due to direct interaction with teletype terminals. This convention has carried over to modern Windows based systems. However *nix based systems such as Linux and Mac OS X use just the line feed character, while version 9 and earlier of the Mac operating system used carriage return as their newline character.

Already we can see that handling the wide variety of newline characters may be an issue. Even worse, due to the nature of XML, how these line break characters are interpreted could vary wildly between individual programs. Luckily this issue was foreseen by the designers of XML, and so by default all newline characters in XML are normalized to ‘/n’, or the line feed character. However if you want to use a different character to indicate a line break in your XML document, it is possible to insert it using a character entity reference. For instance, ‘/r’ or the carriage return character can be inserted using ‘ ’.

Line Break Handling in Specific Programs

For the vast majority of programs then, the normalized ‘/n’ line break in XML files should be correctly interpreted. This is certainly the case in C# and Java. As a quick aside, if you are programmatically generating XML files in C# I would highly recommend reading Microsoft’s article on indenting XML files using XmlDocument or XSL Transforms.

There are, of course, a number of programs that aren’t quite so nice; Flash and ActionScript are two notorious culprits for messing up line breaks. One way to get around Flash’s distaste for whitespace characters is to make use of ‘ignoreWhite’ when loading an XML document:

var objectName = new XML();

objectName.ignoreWhite = true;

objectName.load(“fileToLoad.xml”);

This will often fix weird formatting issues, such as double spaced line breaks. If you find that Flash isn’t rendering line breaks in your XML file at all, then CDATA tags may be the solution. These allow embedding of HTML tags within an XML node, and can be used specifically to embed the
html line break character, like so:

When dealing with Flash, using character entity references is also a valid tactic. Characters like ‘ ’ get ignored by the initial Flash parser, and hence displayed properly when parsed by the browser.

Moving away from Flash, PHP is another language that sometimes has issues properly display newline characters. When outputting data from an XML document into HTML you might find it useful to make use of the ’nl2br’ function, which inserts HTML line breaks before every newline character.

Conclusion

Hopefully this article has given a good overview of how to handle line breaks in XML files. If you have any questions please feel free to leave in a note in the comments section, and I’ll try to answer any queries.