Categories
Articles

PHP SimpleXML: Load XML File, Preserve CDATA, Remove Whitespace Between Nodes and Return JSON

I needed to create an indented easy to read and edit XML file that a novice web owner could edit. This entailed having the nodes nicely indented so the user could find them easily. The XML file required CDATA nodes. The CDATA nodes I wanted on a separate line again so the user could easily find them. Finally I needed to send the loaded XML data in JSON format to the client side.

I chose SimpleXML simplexml_load_file method to handle the XML file and data. The problem with simplexml_load_file is that it ignores CDATA nodes. This is not immediately clear when reading the documentation. You do find references to the CDATA issue in the user comments.

The documentation also provides a hint should you decide to explore the simplexml_load_file options argument. Here you see the LIBXML_NOCDATA value which includes this explanation: “Merge CDATA as text nodes.” Using this option value converts the node containing the CDATA node into a text node using the container node name. Thus

<info>
	<!&#91;CDATA&#91;<b>Help</b>&#93;&#93;>
</info>

becomes

<info>
	<b>Help</b>
</info>

This is fine since the data returning to the client is in JSON format. Here is how the solution would look:

$xml = simplexml_load_file($xml_filename, 'SimpleXMLElement', LIBXML_NOCDATA);

[ad name=”Google Adsense”]
This solution creates another problem. The tabs and new line characters between the CDATA containing node are added to the new text node when converting to JSON.

\n\t\t\t\t<b>help</b>

Clearly, information between nodes is NOT data. Since my goal was to have a nicely formatted XML file for a user to update manually, this created the need to strip the whitespace added.

The approach I selected was to use a regular expression to remove the whitespace characters before and after the < > tag delimters as follows:

'~\s*(<(&#91;^>]*)>[^<&#93;*</\2>|<&#91;^>]*>)\s*~','$1'

One caveat is that if you decide to include html in the cdata node, the regular expression removes the whitespace before and after those tags.

Also is this needs to be applied before converting to SimpleXML. So I used file_get_contents to load the xml data file, applied the regular expression and then converted to a SimpleXML object.

The completed PHP script is as follows:

<?php
// The XML data file with whitespace such as tabs
$xml_file = "user_formatted_xml_data.xml";
// Load xml data.
$xml = file_get_contents($xml_file);
// Strip whitespace between xml tags
$xml = preg_replace('~\s*(<(&#91;^>]*)>[^<&#93;*</\2>|<&#91;^>]*>)\s*~','$1',$xml);
// Convert CDATA into xml nodes.
$xml = simplexml_load_string($xml,'SimpleXMLElement', LIBXML_NOCDATA);
// Return JSON.
echo json_encode($xml);
?>

[ad name=”Google Adsense”]
Here is a sample data file to try.

<?xml version="1.0" encoding="UTF-8"?>
<news_items>
	<news_item>
		<title>
			<!&#91;CDATA&#91;News item title #1 with <br/>html added.&#93;&#93;>
		</title>
		<content>
			<!&#91;CDATA&#91;This is the <b>CDATA content</b> for news item<strong> #1 </strong> with some "html" included.&#93;&#93;>
		</content>
	</news_item>
	<news_item>
		<title>
			<!&#91;CDATA&#91;News item title #2 with <br/>html added.&#93;&#93;>
		</title>
		<content>
			<!&#91;CDATA&#91;This is the <b>CDATA content</b> for news item<strong> #2 </strong> with some "html" included.&#93;&#93;>
		</content>
	</news_item>
</news_items>