{"id":1179,"date":"2011-01-07T16:16:05","date_gmt":"2011-01-07T21:16:05","guid":{"rendered":"http:\/\/www.lonhosford.com\/lonblog\/?p=1179"},"modified":"2015-07-31T18:25:51","modified_gmt":"2015-07-31T23:25:51","slug":"php-simplexml-load-xml-file-preserve-cdata-remove-whitespace-between-nodes-and-return-json","status":"publish","type":"post","link":"https:\/\/www.lonhosford.com\/lonblog\/2011\/01\/07\/php-simplexml-load-xml-file-preserve-cdata-remove-whitespace-between-nodes-and-return-json\/","title":{"rendered":"PHP SimpleXML: Load XML File, Preserve CDATA, Remove Whitespace Between Nodes and Return JSON"},"content":{"rendered":"<p>I needed to create an indented easy to read and edit XML file that a novice web owner could edit. This entailed having the nodes nicely indented so the user could find them easily. The XML file required CDATA nodes. <img loading=\"lazy\" decoding=\"async\" alt=\"\" src=\"http:\/\/lh5.ggpht.com\/_e5pwU0LJbN8\/TSeCQ4dqa0I\/AAAAAAAAFw8\/gBzNFjP-GME\/s800\/cdata_strip_white_space_published.png\" class=\"alignleft\" width=\"400\" height=\"242\" \/>The CDATA nodes I wanted on a separate line again so the user could easily find them. Finally I needed to send the loaded XML data in JSON format to the client side. <\/p>\n<p> I chose <a href=\"http:\/\/www.php.net\/manual\/en\/function.simplexml-load-file.php\" target = \"_blank\">SimpleXML simplexml_load_file method<\/a> to handle the XML file and data. The problem with simplexml_load_file is that it ignores CDATA nodes.  This is not immediately clear when reading the documentation. You do find references to the CDATA issue in the user comments. <\/p>\n<p>The documentation also provides a hint should you decide to explore the <a href=\"http:\/\/www.php.net\/manual\/en\/libxml.constants.php\"  target = \"_blank\">simplexml_load_file options argument<\/a>. Here you see the LIBXML_NOCDATA value which includes this explanation: &#8220;Merge CDATA as text nodes.&#8221; Using this option value converts the node containing the CDATA node into a text node using the container node name. Thus <\/p>\n<pre class=\"brush: xml; title: ; wrap-lines: false; notranslate\" title=\"\">\r\n&lt;info&gt;\r\n\t&lt;!&amp;#91;CDATA&amp;#91;&lt;b&gt;Help&lt;\/b&gt;&amp;#93;&amp;#93;&gt;\r\n&lt;\/info&gt;\r\n<\/pre>\n<p>becomes<\/p>\n<pre class=\"brush: xml; title: ; wrap-lines: false; notranslate\" title=\"\">\r\n&lt;info&gt;\r\n\t&lt;b&gt;Help&lt;\/b&gt;\r\n&lt;\/info&gt;\r\n<\/pre>\n<p>This is fine since the data returning to the client is in JSON format. Here is how the solution would look:<\/p>\n<pre class=\"brush: php; title: ; wrap-lines: false; notranslate\" title=\"\">\r\n$xml = simplexml_load_file($xml_filename, 'SimpleXMLElement', LIBXML_NOCDATA);\r\n<\/pre>\n<p>[ad name=&#8221;Google Adsense&#8221;]<br \/>\nThis solution creates another problem. The tabs and new line characters between the CDATA containing node are added to the new text node when converting to JSON. <\/p>\n<pre class=\"brush: xml; title: ; wrap-lines: false; notranslate\" title=\"\">\r\n\\n\\t\\t\\t\\t&lt;b&gt;help&lt;\/b&gt;\r\n<\/pre>\n<p>Clearly, information between nodes is NOT data.  Since my goal was to have a nicely formatted XML file for a user to update manually, this created the need to strip the whitespace added. <\/p>\n<p>The approach I selected was to use a regular expression to remove the whitespace characters before and after the < > tag delimters  as follows:<\/p>\n<pre class=\"brush: php; title: ; wrap-lines: false; notranslate\" title=\"\">\r\n&#039;~\\s*(&lt;(&amp;#91;^&gt;]*)&gt;&#x5B;^&lt;&amp;#93;*&lt;\/\\2&gt;|&lt;&amp;#91;^&gt;]*&gt;)\\s*~&#039;,&#039;$1&#039;\r\n<\/pre>\n<p>One caveat is that if you decide to include html in the cdata node, the regular expression removes the whitespace before and after those tags.<\/p>\n<p>Also is this needs to be applied before converting to SimpleXML. So I used <a href=\"http:\/\/php.net\/manual\/en\/function.file-get-contents.php\" target = \"_blank\">file_get_contents<\/a> to load the xml data file, applied the regular expression and then converted to a SimpleXML object. <\/p>\n<p>The completed PHP script is as follows:<\/p>\n<pre class=\"brush: php; title: ; wrap-lines: false; notranslate\" title=\"\">\r\n&lt;?php\r\n\/\/ The XML data file with whitespace such as tabs\r\n$xml_file = &quot;user_formatted_xml_data.xml&quot;;\r\n\/\/ Load xml data.\r\n$xml = file_get_contents($xml_file);\r\n\/\/ Strip whitespace between xml tags\r\n$xml = preg_replace(&#039;~\\s*(&lt;(&amp;#91;^&gt;]*)&gt;&#x5B;^&lt;&amp;#93;*&lt;\/\\2&gt;|&lt;&amp;#91;^&gt;]*&gt;)\\s*~&#039;,&#039;$1&#039;,$xml);\r\n\/\/ Convert CDATA into xml nodes.\r\n$xml = simplexml_load_string($xml,&#039;SimpleXMLElement&#039;, LIBXML_NOCDATA);\r\n\/\/ Return JSON.\r\necho json_encode($xml);\r\n?&gt;\r\n<\/pre>\n<p>[ad name=&#8221;Google Adsense&#8221;]<br \/>\nHere is a sample data file to try.<\/p>\n<pre class=\"brush: xml; title: ; wrap-lines: false; notranslate\" title=\"\">\r\n&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;\r\n&lt;news_items&gt;\r\n\t&lt;news_item&gt;\r\n\t\t&lt;title&gt;\r\n\t\t\t&lt;!&amp;#91;CDATA&amp;#91;News item title #1 with &lt;br\/&gt;html added.&amp;#93;&amp;#93;&gt;\r\n\t\t&lt;\/title&gt;\r\n\t\t&lt;content&gt;\r\n\t\t\t&lt;!&amp;#91;CDATA&amp;#91;This is the &lt;b&gt;CDATA content&lt;\/b&gt; for news item&lt;strong&gt; #1 &lt;\/strong&gt; with some &quot;html&quot; included.&amp;#93;&amp;#93;&gt;\r\n\t\t&lt;\/content&gt;\r\n\t&lt;\/news_item&gt;\r\n\t&lt;news_item&gt;\r\n\t\t&lt;title&gt;\r\n\t\t\t&lt;!&amp;#91;CDATA&amp;#91;News item title #2 with &lt;br\/&gt;html added.&amp;#93;&amp;#93;&gt;\r\n\t\t&lt;\/title&gt;\r\n\t\t&lt;content&gt;\r\n\t\t\t&lt;!&amp;#91;CDATA&amp;#91;This is the &lt;b&gt;CDATA content&lt;\/b&gt; for news item&lt;strong&gt; #2 &lt;\/strong&gt; with some &quot;html&quot; included.&amp;#93;&amp;#93;&gt;\r\n\t\t&lt;\/content&gt;\r\n\t&lt;\/news_item&gt;\r\n&lt;\/news_items&gt;\r\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>I needed to create an indented easy to read and edit XML file that a novice web owner could edit. This entailed having the nodes nicely indented so the user could find them easily. The XML file required CDATA nodes. The CDATA nodes I wanted on a separate line again so the user could easily [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5],"tags":[61,57,62],"class_list":["post-1179","post","type-post","status-publish","format-standard","hentry","category-general","tag-json","tag-php-2","tag-xml"],"_links":{"self":[{"href":"https:\/\/www.lonhosford.com\/lonblog\/wp-json\/wp\/v2\/posts\/1179","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.lonhosford.com\/lonblog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.lonhosford.com\/lonblog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.lonhosford.com\/lonblog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.lonhosford.com\/lonblog\/wp-json\/wp\/v2\/comments?post=1179"}],"version-history":[{"count":26,"href":"https:\/\/www.lonhosford.com\/lonblog\/wp-json\/wp\/v2\/posts\/1179\/revisions"}],"predecessor-version":[{"id":3693,"href":"https:\/\/www.lonhosford.com\/lonblog\/wp-json\/wp\/v2\/posts\/1179\/revisions\/3693"}],"wp:attachment":[{"href":"https:\/\/www.lonhosford.com\/lonblog\/wp-json\/wp\/v2\/media?parent=1179"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.lonhosford.com\/lonblog\/wp-json\/wp\/v2\/categories?post=1179"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.lonhosford.com\/lonblog\/wp-json\/wp\/v2\/tags?post=1179"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}