Saturday, March 12, 2011

SimpleXMLElement and EntityRef XML parser

I am using the PHP class SimpleXMLElement to take care of parsing some XML data that I am sourcing from 3rd parties. It had been working well for a while, but I just discovered an error that was popping up frequently. This error was "XML parser error : EntityRef: expecting ';'".

This error comes about as a result of XML input data being improperly encoded. Two data sources I was using had encoded things like "&", "<" and ">" by leaving off the semi-colon. In other words, the ampersand had been encoded as "&" instead of "&". SimpleXMLElement doesn't like this and throws a warning fest.

To fix the problem, I added a line before calling SimpleXMLElement:
$xmldata = preg_replace('/&(amp|lt|gt)([^;])/', '&$1;$2', $xmldata);
$obj_xml = new SimpleXMLElement($xmldata);

The preg_replace fixes the encoding problem and adds the ampersand for you. Just a note, this will only fix the encoding for the three characters I specified above ("&", "<" and ">"). If there are others that are causing problems, you'll have to add them to the first argument of preg_replace().

Here is another blog/article that helped me discover the underlying issue.

No comments: