PDA

View Full Version : XML Parsing using PHP {Intermediate}. Data Truncation



adamd
December 12th, 2004, 08:19 PM
I created a PHP XML parser based on the "XML Parsing using PHP {Intermediate}" tutorial written by Jubba. It works great except for one flaw that I can't seem to figure out.

If one of the the tags contains one of the five predefined internal entities in XML. -- & in the case I'm testing, the data gets truncated.

So if I have:

<headline>Headline part A &amp; Headline part B</headline>

The parser only returns:
Headline part B

It should return:
Headline part A & Headline part B

Has anyone elese had this problem?

TNezvigin
December 12th, 2004, 08:36 PM
& is an illegal character in XML.

You need to escape the & in &amp;.

adamd
December 12th, 2004, 08:42 PM
as you will note from my example, the entity is proberly escaped. The xml being read by the parser is valid. So the issue lies in the PHP code.

I should also note that if I escape & using &amp;#38; I get the same truncation results.

TNezvigin
December 13th, 2004, 12:11 AM
Can you post the complete code? I just took a look at the tutorial, and it doesn't have 2 parts to a story -- which would mean you added to the array?

adamd
December 13th, 2004, 12:24 AM
By "part" I meant to distinguish between the text that came before the ampersand and text that comes after. -- meaning: text before & text after.

The problem I'm having is that any if any data inside an element tag contains an ampersand, data truncation occurs.

So if the the data inside an XML element contains " text before & text after." The output string will be "text after." and not " text before & text after." as you would expect.

For some reason the PHP variable that has been assigned to store the data of the element in question only contains "text after." which leads me to believe that the xml parser is choking somehow.

TNezvigin
December 13th, 2004, 12:42 AM
Have you tried making sure that $story_array[$x] -> headline was not an array of itself?

I think that the XML parser intentionally splits the data into an array if it is an ampersand because it does not consider it to be a single string element (due to the ampersand).

Try changing this line:

$story_array[$counter]->headline = $data;

to:

$story_array[$counter]->headline .= $data;

adamd
December 13th, 2004, 12:55 AM
$story_array[$counter]->headline = $data;

to:

$story_array[$counter]->headline .= $data;

That didn't work.

When I do:

count($story_array[$x]->headline)

1 is returned for all headlines, even the one I placed an "&" in.

And now for the strange part.
If I place an "&" in the DESCRIPTION tag, it doesn't truncate the data, but does this


<h2>Bigfoot Spoted at M.I.T. Dining Area</h2>
<br />
<i>The beast was seen ordering a Snapple in the dining area on Tuesday </i>
<h2></h2>

<br />
<i>&</i>
<h2></h2>
<br />
<i> In a related story, Kirupa Chinnathambi, an MIT engineering student has been reported missing.</i>


--TNezvigin, thanks for your effort trying to solve this issue with me.

TNezvigin
December 13th, 2004, 01:04 AM
NP. It's the best way to learn.. help others.. that way when I run into it, I shouldn't have a prob ;)

While I'm searching, can you try replacing the Amp with another entity? Such as:
&lt; or &quot;

What results do you get when yo udo that?

adamd
December 13th, 2004, 01:13 AM
Same results as & for all XML entities. It think you're right in that the PHP chokes on these entities and thinks they're not in the same element.

I guess I should mention that I'm running PHP 4.3.4. I have looked all over the net and have not found a solution.

TNezvigin
December 13th, 2004, 01:15 AM
I'm off to bed.. last thing I can suggest trying an:

$data = str_replace("&amp;", "&", $data);

If that doesn't work, just dumb it down and move up variable by variable by doing:

echo "<script>alert($variable)</script>";

until you get to the top. Eventually, you should notice one variable giving the right info and another giving the wrong info.. in between those is where the problem is.

If the problem ends up being in the parser, that could only mean:
a) XML was done improperly
b) It's a bug (highly unlikely)

Yeldarb
December 13th, 2004, 04:40 PM
Try using a PHP escape maybe
\&amp

adamd
December 14th, 2004, 09:18 PM
Yeldarb -- thanks for the suggestion, but it didn't work.

adamd
December 14th, 2004, 09:21 PM
If the problem ends up being in the parser, that could only mean:
a) XML was done improperly
b) It's a bug (highly unlikely)
The XML is done proberly. If it wasn't the pasrser would throw an error. It also passes the "Mozilla test." -- Meaning Mozilla handles the XML doc fine, with no errors.

iloveitaly
December 15th, 2004, 07:59 AM
i came across this problem last week. When the xml parser hits a entity sequence it 'restarts' the parsing process and calls the dataHandler() method again. Theres no real easy way to fix this, but over a few jobs of XML work i've come up with a solid XML parsing class. Its attachted in the post. I dont have time right now to explain what exactly you need to do to get this to work right, but heres an example of what i did to get around this problem. Its a class i made to parse a mail-box made in XML, hopefully you will be able to figure out what i did.

class mailParser extends xmlParser {//just a simple xml parser class to display the mail messages
var $dates;
var $messages;
var $senders;

//dummy functions
function doneParse(){}
function endElementHandler($name){}

//store all the dates
function startElementHandler($name, $attributes){ //gets the attribs for the msg's
if(isset($attributes['DATE'])) $this->dates[] = $attributes['DATE'];
if(isset($attributes['FROM'])) $this->senders[] = $attributes['FROM'];
}

//stores the data for all the messages
function dataHandler($data){//there is a problem with entites so this fixes it
if($this->curr_tag == "MESSAGE" && !$this->repeat) {
$this->messages[] = $data;
} else if($this->curr_tag == "MESSAGE" && $this->repeat) {//if we are looking at the same cell then just add the info on to the end of the last element in the array
$this->messages[count($this->messages)-1].=$data;
}
}
}

HTH!
forgot to attach the parser class i created

adamd
December 15th, 2004, 09:18 PM
Thanks for the code. I'll take a look at it. Could post an example XML document that this class has been designed to parse? It will help me in understanding what you are doing in the class. Thanks.

iloveitaly
December 16th, 2004, 08:04 AM
Thanks for the code. I'll take a look at it. Could post an example XML document that this class has been designed to parse? It will help me in understanding what you are doing in the class. Thanks.
here ya go:

<?xml version="1.0" encoding="iso-8859-1"?>
<inbox>
<message date="12/10/04" from="Some dude">Some Message in the inbox</message>
</inbox>