XML Parsing using PHP {Intermediate}
         by Jubba

Introduction
This tutorial is a continuation of the previous XML tutorial I have written. Because I have already written some background information (very little) I will not add that into this tutorial in order to save space and time. The other tutorial can be found at this link.

Formatting XML
Ok, since I already went over the basics for formatting XML data and the basics of PHP/XML parsing I'm just going to jump right into the XML and PHP without much of an explanation. For this project I decided to create a mock-news headline parser. Basically, we have our XML file that has news headlines and a brief description of the story. Many of the news-tickers that you see on websites use a process similar to this (often called RSS). Now, on to our XML file.

Creating our XML
Just as with the last tutorial this XML file is quite simple. We have our highest level "news" tags which encase everything. The next level down is our "story" tags which split up each different news headline that we have and contained within that are the "headline" tag and the "description" tag. See? Simple...

<?xml version="1.0"?>
<news>
<story>
<headline> Godzilla Attacks LA! </headline>
<description>Equipped with a Japanese Mind-control device, the giant monster has attacked important harbours along the California coast. President to take action. </description>
</story>
<story>
<headline> Bigfoot Spotted at M.I.T. Dining Area </headline>
<description>The beast was seen ordering a Snapple in the dining area on Tuesday. In a related story, Kirupa Chinnathambi, an MIT engineering student has been reported missing. </description>
</story>
<story>
<headline> London Angel Saves England </headline>
<description>The "London Angel" known only as "Kit" has saved the U.K. yet again. Reports have stated that she destroyed every single Churchill bobble-head dog in the country. A great heartfilled thank you goes out to her. </description>
</story>
<story>
<headline> Six-eyed Man to be Wed to an Eight-armed Woman </headline>
<description>Uhhhmmm... No comment really... just a little creepy to see them together... </description>
</story>
<story>
<headline> Ahmed's Birthday Extravaganza! </headline>
<description>The gifted youngster's birthday party should be a blast. He is turning thirteen and has requested a large cake, ice cream, and a petting zoo complete with pony rides. </description>
</story>
</news>

 

Creating our PHP
To make this easy on us, I will post the code I used and then explain what each line does after.

<?php

$xml_file = "xml_intermediate.xml";

$xml_headline_key = "*NEWS*STORY*HEADLINE";
$xml_description_key = "*NEWS*STORY*DESCRIPTION";

$story_array = array();

$counter = 0;
class
xml_story{
    var
$headline, $description;
}

function
startTag($parser, $data){
    global
$current_tag;
    
$current_tag .= "*$data";
}

function
endTag($parser, $data){
    global
$current_tag;
    
$tag_key = strrpos($current_tag, '*');
    
$current_tag = substr($current_tag, 0, $tag_key);
}

function
contents($parser, $data){
    global
$current_tag, $xml_headline_key, $xml_description_key, $counter, $story_array;
    switch(
$current_tag){
        case
$xml_headline_key:
            
$story_array[$counter] = new xml_story();
            
$story_array[$counter]->headline = $data;
            break;
        case
$xml_description_key:
            
$story_array[$counter]->description = $data;
            
$counter++;
            break;
    }
}

$xml_parser = xml_parser_create();

xml_set_element_handler($xml_parser, "startTag", "endTag");

xml_set_character_data_handler($xml_parser, "contents");

$fp = fopen($xml_file, "r") or die("Could not open file");

$data = fread($fp, filesize($xml_file)) or die("Could not read file");

if(!(
xml_parse($xml_parser, $data, feof($fp)))){
    die(
"Error on line " . xml_get_current_line_number($xml_parser));
}

xml_parser_free($xml_parser);

fclose($fp);

?>

<html>
<head>
<title>CNT HEADLINE NEWS</title>
</head>
<body bgcolor="#FFFFFF">
<?php
for($x=0;$x<count($story_array);$x++){
    echo
"\t<h2>" . $story_array[$x]->headline . "</h2>\n";
    echo
"\t\t\n";
    echo
"\t<i>" . $story_array[$x]->description . "</i>\n";
}
?>

</body>
</html>


 

This project needs a bit more setup than before. In addition to our 3 main functions:

-A function to handle the start tags
-A function to handle the data between the tags
-A function to handle the end tags

We need a few more things:

-Our XML tag keys
-An array to store information
-A counter
-A class

Thats what we have in the following explanations:

$xml_headline_key = "*NEWS*STORY*HEADLINE";
$xml_description_key = "*NEWS*STORY*DESCRIPTION";

These are our tag keys. They are the different levels of hierarchical tags in our XML file. Because we don't actually have any information in the "news" or the "story" tags we don't have to include them. Our main focus is on the "headline" and the "description" tags. We will use these tags later on in the script.

$story_array = array();
$counter = 0;

Here we are simply initializing our array and our counter for later use.

Now we come up upon our 3 major functions for parsing and formating our data. Same as before they are the "startTag", "endTag", and "contents" functions. Ultimately they do the same things as before. They perform their designated actions when they are called on by the parser. The only change in this file is that the actions are a bit more complex. In this tutorial we'll go through each function fully before moving on to the next. We'll start with "startTag":

function startTag($parser, $data){
    global
$current_tag;
    
$current_tag .= "*$data";
}

When the script hits a start tag it will add the tag that it is currently reading to the string $current_tag. There is a "global" in front of our variable because we will be using the variable in all three of our functions and in order to use it the way we want, we need to declare it as a global variable instead of a local variable.

function endTag($parser, $data){
    global
$current_tag;
    
$tag_key = strrpos($current_tag, '*');
    
$current_tag = substr($current_tag, 0, $tag_key);
}

Again we delcare $current_tag as global so we can use it just like the "startTag". The variable $tag_key is used to mark the last occurence of an asterix (*) in the string $current_tag. Then the string $current_tag is cut back by one XML tag. The purpose of this function is to take a step back just as the purpose of "startTag" is to take a step forward.

function contents($parser, $data){
    global
$current_tag, $xml_headline_key, $xml_description_key, $counter, $story_array;
    switch(
$current_tag){
        case
$xml_headline_key:
            
$story_array[$counter] = new xml_story();
            
$story_array[$counter]->headline = $data;
            break;
        case
$xml_description_key:
            
$story_array[$counter]->description = $data;
            
$counter++;
            break;
    }
}

Our first line of this functions declares all of our variables: $current_tag, $xml_headline_key, $xml_description_key, etc... The next line begins our switch statement. Switch() is basically another method for if() statements. For more on switch() in PHP visit php.net. What this switch statement is doing is comparing the variable $current_tag to the variables $xml_headline_key and $xml_description_key and if it finds a match it performs the scripted actions.

 

If $current_tag matches $xml_headline_key the script defines $story_array[$counter] as a new xml_story() object. Then we assign our data to our new objects "headline" property with this line:

$story_array[$counter]->headline = $data;

Then it breaks the switch statement. If the $current_tag matches $xml_description_key it assigns the data to the objects "description" property and adds 1 to our $counter variable, then breaks out of the switch statement.

Just to clarify, I prefer to use the object method to keep track of my data a little easier. This is just what works for me. Other people may have other methods that work better for them. Its all about you're own personal preferences. Next up are the XML functions, which are exactly the same as in the {Easy} tutorial.

XML functions
For the XML functions, we need to:

-Create the parser
-Set the start and end tag handlers
-Set the data handler
-Open the XML file
-Read the XML file
-Parse the XML data
-Destroy the parser
-Close the XML file

Creating the parser is easy:

$xml_parser = xml_parser_create();

Setting the start tag, end tag, and data handlers are pretty easy as well:

xml_set_element_handler($xml_parser, "startTag", "endTag");

xml_set_character_data_handler($xml_parser, "contents");


The first argument for both of these functions is always the name of the parser we created in the previous step. The next arguments are the functions we created a little earlier. Next up is opening and reading the XML file:

$fp = fopen($file, "r");

$data = fread($fp, 80000);


These are basic file handling functions that you should be familiar with by now. If you need to learn more or just refresh your memory you can check out the great tutorials on php.net.

 

The following if statement does two things: 1) it parses through the XML data from the XML file, and 2) if the parse fails it outputs an error message complete with line number.

if(!(xml_parse($xml_parser, $data, feof($fp)))){
    die(
"Error on line " . xml_get_current_line_number($xml_parser));
}


Again the first argument of the function is our parser. The second argument is the data to be parsed, in this case the variable $data. The third argument tells the function to keep parsing until it reaches the end of the file.

The next two lines just wrap up the script. The first one frees up the memory used by the server to create the parser and the second closes the XML file. Both of these lines are very important so do not forget to include them in your script. Failure to do so could result in problems with your server.

xml_parser_free($xml_parser);

fclose($fp);
 

Wrapping it up
Well thats all fine and good. Now we have our XML parsed and the data is stored into our objects. All we have left to do is format it. We can pretty much do whatever we want to do with it now. I prefer to keep this simple for the example, so the code is just a simple for() loop that outputs our information into headers.

<html>
<head>
<title>CNT HEADLINE NEWS</title>
</head>
<body bgcolor="#FFFFFF">
<?
// A simple for loop that outputs our final data.
for($x=0;$x<count($story_array);$x++){
    echo
"\t<h2>" . $story_array[$x]->headline . "</h2>\n";
    echo
"\t\t\n";
    echo
"\t<i>" . $story_array[$x]->description . "</i>\n";
}
?>

</body>
</html>


Most of this is simple HTML, I'm sure you can figure that out on your own. The for loop is fairly easy as well, outputting our data in the format that I have specified. The "\t" and "\n" are special characters in PHP. They don't show up in the source, they merely tell the PHP to indent or go to the next line when printing out the HTML code. If you aren't exactly sure how to use a for loop more information can be found at php.net.

Conclusion
That is pretty much it. There are many ways to accomplish this and there are many uses for this project. Oh, yeah, here is what our output looks like:

<html>
<head>
<title>CNT HEADLINE NEWS</title>
</head>
<body bgcolor="#FFFFFF">
    <h2>Godzilla Attacks LA!</h2>
         <br />
     <i>Equipped with a Japanese Mind-control device, the giant monster has attacked important harbours along the California coast. President to take action. </i>
     <h2>Bigfoot Spoted at M.I.T. Dining Area</h2>
         <br />
     <i>The beast was seen ordering a Snapple in the dining area on Tuesday. In a related story, Kirupa Chinnathambi, an MIT engineering student has been reported missing.</i>
     <h2>London Angel Saves England</h2>
         <br />
     <i>The "London Angel" known only as "Kit" has saved the U.K. yet again. Reports have stated that she destroyed every single Churchill bobble-head dog in the country. A great heartfilled thank you goes out to her.</i>
     <h2>Six-eyed Man to be Wed to an Eight-armed Woman</h2>
         <br />
     <i>Uhhhmmm... No comment really... just a little creepy to see them together...</i>
     <h2>Ahmed's Birthday Extravaganza!</h2>
         <br />
     <i>The gifted youngster's birthday party should be a blast. He is turning thirteen and has requested a large cake, ice cream, and a petting zoo complete with pony rides.</i>
</body>
</html>

There are a couple things to remember when working with XML.

1. Always free the parser memory
2. Always close the file
3. Always escape illegal XML characters
a. <
b. >
c. &
d. '
e. "

You can download my source files for this tutorial to look at the commented code here, and if you have any questions the best place to ask would be on the forums in the Server-side Scripting Forum.

Jubba

 




SUPPORTERS:

kirupa.com's fast and reliable hosting provided by Media Temple.