SAX (Perl and XML) - e-Reading Library

SAX has been a huge success. Its simplicity makes it easy to learn and work with. Early development with XML was mostly in the realm of Java, so SAX was codified as an interface construct. An interface construct is a special kind of class that declares an object's methods without implementing them, leaving the implementation up to the developer.

5.1. SAX Event Handlers

To use a typical SAX module in a program, you must pass it an object whose methods implement handlers for SAX events. Table 5-1 describes the methods in a typical handler object. A SAX parser passes a hash to each handler containing properties relevant to the event. For example, in this hash, an element handler would receive the element's name and a list of attributes.

Table 5-1. PerlSAX handlers

Method name	Event	Properties
`start_document`	The document processing has started (this is the first event)	(none defined)
`end_document`	The document processing is complete (this is the last event)	(none defined)
`start_element`	An element start tag or empty element tag was found	Name, Attributes
`end_element`	An element end tag or empty element tag was found	Name
`characters`	A string of nonmarkup characters (character data) was found	Data
`processing_instruction`	A parser encountered a processing instruction	Target, Data
`comment`	A parser encountered a comment	Data
`start_cdata`	The beginning of a CDATA section encountered (the following character data may contain reserved markup characters)	(none defined)
`end_cdata`	The end of an encountered CDATA section	(none defined)
`entity_reference`	An internal entity reference was found (as opposed to an external entity reference, which would indicate that a file needs to be loaded)	Name, Value

A few notes about handler methods:

For an empty element, both the start_element( ) and end_element( ) handlers are called, in that order. No handler exists specifically for empty elements.
The characters( ) handler may be called more than once for a string of contiguous character data, parceling it into pieces. For example, a parser might break text around an entity reference, which is often more efficient for the parser.
The characters( ) handler will be called for any whitespace between elements, even if it doesn't seem like significant data. In XML, all characters are considered part of data. It's simply more efficient not to make a distinction otherwise.
Handling of processing instructions, comments, and CDATA sections is optional. In the absence of handlers, the data from processing instructions and comments is discarded. For CDATA sections, calls are still made to the characters( ) handler as before so the data will not be lost.
The start_cdata( ) and end_cdata( ) handlers do not receive data. Instead, they merely act as signals to tell you whether reserved markup characters can be expected in future calls to the characters( ) handler.
In the absence of an entity_reference( ) handler, all internal entity references will be resolved automatically by the parser, and the resulting text or markup will be handled normally. If you do define an entity_reference( ) handler, the entity references will not be expanded and you can do what you want with them.

Let's show an example now. We'll write a program called a filter, a special processor that outputs a replica of the original document with a few modifications. Specifically, it makes these changes to a document:

Turns every XML comment into a <comment> element
Deletes processing instructions
Removes tags, but leaves the content, for <literal> elements that occur within <programlisting> elements at any level

The code for this program is listed in Example 5-1. Like the last program, we initialize the parser with a set of handlers, except this time they are bundled together in a convenient package: an object called MyHandler. Notice that we've implemented a few more handlers, since we want to be able to deal with comments, processing instructions, and the document prolog.

Example 5-1. Filter program

# initialize the parser
#
use XML::Parser::PerlSAX;
my $parser = XML::Parser::PerlSAX->new( Handler => MyHandler->new( ) );

if( my $file = shift @ARGV ) {
    $parser->parse( Source => {SystemId => $file} );
} else {
    my $input = "";
    while( <STDIN> ) { $input .= $_; }
    $parser->parse( Source => {String => $input} );
}
exit;

#
# global variables
#
my @element_stack;                # remembers element names
my $in_intset;                    # flag: are we in the internal subset?

###
### Document Handler Package
###
package MyHandler;

#
# initialize the handler package
#
sub new {
    my $type = shift;
    return bless {}, $type;
}

#
# handle a start-of-element event: output start tag and attributes
#
sub start_element {
    my( $self, $properties ) = @_;
    # note: the hash %{$properties} will lose attribute order

    # close internal subset if still open
    output( "]>\n" ) if( $in_intset );
    $in_intset = 0;

    # remember the name by pushing onto the stack
    push( @element_stack, $properties->{'Name'} );

    # output the tag and attributes UNLESS it's a <literal>
    # inside a <programlisting>
    unless( stack_top( 'literal' ) and
            stack_contains( 'programlisting' )) {
        output( "<" . $properties->{'Name'} );
        my %attributes = %{$properties->{'Attributes'}};
        foreach( keys( %attributes )) {
            output( " $_=\"" . $attributes{$_} . "\"" );
        }
        output( ">" );
    }
} 

#
# handle an end-of-element event: output end tag UNLESS it's from a
# <literal> inside a <programlisting>
#
sub end_element {
    my( $self, $properties ) = @_;
    output( "</" . $properties->{'Name'} . ">" )
         unless( stack_top( 'literal' ) and
                stack_contains( 'programlisting' ));
    pop( @element_stack );
}

#
# handle a character data event
#
sub characters {
    my( $self, $properties ) = @_;
    # parser unfortunately resolves some character entities for us,
    # so we need to replace them with entity references again
    my $data = $properties->{'Data'};
    $data =~ s/\&/\&/;
    $data =~ s/</\&lt;/;
    $data =~ s/>/\&gt;/;
    output( $data );
}

#
# handle a comment event: turn into a <comment> element
#
sub comment {
    my( $self, $properties ) = @_;
    output( "<comment>" . $properties->{'Data'} . "</comment>" );
}

#
# handle a PI event: delete it
#
sub processing_instruction {
  # do nothing!
}

#
# handle internal entity reference (we don't want them resolved)
#
sub entity_reference {
    my( $self, $properties ) = @_;
    output( "&" . $properties->{'Name'} . ";" );
}

sub stack_top {
    my $guess = shift;
    return $element_stack[ $#element_stack ] eq $guess;
}

sub stack_contains {
    my $guess = shift;
    foreach( @element_stack ) {
        return 1 if( $_ eq $guess );
    }
    return 0;
}

sub output {
    my $string = shift;
    print $string;
}

Looking closely at the handlers, we see that one argument is passed, in addition to the obligatory object reference $self. This argument is a reference to a hash of properties about the event. This technique has one disadvantage: in the element start handler, the attributes are stored in a hash, which has no memory of the original attribute order. Semantically, this is not a big deal, since XML is supposed to be ignorant of attribute order. However, there may be cases when you want to replicate that order.[25]

[25]In the case of our filter, we might want to compare the versions from before and after processing using a utility such as the Unix program diff. Such a comparison would yield many false differences where the order of attributes changed. Instead of using diff, you should consider using the module XML::SemanticDiff by Kip Hampton. This module would ignore syntactic differences and compare only the semantics of two documents.

As a filter, this program preserves everything about the original document, except for the few details that have to be changed. The program preserves the document prolog, processing instructions, and comments. Even entity references should be preserved as they are instead of being resolved (as the parser may want to do). Therefore, the program has a few more handlers than in the last example, from which we were interested only in extracting very specific information.

Let's test this program now. Our input datafile is listed in Example 5-2.

Example 5-2. Data for the filter

<?xml version="1.0"?>
<!DOCTYPE book
  SYSTEM "/usr/local/prod/sgml/db.dtd"
[
  <!ENTITY thingy "hoo hah blah blah">
]>

<book id="mybook">
<?print newpage?>
  <title>GRXL in a Nutshell</title>
  <chapter id="intro">
    <title>What is GRXL?</title>
<!-- need a better title -->
    <para>
Yet another acronym.  That was our attitude at first, but then we saw 
the amazing uses of this new technology called
<literal>GRXL</literal>.  Consider the following program:
    </para>
<?print newpage?>
    <programlisting>AH aof -- %%%%
{{{{{{ let x = 0 }}}}}}
  print!  <lineannotation><literal>wow</literal></lineannotation>
or not!</programlisting>
<!-- what font should we use? -->
    <para>
What does it do?  Who cares?  It's just lovely to look at.  In fact,
I'd have to say, "&thingy;".
    </para>
<?print newpage?>
  </chapter>
</book>

The result, after running the program on the data, is shown in Example 5-3.

Example 5-3. Output from the filter

<book id="mybook">
  <title>GRXL in a Nutshell</title>
  <chapter id="intro">
    <title>What is GRXL?</title>
<comment> need a better title </comment>
    <para>
Yet another acronym.  That was our attitude at first, but then we saw 
the amazing uses of this new technology called
<literal>GRXL</literal>.  Consider the following program:
    </para>

    <programlisting>AH aof -- %%%%
{{{{{{ let x = 0 }}}}}}
  print!  <lineannotation>wow</lineannotation>
or not!</programlisting>
<comment> what font should we use? </comment>
    <para>
What does it do?  Who cares?  It's just lovely to look at.  In fact,
I'd have to say, "&thingy;".
    </para>

  </chapter>
</book>

Here's what the filter did right. It turned an XML comment into a <comment> element and deleted the processing instruction. The <literal> element in the <programlisting> was removed, with its contents left intact, while other <literal> elements were preserved. Entity references were left unresolved, as we wanted. So far, so good. But something's missing. The XML declaration, document type declaration, and internal subset are gone. Without the declaration for the entity thingy, this document is not valid. It looks like the handlers we had available to us were not sufficient.

Chapter 5. SAX

Contents:

5.1. SAX Event Handlers

Table 5-1. PerlSAX handlers

Example 5-1. Filter program

Example 5-2. Data for the filter

Example 5-3. Output from the filter


4.6. XML::Parser		5.2. DTD Handlers