ibneko | Finally! Stupid XML parsing. (Reply)

Using XML::LibXML::Reader:

my $reader = new XML::LibXML::Reader('string' => $content)
            or die "cannot read file.xml\n";
    while ($reader->read()) {
        processNode($reader);
    }
    
    sub processNode {
        $reader = shift;
        if ($reader->nodeType != XML_READER_TYPE_SIGNIFICANT_WHITESPACE
                && $reader->nodeType != XML_READER_TYPE_COMMENT){
            print " " x $reader->depth
                    ."[".nodeTypeAsString($reader->nodeType)."]"
                    .$reader->name.""
                    .($reader->isEmptyElement?"[isEmpty]":"")."\r\n";
            if ($reader->nodeType == XML_READER_TYPE_TEXT){                
                print " " x $reader->depth;
                my $text = $reader->value();
                $text =~ s/^\s+//; #strip whitespace first
                $text =~ s/\s+$//;
                print substr($text, 0, 100)."\r\n";
            }
            
            if (my $total = $reader->attributeCount()){ #if more than 0
                $reader->moveToFirstAttribute(); #move to attributes
                my $x = 0;
                while ($x < $total) {
                    #display attribute name
                    print " " x $reader->depth
                            ."[".nodeTypeAsString($reader->nodeType)."]"
                            .$reader->name.""
                            .($reader->isEmptyElement?"[isEmpty]":"")."=";

                    #display attribute value
                    my $text = $reader->value();
                    $text =~ s/^\s+//; #strip whitespace first
                    $text =~ s/\s+$//;
                    print substr($text, 0, 100)."\r\n";
                    
                    $x++;
                    $reader->moveToAttributeNo($x);
                }
            }
        }
    }

=====OUTPUT======

[PROCESSING_INSTRUCTION]xml-stylesheet
[ELEMENT]feed
 [ATTRIBUTE]xmlns=http://purl.org/atom/ns#
 [ATTRIBUTE]version=0.3
 [ATTRIBUTE]xml:lang=en
 [ELEMENT]title
  [ATTRIBUTE]mode=escaped
  [TEXT]#text
  Lucene in Action
 [END_ELEMENT]title
 [ELEMENT]link[isEmpty]
  [ATTRIBUTE]rel=alternate
  [ATTRIBUTE]type=text/html
  [ATTRIBUTE]href=http://www.lucenebook.com/blog/
 [ELEMENT]link[isEmpty]
  [ATTRIBUTE]href=http://www.lucenebook.com/atomapi/default/
  [ATTRIBUTE]rel=service.post
  [ATTRIBUTE]title=Lucene in Action
  [ATTRIBUTE]type=application/x.atom+xml
 [ELEMENT]modified
  [TEXT]#text
  2005-11-07T23:51:26Z
 [END_ELEMENT]modified
 [ELEMENT]info
  [ATTRIBUTE]type=application/xhtml+xml
  [ATTRIBUTE]mode=xml
  [ELEMENT]div
   [ATTRIBUTE]xmlns=http://www.w3.org/1999/xhtml
   [TEXT]#text
   This is an Atom syndication feed. It is intended to be viewed in a news aggregator or syndicated to

   [ELEMENT]a
    [ATTRIBUTE]href=http://intertwingly.net/wiki/pie/
    [TEXT]#text
    Atom Project
   [END_ELEMENT]a
   [TEXT]#text
   for
        more information.
  [END_ELEMENT]div
 [END_ELEMENT]info
 [ELEMENT]author
  [ELEMENT]name
   [TEXT]#text
   Otis and Erik
  [END_ELEMENT]name
  [ELEMENT]url
   [TEXT]#text
   http://www.lucenebook.com/blog/
  [END_ELEMENT]url
  [ELEMENT]email
   [TEXT]#text
   authors@lucenebook.com
  [END_ELEMENT]email
 [END_ELEMENT]author
 [ELEMENT]tagline
  [TEXT]#text
  Lucene in Action
 [END_ELEMENT]tagline
 [ELEMENT]generator
  [ATTRIBUTE]url=http://blojsom.sf.net
  [ATTRIBUTE]version=blojsom v2.23
  [TEXT]#text
  blojsom
 [END_ELEMENT]generator
 [ELEMENT]copyright
  [ATTRIBUTE]mode=escaped
  [TEXT]#text
  Copyright © 2004 Otis and Erik
 [END_ELEMENT]copyright
 [ELEMENT]entry
  [ELEMENT]title
   [TEXT]#text
   Lucene in Action, Korean translation
  [END_ELEMENT]title
  [ELEMENT]link[isEmpty]
   [ATTRIBUTE]rel=alternate
   [ATTRIBUTE]type=text/html
   [ATTRIBUTE]href=http://www.lucenebook.com/blog/announcements/?permalink=Lucene_in_Action_Korean_translation.html
  [ELEMENT]link[isEmpty]
   [ATTRIBUTE]href=http://www.lucenebook.com/atomapi/default/announcements/?permalink=Lucene_in_Action_Korean_translati
   [ATTRIBUTE]rel=service.edit
   [ATTRIBUTE]title=Edit Lucene in Action, Korean translation
   [ATTRIBUTE]type=application/x.atom+xml
  [ELEMENT]modified
   [TEXT]#text
   2005-11-07T23:51:26-05:00
  [END_ELEMENT]modified
  [ELEMENT]issued
   [TEXT]#text
   2005-11-07T23:51:26-05:00
  [END_ELEMENT]issued
  [ELEMENT]id
   [TEXT]#text
   tag:authors@lucenebook.com,2005-11-07:/announcements/?permalink=Lucene_in_Action_Korean_translation.
  [END_ELEMENT]id
  [ELEMENT]created
   [TEXT]#text
   2005-11-07T23:51:26-05:00
  [END_ELEMENT]created
  [ELEMENT]content
   [ATTRIBUTE]type=text/html
   [ATTRIBUTE]mode=escaped
   [ATTRIBUTE]xml:lang=en
   [ATTRIBUTE]xml:base=http://www.lucenebook.com
   [TEXT]#text
   
Lucene in Action has recently been translated to Korean by Cheolgoo Kang, Seongjin Ju, and Moonh
  [END_ELEMENT]content
 [END_ELEMENT]entry

While, yes, this doesn't actually do what I need it to do, it gives me the ways I need to access everything in the xml file without doing any manual parsing on my own. Finally. There's a stupid lack of documentation out there, although I get the impression now (well, 30 minutes ago) that it's because the function names are supposed to be a set standard that is presumed to be already known...

But still, HAH! It WORKS!

Now for the more tedious but easier part of uh, figuring out which parts of what goes where in the database.

Oh, yeah, the code's here mostly to help anyone who might be googling for stuff.
Problems I ran into:
- Why doesn't the XML::LibXML::Reader work? (creating a new Reader with a string throws an error)
Answer: because I was running LibXML 2.5 on MacOS X.
- How do I access the attributes / How do I get the attribute names / How do I figure out which attributes there are?
Answer: Tricky bastard requires use of moveToFirstAttribute() first. Then name() and value() will snag the proper attributes. moveToAttributeNo( number ) will move you to the number-th attribute. They do not appear to be in any particular order, so assumptions should not be made about which one would be first, which would be second, etc.
- Which module should I use?
Answer: I don't know. See http://perl-xml.sourceforge.net/faq/#dont_parse. I hope I made the right choice with XML::LibXML::Reader............ ::crosses fingers and toes::

IBNeko's Journal-Nyo~!

Finally! Stupid XML parsing. (Reply)

Finally! Stupid XML parsing.

Expand Cut Tags

Profile

Most Popular Tags

Active Entries

Style Credit