Finally! Stupid XML parsing.
Jan. 31st, 2007 11:16 pm![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
Using XML::LibXML::Reader:
=====OUTPUT======
While, yes, this doesn't actually do what I need it to do, it gives me the ways I need to access everything in the xml file without doing any manual parsing on my own. Finally. There's a stupid lack of documentation out there, although I get the impression now (well, 30 minutes ago) that it's because the function names are supposed to be a set standard that is presumed to be already known...
But still, HAH! It WORKS!
Now for the more tedious but easier part of uh, figuring out which parts of what goes where in the database.
Oh, yeah, the code's here mostly to help anyone who might be googling for stuff.
Problems I ran into:
- Why doesn't the XML::LibXML::Reader work? (creating a new Reader with a string throws an error)
Answer: because I was running LibXML 2.5 on MacOS X.
- How do I access the attributes / How do I get the attribute names / How do I figure out which attributes there are?
Answer: Tricky bastard requires use of
- Which module should I use?
Answer: I don't know. See http://perl-xml.sourceforge.net/faq/#dont_parse. I hope I made the right choice with XML::LibXML::Reader............ ::crosses fingers and toes::
my $reader = new XML::LibXML::Reader('string' => $content)
or die "cannot read file.xml\n";
while ($reader->read()) {
processNode($reader);
}
sub processNode {
$reader = shift;
if ($reader->nodeType != XML_READER_TYPE_SIGNIFICANT_WHITESPACE
&& $reader->nodeType != XML_READER_TYPE_COMMENT){
print " " x $reader->depth
."[".nodeTypeAsString($reader->nodeType)."]"
.$reader->name.""
.($reader->isEmptyElement?"[isEmpty]":"")."\r\n";
if ($reader->nodeType == XML_READER_TYPE_TEXT){
print " " x $reader->depth;
my $text = $reader->value();
$text =~ s/^\s+//; #strip whitespace first
$text =~ s/\s+$//;
print substr($text, 0, 100)."\r\n";
}
if (my $total = $reader->attributeCount()){ #if more than 0
$reader->moveToFirstAttribute(); #move to attributes
my $x = 0;
while ($x < $total) {
#display attribute name
print " " x $reader->depth
."[".nodeTypeAsString($reader->nodeType)."]"
.$reader->name.""
.($reader->isEmptyElement?"[isEmpty]":"")."=";
#display attribute value
my $text = $reader->value();
$text =~ s/^\s+//; #strip whitespace first
$text =~ s/\s+$//;
print substr($text, 0, 100)."\r\n";
$x++;
$reader->moveToAttributeNo($x);
}
}
}
}
=====OUTPUT======
[PROCESSING_INSTRUCTION]xml-stylesheet
[ELEMENT]feed
[ATTRIBUTE]xmlns=http://purl.org/atom/ns#
[ATTRIBUTE]version=0.3
[ATTRIBUTE]xml:lang=en
[ELEMENT]title
[ATTRIBUTE]mode=escaped
[TEXT]#text
Lucene in Action
[END_ELEMENT]title
[ELEMENT]link[isEmpty]
[ATTRIBUTE]rel=alternate
[ATTRIBUTE]type=text/html
[ATTRIBUTE]href=http://www.lucenebook.com/blog/
[ELEMENT]link[isEmpty]
[ATTRIBUTE]href=http://www.lucenebook.com/atomapi/default/
[ATTRIBUTE]rel=service.post
[ATTRIBUTE]title=Lucene in Action
[ATTRIBUTE]type=application/x.atom+xml
[ELEMENT]modified
[TEXT]#text
2005-11-07T23:51:26Z
[END_ELEMENT]modified
[ELEMENT]info
[ATTRIBUTE]type=application/xhtml+xml
[ATTRIBUTE]mode=xml
[ELEMENT]div
[ATTRIBUTE]xmlns=http://www.w3.org/1999/xhtml
[TEXT]#text
This is an Atom syndication feed. It is intended to be viewed in a news aggregator or syndicated to
[ELEMENT]a
[ATTRIBUTE]href=http://intertwingly.net/wiki/pie/
[TEXT]#text
Atom Project
[END_ELEMENT]a
[TEXT]#text
for
more information.
[END_ELEMENT]div
[END_ELEMENT]info
[ELEMENT]author
[ELEMENT]name
[TEXT]#text
Otis and Erik
[END_ELEMENT]name
[ELEMENT]url
[TEXT]#text
http://www.lucenebook.com/blog/
[END_ELEMENT]url
[ELEMENT]email
[TEXT]#text
authors@lucenebook.com
[END_ELEMENT]email
[END_ELEMENT]author
[ELEMENT]tagline
[TEXT]#text
Lucene in Action
[END_ELEMENT]tagline
[ELEMENT]generator
[ATTRIBUTE]url=http://blojsom.sf.net
[ATTRIBUTE]version=blojsom v2.23
[TEXT]#text
blojsom
[END_ELEMENT]generator
[ELEMENT]copyright
[ATTRIBUTE]mode=escaped
[TEXT]#text
Copyright © 2004 Otis and Erik
[END_ELEMENT]copyright
[ELEMENT]entry
[ELEMENT]title
[TEXT]#text
Lucene in Action, Korean translation
[END_ELEMENT]title
[ELEMENT]link[isEmpty]
[ATTRIBUTE]rel=alternate
[ATTRIBUTE]type=text/html
[ATTRIBUTE]href=http://www.lucenebook.com/blog/announcements/?permalink=Lucene_in_Action_Korean_translation.html
[ELEMENT]link[isEmpty]
[ATTRIBUTE]href=http://www.lucenebook.com/atomapi/default/announcements/?permalink=Lucene_in_Action_Korean_translati
[ATTRIBUTE]rel=service.edit
[ATTRIBUTE]title=Edit Lucene in Action, Korean translation
[ATTRIBUTE]type=application/x.atom+xml
[ELEMENT]modified
[TEXT]#text
2005-11-07T23:51:26-05:00
[END_ELEMENT]modified
[ELEMENT]issued
[TEXT]#text
2005-11-07T23:51:26-05:00
[END_ELEMENT]issued
[ELEMENT]id
[TEXT]#text
tag:authors@lucenebook.com,2005-11-07:/announcements/?permalink=Lucene_in_Action_Korean_translation.
[END_ELEMENT]id
[ELEMENT]created
[TEXT]#text
2005-11-07T23:51:26-05:00
[END_ELEMENT]created
[ELEMENT]content
[ATTRIBUTE]type=text/html
[ATTRIBUTE]mode=escaped
[ATTRIBUTE]xml:lang=en
[ATTRIBUTE]xml:base=http://www.lucenebook.com
[TEXT]#text
Lucene in Action has recently been translated to Korean by Cheolgoo Kang, Seongjin Ju, and Moonh
[END_ELEMENT]content
[END_ELEMENT]entry
While, yes, this doesn't actually do what I need it to do, it gives me the ways I need to access everything in the xml file without doing any manual parsing on my own. Finally. There's a stupid lack of documentation out there, although I get the impression now (well, 30 minutes ago) that it's because the function names are supposed to be a set standard that is presumed to be already known...
But still, HAH! It WORKS!
Now for the more tedious but easier part of uh, figuring out which parts of what goes where in the database.
Oh, yeah, the code's here mostly to help anyone who might be googling for stuff.
Problems I ran into:
- Why doesn't the XML::LibXML::Reader work? (creating a new Reader with a string throws an error)
Answer: because I was running LibXML 2.5 on MacOS X.
- How do I access the attributes / How do I get the attribute names / How do I figure out which attributes there are?
Answer: Tricky bastard requires use of
moveToFirstAttribute()
first. Then name()
and value()
will snag the proper attributes. moveToAttributeNo( number )
will move you to the number-th attribute. They do not appear to be in any particular order, so assumptions should not be made about which one would be first, which would be second, etc.- Which module should I use?
Answer: I don't know. See http://perl-xml.sourceforge.net/faq/#dont_parse. I hope I made the right choice with XML::LibXML::Reader............ ::crosses fingers and toes::