Problem parseing a XML - PullParser

  • Thread starter Sebastian (syepes)
  • Start date
S

Sebastian (syepes)

Hi all,

I need to parse a XML file "line by line" because of a application
limitation, so i am trying to build a Stream/Pull xml parser with the
rexml library, but i can't get it to work..

- Anyone knows what can be causing this error? -> Missing end tag for
''
- This error even happens with a simple xml like this one:
psudo_xml = <<EOF
<?xml version="1.0" encoding="UTF-8"?>
<SChange>
</SChange>
EOF



Error:
----------
DBG: event_type: text
TXT Normal
DBG: event_type: end_element
END Mode
/opt/local/lib/ruby/1.8/rexml/parsers/baseparser.rb:330:in `pull':
Missing end tag for '' (got "SChange") (REXML::parseException)
Line:
Position:
Last 80 unconsumed characters:
from /opt/local/lib/ruby/1.8/rexml/parsers/pullparser.rb:68:in `pull'
from text2.rb:13:in `parse'
from text2.rb:32:in `line_process'
from text2.rb:47



Ruby code
----------
require "stringio"
require 'rexml/parsers/pullparser'

class BaseParser
def initialize
@parser = nil
end

def parse(raw_xml)
@parser = REXML::parsers::pullParser.new(raw_xml)

while @parser.has_next?
pull_event = @parser.pull
puts "DBG: event_type: #{pull_event.event_type}"

if pull_event.error?
puts "\tERR #{pull_event[0]} - #{pull_event[0]}"
elsif pull_event.start_element?
puts "\tSTART #{pull_event[0]}"
elsif pull_event.end_element?
puts "\tEND #{pull_event[0]}"
elsif pull_event.text?
puts "\tTXT #{pull_event[0]}"
end
end
end
end

def line_process(ios,myparser)
while (line = ios.gets)
line.chomp!
myparser.parse(line)
end
end


psudo_xml = <<EOF
<?xml version="1.0" encoding="UTF-8"?>
<SChange>
<Service>Testing:service</Service>
<Status>Critical</Status>
<Mode>Normal</Mode>
</SChange>
EOF

psudo_xml_io = StringIO.new(psudo_xml)
line_process(psudo_xml_io,BaseParser.new)
 
L

Luc Heinrich

@parser =3D REXML::parsers::pullParser.new(raw_xml)

You instantiate a *new* pull parser for *each* line, so the state is =20
obviously lost after each line and when you feed the last parser with =20=

</SChange> it naturally complains because it doesn't know what you're =20=

talking about :)

--=20
Luc Heinrich - (e-mail address removed)
 
S

Sebastian (syepes)

Luc said:
You instantiate a *new* pull parser for *each* line, so the state is
obviously lost after each line and when you feed the last parser with
</SChange> it naturally complains because it doesn't know what you're
talking about :)

Mmm, so is there a way to "parse* each line of the XML independently,
and is it posible with the PullParser library?
 
L

Luc Heinrich

Mmm, so is there a way to "parse* each line of the XML independently

Why do you want to do that exactly? If you don't have the whole XML =20
file at once and only have an IO like object, you can directly pass =20
this object to the pull parser which should simply block until enough =20=

data is available to produce each events.

--=20
Luc Heinrich - (e-mail address removed)
 
B

Bob Hutchison

Hi,

Mmm, so is there a way to "parse* each line of the XML independently,
and is it posible with the PullParser library?


Having written a pull parser, I'd have to say: No.

The parser is going to be looking for 'events', and it is going to
want to deal with well-formedness issues if it is an actual xml parser.

What are you trying to do, maybe that's a better place to start.

Cheers,
Bob
 
S

Sebastian (syepes)

Bob said:
Hi,




Having written a pull parser, I'd have to say: No.

The parser is going to be looking for 'events', and it is going to
want to deal with well-formedness issues if it is an actual xml parser.

What are you trying to do, maybe that's a better place to start.

Cheers,
Bob

----
Bob Hutchison
Recursive Design Inc.
http://www.recursive.ca/
weblog: http://www.recursive.ca/hutch


Ok, this is the problem i am trying to solve:
I need to parse a XML that comes from the stdout of a unix program, the
program sends a xml* stream when it detects a change and the IO.popen
stays open until the next change.

The real problem is that the function: ex_listener, processes the XML
"line by line" because i can't detect a EOF from the IO.popen and it
will always be waiting (open) for the next change "xml stream".


I have tried using "lines = ios.readlines", but it does not work because
there's no EOF, is there some other way of doing this?

I would appreciate any suggestions on how to solve this problem.


*xml: Sent when a change is detected
---
<SChange>
<Service>Testing:service</Service>
<Status>Critical</Status>
<Mode>Normal</Mode>
</SChange>

Ruby
---------
UNIX_PROG = "/bin/xml_stream"

def ex_connect
ios = IO.popen(UNIX_PROG,"w+")
ios.sync = true

line = ios.gets
if line =~ /xml/
puts "INF: Connected OK (XML)"
ios.puts "<Events>"
return ios
else
puts "ERR: Cannot connect"
exit 1
end
end

def ex_listener(ios)
while (line = ios.gets)
line.chomp!
if line =~ /<\/Events>/
puts "INF: END of program"
exit 0
end

puts "INF: #{line} - #{line.size}"
*parse_line_of_xml(line)*
end
end

ios = ex_connect
ex_listener(ios) # Processes the XML stream
 
L

Luc Heinrich

The real problem is that the function: ex_listener, processes the XML
"line by line" because i can't detect a EOF from the IO.popen and it
will always be waiting (open) for the next change "xml stream".

Right, but since you control the parser state you know exactly when =20
and where the document starts and when and where it ends, so you =20
should be able to close the connection by yourself.

--=20
Luc Heinrich - (e-mail address removed)
 
S

Sebastian (syepes)

Luc said:
Right, but since you control the parser state you know exactly when
and where the document starts and when and where it ends, so you
should be able to close the connection by yourself.

Ok i get the point, but i don't see how to detect the EOF (Without using
some ugly code) and pass the hole *xml to the Parser.
Any examples please.


Thanks fro the help.
 
L

Luc Heinrich

Ok i get the point, but i don't see how to detect the EOF (Without =20
using
some ugly code) and pass the hole *xml to the Parser.

I'm still not exactly sure of your exact context, but you don't have =20
to detect the EOF, just parse and when you reach the end of the =20
document close the pipe yourself on your end.

--=20
Luc Heinrich - (e-mail address removed)
 
B

Bob Hutchison

Hi,

I'm still not exactly sure of your exact context, but you don't have =20=
to detect the EOF, just parse and when you reach the end of the =20
document close the pipe yourself on your end.

Just for fun, I tried hacking something together using the pull parser =20=

that I wrote. This pointed out one possible issue that is confusing, =20
I'll get to that in a second.

How to avoid waiting for an EOF? Count events. Crudely, if you =20
increment the count on a start element event, and decrement on an end =20=

element, when the count goes to zero, you've got what you are looking =20=

for. This means you are letting the pull parser read the input, you =20
don't do it for the parser.

The issue I mentioned... In my pull parser I'm assuming a file or =20
string input, not an IO stream. I take advantage of that by looking =20
ahead a bit. This isn't a problem unless you are using a stream. In my =20=

parser's case, it is looking ahead to at least the end of the next =20
line (huge performance thing with files). The confusing effect is with =20=

the stream input:

<SChange>
<Service>Testing:service</Service>
<Status>Critical</Status>
<Mode>Normal</Mode>
</SChange>
<SChange>
<Service>Testing:service</Service>
<Status>Critical</Status>
<Mode>Normal</Mode>
</SChange>

The close of the first SChange element isn't reported until the next =20
line is read, which happens to include the start of the next element. =20=

This is a delayed effect that is maybe not the best for a stream =20
input. If you add a blank line between the events the problem goes =20
away (but it'll read the blank line before reporting which shouldn't =20
be a problem).

It is possible that this is affecting your testing.

Cheers,
Bob
 
M

Mark Thomas

If I understand correctly, you want to keep an IO stream open, and
react to certain elements as they appear? That's a textbook SAX case,
not pull-parsing. Register a SAX handler for your SChange events, and
point your IO stream at it.

I'd use libxml-ruby, but REXML has a stream parser than is SAX-like.
You'd use it something like this (untested)

require "rexml/document"
require "rexml/streamlistener"
include REXML

class Handler
include StreamListener
def tag_start name, attrs
if name=="SChange"
#do something
puts attrs
end
end
end

Document.parse_stream(your_io_stream, Handler.new)

-- Mark.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,767
Messages
2,569,570
Members
45,045
Latest member
DRCM

Latest Threads

Top