[REXML] Raw Elements

T

trans.

I'm having trouble getting elements to be raw. I use, for example:
d = Document.new( str, { :raw => [ 'O', 'T', 'V' ] } )

Then when I traverse the document and query Element#raw it does say
'true' for these tags, but it still appears that they have been parsed
and I can't get the raw text.
e.get_text.value

Returns the same thing as

Is there another way ones supposed to use to get at the raw text?
Thanks,
T.
 
T

trans.

Sigh, I just realized I miss understood what raw meant --it's just
relates to entity parsing. Why am I getting the feeling that there is
no way to prevent parsing of the body of an element? I pray this is not
the case, b/c it means back to the drawing board for something like the
13th time :-(. But if is the case, can anyone recommend another XML
parser then can do this?

Thanks,
T.
 
A

Aredridel

Sigh, I just realized I miss understood what raw meant --it's just
relates to entity parsing. Why am I getting the feeling that there is
no way to prevent parsing of the body of an element? I pray this is not
the case, b/c it means back to the drawing board for something like the
13th time :-(. But if is the case, can anyone recommend another XML
parser then can do this?

None that I know of. The problem is this: where do you continue from,
and how do you know if not by parsing?

Ari
 
T

trans.

Well, I just want to specify a tag and anything in that tag would be
left verbatim. That's all really. I'm tryng to find info on libxml
bindings (rather difficult to find it seems) though I have a feeling
that won't work either.

Worse comes to worse I'll wipe out old trust Tagiter and see if that
will do. Otherwise I'll have to roll my own. Just what I need More
Work!

Thanks Ari.
 
Z

Zach Dennis

trans. said:
Well, I just want to specify a tag and anything in that tag would be
left verbatim. That's all really. I'm tryng to find info on libxml
bindings (rather difficult to find it seems) though I have a feeling
that won't work either.

Worse comes to worse I'll wipe out old trust Tagiter and see if that
will do. Otherwise I'll have to roll my own. Just what I need More
Work!

REXML should be pretty easy to manipulate or add functions to. Why roll
your own when you can just add a new behavior?

Zach
 
T

trans.

Thanks Zach, that's a fair idea. I did a little REXML hacking a few
years back, so maybe so....

"Sean, do you still frequent this list?" Is it reasonably feasible?

T.
 
W

William James

trans. said:
Well, I just want to specify a tag and anything in that tag would be
left verbatim. That's all really. I'm tryng to find info on libxml
bindings (rather difficult to find it seems) though I have a feeling
that won't work either.

Worse comes to worse I'll wipe out old trust Tagiter and see if that
will do. Otherwise I'll have to roll my own. Just what I need More
Work!

Thanks Ari.


Here's a micro xml-parser (posted via Google, so the
indentation has been removed):


# Produces array of nonmatching and matching
# substrings. The size of the array will
# always be an odd number. The first and the
# last item will always be nonmatching.
def shatter( s, re )
s.gsub( re, "\1"+'\&'+"\1" ).split("\1")
end

def get_attr( s )
h = Hash.new
while s =~ /(\w+)="([^"]*)"/
h[$1] = $2
s = $'
end
h
end

def tag_name( s )
if ( s =~ /^<(\S+)(\s|>)/ )
$1
else
nil
end
end

s = ''
$<.each_line {|x| s=s+x}
all = shatter( s, /<[^>]*>/ )
all.each {|x|
x.chomp!
if x.size > 0
print x
tname = tag_name(x)
print " | " + tname if tname
print "\n"
attr = get_attr( x )
if attr.size > 0
attr.each_pair {|key,val| puts "....#{key}-->#{val}" }
end
end
}


With this input:

<?xml version="1.0" encoding="UTF-8"?>
<tv><programme start="20041218204000 +1000"
stop="20041218225000+1000" channel="Network TEN Brisbane">
<title>The Frighteners</title>
<sub-title/><desc>A psychic private detective, who
consorts with deceased souls, becomes engaged in a mystery as members
of the town community begin dying mysteriously.</desc>
<rating system="ABA"><value>M</value></rating><length
units="minutes">130</length><category>Horror</category></programme>

the output is:

<?xml version="1.0" encoding="UTF-8"?> | ?xml
.....encoding-->UTF-8
.....version-->1.0
<tv> | tv
<programme start="20041218204000 +1000"
stop="20041218225000+1000" channel="Network TEN Brisbane"> | programme
.....stop-->20041218225000+1000
.....start-->20041218204000 +1000
.....channel-->Network TEN Brisbane
<title> | title
The Frighteners
</title> | /title
<sub-title/> | sub-title/
<desc> | desc
A psychic private detective, who
consorts with deceased souls, becomes engaged in a mystery as members
of the town community begin dying mysteriously.
</desc> | /desc
<rating system="ABA"> | rating
.....system-->ABA
<value> | value
M
</value> | /value
</rating> | /rating
<length
units="minutes"> | length
.....units-->minutes
130
</length> | /length
<category> | category
Horror
</category> | /category
</programme> | /programme
 
T

trans.

Hey Thanks! Not sure if I'll end up using since I just spent last night
wrting a general purpose stack-based parser. But I'll keep it in
reference.

Love the method name #shatter, BTW.

T.

P.S. FYI, I figured out that you can just use a "margin" character in
order to preserve indention. For example, I'm using Google Groups now
too:

: class A
: def shatter
: # ...
: end
: end

As to which character you like best, that's your call ;-).

Also, I know there is a way to set the google group to a fixed-font
mode (I manage a group and there is that option), but I don't know who
manages this group and thus would be able to set it.
 
R

Robert Klemme

trans. said:
Hey Thanks! Not sure if I'll end up using since I just spent last night
wrting a general purpose stack-based parser. But I'll keep it in
reference.

Love the method name #shatter, BTW.

T.

P.S. FYI, I figured out that you can just use a "margin" character in
order to preserve indention. For example, I'm using Google Groups now
too:

: class A
: def shatter
: # ...
: end
: end

As to which character you like best, that's your call ;-).

I'd like best a space.

Oh, I'm sorry, just got my silly five minutes. :)

robert
 
T

trans.

I'd like best a space.

Me too, but what you gonna do?

Also, btw, I should have mention that the small size of your parser is
impressive --micro indeed!

T.
 
W

William James

trans. said:
P.S. FYI, I figured out that you can just use a "margin" character in
order to preserve indention.


Thanks; I didn't think of that.

.. def shatter( s, re )
.. s.gsub( re, "\1"+'\&'+"\1" ).split("\1")
.. end
 
W

William James

Here's an improved version:

.. class String
.. # Produces array of nonmatching and matching
.. # substrings. The size of the array will
.. # always be an odd number. The first and the
.. # last item will always be nonmatching.
.. def shatter( re )
.. self.gsub( re, "\1"+'\&'+"\1" ).split("\1")
.. end
.. def xml_parse
.. self.shatter( /<[^>]*>/ )
.. end
.. def get_attr
.. s = self
.. while s =~ /(\w+)="([^"]*)"/m
.. yield( $1, $2 )
.. s = $'
.. end
.. end
.. def tag_name
.. if ( self =~ /^<(\S+)("\n"|\s|>)/ )
.. $1
.. else
.. nil
.. end
.. end
.. def span( tagname )
.. s = self
.. while (s =~ Regexp.new(
.. '(<'+tagname+'.*?>)(.*?)</'+tagname+'>',
.. Regexp::MULTILINE))
.. yield( $1, $2 )
.. s = $'
.. end
.. end
.. end
..
.. s = ''
.. $<.each_line {|x| s=s+x}
.. s.span('programme') { |tag,string|
.. string.span('title') {|junk,title|
.. puts 'Title: ' + title.chomp.gsub(/\n/,' ')
.. }
.. string.span('length') {|tag,len|
.. tag.get_attr {|key,val| @units=val }
.. puts "Length: #{len.chomp} #{@units}"
.. }
.. }

With the input

<?xml version="1.0" encoding="UTF-8"?>
<tv><programme start="20041218204000 +1000" stop="20041218225000
+1000" channel="Network TEN Brisbane"><title>The
Frighteners</title><sub-title/><desc>A psychic private detective, who
consorts with deceased souls, becomes engaged in a mystery as members
of the town community begin dying mysteriously.</desc><rating
system="ABA"><value>M</value></rating><length
units="minutes">130</length><category>Horror</category></programme><programme
start="20041218080000 +1000" stop="20041218083000 +1000"
channel="Network TEN Brisbane"><title>Worst Best
Friends</title><sub-title>Better Than Glen</sub-title><desc>Life's
like that for Roger Thesaurus - two of his best friends are also his
worst enemies!</desc><rating
system="ABA"><value>C</value></rating><length
units="minutes">30</length><category>Children</category></programme>

the output is

Title: The Frighteners
Length: 130 minutes
Title: Worst Best Friends
Length: 30 minutes
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,014
Latest member
BiancaFix3

Latest Threads

Top