Hpricot innerTEXT?

B

Bontina Chen

Hi


I'm using hpricot to parse the following file.

<item
rdf:about="http://del.icio.us/url/50666d1a3fe2b942b20819ec2919d2b7#morwyn">
<title>[from morwyn] * HTML for the Conceptually Challenged</title>
<link>http://del.icio.us/url/50666d1a3fe2b942b20819ec2919d2b7#morwyn</link>
<description>HTML for the Conceptually Challenged. Very basic tutorial,
plainly worded for people who hate to read instructions.</description>
<dc:creator>morwyn</dc:creator>
<dc:date>2006-10-10T07:28:28Z</dc:date>
<dc:subject>html imported webpagedesign</dc:subject>
<taxo:topics>
<rdf:Bag>
<rdf:li resource="http://del.icio.us/tag/imported" />
<rdf:li resource="http://del.icio.us/tag/html" />
<rdf:li resource="http://del.icio.us/tag/webpagedesign" />
</rdf:Bag>
</taxo:topics>
</item>

I'm trying to get the content from <dc:subject> like this

doc = Hpricot.parse(File.read("965.xhtml"))

(doc/"item").each do |t|

puts (t/"dc:subject").innerTEXT

end

but I got

<dc:subject>html internet tutorial web</dc:subject>

while I only need "html internet tutorial web"

Anyone knows what's the right function to call?

THanks
 
C

chickenkiller

Hi

I'm using hpricot to parse the following file.

<item
rdf:about="http://del.icio.us/url/50666d1a3fe2b942b20819ec2919d2b7#morwyn">
<title>[from morwyn] * HTML for the Conceptually Challenged</title>
<link>http://del.icio.us/url/50666d1a3fe2b942b20819ec2919d2b7#morwyn</link>
<description>HTML for the Conceptually Challenged. Very basic tutorial,
plainly worded for people who hate to read instructions.</description>
<dc:creator>morwyn</dc:creator>
<dc:date>2006-10-10T07:28:28Z</dc:date>
<dc:subject>html imported webpagedesign</dc:subject>
<taxo:topics>
<rdf:Bag>
<rdf:li resource="http://del.icio.us/tag/imported" />
<rdf:li resource="http://del.icio.us/tag/html" />
<rdf:li resource="http://del.icio.us/tag/webpagedesign" />
</rdf:Bag>
</taxo:topics>
</item>

I'm trying to get the content from <dc:subject> like this

doc = Hpricot.parse(File.read("965.xhtml"))

(doc/"item").each do |t|

puts (t/"dc:subject").innerTEXT

end

but I got

<dc:subject>html internet tutorial web</dc:subject>

while I only need "html internet tutorial web"

Anyone knows what's the right function to call?

THanks

replace innerTEXT by inner_html:

(doc/"item").each do |t|
puts (t/"dc:subject").inner_html
end

regards
Lionel
 
B

Bontina Chen

Lionel said:
replace innerTEXT by inner_html:

(doc/"item").each do |t|
puts (t/"dc:subject").inner_html
end

regards
Lionel

Thx for your response , but I still get
<dc:subject>html internet tutorial web</dc:subject>
 
C

chickenkiller

Thx for your response , but I still get
<dc:subject>html internet tutorial web</dc:subject>

In fact, inner_text works as well. But you should have a look at the
warnings from ruby! The inner_text or inner_html function is applied
to 'puts (t/"dc:subject")' return object, which is nil.
So a warning appears:
rdf.rb:6: undefined method `inner_html' for nil:NilClass
(NoMethodError)

but 'puts (t/"dc:subject")' is executed, and so '<dc:subject>html
internet tutorial web</dc:subject>' is displayed anyway. Therefore I
recommend using a few parentheses there:

puts((t/"dc:subject").inner_text)

and it should work well this time.

Next time, look at the warnings!!! ;)

regards
Lionel
 
B

Brian Candler

In fact, inner_text works as well. But you should have a look at the
warnings from ruby! The inner_text or inner_html function is applied
to 'puts (t/"dc:subject")' return object, which is nil.
So a warning appears:
rdf.rb:6: undefined method `inner_html' for nil:NilClass
(NoMethodError)

That's not a warning, that's an exception, and the program will terminate at
that point. The OP didn't mention any errors.
but 'puts (t/"dc:subject")' is executed, and so '<dc:subject>html
internet tutorial web</dc:subject>' is displayed anyway. Therefore I
recommend using a few parentheses there:

puts((t/"dc:subject").inner_text)

and it should work well this time.

Next time, look at the warnings!!! ;)

Good point, but it was OK the way he wrote it, with a space after puts.

irb(main):003:0> p (1+3).to_s
"4"
=> nil
irb(main):004:0> p(1+3).to_s
4
=> ""

In the first case, this is p( (1+3).to_s )

In the second case, this is ( p(1+3) ).to_s # i.e. nil.to_s
 
C

chickenkiller

That's not a warning, that's an exception, and the program will terminate at
that point. The OP didn't mention any errors.

Indeed I use the term 'warning' VERY abusively - I apologize for this.
This is an exception and nothing else.
Good point, but it was OK the way he wrote it, with a space after puts.

irb(main):003:0> p (1+3).to_s
"4"
=> nil
irb(main):004:0> p(1+3).to_s
4
=> ""

In the first case, this is p( (1+3).to_s )

In the second case, this is ( p(1+3) ).to_s # i.e. nil.to_s

mmmh... interesting... It seems that the problem arises when in a
block:

# output text in comments...
require 'hpricot'

doc = Hpricot(File.open("rdf.xhtml"))

puts (doc/"item"/"dc:subject").inner_text
# html imported webpagedesign

(doc/"item").each do |t|
puts((t/"dc:subject").inner_text)
end
# html imported webpagedesign

(doc/"item").each do |t|
puts (t/"dc:subject").inner_text
end
# <dc:subject>html imported webpagedesign</dc:subject>
# rdf.rb:12: warning: don't put space before argument parentheses
# rdf.rb:12: undefined method `inner_text' for nil:NilClass
(NoMethodError)
# from rdf.rb:11:in `each'
# from rdf.rb:11

I am wondering where the difference is between the two last blocks.
Any ideas?

Lionel
 
B

Brian Candler

doc = Hpricot(File.open("rdf.xhtml"))

puts (doc/"item"/"dc:subject").inner_text
# html imported webpagedesign

(doc/"item").each do |t|
puts((t/"dc:subject").inner_text)
end
# html imported webpagedesign

(doc/"item").each do |t|
puts (t/"dc:subject").inner_text
end
# <dc:subject>html imported webpagedesign</dc:subject>
# rdf.rb:12: warning: don't put space before argument parentheses
# rdf.rb:12: undefined method `inner_text' for nil:NilClass
(NoMethodError)
# from rdf.rb:11:in `each'
# from rdf.rb:11

I am wondering where the difference is between the two last blocks.
Any ideas?

Hmm, looks like this should be something that can be replicated without
hpricot.

$ cat x.rb
x = 3
puts (x-5).abs

1.times do
puts (x-5).abs
end
$ ruby -v
ruby 1.8.4 (2005-12-24) [i486-linux]
$ ruby x.rb
x.rb:5: warning: don't put space before argument parentheses
2
-2
x.rb:5: undefined method `abs' for nil:NilClass (NoMethodError)
from x.rb:4
$

Congratulations, I think you've found a bug in the parser :) I'll post this
example to ruby-core.

Regards,

Brian.
 
C

chickenkiller

doc = Hpricot(File.open("rdf.xhtml"))
puts (doc/"item"/"dc:subject").inner_text
# html imported webpagedesign
(doc/"item").each do |t|
puts((t/"dc:subject").inner_text)
end
# html imported webpagedesign
(doc/"item").each do |t|
puts (t/"dc:subject").inner_text
end
# <dc:subject>html imported webpagedesign</dc:subject>
# rdf.rb:12: warning: don't put space before argument parentheses
# rdf.rb:12: undefined method `inner_text' for nil:NilClass
(NoMethodError)
# from rdf.rb:11:in `each'
# from rdf.rb:11
I am wondering where the difference is between the two last blocks.
Any ideas?

Hmm, looks like this should be something that can be replicated without
hpricot.

$ cat x.rb
x = 3
puts (x-5).abs

1.times do
puts (x-5).abs
end
$ ruby -v
ruby 1.8.4 (2005-12-24) [i486-linux]
$ ruby x.rb
x.rb:5: warning: don't put space before argument parentheses
2
-2
x.rb:5: undefined method `abs' for nil:NilClass (NoMethodError)
from x.rb:4
$

Congratulations, I think you've found a bug in the parser :) I'll post this
example to ruby-core.

Regards,

Brian.

Thanks for your help. I have the same output with this version:

ruby 1.8.6 (2007-03-13 patchlevel 0) [i386-mswin32]

regards,
Lionel
 
J

John Joyce

doc = Hpricot(File.open("rdf.xhtml"))

puts (doc/"item"/"dc:subject").inner_text
# html imported webpagedesign

(doc/"item").each do |t|
puts((t/"dc:subject").inner_text)
end
# html imported webpagedesign

(doc/"item").each do |t|
puts (t/"dc:subject").inner_text
end
# <dc:subject>html imported webpagedesign</dc:subject>
# rdf.rb:12: warning: don't put space before argument parentheses
# rdf.rb:12: undefined method `inner_text' for nil:NilClass
(NoMethodError)
# from rdf.rb:11:in `each'
# from rdf.rb:11

I am wondering where the difference is between the two last blocks.
Any ideas?

Hmm, looks like this should be something that can be replicated
without
hpricot.

$ cat x.rb
x = 3
puts (x-5).abs

1.times do
puts (x-5).abs
end
$ ruby -v
ruby 1.8.4 (2005-12-24) [i486-linux]
$ ruby x.rb
x.rb:5: warning: don't put space before argument parentheses
2
-2
x.rb:5: undefined method `abs' for nil:NilClass (NoMethodError)
from x.rb:4
$

Congratulations, I think you've found a bug in the parser :) I'll
post this
example to ruby-core.

Regards,

Brian.
Inside the do-end or {} block, use this:
puts((x - 5).abs)
It is more explicit, but correct and works.

so,will work as
(doc/"item").each do |t|
puts((t/"dc:subject").inner_html
end
 
F

Florian Gilcher

I prefer this version for the initial problem:

irb(main):045:0> elements = doc.search('dc:subject/text()')
=> #<Hpricot::Elements["html imported webpagedesign"]>

irb(main):048:0> elements.first.to_s
=> "html imported webpagedesign"
irb(main):049:0> elements.first.parent
=> {elem <dc:subject> "html imported webpagedesign" </dc:subject>}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,129
Latest member
FastBurnketo
Top