hpricot selective text modification

  • Thread starter Siddharth Karandikar
  • Start date
S

Siddharth Karandikar

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/207625
is an answer to most of my requirements, except one.

How can I do a selective traverse_text so that I can skip text of
specific tags?

One option was to use parent.name while traversing over text.
Here is the code that I tried,

require 'hpricot'
class Hpricot::Text
def set(string)
@content = string
self.raw_string = string
end
end

s = <<HTML
<html>
<body>
<h4>Abcd</h4>
<java>this is in java1</java>
<ul>
<li>aabbcc</li>
<li>mmnnoo</li>
<li><java>this is in java2</java></li>
</ul>
<java>this is in java3</java>
</body>
</html>
HTML

index = Hpricot.parse(s)
index.traverse_text { |text|
t = text.to_s.strip
if text.parent and text.parent.name and text.parent.name != 'java' and
not t.empty?
t = "=#{t}="
text.set(t)
puts "Modified text to:#{t}"
end
}
puts index


Getting following error,

Modified text to:=Abcd=
Modified text to:=aabbcc=
Modified text to:=mmnnoo=
hpricot-test1.rb:30: undefined method `name' for
#<Hpricot::Doc:0x2e49c18> (NoMethodError)
from
c:/ruby/lib/ruby/gems/1.8/gems/hpricot-0.4-mswin32/lib/hpricot/traverse.rb:377:in
`traverse_text_internal'
from
c:/ruby/lib/ruby/gems/1.8/gems/hpricot-0.4-mswin32/lib/hpricot/traverse.rb:366:in
`traverse_text_internal'
from
c:/ruby/lib/ruby/gems/1.8/gems/hpricot-0.4-mswin32/lib/hpricot/traverse.rb:146:in
`each'
from
c:/ruby/lib/ruby/gems/1.8/gems/hpricot-0.4-mswin32/lib/hpricot/traverse.rb:146:in
`each_child'
from
c:/ruby/lib/ruby/gems/1.8/gems/hpricot-0.4-mswin32/lib/hpricot/traverse.rb:366:in
`traverse_text_internal'
from
c:/ruby/lib/ruby/gems/1.8/gems/hpricot-0.4-mswin32/lib/hpricot/traverse.rb:358:in
`traverse_text'
from hpricot-test1.rb:28


Am I making any mistake?

I am new to the world of Ruby and Hpricot ... so please bear with me.

- Siddharth
 
S

Siddharth Karandikar

Here is the scenario,

I am trying to have my blog in 2 languages. English and my native
language 'marathi'. The blog posts will be written in plain text. Using
bluecloth, I am generating required html markup.
I have hacked bluecloth to spit <english>...</english> in required
places,

e.g.
### <E title E>

will generate
<h3><english>title</english></h3>

Now when I get this kind of html, I would like to skip all the text
under 'english' tag and convert all the remaining text to my language
'marathi' (utf8 codes). Using Hpricot for this.

After that I am thinking of removing all the 'english' tags but keeping
the markup surrounding them.

- Siddharth
 
P

Peter Szinek

Am I making any mistake?
Sure :).

Some W3C DOM theory:

A document consists of different Nodes - in practice subclasses of Node:
Element, Document, Attribute, Comment, Text, ProcessingInstruction etc
(just from the top of my head - there are some more like
DocumentFragment , CData, ... but it is unlikely you will need them
here). Not every Node has a name, or children, or parent, or xxx. You
have to make sure that the subclass of Node you are talking to is
actually responding to a method you are trying to send him.

a Hpricot DOM is not exactly a W3C DOM, but it is mostly similar:

Only HPricot::Element has children, (not HPricot::Document or
HPricot::Comment or...) and also not every Node has a name - like in
your example HPricot::Document (Similarly HPricot::Text or
HPricot::Comment does not have a name...). Also A HPricot::Document does
not have a parent I think.

Your problem is that you are traversing up, and reach the Document node
which does not have a method name.

So you have to modify your code like this:

if text.parent and text.parent.name and text.parent.name != 'java' and

to

parent = text.parent
if (parent.instance_of? Hpricot::Text) #or with respond_to, or with
parent.parent == nil
#do the stuff
else
#you have reached the top Node - Document; nothing to do
end

(The else branch is not needed, I just added it for illustration)

HTH,
Peter

__
http://www.rubyrailways.com
 
S

Siddharth Karandikar

Thanks Peter.
I need to improve my knowledge abt DOM in general.

I have modified the code and do "if p.instance_of? Hpricot::Elem and
....."
Right now, Its working fine for me. Still need to think abt all the
possible cases.

Thanks,
Siddharth
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,020
Latest member
GenesisGai

Latest Threads

Top