Best way to parse/update HTML file?

B

Bucco

Sorry for the newbie question. I am trying to find the best metod for
parsing a HTML file and changinf one tag/item. Unfortunately, REXML
chokes on the file because of the incomplete tags. Completing the tag
is not an option either. What is the best way to find a specific tag
in an html file, change it's text and attribute settings?

Thanks:)

SA
 
D

daz

Bucco said:
Sorry for the newbie question.

This has been answered once or twice before by this group ;-)
I am trying to find the best metod for parsing a HTML file
and changinf one tag/item. Unfortunately, REXML chokes on
the file because of the incomplete tags. Completing the tag
is not an option either. What is the best way to find a specific
tag in an html file, change it's text and attribute settings?

Thanks:)

SA

The best way tends to involve using a package although
you /could/ work your way through it using regular expressions.

If you're likely to be doing this kind of thing in the
future, you'll be glad you spent a bit of time installing;
then it's always available.

As I recall, a different package is often
recommended but I don't know which is best.

This is what some of us use:
http://ruby-htmltools.rubyforge.org/ (Ned Konz +)

Examples are included but here's another ...

#-----------------------------------------------------------------
EXAMPLE = <<EOX
<html lang="en">
<head>
<title>Page title</title>
</head>
<body>
<div id="Header">
<h1><a href="http://xxxx.net/"><em>q</em>URL.net</a></h1>
<p>For When You Want a Quick URL</p>
</div>
<hr>
<div id="Content">
<form action="http://xxxx.net/" method="post">
<fieldset>
<legend>Enter a <abbr title="Uniform Resource Locator">URL</abbr> to make into a xxxx:</legend>
<input id="InputURL" type="text" size="40" maxlength="65535" name="url" value="">
<input type="submit" name="action" value="Create xxxx">
</fieldset>
</form>
</div>
<hr>
<a href="http://xxxx.net/pages/contact">Contact</a> -
<a href="http://xxxx.net/downloads/">Downloads</a> -
<a href="http://xxxx.net/">Create</a> -
<a href="http://xxxx.net/pages/terms">Terms of Use</a> -
<a href="http://xxxx.net/pages/list">List</a> -
<a href="http://xxxx.net/pages/prefs">Preferences</a>
</body>
</html>
EOX


require 'html/tree' # http://ruby-htmltools.rubyforge.org/

verbose = true

exa = HTMLTree::parser.new(verbose, !false)
#exa.parse_file_named('xxxx_net.html')
exa.feed(EXAMPLE) # replaces '.parse_file_named'

item_a = exa.html.select {|ea| ea.tag == 'a'}
item_a.each {|ea| p [:ahref, ea['href']]}
puts '+'*100

exa.html.each do |ea|
p [ea.tag, ea['href']]
ea.each do |item|
if item.data?
p [:data, item.to_s]
elsif item.tag == 'a'
item['href'].sub!(/xxxx/, 'mysite')
end
end
puts '='*100
ea.dump
end

### exa.html.dump

#-----------------------------------------------------------------


Output from the script above is too long to post here,
so I've uploaded it to:
http://www.d10.karoo.net/ruby/example_html_parse.txt


Hope this is of some use,


daz
 
B

Brad Wilson

If you're comfortable "cleaning it up", why not tidy it to XHTML then
use the XML parser? This is the approach I took recently when I needed
it.
 
B

Bucco

I have a couple of more questions then:

1. I tried the example for the htmltokenizer and got an error around
assert. Where/what is the assert method?

2. What do you mean by "slurp" in the rest of the text?

3. Any better examples how to use htmltokenizer?

Thanks:)
SA
 
W

why the lucky stiff

Bucco said:
Sorry for the newbie question. I am trying to find the best metod for
parsing a HTML file and changinf one tag/item. Unfortunately, REXML
chokes on the file because of the incomplete tags. Completing the tag
is not an option either. What is the best way to find a specific tag
in an html file, change it's text and attribute settings?
Hi. I know I'm a bit late to the discussion, so 'sokay if you have an
answer already.

A really fantastic HTML parser library is HTree by Tanaka Akira.

<http://cvs.m17n.org/~akr/htree/>

It's completely forgiving of bad HTML and you can import the document
into REXML through the HTree parser.

require 'htree'
HTree.parse( "<b>Bad markup" ).to_rexml

The only downside is that you'll need to install the iconv library,
which can be a bit of a pain to track down on Windows. Other than that,
it's a great package.

_why
 
D

daz

_why said:
[snip]

A really fantastic HTML parser library is HTree by Tanaka Akira.

I'm glad you brought that in because I tried it last year and saw
that it was a serious "heavy horse" and perhaps a little bit
_more_ than I was looking for.
It was adding XHTML namespace prefixes to all tags, so a
horizontal rule, for example, became:

It's completely forgiving of bad HTML and you can import the
document into REXML through the HTree parser.

require 'htree'
HTree.parse( "<b>Bad markup" ).to_rexml

It may have been in its early stages of development, but my
assumption that HTree would be too strict is under review :)
Applying your example, I get the result I was expecting
without all that namespace stuff.
The only downside is that you'll need to install the iconv library,
which can be a bit of a pain to track down on Windows.

Instead of hunting around for that, I'd made a dummy Ruby version
(no functionality for those who don't need any):

In my old version there are two files which require 'iconv' -
("text.rb" and "encoder.rb") which I changed to:

begin
require 'iconv'
rescue LoadError
require 'htree/iconv_dummy'
end

Then, add this dummy file as:
lib\ruby\site_ruby\1.8\htree\iconv_dummy.rb

#-------------------------------------------------------------
class Iconv

## For testing : Not part of the HTree package ##
warn "Using dummy iconv lib: #{__FILE__}"
IC_DUMMY = true

def Iconv.open(to, from)
inst = Iconv.new
block_given? ? yield(inst) : inst
end
def Iconv.iconv(to, from, *strs)
strs.join
end
def Iconv.conv(to, from, str)
str
end
def Iconv.list
raise 'No Iconv.list'
end
def initialize(to, from)
end
def close
''
end
def iconv(str, strt = 0, len = -1)
(len and !( len < 0 )) or len = str.size - strt
str[strt, len]
end

module Failure
def initialize(*args) # 3
end
def success
end
def failed
end
def inspect
end
end

# class InvalidEncoding < ArgumentError; end
# class IllegalSequence < ArgumentError; end
# class InvalidCharacter < ArgumentError; end
# class OutOfRange < RuntimeError; end

def Iconv.charset_map
raise 'No Iconv.charset_map'
end
end
#-------------------------------------------------------------



daz
 
B

Bill Guindon

Bucco wrote:
=20
Hi. I know I'm a bit late to the discussion, so 'sokay if you have an
answer already.
=20
A really fantastic HTML parser library is HTree by Tanaka Akira.
=20
<http://cvs.m17n.org/~akr/htree/>
=20
It's completely forgiving of bad HTML and you can import the document
into REXML through the HTree parser.
=20
require 'htree'
HTree.parse( "<b>Bad markup" ).to_rexml
=20
The only downside is that you'll need to install the iconv library,
which can be a bit of a pain to track down on Windows. Other than that,
it's a great package.

There's a page on the Rails site that covers the iconv installation on Wind=
ows:
http://wiki.rubyonrails.com/rails/show/iconv

Once I had the iconv.so in a library path, and iconv.dll in
windows\system32, I ran the test-all.rb. Got an error due to a lack
of /dev/null, but that was fixed by creating a dev directory, and
adding an empty 'null' file to it.
=20
Should swap that out to have it point to a temp dir, but with that
setup, all of the htree tests passed.
=20
_why
=20
=20

--=20
Bill Guindon (aka aGorilla)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,564
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top