Best way to parse/update HTML file?

Bucco · Jun 25, 2005

Sorry for the newbie question. I am trying to find the best metod for
parsing a HTML file and changinf one tag/item. Unfortunately, REXML
chokes on the file because of the incomplete tags. Completing the tag
is not an option either. What is the best way to find a specific tag
in an html file, change it's text and attribute settings?

Thanks

SA

daz · Jun 26, 2005

Bucco said:
Sorry for the newbie question.

This has been answered once or twice before by this group ;-)

I am trying to find the best metod for parsing a HTML file
and changinf one tag/item. Unfortunately, REXML chokes on
the file because of the incomplete tags. Completing the tag
is not an option either. What is the best way to find a specific
tag in an html file, change it's text and attribute settings?

Thanks

SA

The best way tends to involve using a package although
you /could/ work your way through it using regular expressions.

If you're likely to be doing this kind of thing in the
future, you'll be glad you spent a bit of time installing;
then it's always available.

As I recall, a different package is often
recommended but I don't know which is best.

This is what some of us use:
http://ruby-htmltools.rubyforge.org/ (Ned Konz +)

Examples are included but here's another ...

#-----------------------------------------------------------------
EXAMPLE = <<EOX
<html lang="en">
<head>
<title>Page title</title>
</head>
<body>
<div id="Header">
<h1><a href="http://xxxx.net/"><em>q</em>URL.net</a></h1>
<p>For When You Want a Quick URL</p>
</div>
<hr>
<div id="Content">
<form action="http://xxxx.net/" method="post">
<fieldset>
<legend>Enter a <abbr title="Uniform Resource Locator">URL</abbr> to make into a xxxx:</legend>
<input id="InputURL" type="text" size="40" maxlength="65535" name="url" value="">
<input type="submit" name="action" value="Create xxxx">
</fieldset>
</form>
</div>
<hr>
<a href="http://xxxx.net/pages/contact">Contact</a> -
<a href="http://xxxx.net/downloads/">Downloads</a> -
<a href="http://xxxx.net/">Create</a> -
<a href="http://xxxx.net/pages/terms">Terms of Use</a> -
<a href="http://xxxx.net/pages/list">List</a> -
<a href="http://xxxx.net/pages/prefs">Preferences</a>
</body>
</html>
EOX

require 'html/tree' # http://ruby-htmltools.rubyforge.org/

verbose = true

exa = HTMLTree:

arser.new(verbose, !false)
#exa.parse_file_named('xxxx_net.html')
exa.feed(EXAMPLE) # replaces '.parse_file_named'

item_a = exa.html.select {|ea| ea.tag == 'a'}
item_a.each {|ea| p [:ahref, ea['href']]}
puts '+'*100

exa.html.each do |ea|
p [ea.tag, ea['href']]
ea.each do |item|
if item.data?
p [:data, item.to_s]
elsif item.tag == 'a'
item['href'].sub!(/xxxx/, 'mysite')
end
end
puts '='*100
ea.dump
end

### exa.html.dump

#-----------------------------------------------------------------

Output from the script above is too long to post here,
so I've uploaded it to:
http://www.d10.karoo.net/ruby/example_html_parse.txt

Hope this is of some use,

daz

Brad Wilson · Jun 27, 2005

If you're comfortable "cleaning it up", why not tidy it to XHTML then
use the XML parser? This is the approach I took recently when I needed
it.

Bucco · Jun 28, 2005

I have a couple of more questions then:

1. I tried the example for the htmltokenizer and got an error around
assert. Where/what is the assert method?

2. What do you mean by "slurp" in the rest of the text?

3. Any better examples how to use htmltokenizer?

Thanks

SA

why the lucky stiff · Jun 28, 2005

Bucco said:
Sorry for the newbie question. I am trying to find the best metod for
parsing a HTML file and changinf one tag/item. Unfortunately, REXML
chokes on the file because of the incomplete tags. Completing the tag
is not an option either. What is the best way to find a specific tag
in an html file, change it's text and attribute settings?

Hi. I know I'm a bit late to the discussion, so 'sokay if you have an
answer already.

A really fantastic HTML parser library is HTree by Tanaka Akira.

<http://cvs.m17n.org/~akr/htree/>

It's completely forgiving of bad HTML and you can import the document
into REXML through the HTree parser.

require 'htree'
HTree.parse( "<b>Bad markup" ).to_rexml

The only downside is that you'll need to install the iconv library,
which can be a bit of a pain to track down on Windows. Other than that,
it's a great package.

_why

daz · Jun 29, 2005

_why said:
[snip]

A really fantastic HTML parser library is HTree by Tanaka Akira.

I'm glad you brought that in because I tried it last year and saw
that it was a serious "heavy horse" and perhaps a little bit
_more_ than I was looking for.
It was adding XHTML namespace prefixes to all tags, so a
horizontal rule, for example, became:

It's completely forgiving of bad HTML and you can import the
document into REXML through the HTree parser.

require 'htree'
HTree.parse( "<b>Bad markup" ).to_rexml

It may have been in its early stages of development, but my
assumption that HTree would be too strict is under review

Applying your example, I get the result I was expecting
without all that namespace stuff.

The only downside is that you'll need to install the iconv library,
which can be a bit of a pain to track down on Windows.

Instead of hunting around for that, I'd made a dummy Ruby version
(no functionality for those who don't need any):

In my old version there are two files which require 'iconv' -
("text.rb" and "encoder.rb") which I changed to:

begin
require 'iconv'
rescue LoadError
require 'htree/iconv_dummy'
end

Then, add this dummy file as:
lib\ruby\site_ruby\1.8\htree\iconv_dummy.rb

#-------------------------------------------------------------
class Iconv

## For testing : Not part of the HTree package ##
warn "Using dummy iconv lib: #{__FILE__}"
IC_DUMMY = true

def Iconv.open(to, from)
inst = Iconv.new
block_given? ? yield(inst) : inst
end
def Iconv.iconv(to, from, *strs)
strs.join
end
def Iconv.conv(to, from, str)
str
end
def Iconv.list
raise 'No Iconv.list'
end
def initialize(to, from)
end
def close
''
end
def iconv(str, strt = 0, len = -1)
(len and !( len < 0 )) or len = str.size - strt
str[strt, len]
end

module Failure
def initialize(*args) # 3
end
def success
end
def failed
end
def inspect
end
end

# class InvalidEncoding < ArgumentError; end
# class IllegalSequence < ArgumentError; end
# class InvalidCharacter < ArgumentError; end
# class OutOfRange < RuntimeError; end

def Iconv.charset_map
raise 'No Iconv.charset_map'
end
end
#-------------------------------------------------------------

_why

daz

Bill Guindon · Jul 9, 2005

Bucco wrote:
=20
Hi. I know I'm a bit late to the discussion, so 'sokay if you have an
answer already.
=20
A really fantastic HTML parser library is HTree by Tanaka Akira.
=20
<http://cvs.m17n.org/~akr/htree/>
=20
It's completely forgiving of bad HTML and you can import the document
into REXML through the HTree parser.
=20
require 'htree'
HTree.parse( "<b>Bad markup" ).to_rexml
=20
The only downside is that you'll need to install the iconv library,
which can be a bit of a pain to track down on Windows. Other than that,
it's a great package.

There's a page on the Rails site that covers the iconv installation on Wind=
ows:
http://wiki.rubyonrails.com/rails/show/iconv

Once I had the iconv.so in a library path, and iconv.dll in
windows\system32, I ran the test-all.rb. Got an error due to a lack
of /dev/null, but that was fixed by creating a dev directory, and
adding an empty 'null' file to it.
=20
Should swap that out to have it point to a temp dir, but with that
setup, all of the htree tests passed.
=20

_why
=20
=20

--=20
Bill Guindon (aka aGorilla)

What's the best way to parse this HTML tag?	3	Mar 11, 2012
best way to build an absolute spec	2	Jun 1, 2014
Best way to update a settings file?	1	Apr 14, 2008
Parse Word/HTML Docs for database inserts	3	Jul 16, 2009
The best way to parse an html file?	1	Oct 9, 2004
Hpricot - best way to parse based on comments	2	Nov 20, 2006
Best way to parse delimited data from a file.	4	Feb 18, 2010
Best XML Parser & advice	4	Aug 10, 2010

Best way to parse/update HTML file?

Bucco

daz

Brad Wilson

Bucco

why the lucky stiff

daz

Bill Guindon

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads