Hpricot elem index/position

  • Thread starter henryturnerlists
  • Start date
H

henryturnerlists

Hey,

Trying to find the String index of an Hpricot::Elem within its doc.
For example..

doc = Hpricot("<a>bob</a><a>james</a><a>dan</a>")
elem = doc.search("a")[1]
elem.start #=> 10 ( the first '<' of the second a tag.)

and eventually the following would be good..

elem.length #=> 12
elem.end #=> 21

Any thoughts appreciated!
Henners
 
M

Mark Thomas

Hey,

Trying to find the String index of an Hpricot::Elem within its doc.
For example..

doc = Hpricot("<a>bob</a><a>james</a><a>dan</a>")
elem = doc.search("a")[1]
elem.start #=> 10 ( the first '<' of the second a tag.)

and eventually the following would be good..

elem.length #=> 12
elem.end #=> 21

Any thoughts appreciated!
Henners

My first thought is: Why do you want that information? Character
position is meaningless in an XML and HTML DOM. Whitespace can change
character positions without affecting the DOM at all.

-- Mark.
 
H

henryturnerlists

Hi Mark,

I'm writing a broken link reporting type tool. When I find a dodgy tag
I'd like to be able to relay the character position and or line number
to the user. Useful for debugging.

Thanks -h

Trying to find the String index of an Hpricot::Elem within its doc.
For example..
doc =3D Hpricot("<a>bob</a><a>james</a><a>dan</a>")
elem =3D doc.search("a")[1]
elem.start #=3D> 10 ( the first '<' of the second a tag.)
and eventually the following would be good..
elem.length #=3D> 12
elem.end #=3D> 21
Any thoughts appreciated!
Henners

My first thought is: Why do you want that information? Character
position is meaningless in an XML and HTML DOM. Whitespace can change
character positions without affecting the DOM at all.

-- Mark.
 
M

Mark Thomas

Hi Mark,

I'm writing a broken link reporting type tool. When I find a dodgy tag
I'd like to be able to relay the character position and or line number
to the user. Useful for debugging.

So, are you really interested in broken *links* (as in a GET does not
return a 200 result code) or broken HTML? I have done the former via
AJAX (jQuery sends links to a backend rails action, and if it is
broken changes the class of the link to display a red background). The
latter may be able to be done with libxml, which reports the character
position of broken input.

-- Mark.
 
H

henryturnerlists

Well, I suppose there are incorrectly formatted links too... I was
talking about correctly formatted links that point to a 400+ status
code resource. Something libxml would not pick up since I guess you're
talking about its syntax checking bit.

Since the entire document is accessible from the Hpricot::Elem it
seems plausible to count the characters up to and after the element. A
15min look at the source didn't reveal anything obvious.. Have a nasty
feeling that this type of thing would have to be done in the compiled
C section of it..
 
M

Mark Thomas

Well, I suppose there are incorrectly formatted links too... I was
talking about correctly formatted links that point to a 400+ status
code resource. Something libxml would not pick up since I guess you're
talking about its syntax checking bit.

Well, libxml stores the line number of every element. So you can
extract all links, check them, and print out element.line_num for each
one that fails the check.

Here's some starter code:

#----------------------------------------------

require 'rubygems'
require 'xml'

XML::parser.default_line_numbers = true

html = <<END_HTML
<html>
<head><title>test</title></head>
<body>
Here is a <a href="http://brok.en">broken link.</a>
</body>
</html>
END_HTML

parser = XML::parser.string html
doc = parser.parse

def broken?(link)
true
end

doc.find("//a[@href]").each do |link|
if broken?(link)
puts "Broken link to #{link['href']} on line #{link.line_num}"
end
end
 
M

Mark Thomas

Well, libxml stores the line number of every element. So you can
extract all links, check them, and print out element.line_num for each
one that fails the check.

Oops, my example mistakenly used the XML parser, so replace that with
XML::HTMLparser since you are parsing HTML.

-- Mark.
 
H

henryturnerlists

Thanks for the hint towards to libxml-ruby! I didn't even know it
existed. Can't see anything for character position but very happy
indeed. Will have a go at implementing it myself when poss..

cheers
-h
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top