Hpricot elem index/position

henryturnerlists · Oct 6, 2008

Hey,

Trying to find the String index of an Hpricot::Elem within its doc.
For example..

doc = Hpricot("<a>bob</a><a>james</a><a>dan</a>")
elem = doc.search("a")[1]
elem.start #=> 10 ( the first '<' of the second a tag.)

and eventually the following would be good..

elem.length #=> 12
elem.end #=> 21

Any thoughts appreciated!
Henners

Mark Thomas · Oct 6, 2008

Hey,

Trying to find the String index of an Hpricot::Elem within its doc.
For example..

doc = Hpricot("<a>bob</a><a>james</a><a>dan</a>")
elem = doc.search("a")[1]
elem.start #=> 10 ( the first '<' of the second a tag.)

and eventually the following would be good..

elem.length #=> 12
elem.end #=> 21

Any thoughts appreciated!
Henners

My first thought is: Why do you want that information? Character
position is meaningless in an XML and HTML DOM. Whitespace can change
character positions without affecting the DOM at all.

-- Mark.

henryturnerlists · Oct 7, 2008

Hi Mark,

I'm writing a broken link reporting type tool. When I find a dodgy tag
I'd like to be able to relay the character position and or line number
to the user. Useful for debugging.

Thanks -h

Hey,

Click to expand...

Trying to find the String index of an Hpricot::Elem within its doc.
For example..

Click to expand...

doc =3D Hpricot("<a>bob</a><a>james</a><a>dan</a>")
elem =3D doc.search("a")[1]
elem.start #=3D> 10 ( the first '<' of the second a tag.)

Click to expand...

and eventually the following would be good..

Click to expand...

elem.length #=3D> 12
elem.end #=3D> 21

Click to expand...

Any thoughts appreciated!
Henners

Click to expand...

My first thought is: Why do you want that information? Character
position is meaningless in an XML and HTML DOM. Whitespace can change
character positions without affecting the DOM at all.

-- Mark.

Mark Thomas · Oct 7, 2008

Hi Mark,

I'm writing a broken link reporting type tool. When I find a dodgy tag
I'd like to be able to relay the character position and or line number
to the user. Useful for debugging.

So, are you really interested in broken *links* (as in a GET does not
return a 200 result code) or broken HTML? I have done the former via
AJAX (jQuery sends links to a backend rails action, and if it is
broken changes the class of the link to display a red background). The
latter may be able to be done with libxml, which reports the character
position of broken input.

-- Mark.

henryturnerlists · Oct 7, 2008

Well, I suppose there are incorrectly formatted links too... I was
talking about correctly formatted links that point to a 400+ status
code resource. Something libxml would not pick up since I guess you're
talking about its syntax checking bit.

Since the entire document is accessible from the Hpricot::Elem it
seems plausible to count the characters up to and after the element. A
15min look at the source didn't reveal anything obvious.. Have a nasty
feeling that this type of thing would have to be done in the compiled
C section of it..

Mark Thomas · Oct 7, 2008

Well, I suppose there are incorrectly formatted links too... I was
talking about correctly formatted links that point to a 400+ status
code resource. Something libxml would not pick up since I guess you're

talking about its syntax checking bit.

Well, libxml stores the line number of every element. So you can
extract all links, check them, and print out element.line_num for each
one that fails the check.

Here's some starter code:

#----------------------------------------------

require 'rubygems'
require 'xml'

XML:

arser.default_line_numbers = true

html = <<END_HTML
<html>
<head><title>test</title></head>
<body>
Here is a <a href="http://brok.en">broken link.</a>
</body>
</html>
END_HTML

parser = XML:

arser.string html
doc = parser.parse

def broken?(link)
true
end

doc.find("//a[@href]").each do |link|
if broken?(link)
puts "Broken link to #{link['href']} on line #{link.line_num}"
end
end

Mark Thomas · Oct 8, 2008

Well, libxml stores the line number of every element. So you can
extract all links, check them, and print out element.line_num for each
one that fails the check.

Oops, my example mistakenly used the XML parser, so replace that with
XML::HTMLparser since you are parsing HTML.

-- Mark.

henryturnerlists · Oct 8, 2008

Thanks for the hint towards to libxml-ruby! I didn't even know it
existed. Can't see anything for character position but very happy
indeed. Will have a go at implementing it myself when poss..

cheers
-h

extract value of the hpricot elem	1	Aug 12, 2008
Hpricot Help	0	Aug 25, 2006
Can I use Hpricot to parse data into different array elem?	1	May 21, 2009
hpricot parsing	5	Apr 19, 2009
Yet another Hpricot question	5	Oct 11, 2006
Hpricot not returning the right html??	1	Oct 27, 2008
Development hpricot breaks code	5	Dec 3, 2006
Accessing array index addresses with custom datatype in a function	0	Jun 2, 2022

Hpricot elem index/position

henryturnerlists

Mark Thomas

henryturnerlists

Mark Thomas

henryturnerlists

Mark Thomas

Mark Thomas

henryturnerlists

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads