Hpricot - Trying to do a few things...are they possible?

HH · Oct 2, 2006

I've been messing with Hpricot and I'm trying to do a few things that
aren't apparently documented or available as part of Hpricot. Can
someone verify the following...

1) Is there a simple way to determine the element's current path /
location? For example, if I find a text node, is there a simple way to
determine the path of that text node so I can find it again later using
that path / location as a parameter to the search method? I assume I
can use the parent method to find the parent and recurse through until
I get to the root node...is there an easier way?

2) Is there a simple way to find all elements with non-empty text
nodes? It appears that Hpricot is focused on providing methods for
finding something if you know the element tag / attributes / classes /
etc. I've been using traverse_text which requires going through every
text node and filtering out the ones that are empty / whitespace. Is
there an easier way to find all elements with non-empty text nodes?

This is in reference to parsing HTML pages which may or may not be
well-formed.

All in all - I really like Hpricot. I was using REXML and tidy before,
but this is alot simplier and faster!

Thanks to _why the lucky stiff for a great little HTML parser...

Gregory Seidman · Oct 2, 2006

On Mon, Oct 02, 2006 at 11:38:05PM +0900, HH wrote:
} I've been messing with Hpricot and I'm trying to do a few things that
} aren't apparently documented or available as part of Hpricot. Can
} someone verify the following...
}
} 1) Is there a simple way to determine the element's current path /
} location? For example, if I find a text node, is there a simple way to
} determine the path of that text node so I can find it again later using
} that path / location as a parameter to the search method? I assume I
} can use the parent method to find the parent and recurse through until
} I get to the root node...is there an easier way?

I have been using the recursive (well, iterative, actually) way. I suspect
that that is the way to do it since the tree structure is intentionally
simple and is designed to allow you to move nodes around arbitrarily.
Maintaining a node's path independent of its structural location is
inefficient at best and impossible at worst.

} 2) Is there a simple way to find all elements with non-empty text
} nodes? It appears that Hpricot is focused on providing methods for
} finding something if you know the element tag / attributes / classes /
} etc. I've been using traverse_text which requires going through every
} text node and filtering out the ones that are empty / whitespace. Is
} there an easier way to find all elements with non-empty text nodes?

nodes = []
doc.traverse_text { |t| nodes << t.parent if (t.content && t.content != '') }

} This is in reference to parsing HTML pages which may or may not be
} well-formed.

I've found Hpricot to be remarkably resilient in parsing questionable HTML.

} All in all - I really like Hpricot. I was using REXML and tidy before,
} but this is alot simplier and faster!
}
} Thanks to _why the lucky stiff for a great little HTML parser...

I'll second that.
--Greg

HH · Oct 3, 2006

Gregory -- I appreciate your reply. I found it to be very helpful.

One more question that pertains to finding the path recursively...if
done this way, the path itself is not necessarily unique to the text
node. It's quite possible to have 2 different text nodes with the same
path (e.g. they are in the same table, but in different rows). The
only way to distinguish between text nodes would be to include some
sort of index to account for multiple children under the same parent or
other instances that would create a situation with the same path.

For example, you could have two text nodes that are in the first cell
of a table but in different rows with the path:

/html/body/table/tr/td

To truely select a particular text node, you would have to know the
index of the row in order to get to that text node:

/html/body/table/tr[1]/td
/html/body/table/tr[2]/td

I'm assuming there is no easy way to determine the index as well as the
path of a particular text node....

Any ideas would be greatly appreciated.

Thanks again!

Gregory said:
On Mon, Oct 02, 2006 at 11:38:05PM +0900, HH wrote:
} I've been messing with Hpricot and I'm trying to do a few things that
} aren't apparently documented or available as part of Hpricot. Can
} someone verify the following...
}
} 1) Is there a simple way to determine the element's current path /
} location? For example, if I find a text node, is there a simple way to
} determine the path of that text node so I can find it again later using
} that path / location as a parameter to the search method? I assume I
} can use the parent method to find the parent and recurse through until
} I get to the root node...is there an easier way?

I have been using the recursive (well, iterative, actually) way. I suspect
that that is the way to do it since the tree structure is intentionally
simple and is designed to allow you to move nodes around arbitrarily.
Maintaining a node's path independent of its structural location is
inefficient at best and impossible at worst.

} 2) Is there a simple way to find all elements with non-empty text
} nodes? It appears that Hpricot is focused on providing methods for
} finding something if you know the element tag / attributes / classes /
} etc. I've been using traverse_text which requires going through every
} text node and filtering out the ones that are empty / whitespace. Is
} there an easier way to find all elements with non-empty text nodes?

nodes = []
doc.traverse_text { |t| nodes << t.parent if (t.content && t.content != '') }

} This is in reference to parsing HTML pages which may or may not be
} well-formed.

I've found Hpricot to be remarkably resilient in parsing questionable HTML.

} All in all - I really like Hpricot. I was using REXML and tidy before,
} but this is alot simplier and faster!
}
} Thanks to _why the lucky stiff for a great little HTML parser...

I'll second that.
--Greg

inner_html = "" in hpricot	0	Jan 25, 2010
Trying to access hdml from an open browser using Python.	1	Jan 18, 2023
Hpricot Relative Path	2	Mar 12, 2010
[ANN] Hpricot 0.8.2 released	1	Nov 6, 2009
A few confusing Hpricot outputs. Anyone had similar experience?	2	Apr 6, 2009
I am trying to make an audio player, how do I get the selected file to be playable?	5	Mar 29, 2022
Trying to figure out http request POST phrasing	1	Mar 30, 2023
Python Makefiles... are they possible?	9	Feb 13, 2013

Hpricot - Trying to do a few things...are they possible?

HH

Gregory Seidman

HH

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads