Hpricot html parsing

Dhanasekaran Vivekanandhan · Dec 13, 2006

hi all,
I have the following html fragment
I want to get the inner html content inside the
 <img> tag , not the between the tag.
for example in the following example i want to get the result as
"this is fun". I dont want to get the result including "NO FUN".
how to do with Hpricot

example html fragment:
----------------------


this is fun
<img src="" class="dhans"/>


NO FUN


thanks in advance,
dhanasekaran

Peter Szinek · Dec 13, 2006

Dhanasekaran said:
hi all,
I have the following html fragment
I want to get the inner html content inside the
 <img> tag , not the between the tag.
for example in the following example i want to get the result as
"this is fun". I dont want to get the result including "NO FUN".
how to do with Hpricot

example html fragment:
----------------------


this is fun
<img src="" class="dhans"/>


NO FUN

I did not quite get you. You want the text of the first because it
has an image?
Or what is the exact criterion to accept/reject 's?

Peter

__
http://www.rubyrailways.com

Dhanasekaran Vivekanandhan · Dec 13, 2006

yes, I want the text of the first because it
has an image. and reject if has no image.
thanks,
Dhanasekaran

lrlebron · Dec 13, 2006

You can try something like this:

if p.search("img").length > 0
puts p.inner_html
end

Peter Szinek · Dec 13, 2006

Dhanasekaran said:
yes, I want the text of the first because it
has an image. and reject if has no image.
thanks,

I see. Try this:
===============================================
require 'rubygems'
require 'hpricot'

doc = Hpricot %q{
this is fun
<img src="" class="dhans"/>


NO FUN


fun again!
<img src=""/>


NO FUN AT ALL!

}

paragraphs = doc/'p'

good_elems = paragraphs.map.reject {|elem| ((elem/"img").empty?) }
good_elems.each { |elem| puts elem.inner_text.strip }
===============================================

output:

************
this is fun
fun again!
************

You will need hpricot 0.4.84 because of inner_text - if you don't want
to install it (I did not experience any difficulties, so I can recommend
it) then you have to roll your own inner_text, but I guess this is not a
big problem.

Cheers,
Peter

David Vallner · Dec 13, 2006

--------------enig2794C5B87AD68CF9346CBEC9
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Peter said:
paragraphs =3D doc/'p'
=20
good_elems =3D paragraphs.map.reject {|elem| ((elem/"img").empty?) }

Which once again makes me wish paragraphs =3D doc/'//p/text()'
worked. This could be doable if you asked Hpricot to provide you with
the REXML document (it's probably out of scope for the intendedly simple
XPath engine Hpricot uses natively), but unfortunately I can't for the
heck of it figure out how to make REXML accept the final /text(), even
though the parser claims to support XPath 1.0 except a few exceptions,
that one not being noted.

David Vallner

--------------enig2794C5B87AD68CF9346CBEC9
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (MingW32)

iD8DBQFFgIyEy6MhrS8astoRAn1VAJ0cxjuJeKzj0kPPb0Oa6zXHlR2cuACfUVAo
m1dOyjfNUiBee45Brzgz9Gc=
=GjVT
-----END PGP SIGNATURE-----

--------------enig2794C5B87AD68CF9346CBEC9--

ruby talk · Dec 15, 2006

Ask:

http://code.whytheluckystiff.net/hpricot/ticket/32

text in xpath should return a text node if present. For example:
(doc/"/html/body/div[1]/*/table[0]/tr[0]/*/b[9]/text")

Currently I am using the search and next_node:

doc.search("/html/body/div[1]/*/table[0]/tr[0]/td/b"){|x|
@movie_plot=x.next_node.to_s.strip if x.inner_html=="Plot Outline:" }

And receive

Author:
why
Message:

* lib/hpricot/elements.rb: added support for selecting text
nodes with text(): //p/text(), //p[a]//text(), etc.
* lib/hpricot/traverse.rb: ditto.
* lib/hpricot/tag.rb: the pathname method reports the path
fragment needed to get to this node.
* lib/hpricot/parse.rb: handle possible empty processing instruction.
http://code.whytheluckystiff.net/hpricot/changeset/87

Peter said:
Peter said:

paragraphs = doc/'p'

good_elems = paragraphs.map.reject {|elem| ((elem/"img").empty?) }

Click to expand...

Which once again makes me wish paragraphs = doc/'//p/text()'
worked. This could be doable if you asked Hpricot to provide you with
the REXML document (it's probably out of scope for the intendedly simple
XPath engine Hpricot uses natively), but unfortunately I can't for the
heck of it figure out how to make REXML accept the final /text(), even
though the parser claims to support XPath 1.0 except a few exceptions,
that one not being noted.

David Vallner
[/QUOTE]

David Vallner · Dec 16, 2006

--------------enig33E9091034D4C2BDEBAF685B
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

ruby said:
Ask:
=20
http://code.whytheluckystiff.net/hpricot/ticket/32
=20
text in xpath should return a text node if present. For example:
(doc/"/html/body/div[1]/*/table[0]/tr[0]/*/b[9]/text")
=20

Well, it's 'text()' not 'text'. Luckily _why noticed.

* lib/hpricot/elements.rb: added support for selecting text
nodes with text(): //p/text(), //p[a]//text(), etc.

W00t ;P

Thanks for pointing this out.

David Vallner

PS: Your email address name confuses the heck out of me. Please use
something that doesn't cause a mental namespace clash?

--------------enig33E9091034D4C2BDEBAF685B
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (MingW32)

iD8DBQFFhCpcy6MhrS8astoRAqATAJ9aaFisLlRV2vMdrMpycIj+uWgjZACdE5iQ
F4GF/Ef4SAkCrZ5Gss/B1mo=
=jQhx
-----END PGP SIGNATURE-----

--------------enig33E9091034D4C2BDEBAF685B--

ruby talk · Dec 16, 2006

ruby said:
ruby said:

Ask:

http://code.whytheluckystiff.net/hpricot/ticket/32

text in xpath should return a text node if present. For example:
(doc/"/html/body/div[1]/*/table[0]/tr[0]/*/b[9]/text")

Click to expand...

Well, it's 'text()' not 'text'. Luckily _why noticed.

* lib/hpricot/elements.rb: added support for selecting text
nodes with text(): //p/text(), //p[a]//text(), etc.

Click to expand...

W00t ;P

Thanks for pointing this out.

David Vallner

PS: Your email address name confuses the heck out of me. Please use
something that doesn't cause a mental namespace clash?

Sorry, I have been archiving ruby talk at (e-mail address removed) since 10/14/04.

Stephen Becker IV

Errors with HTML packing slip code	2	Jul 5, 2023
Stuck with html and css	25	Dec 14, 2022
hpricot parsing	5	Apr 19, 2009
Need assistance finetuning HTML, CSS, Javascript - sticky header issue	3	Feb 25, 2022
I need help making an html website	2	Aug 2, 2023
Make 'Image X' spin 360deg on y-Axis until reset button clicked	1	Jan 22, 2023
Background image not showing up on html page	3	Sep 23, 2023
Help with code	0	Jun 12, 2022

Hpricot html parsing

Dhanasekaran Vivekanandhan

Peter Szinek

Dhanasekaran Vivekanandhan

lrlebron

Peter Szinek

David Vallner

ruby talk

David Vallner

ruby talk

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads