Hpricot - strip out images and show first x words

M

Max Williams

Hey all

I have a bunch of html in p tags and i want to show shorted summary -
like the first 25 words followed by '...'. I also want to strip out any
images. I'm using hpricot for this and am finding myself writing a
convoluted messy method. Can anyone show me a clean and simple way? Is
Hpricot even the right tool for the job?

Here's the hacky and not-really-to-spec mess i have so far.

def lede(word_count = 25)
doc = self.hpricot_body
#wipe images
doc.search("img").remove
paras = doc.search("//p")
text = ""
while paras.size > 0 && text.split(" ").size < word_count
text += paras.shift.to_html
end
if (arr = text.split(" ")).size > 25
return arr[0..24].join(" ") + " ..."
else
return arr.join(" ")
end
end

thanks
max
 
7

7stud --

Max said:
Hey all

I have a bunch of html in p tags and i want to show shorted summary -
like the first 25 words followed by '...'. I also want to strip out any
images. I'm using hpricot for this and am finding myself writing a
convoluted messy method. Can anyone show me a clean and simple way? Is
Hpricot even the right tool for the job?


See if this helps:


require "rubygems"
require "hpricot"

html =<<ENDOFHTML
<html>
<body>
<div>hello: <img href="blah" /></div>

<p><b>first paragraph</b> <img href="blah" />of longer text</p>
<p><img href="blah" />second paragraph</p>

<div>bye: <img href="blah" /></div>
</body>
</html>
ENDOFHTML

doc = Hpricot(html)
results = []

paras = doc.search("//p")

paras.each do |para|
para.search("img").remove
text = para.inner_html

if text.length <= 25
results << text
else
results << "#{text[0, 25]}..."
end
end

p results

["<b>first paragraph</b> of...", "second paragraph"]
 
7

7stud --

Whoops! You wanted 25 *words*:

require "rubygems"
require "hpricot"


html =<<ENDOFHTML
<html>
<body>
<div>hello: <img href="blah" /></div>

<p><b>first paragraph</b> <img href="blah" />of longer text
apple apple apple apple apple apple apple apple apple apple
pear pear pear pear pear pear pear pear pear pear pear pear
ball ball ball ball ball ball ball ball ball ball ball ball
</p>

<p><img href="blah" />second paragraph</p>

<div>bye: <img href="blah" /></div>
</body>
</html>
ENDOFHTML

doc = Hpricot(html)
results = []
max_words = 25

paras = doc.search("//p")

paras.each do |para|
para.search("img").remove

text = para.inner_html
words = text.split()

if words.length <= max_words
results << words.join(" ")
else
results << "#{words[0, 25].join(" ")}..."
end
end

p results
["<b>first paragraph</b> of longer text apple apple apple apple apple
apple apple apple apple apple pear pear pear pear pear pear pear pear
pear pear...", "second paragraph"]
 
M

Max Williams

Hi 7stud, thanks.

This seems to return the first 25 words of every paragraph? Maybe i'm
not returning the right thing from the method though...here's how i
wrapped up your code (in lede2 method) and mine (in lede method):

def lede(word_count = 25)
doc = self.hpricot_body
#wipe images
doc.search("img").remove
paras = doc.search("//p")
text = ""
while paras.size > 0 && text.split(" ").size < word_count
text += paras.shift.to_html
end
if (arr = text.split(" ")).size > 25
return arr[0..24].join(" ") + " ..."
else
return arr.join(" ")
end
end

def lede2(word_count = 25)
doc = self.hpricot_body
results = []
paras = doc.search("//p")
paras.each do |para|
para.search("img").remove
text = para.inner_html
words = text.split()
if words.length <= word_count
results << words.join(" ")
else
results << "#{words[0, word_count].join(" ")}..."
end
end
results.collect{|text| "<p>#{text}</p>"}.join
end

And here are the results of calling each method on the same post: first
i show the html content (body_rendered) that we use in the lede methods.
=> "<p><i>From the <a
href="http://blog.ukti.gov.uk/2009/01/26/business-breakfast-20-or-when-new-media-met-the-foreign-secretar/"><span
class="caps">UKTI</span> blog</a>:</i></p>\n<p>Early on 20 January 2009,
nine British new media innovators and I arrived at the Foreign Office
for a meeting with the Foreign Secretary, <a
href="https://blogs.fco.gov.uk/roller/miliband/"><strong>David Milliband
MP</strong></a>, to discuss how government can support this important
sector. Over breakfast, the delegates, Mr Milliband, Foreign Office
Minister <strong>Gillian Merron</strong> and <strong>Sir Andrew
Cahn</strong> explored this key area for opportunities in wealth
creation, and to understand how the public sector can find new ways of
working using new media tools.</p>\n<p><span></span>The biggest
challenge raised by the companies was the apparent dearth of funding
opportunities for new start-ups in this economic climate and beyond.
There are few investors in the UK outside London, and most companies
seek funding from US sources.</p>\n<p>Admittedly, the digital sector has
a stronger ability to weather the storm than other market sectors; after
the bubble of 2000 burst, start-ups regrouped and re-emerged from the
ashes as <a
href="http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html">\342\200\230Web
2.0\342\200\262</a> in 2005, aiming to generate businesses with sturdy
and market-resilient plans that can face inclement weather. However,
delegates called for funds to bridge the gap between
\302\24310-\302\243250k, and to support training programmes for new
talent into the digital and creative industries.</p>\n<p>The Foreign
Secretary, who has a track record as a blogger and new media enthusiast,
also wanted to explore with his visitors the contribution new
technologies can make to diplomacy and international
problem-solving.</p>\n<p>British new media developers excel at building
social entrepreneurial applications, ensuring that their social
networking, social software (e.g., blogs) and other social systems
(e.g., search, data visualisation) work to ensure participation in the
community.</p>\n<p>New media leverages communities based on
commonalities, rather than proximity, encouraging participation on an
equal playing field. It has been crucial in breaking down international
and social barriers, exposing participants first-hand to news and news
sources, encouraging them to engage with people of different religions,
cultures and creeds, of different abilities and languages. It is
transforming the way our children learn, the way our teachers teach and
the way we do business. In short, it has put the person back into the
technology, lowering the barriers for knowledge and sharing.</p>\n<p>Yet
the challenge remains in opening up policy debates in such a way as they
mobilise the many-to-many networks which new media supports. Delegates
suggested using gaming technologies to break down boundaries of
participation and communication, and to open up public assets to
communities who may be able to offer better solutions than those that
have come before.</p>\n<p>The Business Breakfast was organised by the <a
href="http://twitter.com/londonukti"><span class="caps">ICT</span>
Sector Team</a> and is part of a series of meetings aimed at
facilitating abetter understanding between business and government on
key issues affecting businesses. Many thanks to all
involved!</p>\n<p><span style="text-decoration:underline;">The
Attendees</span><br />The attendees had been hand-picked to represent
the spectrum of digital services developed in the UK, from innovators in
social entrepreneurship and education to broadcasters and videogame
developers:</p>\n<p><a href="http://www.4ip.org.uk/">4iP</a><br /><a
href="http://www.4learning.co.uk/">Channel 4 Education</a><br /><a
href="http://www.chinwag.com/">Chinwag</a><br /><a
href="http://www.dopplr.com/"> Dopplr</a><br /><a
href="http://www.mindcandy.com/">Mind Candy</a><a
href="http://www.mysociety.org/"><br />MySociety</a><a
href="http://schoolofeverything.com/"><br />School of Everything</a><br
=> "<p><i>From the <a
href="http://blog.ukti.gov.uk/2009/01/26/business-breakfast-20-or-when-new-media-met-the-foreign-secretar/"><span
class="caps">UKTI</span> blog</a>:</i></p><p>Early on 20 January 2009,
nine British new media innovators and I arrived at the Foreign Office
for a meeting ..."=> "<p><i>From the <a
href="http://blog.ukti.gov.uk/2009/01/26/business-breakfast-20-or-when-new-media-met-the-foreign-secretar/"><span
class="caps">UKTI</span> blog</a>:</i></p><p>Early on 20 January 2009,
nine British new media innovators and I arrived at the Foreign Office
for a meeting with the Foreign Secretary, <a...</p><p><span></span>The
biggest challenge raised by the companies was the apparent dearth of
funding opportunities for new start-ups in this economic climate and
beyond. There are...</p><p>Admittedly, the digital sector has a stronger
ability to weather the storm than other market sectors; after the bubble
of 2000 burst, start-ups regrouped and...</p><p>The Foreign Secretary,
who has a track record as a blogger and new media enthusiast, also
wanted to explore with his visitors the contribution
new...</p><p>British new media developers excel at building social
entrepreneurial applications, ensuring that their social networking,
social software (e.g., blogs) and other social systems (e.g.,
search,...</p><p>New media leverages communities based on commonalities,
rather than proximity, encouraging participation on an equal playing
field. It has been crucial in breaking down international...</p><p>Yet
the challenge remains in opening up policy debates in such a way as they
mobilise the many-to-many networks which new media supports. Delegates
suggested...</p><p>The Business Breakfast was organised by the <a
href="http://twitter.com/londonukti"><span class="caps">ICT</span>
Sector Team</a> and is part of a series of meetings aimed at
facilitating abetter understanding...</p><p><span
style="text-decoration:underline;">The Attendees</span><br />The
attendees had been hand-picked to represent the spectrum of digital
services developed in the UK, from innovators in social entrepreneurship
and...</p><p><a href="http://www.4ip.org.uk/">4iP</a><br /><a
href="http://www.4learning.co.uk/">Channel 4 Education</a><br /><a
href="http://www.chinwag.com/">Chinwag</a><br /><a
href="http://www.dopplr.com/"> Dopplr</a><br /><a
href="http://www.mindcandy.com/">Mind Candy</a><a
href="http://www.mysociety.org/"><br />MySociety</a><a
href="http://schoolofeverything.com/"><br />School of Everything</a><br
/><a href="http://www.ttgames.com/">TTGames</a><br /><a
href="http://unltdworld.com/">UnLtdWorld</a></p>"
 
7

7stud --

Max said:
Hi 7stud, thanks.

This seems to return the first 25 words of every paragraph?

Well, what does this say:
I have a bunch of html in p tags and i want to show shorted
summary - like the first 25 words followed by '...'.

If you want relevant answers, then you have to post precise questions.

So now you just want **the first 25 words of your html**?

html =<<ENDOFHTML
<p><i>From the <a
href="http://blog.ukti.gov.uk/2009/01/26/business-breakfast-20-or-when-new-media-met-the-foreign-secretar/"><span
class="caps">UKTI</span> blog</a>:</i></p>\n<p>Early on 20 January 2009,
nine British new media innovators and I arrived at the Foreign Office
for a meeting with the Foreign Secretary, <a
href="https://blogs.fco.gov.uk/roller/miliband/"><strong>David Milliband
MP</strong></a>, to discuss how government can support this important
sector.
ENDOFHTML


max_words = 25

no_images = html.gsub(/<\s*img.*?>/, "")
words = no_images.split()

if words.length <= 25:
puts words.join(" ")
else
puts "#{words[0, 25].join(" ")}..."
end

--output:--
<p><i>From the <a
href="http://blog.ukti.gov.uk/2009/01/26/business-breakfast-20-or-when-new-media-met-the-foreign-secretar/"><span
class="caps">UKTI</span> blog</a>:</i></p> <p>Early on 20 January 2009,
nine British new media innovators and I arrived at the Foreign Office
for a...
 
M

Max Williams

7stud said:
Well, what does this say:
I guess that was a bit ambiguous, sorry. Anyway, i'm interested to see
you don't think it's worth bothering with hpricot in this case, and just
use a regex - i was wondering if hpricot was overkill myself.

One problem that just occurred to me is that treating the whole html
like text and taking the first 25 words will make it into invalid html -
because we have start tags with no matching end tags. So, ideally i
would preserve the start and end tags and strip the content down to 25
words. That's why i used hpricot initially.

Anyway, thanks for your help.
max
 
7

7stud --

Max said:
I guess that was a bit ambiguous, sorry. Anyway, i'm interested to see
you don't think it's worth bothering with hpricot in this case, and just
use a regex - i was wondering if hpricot was overkill myself.

One problem that just occurred to me is that treating the whole html
like text and taking the first 25 words will make it into invalid html -
because we have start tags with no matching end tags.

I stopped using Hpricot because you said this was your desired output:
=> "<p><i>From the <a
href="http://blog.ukti.gov.uk/2009/01/26/business-breakfast-20-or-when-new-media-met-the-foreign-secretar/"><span
class="caps">UKTI</span> blog</a>:</i></p><p>Early on 20 January 2009,
nine British new media innovators and I arrived at the Foreign Office
for a meeting ..."
 
M

Max Williams

I didn't say it was the desired output, i just said that was what my
crappy version was currently doing :) Anyway, i've troubled you enough
- thanks for all your help.
 
7

7stud --

So, ideally i would preserve the start and end tags and
strip the content down to 25 words.

It's easy enough to slap a "</p>" on the end. But something else you
might not have considered is: what if the 25th word is inside a tag, for
instance:

<a href="blah"
 
M

Max Williams

7stud said:
It's easy enough to slap a "</p>" on the end. But something else you
might not have considered is: what if the 25th word is inside a tag, for
instance:

<a href="blah"

Yeah, i know, it's a bit complicated isn't it. Slapping a </p> on the
end isn't enough because there could be a load of unfinished tags - for
example, half a p tag with half an a tag inside it. Anything really.
That's why ideally i would just consider the inner content of tags and
when it comes to tags inside tags, either remove them completely or
leave them as they are. This seems like such a common thing on the net,
to have a short section followed by 'read more' for example, that i
thought there would be an easy way to do it.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,007
Latest member
obedient dusk

Latest Threads

Top