Hpricot - strip out images and show first x words

Discussion in 'Ruby' started by Max Williams, Jul 19, 2009.

  1. Max Williams

    Max Williams Guest

    Hey all

    I have a bunch of html in p tags and i want to show shorted summary -
    like the first 25 words followed by '...'. I also want to strip out any
    images. I'm using hpricot for this and am finding myself writing a
    convoluted messy method. Can anyone show me a clean and simple way? Is
    Hpricot even the right tool for the job?

    Here's the hacky and not-really-to-spec mess i have so far.

    def lede(word_count = 25)
    doc = self.hpricot_body
    #wipe images
    doc.search("img").remove
    paras = doc.search("//p")
    text = ""
    while paras.size > 0 && text.split(" ").size < word_count
    text += paras.shift.to_html
    end
    if (arr = text.split(" ")).size > 25
    return arr[0..24].join(" ") + " ..."
    else
    return arr.join(" ")
    end
    end

    thanks
    max
    --
    Posted via http://www.ruby-forum.com/.
    Max Williams, Jul 19, 2009
    #1
    1. Advertising

  2. Max Williams

    7stud -- Guest

    Max Williams wrote:
    > Hey all
    >
    > I have a bunch of html in p tags and i want to show shorted summary -
    > like the first 25 words followed by '...'. I also want to strip out any
    > images. I'm using hpricot for this and am finding myself writing a
    > convoluted messy method. Can anyone show me a clean and simple way? Is
    > Hpricot even the right tool for the job?
    >



    See if this helps:


    require "rubygems"
    require "hpricot"

    html =<<ENDOFHTML
    <html>
    <body>
    <div>hello: <img href="blah" /></div>

    <p><b>first paragraph</b> <img href="blah" />of longer text</p>
    <p><img href="blah" />second paragraph</p>

    <div>bye: <img href="blah" /></div>
    </body>
    </html>
    ENDOFHTML

    doc = Hpricot(html)
    results = []

    paras = doc.search("//p")

    paras.each do |para|
    para.search("img").remove
    text = para.inner_html

    if text.length <= 25
    results << text
    else
    results << "#{text[0, 25]}..."
    end
    end

    p results

    ["<b>first paragraph</b> of...", "second paragraph"]






    --
    Posted via http://www.ruby-forum.com/.
    7stud --, Jul 19, 2009
    #2
    1. Advertising

  3. Max Williams

    7stud -- Guest

    Whoops! You wanted 25 *words*:

    require "rubygems"
    require "hpricot"


    html =<<ENDOFHTML
    <html>
    <body>
    <div>hello: <img href="blah" /></div>

    <p><b>first paragraph</b> <img href="blah" />of longer text
    apple apple apple apple apple apple apple apple apple apple
    pear pear pear pear pear pear pear pear pear pear pear pear
    ball ball ball ball ball ball ball ball ball ball ball ball
    </p>

    <p><img href="blah" />second paragraph</p>

    <div>bye: <img href="blah" /></div>
    </body>
    </html>
    ENDOFHTML

    doc = Hpricot(html)
    results = []
    max_words = 25

    paras = doc.search("//p")

    paras.each do |para|
    para.search("img").remove

    text = para.inner_html
    words = text.split()

    if words.length <= max_words
    results << words.join(" ")
    else
    results << "#{words[0, 25].join(" ")}..."
    end
    end

    p results
    ["<b>first paragraph</b> of longer text apple apple apple apple apple
    apple apple apple apple apple pear pear pear pear pear pear pear pear
    pear pear...", "second paragraph"]











    --
    Posted via http://www.ruby-forum.com/.
    7stud --, Jul 19, 2009
    #3
  4. Max Williams

    Max Williams Guest

    Hi 7stud, thanks.

    This seems to return the first 25 words of every paragraph? Maybe i'm
    not returning the right thing from the method though...here's how i
    wrapped up your code (in lede2 method) and mine (in lede method):

    def lede(word_count = 25)
    doc = self.hpricot_body
    #wipe images
    doc.search("img").remove
    paras = doc.search("//p")
    text = ""
    while paras.size > 0 && text.split(" ").size < word_count
    text += paras.shift.to_html
    end
    if (arr = text.split(" ")).size > 25
    return arr[0..24].join(" ") + " ..."
    else
    return arr.join(" ")
    end
    end

    def lede2(word_count = 25)
    doc = self.hpricot_body
    results = []
    paras = doc.search("//p")
    paras.each do |para|
    para.search("img").remove
    text = para.inner_html
    words = text.split()
    if words.length <= word_count
    results << words.join(" ")
    else
    results << "#{words[0, word_count].join(" ")}..."
    end
    end
    results.collect{|text| "<p>#{text}</p>"}.join
    end

    And here are the results of calling each method on the same post: first
    i show the html content (body_rendered) that we use in the lede methods.

    >> post.body_rendered

    => "<p><i>From the <a
    href="http://blog.ukti.gov.uk/2009/01/26/business-breakfast-20-or-when-new-media-met-the-foreign-secretar/"><span
    class="caps">UKTI</span> blog</a>:</i></p>\n<p>Early on 20 January 2009,
    nine British new media innovators and I arrived at the Foreign Office
    for a meeting with the Foreign Secretary, <a
    href="https://blogs.fco.gov.uk/roller/miliband/"><strong>David Milliband
    MP</strong></a>, to discuss how government can support this important
    sector. Over breakfast, the delegates, Mr Milliband, Foreign Office
    Minister <strong>Gillian Merron</strong> and <strong>Sir Andrew
    Cahn</strong> explored this key area for opportunities in wealth
    creation, and to understand how the public sector can find new ways of
    working using new media tools.</p>\n<p><span></span>The biggest
    challenge raised by the companies was the apparent dearth of funding
    opportunities for new start-ups in this economic climate and beyond.
    There are few investors in the UK outside London, and most companies
    seek funding from US sources.</p>\n<p>Admittedly, the digital sector has
    a stronger ability to weather the storm than other market sectors; after
    the bubble of 2000 burst, start-ups regrouped and re-emerged from the
    ashes as <a
    href="http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html">\342\200\230Web
    2.0\342\200\262</a> in 2005, aiming to generate businesses with sturdy
    and market-resilient plans that can face inclement weather. However,
    delegates called for funds to bridge the gap between
    \302\24310-\302\243250k, and to support training programmes for new
    talent into the digital and creative industries.</p>\n<p>The Foreign
    Secretary, who has a track record as a blogger and new media enthusiast,
    also wanted to explore with his visitors the contribution new
    technologies can make to diplomacy and international
    problem-solving.</p>\n<p>British new media developers excel at building
    social entrepreneurial applications, ensuring that their social
    networking, social software (e.g., blogs) and other social systems
    (e.g., search, data visualisation) work to ensure participation in the
    community.</p>\n<p>New media leverages communities based on
    commonalities, rather than proximity, encouraging participation on an
    equal playing field. It has been crucial in breaking down international
    and social barriers, exposing participants first-hand to news and news
    sources, encouraging them to engage with people of different religions,
    cultures and creeds, of different abilities and languages. It is
    transforming the way our children learn, the way our teachers teach and
    the way we do business. In short, it has put the person back into the
    technology, lowering the barriers for knowledge and sharing.</p>\n<p>Yet
    the challenge remains in opening up policy debates in such a way as they
    mobilise the many-to-many networks which new media supports. Delegates
    suggested using gaming technologies to break down boundaries of
    participation and communication, and to open up public assets to
    communities who may be able to offer better solutions than those that
    have come before.</p>\n<p>The Business Breakfast was organised by the <a
    href="http://twitter.com/londonukti"><span class="caps">ICT</span>
    Sector Team</a> and is part of a series of meetings aimed at
    facilitating abetter understanding between business and government on
    key issues affecting businesses. Many thanks to all
    involved!</p>\n<p><span style="text-decoration:underline;">The
    Attendees</span><br />The attendees had been hand-picked to represent
    the spectrum of digital services developed in the UK, from innovators in
    social entrepreneurship and education to broadcasters and videogame
    developers:</p>\n<p><a href="http://www.4ip.org.uk/">4iP</a><br /><a
    href="http://www.4learning.co.uk/">Channel 4 Education</a><br /><a
    href="http://www.chinwag.com/">Chinwag</a><br /><a
    href="http://www.dopplr.com/"> Dopplr</a><br /><a
    href="http://www.mindcandy.com/">Mind Candy</a><a
    href="http://www.mysociety.org/"><br />MySociety</a><a
    href="http://schoolofeverything.com/"><br />School of Everything</a><br
    /><a href="http://www.ttgames.com/">TTGames</a><br /><a
    href="http://unltdworld.com/">UnLtdWorld</a></p>"
    >> post.lede

    => "<p><i>From the <a
    href="http://blog.ukti.gov.uk/2009/01/26/business-breakfast-20-or-when-new-media-met-the-foreign-secretar/"><span
    class="caps">UKTI</span> blog</a>:</i></p><p>Early on 20 January 2009,
    nine British new media innovators and I arrived at the Foreign Office
    for a meeting ..."
    >> post.lede2

    => "<p><i>From the <a
    href="http://blog.ukti.gov.uk/2009/01/26/business-breakfast-20-or-when-new-media-met-the-foreign-secretar/"><span
    class="caps">UKTI</span> blog</a>:</i></p><p>Early on 20 January 2009,
    nine British new media innovators and I arrived at the Foreign Office
    for a meeting with the Foreign Secretary, <a...</p><p><span></span>The
    biggest challenge raised by the companies was the apparent dearth of
    funding opportunities for new start-ups in this economic climate and
    beyond. There are...</p><p>Admittedly, the digital sector has a stronger
    ability to weather the storm than other market sectors; after the bubble
    of 2000 burst, start-ups regrouped and...</p><p>The Foreign Secretary,
    who has a track record as a blogger and new media enthusiast, also
    wanted to explore with his visitors the contribution
    new...</p><p>British new media developers excel at building social
    entrepreneurial applications, ensuring that their social networking,
    social software (e.g., blogs) and other social systems (e.g.,
    search,...</p><p>New media leverages communities based on commonalities,
    rather than proximity, encouraging participation on an equal playing
    field. It has been crucial in breaking down international...</p><p>Yet
    the challenge remains in opening up policy debates in such a way as they
    mobilise the many-to-many networks which new media supports. Delegates
    suggested...</p><p>The Business Breakfast was organised by the <a
    href="http://twitter.com/londonukti"><span class="caps">ICT</span>
    Sector Team</a> and is part of a series of meetings aimed at
    facilitating abetter understanding...</p><p><span
    style="text-decoration:underline;">The Attendees</span><br />The
    attendees had been hand-picked to represent the spectrum of digital
    services developed in the UK, from innovators in social entrepreneurship
    and...</p><p><a href="http://www.4ip.org.uk/">4iP</a><br /><a
    href="http://www.4learning.co.uk/">Channel 4 Education</a><br /><a
    href="http://www.chinwag.com/">Chinwag</a><br /><a
    href="http://www.dopplr.com/"> Dopplr</a><br /><a
    href="http://www.mindcandy.com/">Mind Candy</a><a
    href="http://www.mysociety.org/"><br />MySociety</a><a
    href="http://schoolofeverything.com/"><br />School of Everything</a><br
    /><a href="http://www.ttgames.com/">TTGames</a><br /><a
    href="http://unltdworld.com/">UnLtdWorld</a></p>"
    --
    Posted via http://www.ruby-forum.com/.
    Max Williams, Jul 19, 2009
    #4
  5. Max Williams

    7stud -- Guest

    Max Williams wrote:
    > Hi 7stud, thanks.
    >
    > This seems to return the first 25 words of every paragraph?


    Well, what does this say:

    > I have a bunch of html in p tags and i want to show shorted
    > summary - like the first 25 words followed by '...'.


    If you want relevant answers, then you have to post precise questions.

    So now you just want **the first 25 words of your html**?

    html =<<ENDOFHTML
    <p><i>From the <a
    href="http://blog.ukti.gov.uk/2009/01/26/business-breakfast-20-or-when-new-media-met-the-foreign-secretar/"><span
    class="caps">UKTI</span> blog</a>:</i></p>\n<p>Early on 20 January 2009,
    nine British new media innovators and I arrived at the Foreign Office
    for a meeting with the Foreign Secretary, <a
    href="https://blogs.fco.gov.uk/roller/miliband/"><strong>David Milliband
    MP</strong></a>, to discuss how government can support this important
    sector.
    ENDOFHTML


    max_words = 25

    no_images = html.gsub(/<\s*img.*?>/, "")
    words = no_images.split()

    if words.length <= 25:
    puts words.join(" ")
    else
    puts "#{words[0, 25].join(" ")}..."
    end

    --output:--
    <p><i>From the <a
    href="http://blog.ukti.gov.uk/2009/01/26/business-breakfast-20-or-when-new-media-met-the-foreign-secretar/"><span
    class="caps">UKTI</span> blog</a>:</i></p> <p>Early on 20 January 2009,
    nine British new media innovators and I arrived at the Foreign Office
    for a...




    --
    Posted via http://www.ruby-forum.com/.
    7stud --, Jul 20, 2009
    #5
  6. Max Williams

    Max Williams Guest

    7stud -- wrote:

    > Well, what does this say:
    >
    >> I have a bunch of html in p tags and i want to show shorted
    >> summary - like the first 25 words followed by '...'.

    I guess that was a bit ambiguous, sorry. Anyway, i'm interested to see
    you don't think it's worth bothering with hpricot in this case, and just
    use a regex - i was wondering if hpricot was overkill myself.

    One problem that just occurred to me is that treating the whole html
    like text and taking the first 25 words will make it into invalid html -
    because we have start tags with no matching end tags. So, ideally i
    would preserve the start and end tags and strip the content down to 25
    words. That's why i used hpricot initially.

    Anyway, thanks for your help.
    max

    --
    Posted via http://www.ruby-forum.com/.
    Max Williams, Jul 20, 2009
    #6
  7. Max Williams

    7stud -- Guest

    Max Williams wrote:
    > 7stud -- wrote:
    >
    >> Well, what does this say:
    >>
    >>> I have a bunch of html in p tags and i want to show shorted
    >>> summary - like the first 25 words followed by '...'.

    > I guess that was a bit ambiguous, sorry. Anyway, i'm interested to see
    > you don't think it's worth bothering with hpricot in this case, and just
    > use a regex - i was wondering if hpricot was overkill myself.
    >
    > One problem that just occurred to me is that treating the whole html
    > like text and taking the first 25 words will make it into invalid html -
    > because we have start tags with no matching end tags.
    >


    I stopped using Hpricot because you said this was your desired output:

    >> post.lede

    => "<p><i>From the <a
    href="http://blog.ukti.gov.uk/2009/01/26/business-breakfast-20-or-when-new-media-met-the-foreign-secretar/"><span
    class="caps">UKTI</span> blog</a>:</i></p><p>Early on 20 January 2009,
    nine British new media innovators and I arrived at the Foreign Office
    for a meeting ..."

    --
    Posted via http://www.ruby-forum.com/.
    7stud --, Jul 20, 2009
    #7
  8. Max Williams

    Max Williams Guest

    I didn't say it was the desired output, i just said that was what my
    crappy version was currently doing :) Anyway, i've troubled you enough
    - thanks for all your help.
    --
    Posted via http://www.ruby-forum.com/.
    Max Williams, Jul 20, 2009
    #8
  9. Max Williams

    7stud -- Guest

    > So, ideally i would preserve the start and end tags and
    > strip the content down to 25 words.


    It's easy enough to slap a "</p>" on the end. But something else you
    might not have considered is: what if the 25th word is inside a tag, for
    instance:

    <a href="blah"

    --
    Posted via http://www.ruby-forum.com/.
    7stud --, Jul 20, 2009
    #9
  10. Max Williams

    Max Williams Guest

    7stud -- wrote:
    >> So, ideally i would preserve the start and end tags and
    >> strip the content down to 25 words.

    >
    > It's easy enough to slap a "</p>" on the end. But something else you
    > might not have considered is: what if the 25th word is inside a tag, for
    > instance:
    >
    > <a href="blah"


    Yeah, i know, it's a bit complicated isn't it. Slapping a </p> on the
    end isn't enough because there could be a load of unfinished tags - for
    example, half a p tag with half an a tag inside it. Anything really.
    That's why ideally i would just consider the inner content of tags and
    when it comes to tags inside tags, either remove them completely or
    leave them as they are. This seems like such a common thing on the net,
    to have a short section followed by 'read more' for example, that i
    thought there would be an easy way to do it.
    --
    Posted via http://www.ruby-forum.com/.
    Max Williams, Jul 20, 2009
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Peter Strøiman
    Replies:
    1
    Views:
    2,070
    Peter Strøiman
    Aug 23, 2005
  2. Bimo Remus

    pull out first and last words

    Bimo Remus, Jun 28, 2003, in forum: C++
    Replies:
    6
    Views:
    2,074
    Samuele Armondi
    Jun 29, 2003
  3. Guest
    Replies:
    16
    Views:
    343
    Dave Anderson
    Aug 24, 2004
  4. Aquila
    Replies:
    35
    Views:
    440
    Mathieu Bouchard
    Mar 31, 2005
  5. yelipolok
    Replies:
    4
    Views:
    243
    John W. Krahn
    Jan 27, 2010
Loading...

Share This Page