Screen scraping an html text contents into a file

Discussion in 'Ruby' started by basi, Dec 6, 2005.

  1. basi

    basi Guest

    Hello,

    I'm looking for a screen scraper that will extract text contents off
    html pages and save the text into files. I have looked at Mechanize and
    Rubyful_Soup, but they are a bit over my head to modify to save just
    the text contents to a file. (I'm a researcher trying to use Ruby for
    real world text analysis tasks, and trying to learn Ruby at the same
    time.) The levels of usage I'd love to have (choosey beggar):

    > program prompts me for url address to scrape and file name to save texts into,
    > or edit program to enter url address and file name


    Of course a program that, given a url, would walk down the links, open
    the pages, and save the text contents to a file would be ... that would
    be a commercial product. Is there one?

    Thanks!
    basi
     
    basi, Dec 6, 2005
    #1
    1. Advertising

  2. basi

    Lou Vanek Guest

    you might get away with just using curl:

    curl www.apple.com > mytextfile

    or wget, which is capable of acting recursively on an entire site.
    http://www.delorie.com/gnu/docs/wget/wget_14.html




    basi wrote:

    > Hello,
    >
    > I'm looking for a screen scraper that will extract text contents off
    > html pages and save the text into files. I have looked at Mechanize and
    > Rubyful_Soup, but they are a bit over my head to modify to save just
    > the text contents to a file. (I'm a researcher trying to use Ruby for
    > real world text analysis tasks, and trying to learn Ruby at the same
    > time.) The levels of usage I'd love to have (choosey beggar):
    >
    >
    >>program prompts me for url address to scrape and file name to save texts into,
    >>or edit program to enter url address and file name

    >
    >
    > Of course a program that, given a url, would walk down the links, open
    > the pages, and save the text contents to a file would be ... that would
    > be a commercial product. Is there one?
    >
    > Thanks!
    > basi
    >
    >
    >
     
    Lou Vanek, Dec 6, 2005
    #2
    1. Advertising

  3. basi

    Gene Tani Guest

    basi wrote:
    > Hello,
    >
    > I'm looking for a screen scraper that will extract text contents off
    > html pages and save the text into files. I have looked at Mechanize and
    > Rubyful_Soup, but they are a bit over my head to modify to save just
    > the text contents to a file. (I'm a researcher trying to use Ruby for
    > real world text analysis tasks, and trying to learn Ruby at the same
    > time.) The levels of usage I'd love to have (choosey beggar):
    >
    > > program prompts me for url address to scrape and file name to save texts into,
    > > or edit program to enter url address and file name

    >
    > Of course a program that, given a url, would walk down the links, open
    > the pages, and save the text contents to a file would be ... that would
    > be a commercial product. Is there one?
    >
    > Thanks!
    > basi


    i think open-uri and Rubyful_soup are pretty straightforward. I like
    this shows open-uri vs. Net::HTTP:
    http://www.zenspider.com/dl/rubyconf2005/open-uri.pdf

    There's commercial website downloaders that will follow every link in
    every page, hit the server hundreds of times in a few seconds and get
    your IP blacklisted pretty quickly (so run them from Starbuck's
    wireless). Look in Oreilly Spidering Hacks, for the right way to do
    it. the (perl) examples are straightforward.
     
    Gene Tani, Dec 6, 2005
    #3
  4. On 06/12/05, basi <> wrote:
    > Hello,
    >
    > I'm looking for a screen scraper that will extract text contents off
    > html pages and save the text into files. I have looked at Mechanize and
    > Rubyful_Soup, but they are a bit over my head to modify to save just
    > the text contents to a file. (I'm a researcher trying to use Ruby for
    > real world text analysis tasks, and trying to learn Ruby at the same
    > time.) The levels of usage I'd love to have (choosey beggar):
    >
    > > program prompts me for url address to scrape and file name to save text=

    s into,
    > > or edit program to enter url address and file name

    >
    > Of course a program that, given a url, would walk down the links, open
    > the pages, and save the text contents to a file would be ... that would
    > be a commercial product. Is there one?
    >
    > Thanks!
    > basi
    >
    >
    >


    try this:

    $ w3m -dump www.ruby-lang.org

    cheers,

    Brian

    --
    http://ruby.brian-schroeder.de/

    Stringed instrument chords: http://chordlist.brian-schroeder.de/
     
    Brian Schröder, Dec 6, 2005
    #4
  5. --5I6of5zJg18YgZEa
    Content-Type: text/plain; charset=us-ascii
    Content-Disposition: inline

    > basi wrote:
    > > Of course a program that, given a url, would walk down the links, open
    > > the pages, and save the text contents to a file would be ... that would
    > > be a commercial product. Is there one?


    No need for a commercial product. wget does all that.

    --5I6of5zJg18YgZEa
    Content-Type: application/pgp-signature; name="signature.asc"
    Content-Description: Digital signature
    Content-Disposition: inline

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.4.1 (GNU/Linux)

    iD8DBQFDlgAtnhUz11p9MSARArLwAJ0d//D+GwK8jlzShAJWv6wS6SZa2QCffIe1
    6MNwVCyZkzJ0ORrV+1EyKlU=
    =fBPk
    -----END PGP SIGNATURE-----

    --5I6of5zJg18YgZEa--
     
    Edward Faulkner, Dec 6, 2005
    #5
  6. basi

    basi Guest

    Hi,
    Thanks for the info on wget and curl. Both are powerful page
    downloaders. The downloaded pages are still "tagged". I need to find a
    way to "run" the pages and capture only the text display.
    Thanks again.
    basi
     
    basi, Dec 6, 2005
    #6
  7. basi

    Joe Van Dyk Guest

    On 12/6/05, basi <> wrote:
    > Hi,
    > Thanks for the info on wget and curl. Both are powerful page
    > downloaders. The downloaded pages are still "tagged". I need to find a
    > way to "run" the pages and capture only the text display.


    $ lynx -dump www.rubystuff.com

    Ruby Stuff. The Ruby Store for Ruby programmers.

    Got the Right Stuff?

    The Ruby Stuff store is one-stop shopping for assorted Ruby hacker
    goodness.

    T-shirts, hats, coffee mugs, clocks, mouse pads, and more.

    Shirts

    Ruby Stuff offers a nifty variety of stunning T-shirts and jerseys for
    men and women. You'll feel naked without one.

    [1]More ...

    Coffee Mugs

    Hackers + caffeine =3D max coding pleasure. Drink your beverage of
    choice from one of these mugs.

    [2]More ...

    RubyStuff now has Stamps!

    These first-class U. S. postage stamps won't make the mail go any
    faster, but they are sure to raise an eyebrow.

    [3]More ...

    Stuff

    There's yet more Ruby stuff: Knock-out clocks; mighty mouse pads,
    handsome hats.

    [4]Mouse pads, [5]bags, [6]undies, [7]hats, [8]buttons, [9]more

    [10]About RubyStuff.com ...

    References

    1. http://www.rubystuff.com/shirts.html
    2. http://www.rubystuff.com/mugs.html
    3. http://www.rubystuff.com/stamps.html
    4. http://www.rubystuff.com/mousepads.html
    5. http://www.rubystuff.com/bags.html
    6. http://www.rubystuff.com/undies.html
    7. http://www.rubystuff.com/hats.html
    8. http://www.rubystuff.com/buttons_and_magnets.html
    9. http://www.rubystuff.com/assorted.html
    10. http://www.rubystuff.com/about.html
     
    Joe Van Dyk, Dec 6, 2005
    #7
  8. basi

    basi Guest

    Can't find a w3m binaries for windows xp. I'll continue to look.
    Thanks,
    basi
     
    basi, Dec 6, 2005
    #8
  9. basi

    basi Guest

    Hello,
    Can't find windows xp binaries of w3m, snarf, also tried cUrl, wget,
    but lynx does look like it renders the page close to what I'm looking
    for.
    Thanks to all who responded!
    basi
     
    basi, Dec 7, 2005
    #9
  10. I will throw something like this together in Ruby over
    the next days when I get some time and post it on
    RubyForge. I have already done this sort of stuff in
    Java and the concepts just really need a port.All we
    are looking at Basi's initial level of requirements is
    to send an HTTP get, and pipe the response to a file.

    Link following is a little more tricky since you need
    to parse the HTML, issue a get, pipe the file, rinse
    and repeat, but again not exactly rocket science.

    rgds

    Steve

    --- basi <> wrote:

    > Hello,
    > Can't find windows xp binaries of w3m, snarf, also
    > tried cUrl, wget,
    > but lynx does look like it renders the page close to
    > what I'm looking
    > for.=20
    > Thanks to all who responded!
    > basi
    >=20
    >=20
    >=20




    =09
    __________________________________________=20
    Yahoo! DSL =96 Something to write home about.=20
    Just $16.99/mo. or less.=20
    dsl.yahoo.com=20
     
    Steve Callaway, Dec 7, 2005
    #10
  11. Steve Callaway <> wrote:
    > I will throw something like this together in Ruby over
    > the next days when I get some time and post it on
    > RubyForge. I have already done this sort of stuff in
    > Java and the concepts just really need a port.All we
    > are looking at Basi's initial level of requirements is
    > to send an HTTP get, and pipe the response to a file.


    Nope, according to the OP's requirements, you also need to render the
    html and spit out the rendered version as text, which makes lynx --dump
    the right tool for the job. It'd be quite a big task to duplicate this
    in ruby, I think.

    martin
     
    Martin DeMello, Dec 7, 2005
    #11
  12. --- Martin DeMello <> wrote:

    > Steve Callaway <> wrote:
    > > I will throw something like this together in Ruby

    > over
    > > the next days when I get some time and post it on
    > > RubyForge. I have already done this sort of stuff

    > in
    > > Java and the concepts just really need a port.All

    > we
    > > are looking at Basi's initial level of

    > requirements is
    > > to send an HTTP get, and pipe the response to a

    > file.
    >=20
    > Nope, according to the OP's requirements, you also
    > need to render the
    > html and spit out the rendered version as text,
    > which makes lynx --dump
    > the right tool for the job. It'd be quite a big task
    > to duplicate this
    > in ruby, I think.
    >=20
    > martin
    >=20
    >=20


    By rendering the html, my interpretation of this was
    that it is merely a question of stripping tags etc,
    which can quickly be accomplished with gsub. Or am I
    missing something?

    rgds

    Steve


    =09
    __________________________________________=20
    Yahoo! DSL =96 Something to write home about.=20
    Just $16.99/mo. or less.=20
    dsl.yahoo.com=20
     
    Steve Callaway, Dec 7, 2005
    #12
  13. On 07/12/05, Steve Callaway <> wrote:
    >
    >
    > --- Martin DeMello <> wrote:
    >
    > > Steve Callaway <> wrote:
    > > > I will throw something like this together in Ruby

    > > over
    > > > the next days when I get some time and post it on
    > > > RubyForge. I have already done this sort of stuff

    > > in
    > > > Java and the concepts just really need a port.All

    > > we
    > > > are looking at Basi's initial level of

    > > requirements is
    > > > to send an HTTP get, and pipe the response to a

    > > file.
    > >
    > > Nope, according to the OP's requirements, you also
    > > need to render the
    > > html and spit out the rendered version as text,
    > > which makes lynx --dump
    > > the right tool for the job. It'd be quite a big task
    > > to duplicate this
    > > in ruby, I think.
    > >
    > > martin
    > >
    > >

    >
    > By rendering the html, my interpretation of this was
    > that it is merely a question of stripping tags etc,
    > which can quickly be accomplished with gsub. Or am I
    > missing something?
    >
    > rgds
    >
    > Steve
    >


    E.g. Tables and frames. So better use links2 or w3m for the task.

    cheers,

    Brian



    --
    http://ruby.brian-schroeder.de/

    Stringed instrument chords: http://chordlist.brian-schroeder.de/
     
    Brian Schröder, Dec 7, 2005
    #13
  14. Ah, yeah, forgot all about those nasty little things.
    Not insuperable but would certainly add an overhead to
    handle them effectively.

    Steve

    --- Brian Schr=F6der <> wrote:

    > On 07/12/05, Steve Callaway <>
    > wrote:
    > >
    > >
    > > --- Martin DeMello <>

    > wrote:
    > >
    > > > Steve Callaway <> wrote:
    > > > > I will throw something like this together in

    > Ruby
    > > > over
    > > > > the next days when I get some time and post it

    > on
    > > > > RubyForge. I have already done this sort of

    > stuff
    > > > in
    > > > > Java and the concepts just really need a

    > port.All
    > > > we
    > > > > are looking at Basi's initial level of
    > > > requirements is
    > > > > to send an HTTP get, and pipe the response to

    > a
    > > > file.
    > > >
    > > > Nope, according to the OP's requirements, you

    > also
    > > > need to render the
    > > > html and spit out the rendered version as text,
    > > > which makes lynx --dump
    > > > the right tool for the job. It'd be quite a big

    > task
    > > > to duplicate this
    > > > in ruby, I think.
    > > >
    > > > martin
    > > >
    > > >

    > >
    > > By rendering the html, my interpretation of this

    > was
    > > that it is merely a question of stripping tags

    > etc,
    > > which can quickly be accomplished with gsub. Or am

    > I
    > > missing something?
    > >
    > > rgds
    > >
    > > Steve
    > >

    >=20
    > E.g. Tables and frames. So better use links2 or w3m
    > for the task.
    >=20
    > cheers,
    >=20
    > Brian
    >=20
    >=20
    >=20
    > --
    > http://ruby.brian-schroeder.de/
    >=20
    > Stringed instrument chords:
    > http://chordlist.brian-schroeder.de/
    >=20
    >=20




    =09
    __________________________________________=20
    Yahoo! DSL =96 Something to write home about.=20
    Just $16.99/mo. or less.=20
    dsl.yahoo.com=20
     
    Steve Callaway, Dec 7, 2005
    #14
  15. Steve Callaway <> wrote:
    > >

    >
    > By rendering the html, my interpretation of this was
    > that it is merely a question of stripping tags etc,
    > which can quickly be accomplished with gsub. Or am I
    > missing something?


    Even without things like tables, the significance of various whitespace
    elements (space, tab, newline) in html is very different from its
    significance in the rendered page. e.g. the following can't be done by
    just stripping tags:

    <ul><li>one
    two</li><li>three<li>four</ul>

    martin
     
    Martin DeMello, Dec 7, 2005
    #15
  16. WWW::Mechanize can do most of what is needed, except for the dumping
    of the HTML as text. As others have said, what we really need is some
    kind of HTML to text renderer. There has got to be gobs of C or C++
    code out there that does this...how hard would it be to make a Ruby C
    extension for this? Hash anyone ever thought about making a nice Ruby
    extension for Gecko or even the HTML renderers in lynx or w3m?

    Ryan
     
    Ryan Leavengood, Dec 7, 2005
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jeremy Chapman

    Load contents of a text file into a text box

    Jeremy Chapman, Aug 15, 2003, in forum: ASP .Net
    Replies:
    1
    Views:
    764
    Tommy
    Aug 15, 2003
  2. George Durzi

    HTML Screen Scraping Q

    George Durzi, Feb 25, 2004, in forum: ASP .Net
    Replies:
    2
    Views:
    514
    George Durzi
    Feb 25, 2004
  3. Jonathan Epstein
    Replies:
    2
    Views:
    594
    Paul Rubin
    May 11, 2004
  4. David Jones

    Web Scraping/Site Scraping

    David Jones, Jul 11, 2004, in forum: Python
    Replies:
    4
    Views:
    515
    Andrew Bennetts
    Jul 13, 2004
  5. Replies:
    3
    Views:
    584
Loading...

Share This Page