Efficient way to rip html

Discussion in 'HTML' started by Arthur Rhodes, Oct 3, 2006.

  1. I'm building a web store and I have to create a large number of
    product descriptions. The distributors do not provide spec sheets
    or marketing materials to me in html format. Instead, they advise
    me to simply copy the descriptions from their web sites.

    The problem is that the descriptions I need to copy are embedded
    in complex pages, with nested tables, etc. Simply copying the
    page source doesn't seem to be that useful. I end up having to
    cut out lots of table code, etc., and usually make mistakes that
    are time consuming to figure out and fix.

    The other alternative is to copy the text and then recreating the html
    formatting from scratch.

    Is there an easier way?

    Right now, I'm just writing HTML by hand in a text editor. Would
    this be any easier if I used a web editor like Dreamweaver?
    Arthur Rhodes, Oct 3, 2006
    #1
    1. Advertising

  2. Arthur Rhodes

    Ben C Guest

    On 2006-10-03, Arthur Rhodes <> wrote:
    > I'm building a web store and I have to create a large number of
    > product descriptions. The distributors do not provide spec sheets
    > or marketing materials to me in html format. Instead, they advise
    > me to simply copy the descriptions from their web sites.
    >
    > The problem is that the descriptions I need to copy are embedded
    > in complex pages, with nested tables, etc. Simply copying the
    > page source doesn't seem to be that useful. I end up having to
    > cut out lots of table code, etc., and usually make mistakes that
    > are time consuming to figure out and fix.
    >
    > The other alternative is to copy the text and then recreating the html
    > formatting from scratch.
    >
    > Is there an easier way?


    Python, and Beautiful Soup.

    http://www.crummy.com/software/BeautifulSoup/
    Ben C, Oct 3, 2006
    #2
    1. Advertising

  3. In article <>,
    Ben C <> wrote:

    > On 2006-10-03, Arthur Rhodes <> wrote:
    > > I'm building a web store and I have to create a large number of
    > > product descriptions. The distributors do not provide spec sheets
    > > or marketing materials to me in html format. Instead, they advise
    > > me to simply copy the descriptions from their web sites.
    > >
    > > The problem is that the descriptions I need to copy are embedded
    > > in complex pages, with nested tables, etc. Simply copying the
    > > page source doesn't seem to be that useful. I end up having to
    > > cut out lots of table code, etc., and usually make mistakes that
    > > are time consuming to figure out and fix.
    > >
    > > The other alternative is to copy the text and then recreating the html
    > > formatting from scratch.
    > >
    > > Is there an easier way?

    >
    > Python, and Beautiful Soup.
    >
    > http://www.crummy.com/software/BeautifulSoup/


    Seconded. If you're willing to go the Python programming route, Connelly
    Barnes' htmldata might also prove helpful:
    http://oregonstate.edu/~barnesc/htmldata/

    Last but not least you could use command-line Spyce (HTML templates with
    the dynamic bits written in Python) to build your Web pages:
    http://spyce.sourceforge.net/

    Good luck

    --
    Philip
    http://NikitaTheSpider.com/
    Whole-site HTML validation, link checking and more
    Nikita the Spider, Oct 3, 2006
    #3
  4. Arthur Rhodes

    dorayme Guest

    In article <>,
    Arthur Rhodes <> wrote:

    > I'm building a web store and I have to create a large number of
    > product descriptions. The distributors do not provide spec sheets
    > or marketing materials to me in html format. Instead, they advise
    > me to simply copy the descriptions from their web sites.
    >
    > The problem is that the descriptions I need to copy are embedded
    > in complex pages, with nested tables, etc. Simply copying the
    > page source doesn't seem to be that useful. I end up having to
    > cut out lots of table code, etc., and usually make mistakes that
    > are time consuming to figure out and fix.
    >
    > The other alternative is to copy the text and then recreating the html
    > formatting from scratch.
    >
    > Is there an easier way?
    >
    > Right now, I'm just writing HTML by hand in a text editor. Would
    > this be any easier if I used a web editor like Dreamweaver?


    It depends on how well you know Dreamweaver (or any other
    software). I have a friend who would go this way and well. I
    would grab the product descriptions and work hard and use a text
    editor because it would take me less time. You are in the middle
    of a job. Can you risk finding out? If you know what you are
    doing with the text grabs, just do it and get it done and charge
    the client. As you get going, you will find it going quicker and
    quicker because you will be building patterns in your hand work.
    Products are products, and if they are all in tables to show off
    proper tabular specs, you will simply copy and paste a few table
    types you have constructed, most data will fit in one or other of
    them with little mods.

    --
    dorayme
    dorayme, Oct 3, 2006
    #4
  5. Arthur Rhodes

    Guest

    On 3-Oct-2006, dorayme <> wrote:

    > Instead, they advise
    > me to simply copy the descriptions from their web sites.


    With images on websites you can usually right click then
    copy and then paste the gif or jpg into the folder (or,
    for some applications,) into the application of your choice.

    Likewise with text, with a bit of practice you can right click
    then wipe to highlight, release, right click again on
    highlighted text, copy, then paste (selecting paste option
    Unformatted Text) or paste into notepad which reduces
    everyting to unformatted text.
    Paste options depend on application, sometimes you have
    to start from Edit menu to find the Unformatted Text option.
    With Dreamweaver I think that there is an unformatted text
    option to paste long runs of text in the code window.
    But then I mostly edit/build in Wordpad because it opens
    and saves, html without asking what format you want to
    save in.
    DW8 can produce non-validating code without warning you,
    has some merit in early stages of design. With Wordpad
    I can save and immediately see the effect with refresh
    the browser.
    Sometimes you can select tables or highlight cells, copy, and
    paste into Excel, which gives you further options for
    manipulating/parsing the data.
    , Oct 4, 2006
    #5
  6. Arthur Rhodes

    mbstevens Guest

    On Tue, 03 Oct 2006 11:22:24 -0600, Arthur Rhodes wrote:

    > The problem is that the descriptions I need to copy are embedded in
    > complex pages, with nested tables, etc. Simply copying the page source
    > doesn't seem to be that useful. I end up having to cut out lots of table
    > code, etc., and usually make mistakes that are time consuming to figure
    > out and fix.



    Perl's HTML::parser module will divide an HTML document into its various
    parts (including text) with just a few lines of code. In the more
    structured Python world, sgmllib, htmllib, or HTMLParser are the modules
    to look into.
    --
    mbstevens
    http://www.mbstevens.com/
    mbstevens, Oct 4, 2006
    #6
  7. On Tue, 03 Oct 2006 13:25:02 -0500, Ben C wrote:

    >> Is there an easier way?

    >
    > Python, and Beautiful Soup.
    >
    > http://www.crummy.com/software/BeautifulSoup/


    Looks good. You don't know of any ready made gui for it,
    do you? I'm thinking it would be nice to have a tree
    pane representing the structure of the document, and when
    you click on a node a text pane shows the corresponding part
    of the document.
    Arthur Rhodes, Oct 4, 2006
    #7
  8. Arthur Rhodes

    Andy Dingley Guest

    dorayme wrote:

    > It depends on how well you know Dreamweaver (or any other
    > software). I have a friend who would go this way and well. I
    > would grab the product descriptions and work hard and use a text
    > editor


    Twice a day, for two thousand products ?
    Andy Dingley, Oct 4, 2006
    #8
  9. Arthur Rhodes

    Ben C Guest

    On 2006-10-04, Arthur Rhodes <> wrote:
    > On Tue, 03 Oct 2006 13:25:02 -0500, Ben C wrote:
    >
    >>> Is there an easier way?

    >>
    >> Python, and Beautiful Soup.
    >>
    >> http://www.crummy.com/software/BeautifulSoup/

    >
    > Looks good. You don't know of any ready made gui for it,
    > do you? I'm thinking it would be nice to have a tree
    > pane representing the structure of the document, and when
    > you click on a node a text pane shows the corresponding part
    > of the document.


    I don't know of one, but it wouldn't be hard to do. Someone may have
    done one.

    But Firefox can do exactly what you're describing, if you install the
    "DOM Inspector" extension. You can click on something in the tree
    representation in the DOM Inspector window and it flashes red on the
    page, or you can point to part of the page, click, and the corresponding
    part of the tree representation gets highlighted.

    Having found your way around the document with this DOM Inspector, you
    can then write the python/BeautifulSoup script to pull out the bits
    you're interested in.
    Ben C, Oct 4, 2006
    #9
  10. Arthur Rhodes

    dorayme Guest

    In article
    <>,
    "Andy Dingley" <> wrote:

    > dorayme wrote:
    >
    > > It depends on how well you know Dreamweaver (or any other
    > > software). I have a friend who would go this way and well. I
    > > would grab the product descriptions and work hard and use a text
    > > editor

    >
    > Twice a day, for two thousand products ?


    No, well, if it were on this scale, I would fire up Dreamweaver
    or even the 98 version of Word and export to HTML and see how it
    renders a table of product specs. I would then see what I could
    do to clean up crap via Search and Replace, using extra GREP if
    need be, and shape it all how I wanted. But my point was this: be
    sure the scale of the job is big enough to embark on anything
    more than simple hard work with a text editor, entering, cutting
    and pasting where possible etc.

    You get these figures from?

    Truth is this, I have found many earthlings think hard rote work
    beneath their human dignity. I happen to think humans have no
    real dignity, it is all a pretence and they should get a better
    perspective of their place in evolution. They are machines and
    should stop trying to distance themselves from lower and more
    mechanical forms.


    [btw. Alan Flavell has a philosophy behind the idea of hard rote
    work, that it offends against human dignity... It is a point of
    view. I am not saying it is unintelligent. But imo, much evil has
    come from ideas like this. I don't suppose anyone wants to know
    more? :) ]

    --
    dorayme
    dorayme, Oct 5, 2006
    #10
  11. Arthur Rhodes

    wayne Guest

    Arthur Rhodes wrote:
    > I'm building a web store and I have to create a large number of
    > product descriptions. The distributors do not provide spec sheets
    > or marketing materials to me in html format. Instead, they advise
    > me to simply copy the descriptions from their web sites.
    >
    > The problem is that the descriptions I need to copy are embedded
    > in complex pages, with nested tables, etc. Simply copying the
    > page source doesn't seem to be that useful. I end up having to
    > cut out lots of table code, etc., and usually make mistakes that
    > are time consuming to figure out and fix.
    >
    > The other alternative is to copy the text and then recreating the html
    > formatting from scratch.
    >
    > Is there an easier way?
    >
    > Right now, I'm just writing HTML by hand in a text editor. Would
    > this be any easier if I used a web editor like Dreamweaver?
    >
    >

    Perhaps you want to use server side includes to include text files in
    the proper locations?

    --
    Wayne
    http://www.glenmeadows.us
    With or without religion, you would have good people doing good things
    and evil people doing evil things. But for good people to do evil
    things, that takes religion.
    —Steven Weinberg
    wayne, Oct 5, 2006
    #11
  12. On Thu, 5 Oct 2006, dorayme wrote:

    > [btw. Alan Flavell has a philosophy behind the idea of hard rote
    > work, that it offends against human dignity...


    [ warning, off-topic ]

    I don't care whether it's hard or easy - *rote* work that can be done
    with the computer is inappropriate to be done manually.

    I guess you're referring to my postings which rebuke posters for
    asking help on web pages which don't even pass validation. I stand by
    the principle that it's demeaning to ask others for help when the
    validation could and should have been done before asking for help.

    When I'm tasked to do something new for which I don't know a good
    solution, I'll tend to do a lot more manual work the first time
    around. But while I'm doing it I'll be thinking of ways to automate
    what I'm doing, on the principle that if I produce a successful result
    first time, I'm very likely to be asked to do the same kind of thing
    again.

    Quite some years ago I was suddenly asked (about 2 weeks after the
    final deadline!) to produce a webified version of the student handbook
    of our department, which was available only in an MS Word format.
    Back then the results from any MS product which purported to produce
    HTML were significantly worse even than the mess that today's MS
    products generate. But I found a package called rtftohtml, from
    Sunpack software, which was highly configurable and produced pretty
    much the results I wanted. Then I was asked to make some changes, so
    I said OK, give me the updated Word file and I'll do it (they seemed
    to think that the solution would be to apply updates separately to the
    Word file and to the HTML file, but that's a mug's game). Then I
    tossed the new Word file into the conversion procedure that I had set
    up, and hey presto.

    Needless to say, a year later I was asked to webify the new edition of
    the handbook. I simply tossed the new edition into the processing
    chain and the result came out nearly as good as the last time. The
    only thing wrong was that the Word file incorporated some Mac-coded
    scientific content from one of the academics (which already displayed
    as garbage in Win MS Word), so I needed an extra stanza in the
    converter to deal with that.

    This is all some years ago now - I haven't done this task for a few
    years now, and Sunpack rtftohtml has transmogrified into something
    else. But I think this case is quite a good illustration of the
    benefits of using the computer. If one had done that only with point
    and shove every time, just imagine the wasted effort.
    Alan J. Flavell, Oct 6, 2006
    #12
  13. Arthur Rhodes

    dorayme Guest

    In article
    <>,
    "Alan J. Flavell" <> wrote:

    > On Thu, 5 Oct 2006, dorayme wrote:
    >
    > > [btw. Alan Flavell has a philosophy behind the idea of hard rote
    > > work, that it offends against human dignity...

    >
    > [ warning, off-topic ]
    >
    > I don't care whether it's hard or easy - *rote* work that can be done
    > with the computer is inappropriate to be done manually.
    >


    I don't really disagree with anything you go on to say. It does
    not really matter whether it is an ethical stance. It is sure
    sensible to get the machine to do things auto as much as
    possible. I am a big fan of automation, I know it sounds absurd,
    but, first time, I had to force myself not to watch (in
    fascination) my computer do a big batching process in Photoshop.
    There it was, the machine at its best, opening and altering and
    saving files all by itself! Yes, it lierally opened things on
    screen. Now, that is what a computer is for, I thought.

    But, if I may just make this point again, not all jobs are worth
    the effort of "tooling up" to do things automatically.

    As you describe, it is often useful to get a big percentage of
    the job done with auto processes. But in many jobs, the push for
    turn-key operational success brings in diminishing returns. Not a
    bad maxim is:

    Automate what it is easy to automate and get ready to roll up the
    sleeves for the rest.

    --
    dorayme
    dorayme, Oct 7, 2006
    #13
  14. "dorayme" <> skrev i meddelandet
    news:...
    > In article
    > <>,
    > "Alan J. Flavell" <> wrote:
    >
    > > On Thu, 5 Oct 2006, dorayme wrote:
    > >
    > > > [btw. Alan Flavell has a philosophy behind the idea of hard rote
    > > > work, that it offends against human dignity...

    > >
    > > [ warning, off-topic ]
    > >
    > > I don't care whether it's hard or easy - *rote* work that can be done
    > > with the computer is inappropriate to be done manually.
    > >

    >
    > I don't really disagree with anything you go on to say. It does
    > not really matter whether it is an ethical stance. It is sure
    > sensible to get the machine to do things auto as much as
    > possible. I am a big fan of automation, I know it sounds absurd,
    > but, first time, I had to force myself not to watch (in
    > fascination) my computer do a big batching process in Photoshop.
    > There it was, the machine at its best, opening and altering and
    > saving files all by itself! Yes, it lierally opened things on
    > screen. Now, that is what a computer is for, I thought.
    >
    > But, if I may just make this point again, not all jobs are worth
    > the effort of "tooling up" to do things automatically.
    >
    > As you describe, it is often useful to get a big percentage of
    > the job done with auto processes. But in many jobs, the push for
    > turn-key operational success brings in diminishing returns. Not a
    > bad maxim is:
    >
    > Automate what it is easy to automate and get ready to roll up the
    > sleeves for the rest.



    Strange as it may sound to you, I share your opinion.
    Automation makes sense when there is already a big amount of work to do, not
    for every little thing.

    --
    Luigi Donatello Asero
    https://www.scaiecat-spa-gigi.com/it/svezia.html
    谢谢你, ÑпаÑибо, tack sÃ¥ mycket!
    Luigi Donatello Asero, Oct 7, 2006
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Aahz

    RIP: Joseph Weizenbaum

    Aahz, Mar 14, 2008, in forum: Python
    Replies:
    12
    Views:
    447
  2. t7dgbnak

    Black Duck Outfitters!!! RIP OFF

    t7dgbnak, Apr 2, 2008, in forum: C Programming
    Replies:
    0
    Views:
    557
    t7dgbnak
    Apr 2, 2008
  3. Travis Newbury

    dorayme RIP

    Travis Newbury, Jul 29, 2008, in forum: HTML
    Replies:
    14
    Views:
    825
    Samuel van Laere
    Aug 3, 2008
  4. Lynn McGuire

    Re: RIP Dennis Ritchie

    Lynn McGuire, Oct 13, 2011, in forum: C++
    Replies:
    12
    Views:
    699
    Markus Wichmann
    Oct 18, 2011
  5. Jean-François Trân

    [RIP] Guy Decoux.

    Jean-François Trân, Sep 24, 2008, in forum: Ruby
    Replies:
    27
    Views:
    221
    Roger Pack
    Oct 8, 2008
Loading...

Share This Page