Efficient way to rip html

A

Arthur Rhodes

I'm building a web store and I have to create a large number of
product descriptions. The distributors do not provide spec sheets
or marketing materials to me in html format. Instead, they advise
me to simply copy the descriptions from their web sites.

The problem is that the descriptions I need to copy are embedded
in complex pages, with nested tables, etc. Simply copying the
page source doesn't seem to be that useful. I end up having to
cut out lots of table code, etc., and usually make mistakes that
are time consuming to figure out and fix.

The other alternative is to copy the text and then recreating the html
formatting from scratch.

Is there an easier way?

Right now, I'm just writing HTML by hand in a text editor. Would
this be any easier if I used a web editor like Dreamweaver?
 
B

Ben C

I'm building a web store and I have to create a large number of
product descriptions. The distributors do not provide spec sheets
or marketing materials to me in html format. Instead, they advise
me to simply copy the descriptions from their web sites.

The problem is that the descriptions I need to copy are embedded
in complex pages, with nested tables, etc. Simply copying the
page source doesn't seem to be that useful. I end up having to
cut out lots of table code, etc., and usually make mistakes that
are time consuming to figure out and fix.

The other alternative is to copy the text and then recreating the html
formatting from scratch.

Is there an easier way?

Python, and Beautiful Soup.

http://www.crummy.com/software/BeautifulSoup/
 
D

dorayme

Arthur Rhodes said:
I'm building a web store and I have to create a large number of
product descriptions. The distributors do not provide spec sheets
or marketing materials to me in html format. Instead, they advise
me to simply copy the descriptions from their web sites.

The problem is that the descriptions I need to copy are embedded
in complex pages, with nested tables, etc. Simply copying the
page source doesn't seem to be that useful. I end up having to
cut out lots of table code, etc., and usually make mistakes that
are time consuming to figure out and fix.

The other alternative is to copy the text and then recreating the html
formatting from scratch.

Is there an easier way?

Right now, I'm just writing HTML by hand in a text editor. Would
this be any easier if I used a web editor like Dreamweaver?

It depends on how well you know Dreamweaver (or any other
software). I have a friend who would go this way and well. I
would grab the product descriptions and work hard and use a text
editor because it would take me less time. You are in the middle
of a job. Can you risk finding out? If you know what you are
doing with the text grabs, just do it and get it done and charge
the client. As you get going, you will find it going quicker and
quicker because you will be building patterns in your hand work.
Products are products, and if they are all in tables to show off
proper tabular specs, you will simply copy and paste a few table
types you have constructed, most data will fit in one or other of
them with little mods.
 
A

ato_zee

Instead, they advise
me to simply copy the descriptions from their web sites.

With images on websites you can usually right click then
copy and then paste the gif or jpg into the folder (or,
for some applications,) into the application of your choice.

Likewise with text, with a bit of practice you can right click
then wipe to highlight, release, right click again on
highlighted text, copy, then paste (selecting paste option
Unformatted Text) or paste into notepad which reduces
everyting to unformatted text.
Paste options depend on application, sometimes you have
to start from Edit menu to find the Unformatted Text option.
With Dreamweaver I think that there is an unformatted text
option to paste long runs of text in the code window.
But then I mostly edit/build in Wordpad because it opens
and saves, html without asking what format you want to
save in.
DW8 can produce non-validating code without warning you,
has some merit in early stages of design. With Wordpad
I can save and immediately see the effect with refresh
the browser.
Sometimes you can select tables or highlight cells, copy, and
paste into Excel, which gives you further options for
manipulating/parsing the data.
 
M

mbstevens

The problem is that the descriptions I need to copy are embedded in
complex pages, with nested tables, etc. Simply copying the page source
doesn't seem to be that useful. I end up having to cut out lots of table
code, etc., and usually make mistakes that are time consuming to figure
out and fix.


Perl's HTML::parser module will divide an HTML document into its various
parts (including text) with just a few lines of code. In the more
structured Python world, sgmllib, htmllib, or HTMLParser are the modules
to look into.
 
A

Andy Dingley

dorayme said:
It depends on how well you know Dreamweaver (or any other
software). I have a friend who would go this way and well. I
would grab the product descriptions and work hard and use a text
editor

Twice a day, for two thousand products ?
 
B

Ben C

Looks good. You don't know of any ready made gui for it,
do you? I'm thinking it would be nice to have a tree
pane representing the structure of the document, and when
you click on a node a text pane shows the corresponding part
of the document.

I don't know of one, but it wouldn't be hard to do. Someone may have
done one.

But Firefox can do exactly what you're describing, if you install the
"DOM Inspector" extension. You can click on something in the tree
representation in the DOM Inspector window and it flashes red on the
page, or you can point to part of the page, click, and the corresponding
part of the tree representation gets highlighted.

Having found your way around the document with this DOM Inspector, you
can then write the python/BeautifulSoup script to pull out the bits
you're interested in.
 
D

dorayme

"Andy Dingley said:
Twice a day, for two thousand products ?

No, well, if it were on this scale, I would fire up Dreamweaver
or even the 98 version of Word and export to HTML and see how it
renders a table of product specs. I would then see what I could
do to clean up crap via Search and Replace, using extra GREP if
need be, and shape it all how I wanted. But my point was this: be
sure the scale of the job is big enough to embark on anything
more than simple hard work with a text editor, entering, cutting
and pasting where possible etc.

You get these figures from?

Truth is this, I have found many earthlings think hard rote work
beneath their human dignity. I happen to think humans have no
real dignity, it is all a pretence and they should get a better
perspective of their place in evolution. They are machines and
should stop trying to distance themselves from lower and more
mechanical forms.


[btw. Alan Flavell has a philosophy behind the idea of hard rote
work, that it offends against human dignity... It is a point of
view. I am not saying it is unintelligent. But imo, much evil has
come from ideas like this. I don't suppose anyone wants to know
more? :) ]
 
W

wayne

Arthur said:
I'm building a web store and I have to create a large number of
product descriptions. The distributors do not provide spec sheets
or marketing materials to me in html format. Instead, they advise
me to simply copy the descriptions from their web sites.

The problem is that the descriptions I need to copy are embedded
in complex pages, with nested tables, etc. Simply copying the
page source doesn't seem to be that useful. I end up having to
cut out lots of table code, etc., and usually make mistakes that
are time consuming to figure out and fix.

The other alternative is to copy the text and then recreating the html
formatting from scratch.

Is there an easier way?

Right now, I'm just writing HTML by hand in a text editor. Would
this be any easier if I used a web editor like Dreamweaver?
Perhaps you want to use server side includes to include text files in
the proper locations?

--
Wayne
http://www.glenmeadows.us
With or without religion, you would have good people doing good things
and evil people doing evil things. But for good people to do evil
things, that takes religion.
—Steven Weinberg
 
A

Alan J. Flavell

[btw. Alan Flavell has a philosophy behind the idea of hard rote
work, that it offends against human dignity...

[ warning, off-topic ]

I don't care whether it's hard or easy - *rote* work that can be done
with the computer is inappropriate to be done manually.

I guess you're referring to my postings which rebuke posters for
asking help on web pages which don't even pass validation. I stand by
the principle that it's demeaning to ask others for help when the
validation could and should have been done before asking for help.

When I'm tasked to do something new for which I don't know a good
solution, I'll tend to do a lot more manual work the first time
around. But while I'm doing it I'll be thinking of ways to automate
what I'm doing, on the principle that if I produce a successful result
first time, I'm very likely to be asked to do the same kind of thing
again.

Quite some years ago I was suddenly asked (about 2 weeks after the
final deadline!) to produce a webified version of the student handbook
of our department, which was available only in an MS Word format.
Back then the results from any MS product which purported to produce
HTML were significantly worse even than the mess that today's MS
products generate. But I found a package called rtftohtml, from
Sunpack software, which was highly configurable and produced pretty
much the results I wanted. Then I was asked to make some changes, so
I said OK, give me the updated Word file and I'll do it (they seemed
to think that the solution would be to apply updates separately to the
Word file and to the HTML file, but that's a mug's game). Then I
tossed the new Word file into the conversion procedure that I had set
up, and hey presto.

Needless to say, a year later I was asked to webify the new edition of
the handbook. I simply tossed the new edition into the processing
chain and the result came out nearly as good as the last time. The
only thing wrong was that the Word file incorporated some Mac-coded
scientific content from one of the academics (which already displayed
as garbage in Win MS Word), so I needed an extra stanza in the
converter to deal with that.

This is all some years ago now - I haven't done this task for a few
years now, and Sunpack rtftohtml has transmogrified into something
else. But I think this case is quite a good illustration of the
benefits of using the computer. If one had done that only with point
and shove every time, just imagine the wasted effort.
 
D

dorayme

"Alan J. Flavell said:
[btw. Alan Flavell has a philosophy behind the idea of hard rote
work, that it offends against human dignity...

[ warning, off-topic ]

I don't care whether it's hard or easy - *rote* work that can be done
with the computer is inappropriate to be done manually.

I don't really disagree with anything you go on to say. It does
not really matter whether it is an ethical stance. It is sure
sensible to get the machine to do things auto as much as
possible. I am a big fan of automation, I know it sounds absurd,
but, first time, I had to force myself not to watch (in
fascination) my computer do a big batching process in Photoshop.
There it was, the machine at its best, opening and altering and
saving files all by itself! Yes, it lierally opened things on
screen. Now, that is what a computer is for, I thought.

But, if I may just make this point again, not all jobs are worth
the effort of "tooling up" to do things automatically.

As you describe, it is often useful to get a big percentage of
the job done with auto processes. But in many jobs, the push for
turn-key operational success brings in diminishing returns. Not a
bad maxim is:

Automate what it is easy to automate and get ready to roll up the
sleeves for the rest.
 
L

Luigi Donatello Asero

dorayme said:
Alan J. Flavell said:
[btw. Alan Flavell has a philosophy behind the idea of hard rote
work, that it offends against human dignity...

[ warning, off-topic ]

I don't care whether it's hard or easy - *rote* work that can be done
with the computer is inappropriate to be done manually.

I don't really disagree with anything you go on to say. It does
not really matter whether it is an ethical stance. It is sure
sensible to get the machine to do things auto as much as
possible. I am a big fan of automation, I know it sounds absurd,
but, first time, I had to force myself not to watch (in
fascination) my computer do a big batching process in Photoshop.
There it was, the machine at its best, opening and altering and
saving files all by itself! Yes, it lierally opened things on
screen. Now, that is what a computer is for, I thought.

But, if I may just make this point again, not all jobs are worth
the effort of "tooling up" to do things automatically.

As you describe, it is often useful to get a big percentage of
the job done with auto processes. But in many jobs, the push for
turn-key operational success brings in diminishing returns. Not a
bad maxim is:

Automate what it is easy to automate and get ready to roll up the
sleeves for the rest.


Strange as it may sound to you, I share your opinion.
Automation makes sense when there is already a big amount of work to do, not
for every little thing.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,733
Messages
2,569,440
Members
44,830
Latest member
ZADIva7383

Latest Threads

Top