Stripping html

M

Medros

I understand that you can strip html out of a txt file so that all the
information is left is the visable information that is needed (e.g.
everything that has < > around is gone). My question is that I have a
table of information that I need to be fed into a program as such. Well
kind of I need the program to read it just as you would on paper and be
able to use that information like it was entered. I am unsure how strip
so much away just to leave me with the information I want and then use
it like I want. Any help?
 
M

Morris Dovey

Medros (in (e-mail address removed))
said:

| I understand that you can strip html out of a txt file so that all
| the information is left is the visable information that is needed
| (e.g. everything that has < > around is gone). My question is that
| I have a table of information that I need to be fed into a program
| as such. Well kind of I need the program to read it just as you
| would on paper and be able to use that information like it was
| entered. I am unsure how strip so much away just to leave me with
| the information I want and then use it like I want. Any help?

Start with a simple program that reads and saves one character at a
time looking for a '<' character. When it finds a '<', it should throw
it (and following characters) away until it finds a '>'. When the
program reaches end-of-file, hopefully it's saved what you want to
keep.

You'll probably discover that you want to add refinements (perhaps to
deal with HTML encodings like &nbsp; and &lt; - but those can wait on
getting the initial version working.
 
R

Richard Heathfield

Medros said:
I understand that you can strip html out of a txt file so that all the
information is left is the visable information that is needed (e.g.
everything that has < > around is gone). My question is that I have a
table of information that I need to be fed into a program as such. Well
kind of I need the program to read it just as you would on paper and be
able to use that information like it was entered. I am unsure how strip
so much away just to leave me with the information I want and then use
it like I want. Any help?

If the HTML is well-produced, mostly you can simply read characters one by
one. If you hit a '<' character, discard it, and keep discarding everything
until you hit a '>', which again you can discard.

If you hit a & character, though, you have some work to do. You'll need to
save up characters until you hit a semicolon.

The characters between the & and the ; form a keyword, e.g. &amp; for
ampersand, &lt; for '<', &gt; for '>', &copy; for the copyright symbol, and
so on. You will need to have some kind of lookup in your program for
matching these keywords with their replacements.

If you hit a space character, preserve it, but then discard all remaining
whitespace until the next non-whitespace character.

These simple rules will give you a basic translation into English, but you
have to be a bit cleverer if you want to split text into paragraphs and so
on, by interpreting tags such as <BR>, <P>, <TD> etc -- at which point you
won't be too far away from having your own text-only but otherwise
full-blown HTML renderer.

If the HTML is /not/ well-produced, the above may not be sufficient.
 
B

Bill Latvin

Medros (in (e-mail address removed))
said:

| I understand that you can strip html out of a txt file so that all
| the information is left is the visable information that is needed
| (e.g. everything that has < > around is gone). My question is that
| I have a table of information that I need to be fed into a program
| as such. Well kind of I need the program to read it just as you
| would on paper and be able to use that information like it was
| entered. I am unsure how strip so much away just to leave me with
| the information I want and then use it like I want. Any help?

Start with a simple program that reads and saves one character at a
time looking for a '<' character. When it finds a '<', it should throw
it (and following characters) away until it finds a '>'. When the
program reaches end-of-file, hopefully it's saved what you want to
keep.
I remember starting with a simple program like that, and finding to my
dismay that between the "script" and "/script" tags the '<' and '>'
characters are used not as tag delimiters but as "greater than" and
"less than" comparison operators. I had to check for those particular
tags and discard everything between them, and not let the presence of
a lone unbalanced '<' in the script cause my logic to miss finding the
"/string" tag.

Bill
 
M

Morris Dovey

Bill Latvin (in (e-mail address removed)) said:

| On Sun, 11 Jun 2006 21:46:03 -0500, "Morris Dovey"
|
|| Medros (in (e-mail address removed))
|| said:
||
||| I understand that you can strip html out of a txt file so that all
||| the information is left is the visable information that is needed
||| (e.g. everything that has < > around is gone). My question is that
||| I have a table of information that I need to be fed into a program
||| as such. Well kind of I need the program to read it just as you
||| would on paper and be able to use that information like it was
||| entered. I am unsure how strip so much away just to leave me with
||| the information I want and then use it like I want. Any help?
||
|| Start with a simple program that reads and saves one character at a
|| time looking for a '<' character. When it finds a '<', it should
|| throw it (and following characters) away until it finds a '>'.
|| When the program reaches end-of-file, hopefully it's saved what
|| you want to keep.
||
| I remember starting with a simple program like that, and finding to
| my dismay that between the "script" and "/script" tags the '<' and
| '>' characters are used not as tag delimiters but as "greater than"
| and "less than" comparison operators. I had to check for those
| particular tags and discard everything between them, and not let
| the presence of a lone unbalanced '<' in the script cause my logic
| to miss finding the "/string" tag.

Welcome to the club. It's because of things like that that I added my
second paragraph:

"You'll probably discover that you want to add refinements (perhaps to
deal with HTML encodings like &nbsp; and &lt; - but those can wait on
getting the initial version working."

The refinements will depend on whether the OP wants a general solution
or just enough to extract data from one particular page. On
re-reading, I'd guess is that <table>, <tr>, and <td> tags may be his
1st refinement - but the question indicated that he'll probably need
to start at the most basic level.
 
I

Ian Collins

Richard said:
Medros said:




If the HTML is well-produced, mostly you can simply read characters one by
one. If you hit a '<' character, discard it, and keep discarding everything
until you hit a '>', which again you can discard.

If you hit a & character, though, you have some work to do. You'll need to
save up characters until you hit a semicolon.

The characters between the & and the ; form a keyword, e.g. &amp; for
ampersand, &lt; for '<', &gt; for '>', &copy; for the copyright symbol, and
so on. You will need to have some kind of lookup in your program for
matching these keywords with their replacements.

If you hit a space character, preserve it, but then discard all remaining
whitespace until the next non-whitespace character.

These simple rules will give you a basic translation into English, but you
have to be a bit cleverer if you want to split text into paragraphs and so
on, by interpreting tags such as <BR>, <P>, <TD> etc -- at which point you
won't be too far away from having your own text-only but otherwise
full-blown HTML renderer.

If the HTML is /not/ well-produced, the above may not be sufficient.
HTMLtidy (http://tidy.sourceforge.net/) is your friend in this cases.
This little program has prevented much pain and suffering!
 
?

=?iso-8859-1?q?Asbj=F8rn_S=E6b=F8?=

Medros said:
I understand that you can strip html out of a txt file so that all the
information is left is the visable information that is needed (e.g.
everything that has < > around is gone). My question is that I have a
table of information that I need to be fed into a program as such. Well
kind of I need the program to read it just as you would on paper and be
able to use that information like it was entered. I am unsure how strip
so much away just to leave me with the information I want and then use
it like I want. Any help?

lynx -dump ?

Asbjørn
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,902
Latest member
Elena68X5

Latest Threads

Top