Stripping html

Medros · Jun 12, 2006

I understand that you can strip html out of a txt file so that all the
information is left is the visable information that is needed (e.g.
everything that has < > around is gone). My question is that I have a
table of information that I need to be fed into a program as such. Well
kind of I need the program to read it just as you would on paper and be
able to use that information like it was entered. I am unsure how strip
so much away just to leave me with the information I want and then use
it like I want. Any help?

Morris Dovey · Jun 12, 2006

Medros (in (e-mail address removed))
said:

| I understand that you can strip html out of a txt file so that all
| the information is left is the visable information that is needed
| (e.g. everything that has < > around is gone). My question is that
| I have a table of information that I need to be fed into a program
| as such. Well kind of I need the program to read it just as you
| would on paper and be able to use that information like it was
| entered. I am unsure how strip so much away just to leave me with
| the information I want and then use it like I want. Any help?

Start with a simple program that reads and saves one character at a
time looking for a '<' character. When it finds a '<', it should throw
it (and following characters) away until it finds a '>'. When the
program reaches end-of-file, hopefully it's saved what you want to
keep.

You'll probably discover that you want to add refinements (perhaps to
deal with HTML encodings like   and < - but those can wait on
getting the initial version working.

Richard Heathfield · Jun 12, 2006

Medros said:

I understand that you can strip html out of a txt file so that all the
information is left is the visable information that is needed (e.g.
everything that has < > around is gone). My question is that I have a
table of information that I need to be fed into a program as such. Well
kind of I need the program to read it just as you would on paper and be
able to use that information like it was entered. I am unsure how strip
so much away just to leave me with the information I want and then use
it like I want. Any help?

If the HTML is well-produced, mostly you can simply read characters one by
one. If you hit a '<' character, discard it, and keep discarding everything
until you hit a '>', which again you can discard.

If you hit a & character, though, you have some work to do. You'll need to
save up characters until you hit a semicolon.

The characters between the & and the ; form a keyword, e.g. & for
ampersand, < for '<', > for '>', © for the copyright symbol, and
so on. You will need to have some kind of lookup in your program for
matching these keywords with their replacements.

If you hit a space character, preserve it, but then discard all remaining
whitespace until the next non-whitespace character.

These simple rules will give you a basic translation into English, but you
have to be a bit cleverer if you want to split text into paragraphs and so
on, by interpreting tags such as <BR>, <P>, <TD> etc -- at which point you
won't be too far away from having your own text-only but otherwise
full-blown HTML renderer.

If the HTML is /not/ well-produced, the above may not be sufficient.

Bill Latvin · Jun 12, 2006

Medros (in (e-mail address removed))
said:

| I understand that you can strip html out of a txt file so that all
| the information is left is the visable information that is needed
| (e.g. everything that has < > around is gone). My question is that
| I have a table of information that I need to be fed into a program
| as such. Well kind of I need the program to read it just as you
| would on paper and be able to use that information like it was
| entered. I am unsure how strip so much away just to leave me with
| the information I want and then use it like I want. Any help?

Start with a simple program that reads and saves one character at a
time looking for a '<' character. When it finds a '<', it should throw
it (and following characters) away until it finds a '>'. When the
program reaches end-of-file, hopefully it's saved what you want to
keep.

I remember starting with a simple program like that, and finding to my
dismay that between the "script" and "/script" tags the '<' and '>'
characters are used not as tag delimiters but as "greater than" and
"less than" comparison operators. I had to check for those particular
tags and discard everything between them, and not let the presence of
a lone unbalanced '<' in the script cause my logic to miss finding the
"/string" tag.

Bill

Morris Dovey · Jun 12, 2006

Bill Latvin (in (e-mail address removed)) said:

| On Sun, 11 Jun 2006 21:46:03 -0500, "Morris Dovey"
|
|| Medros (in (e-mail address removed))
|| said:
||
||| I understand that you can strip html out of a txt file so that all
||| the information is left is the visable information that is needed
||| (e.g. everything that has < > around is gone). My question is that
||| I have a table of information that I need to be fed into a program
||| as such. Well kind of I need the program to read it just as you
||| would on paper and be able to use that information like it was
||| entered. I am unsure how strip so much away just to leave me with
||| the information I want and then use it like I want. Any help?
||
|| Start with a simple program that reads and saves one character at a
|| time looking for a '<' character. When it finds a '<', it should
|| throw it (and following characters) away until it finds a '>'.
|| When the program reaches end-of-file, hopefully it's saved what
|| you want to keep.
||
| I remember starting with a simple program like that, and finding to
| my dismay that between the "script" and "/script" tags the '<' and
| '>' characters are used not as tag delimiters but as "greater than"
| and "less than" comparison operators. I had to check for those
| particular tags and discard everything between them, and not let
| the presence of a lone unbalanced '<' in the script cause my logic
| to miss finding the "/string" tag.

Welcome to the club. It's because of things like that that I added my
second paragraph:

"You'll probably discover that you want to add refinements (perhaps to
deal with HTML encodings like   and < - but those can wait on
getting the initial version working."

The refinements will depend on whether the OP wants a general solution
or just enough to extract data from one particular page. On
re-reading, I'd guess is that <table>, <tr>, and <td> tags may be his
1st refinement - but the question indicated that he'll probably need
to start at the most basic level.

Ian Collins · Jun 12, 2006

Richard said:
Medros said:

If the HTML is well-produced, mostly you can simply read characters one by
one. If you hit a '<' character, discard it, and keep discarding everything
until you hit a '>', which again you can discard.

If you hit a & character, though, you have some work to do. You'll need to
save up characters until you hit a semicolon.

The characters between the & and the ; form a keyword, e.g. & for
ampersand, < for '<', > for '>', © for the copyright symbol, and
so on. You will need to have some kind of lookup in your program for
matching these keywords with their replacements.

If you hit a space character, preserve it, but then discard all remaining
whitespace until the next non-whitespace character.

These simple rules will give you a basic translation into English, but you
have to be a bit cleverer if you want to split text into paragraphs and so
on, by interpreting tags such as <BR>, <P>, <TD> etc -- at which point you
won't be too far away from having your own text-only but otherwise
full-blown HTML renderer.

If the HTML is /not/ well-produced, the above may not be sufficient.

HTMLtidy (http://tidy.sourceforge.net/) is your friend in this cases.
This little program has prevented much pain and suffering!

=?iso-8859-1?q?Asbj=F8rn_S=E6b=F8?= · Jun 12, 2006

Medros said:
I understand that you can strip html out of a txt file so that all the
information is left is the visable information that is needed (e.g.
everything that has < > around is gone). My question is that I have a
table of information that I need to be fed into a program as such. Well
kind of I need the program to read it just as you would on paper and be
able to use that information like it was entered. I am unsure how strip
so much away just to leave me with the information I want and then use
it like I want. Any help?

lynx -dump ?

Asbjørn

Stuck with html and css	25	Dec 14, 2022
Need assistance finetuning HTML, CSS, Javascript - sticky header issue	3	Feb 25, 2022
I need help making an html website	2	Aug 2, 2023
Saving and rewatch a game played before on cmd with C	0	Jun 26, 2022
Changing .html in URL	3	Jul 11, 2022
Send Var through html(data)	2	Feb 29, 2020
Canvas drawing HTML Javascript on elementor	1	Feb 22, 2023
Python client/server that reads HTML body from server	1	Apr 12, 2023

Stripping html

Medros

Morris Dovey

Richard Heathfield

Bill Latvin

Morris Dovey

Ian Collins

=?iso-8859-1?q?Asbj=F8rn_S=E6b=F8?=

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads