cutting out the tags

R

Raven

Hi.

For some program I write, I need a function that will take an HTML file and
cut tags out of it, leaving text only data.
Obvious way is (seems to be) searching for <, then, for > and cutting
everything between them.

Is the use of <> characters in them any limited? I know that they should be
normally replaced with &lt; and &gt (afair) in the plain text data, but I
guess HTML documents you see on the net are not ideal and many html makers
write their documents with rule violations that still allow them to be
displayed normally by all the well known browsers.

Is <!-- <<<<< --> a valid comment, or <img src="aaa.jpg" alt="<<evil alt><">
a valid image tag, for example? (by valid, I mean usable without errors in
this case ;) )

Is "\<" and "\>" treated as the plain-text "<" or ">" character itself in
HTML files?

And the last one, are there any other special things that I have to think
about if I am using this simple method (cutting out everything between < and
 
D

David Dorward

Raven said:
For some program I write, I need a function that will take an HTML file
and cut tags out of it, leaving text only data.

lynx --dump http://www.url.com/
Is <!-- <<<<< --> a valid comment
Yes

, or <img src="aaa.jpg" alt="<<evil alt><"> a valid image tag, for example?
No

Is "\<" and "\>" treated as the plain-text "<" or ">" character itself in
HTML files?

\ has no escaping function, that's what entities are for.
 
R

Robert Frost-Bridges

Hi.

For some program I write, I need a function that will take an HTML file and
cut tags out of it, leaving text only data.
Obvious way is (seems to be) searching for <, then, for > and cutting
everything between them.

There is already a php function along these lines that may be of some
help: http://uk.php.net/manual/en/function.strip-tags.php

although note the disclaimer, '<i>tries</i> to return a string'.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,479
Members
44,899
Latest member
RodneyMcAu

Latest Threads

Top