cutting out the tags

Raven · Jul 26, 2003

Hi.

For some program I write, I need a function that will take an HTML file and
cut tags out of it, leaving text only data.
Obvious way is (seems to be) searching for <, then, for > and cutting
everything between them.

Is the use of <> characters in them any limited? I know that they should be
normally replaced with < and &gt (afair) in the plain text data, but I
guess HTML documents you see on the net are not ideal and many html makers
write their documents with rule violations that still allow them to be
displayed normally by all the well known browsers.

Is  a valid comment, or <img src="aaa.jpg" alt="<<evil alt><">
a valid image tag, for example? (by valid, I mean usable without errors in
this case

)

Is "\<" and "\>" treated as the plain-text "<" or ">" character itself in
HTML files?

And the last one, are there any other special things that I have to think
about if I am using this simple method (cutting out everything between < and

David Dorward · Jul 26, 2003

Raven said:
For some program I write, I need a function that will take an HTML file
and cut tags out of it, leaving text only data.

lynx --dump http://www.url.com/

Is  a valid comment
Yes

, or <img src="aaa.jpg" alt="<<evil alt><"> a valid image tag, for example?
No

Is "\<" and "\>" treated as the plain-text "<" or ">" character itself in
HTML files?

\ has no escaping function, that's what entities are for.

Robert Frost-Bridges · Jul 26, 2003

Hi.

For some program I write, I need a function that will take an HTML file and
cut tags out of it, leaving text only data.
Obvious way is (seems to be) searching for <, then, for > and cutting
everything between them.

There is already a php function along these lines that may be of some
help: http://uk.php.net/manual/en/function.strip-tags.php

although note the disclaimer, '<i>tries</i> to return a string'.

I'm tempted to quit out of frustration	1	Aug 13, 2023
Aligned to the left	3	Apr 19, 2023
Image upload not working in browser	4	Sep 8, 2022
Bootstrap contact form not working	2	Feb 15, 2025
New to coding - image to the time of day based on time	2	May 14, 2022
How to use PDF-lib and how to center each line of texts on the page?	1	Aug 16, 2023
HTML tags in textarea	4	May 26, 2010
Having difficulty with the layout of these images / video for this web page	2	Jul 4, 2022

cutting out the tags

Raven

David Dorward

Robert Frost-Bridges

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads