R
Raven
Hi.
For some program I write, I need a function that will take an HTML file and
cut tags out of it, leaving text only data.
Obvious way is (seems to be) searching for <, then, for > and cutting
everything between them.
Is the use of <> characters in them any limited? I know that they should be
normally replaced with < and > (afair) in the plain text data, but I
guess HTML documents you see on the net are not ideal and many html makers
write their documents with rule violations that still allow them to be
displayed normally by all the well known browsers.
Is <!-- <<<<< --> a valid comment, or <img src="aaa.jpg" alt="<<evil alt><">
a valid image tag, for example? (by valid, I mean usable without errors in
this case
)
Is "\<" and "\>" treated as the plain-text "<" or ">" character itself in
HTML files?
And the last one, are there any other special things that I have to think
about if I am using this simple method (cutting out everything between < and
For some program I write, I need a function that will take an HTML file and
cut tags out of it, leaving text only data.
Obvious way is (seems to be) searching for <, then, for > and cutting
everything between them.
Is the use of <> characters in them any limited? I know that they should be
normally replaced with < and > (afair) in the plain text data, but I
guess HTML documents you see on the net are not ideal and many html makers
write their documents with rule violations that still allow them to be
displayed normally by all the well known browsers.
Is <!-- <<<<< --> a valid comment, or <img src="aaa.jpg" alt="<<evil alt><">
a valid image tag, for example? (by valid, I mean usable without errors in
this case
Is "\<" and "\>" treated as the plain-text "<" or ">" character itself in
HTML files?
And the last one, are there any other special things that I have to think
about if I am using this simple method (cutting out everything between < and