cutting out the tags

Discussion in 'HTML' started by Raven, Jul 26, 2003.

  1. Raven

    Raven Guest

    Hi.

    For some program I write, I need a function that will take an HTML file and
    cut tags out of it, leaving text only data.
    Obvious way is (seems to be) searching for <, then, for > and cutting
    everything between them.

    Is the use of <> characters in them any limited? I know that they should be
    normally replaced with &lt; and &gt (afair) in the plain text data, but I
    guess HTML documents you see on the net are not ideal and many html makers
    write their documents with rule violations that still allow them to be
    displayed normally by all the well known browsers.

    Is <!-- <<<<< --> a valid comment, or <img src="aaa.jpg" alt="<<evil alt><">
    a valid image tag, for example? (by valid, I mean usable without errors in
    this case ;) )

    Is "\<" and "\>" treated as the plain-text "<" or ">" character itself in
    HTML files?

    And the last one, are there any other special things that I have to think
    about if I am using this simple method (cutting out everything between < and
    >)?
    Raven, Jul 26, 2003
    #1
    1. Advertising

  2. Raven wrote:

    > For some program I write, I need a function that will take an HTML file
    > and cut tags out of it, leaving text only data.


    lynx --dump http://www.url.com/

    > Is <!-- <<<<< --> a valid comment


    Yes

    >, or <img src="aaa.jpg" alt="<<evil alt><"> a valid image tag, for example?


    No

    > Is "\<" and "\>" treated as the plain-text "<" or ">" character itself in
    > HTML files?


    \ has no escaping function, that's what entities are for.


    --
    David Dorward http://david.us-lot.org/
    David Dorward, Jul 26, 2003
    #2
    1. Advertising

  3. On Sun, 27 Jul 2003 01:19:31 +0400, "Raven" <>
    wrote:

    >Hi.
    >
    >For some program I write, I need a function that will take an HTML file and
    >cut tags out of it, leaving text only data.
    >Obvious way is (seems to be) searching for <, then, for > and cutting
    >everything between them.


    There is already a php function along these lines that may be of some
    help: http://uk.php.net/manual/en/function.strip-tags.php

    although note the disclaimer, '<i>tries</i> to return a string'.


    --
    frostie
    http://www.brightonfixedodds.com
    Robert Frost-Bridges, Jul 27, 2003
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. MattB

    cutting down on postbacks

    MattB, Apr 2, 2004, in forum: ASP .Net
    Replies:
    1
    Views:
    365
    Jim Corey
    Apr 3, 2004
  2. Dave
    Replies:
    1
    Views:
    648
    John Timney \( MVP \)
    Feb 2, 2006
  3. news.tkdsoftware.com

    Cutting From RichEdit

    news.tkdsoftware.com, Oct 10, 2004, in forum: C++
    Replies:
    3
    Views:
    585
    White Wolf
    Oct 10, 2004
  4. Justin
    Replies:
    16
    Views:
    550
  5. Headhunter
    Replies:
    0
    Views:
    389
    Headhunter
    Nov 23, 2006
Loading...

Share This Page