Stripping html

Discussion in 'C Programming' started by Medros, Jun 12, 2006.

  1. Medros

    Medros Guest

    I understand that you can strip html out of a txt file so that all the
    information is left is the visable information that is needed (e.g.
    everything that has < > around is gone). My question is that I have a
    table of information that I need to be fed into a program as such. Well
    kind of I need the program to read it just as you would on paper and be
    able to use that information like it was entered. I am unsure how strip
    so much away just to leave me with the information I want and then use
    it like I want. Any help?
     
    Medros, Jun 12, 2006
    #1
    1. Advertising

  2. Medros

    Morris Dovey Guest

    Medros (in )
    said:

    | I understand that you can strip html out of a txt file so that all
    | the information is left is the visable information that is needed
    | (e.g. everything that has < > around is gone). My question is that
    | I have a table of information that I need to be fed into a program
    | as such. Well kind of I need the program to read it just as you
    | would on paper and be able to use that information like it was
    | entered. I am unsure how strip so much away just to leave me with
    | the information I want and then use it like I want. Any help?

    Start with a simple program that reads and saves one character at a
    time looking for a '<' character. When it finds a '<', it should throw
    it (and following characters) away until it finds a '>'. When the
    program reaches end-of-file, hopefully it's saved what you want to
    keep.

    You'll probably discover that you want to add refinements (perhaps to
    deal with HTML encodings like &nbsp; and &lt; - but those can wait on
    getting the initial version working.

    --
    Morris Dovey
    DeSoto Solar
    DeSoto, Iowa USA
    http://www.iedu.com/DeSoto
     
    Morris Dovey, Jun 12, 2006
    #2
    1. Advertising

  3. Medros said:

    > I understand that you can strip html out of a txt file so that all the
    > information is left is the visable information that is needed (e.g.
    > everything that has < > around is gone). My question is that I have a
    > table of information that I need to be fed into a program as such. Well
    > kind of I need the program to read it just as you would on paper and be
    > able to use that information like it was entered. I am unsure how strip
    > so much away just to leave me with the information I want and then use
    > it like I want. Any help?


    If the HTML is well-produced, mostly you can simply read characters one by
    one. If you hit a '<' character, discard it, and keep discarding everything
    until you hit a '>', which again you can discard.

    If you hit a & character, though, you have some work to do. You'll need to
    save up characters until you hit a semicolon.

    The characters between the & and the ; form a keyword, e.g. &amp; for
    ampersand, &lt; for '<', &gt; for '>', &copy; for the copyright symbol, and
    so on. You will need to have some kind of lookup in your program for
    matching these keywords with their replacements.

    If you hit a space character, preserve it, but then discard all remaining
    whitespace until the next non-whitespace character.

    These simple rules will give you a basic translation into English, but you
    have to be a bit cleverer if you want to split text into paragraphs and so
    on, by interpreting tags such as <BR>, <P>, <TD> etc -- at which point you
    won't be too far away from having your own text-only but otherwise
    full-blown HTML renderer.

    If the HTML is /not/ well-produced, the above may not be sufficient.

    --
    Richard Heathfield
    "Usenet is a strange place" - dmr 29/7/1999
    http://www.cpax.org.uk
    email: rjh at above domain (but drop the www, obviously)
     
    Richard Heathfield, Jun 12, 2006
    #3
  4. Medros

    Bill Latvin Guest

    On Sun, 11 Jun 2006 21:46:03 -0500, "Morris Dovey" <>
    wrote:

    >Medros (in )
    >said:
    >
    >| I understand that you can strip html out of a txt file so that all
    >| the information is left is the visable information that is needed
    >| (e.g. everything that has < > around is gone). My question is that
    >| I have a table of information that I need to be fed into a program
    >| as such. Well kind of I need the program to read it just as you
    >| would on paper and be able to use that information like it was
    >| entered. I am unsure how strip so much away just to leave me with
    >| the information I want and then use it like I want. Any help?
    >
    >Start with a simple program that reads and saves one character at a
    >time looking for a '<' character. When it finds a '<', it should throw
    >it (and following characters) away until it finds a '>'. When the
    >program reaches end-of-file, hopefully it's saved what you want to
    >keep.
    >

    I remember starting with a simple program like that, and finding to my
    dismay that between the "script" and "/script" tags the '<' and '>'
    characters are used not as tag delimiters but as "greater than" and
    "less than" comparison operators. I had to check for those particular
    tags and discard everything between them, and not let the presence of
    a lone unbalanced '<' in the script cause my logic to miss finding the
    "/string" tag.

    Bill
     
    Bill Latvin, Jun 12, 2006
    #4
  5. Medros

    Morris Dovey Guest

    Bill Latvin (in ) said:

    | On Sun, 11 Jun 2006 21:46:03 -0500, "Morris Dovey"
    | <> wrote:
    |
    || Medros (in )
    || said:
    ||
    ||| I understand that you can strip html out of a txt file so that all
    ||| the information is left is the visable information that is needed
    ||| (e.g. everything that has < > around is gone). My question is that
    ||| I have a table of information that I need to be fed into a program
    ||| as such. Well kind of I need the program to read it just as you
    ||| would on paper and be able to use that information like it was
    ||| entered. I am unsure how strip so much away just to leave me with
    ||| the information I want and then use it like I want. Any help?
    ||
    || Start with a simple program that reads and saves one character at a
    || time looking for a '<' character. When it finds a '<', it should
    || throw it (and following characters) away until it finds a '>'.
    || When the program reaches end-of-file, hopefully it's saved what
    || you want to keep.
    ||
    | I remember starting with a simple program like that, and finding to
    | my dismay that between the "script" and "/script" tags the '<' and
    | '>' characters are used not as tag delimiters but as "greater than"
    | and "less than" comparison operators. I had to check for those
    | particular tags and discard everything between them, and not let
    | the presence of a lone unbalanced '<' in the script cause my logic
    | to miss finding the "/string" tag.

    Welcome to the club. It's because of things like that that I added my
    second paragraph:

    "You'll probably discover that you want to add refinements (perhaps to
    deal with HTML encodings like &nbsp; and &lt; - but those can wait on
    getting the initial version working."

    The refinements will depend on whether the OP wants a general solution
    or just enough to extract data from one particular page. On
    re-reading, I'd guess is that <table>, <tr>, and <td> tags may be his
    1st refinement - but the question indicated that he'll probably need
    to start at the most basic level.

    --
    Morris Dovey
    DeSoto Solar
    DeSoto, Iowa USA
    http://www.iedu.com/DeSoto
     
    Morris Dovey, Jun 12, 2006
    #5
  6. Medros

    Ian Collins Guest

    Richard Heathfield wrote:
    > Medros said:
    >
    >
    >>I understand that you can strip html out of a txt file so that all the
    >>information is left is the visable information that is needed (e.g.
    >>everything that has < > around is gone). My question is that I have a
    >>table of information that I need to be fed into a program as such. Well
    >>kind of I need the program to read it just as you would on paper and be
    >>able to use that information like it was entered. I am unsure how strip
    >>so much away just to leave me with the information I want and then use
    >>it like I want. Any help?

    >
    >
    > If the HTML is well-produced, mostly you can simply read characters one by
    > one. If you hit a '<' character, discard it, and keep discarding everything
    > until you hit a '>', which again you can discard.
    >
    > If you hit a & character, though, you have some work to do. You'll need to
    > save up characters until you hit a semicolon.
    >
    > The characters between the & and the ; form a keyword, e.g. &amp; for
    > ampersand, &lt; for '<', &gt; for '>', &copy; for the copyright symbol, and
    > so on. You will need to have some kind of lookup in your program for
    > matching these keywords with their replacements.
    >
    > If you hit a space character, preserve it, but then discard all remaining
    > whitespace until the next non-whitespace character.
    >
    > These simple rules will give you a basic translation into English, but you
    > have to be a bit cleverer if you want to split text into paragraphs and so
    > on, by interpreting tags such as <BR>, <P>, <TD> etc -- at which point you
    > won't be too far away from having your own text-only but otherwise
    > full-blown HTML renderer.
    >
    > If the HTML is /not/ well-produced, the above may not be sufficient.
    >

    HTMLtidy (http://tidy.sourceforge.net/) is your friend in this cases.
    This little program has prevented much pain and suffering!

    --
    Ian Collins.
     
    Ian Collins, Jun 12, 2006
    #6
  7. "Medros" <> writes:

    > I understand that you can strip html out of a txt file so that all the
    > information is left is the visable information that is needed (e.g.
    > everything that has < > around is gone). My question is that I have a
    > table of information that I need to be fed into a program as such. Well
    > kind of I need the program to read it just as you would on paper and be
    > able to use that information like it was entered. I am unsure how strip
    > so much away just to leave me with the information I want and then use
    > it like I want. Any help?


    lynx -dump ?

    Asbjørn
     
    =?iso-8859-1?q?Asbj=F8rn_S=E6b=F8?=, Jun 12, 2006
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Michael Vilain

    regex for stripping HTML

    Michael Vilain, Oct 28, 2003, in forum: Perl
    Replies:
    4
    Views:
    677
    Anno Siegel
    Oct 30, 2003
  2. Spondishy

    Stripping html tags from text

    Spondishy, Mar 6, 2006, in forum: ASP .Net
    Replies:
    4
    Views:
    4,195
    m.posseth
    Mar 7, 2006
  3. JJ Harrison

    Stripping HTML attributes and tags

    JJ Harrison, Nov 27, 2005, in forum: HTML
    Replies:
    5
    Views:
    1,354
    Toby Inkster
    Nov 28, 2005
  4. Steveo

    Stripping HTML with RE

    Steveo, Nov 9, 2004, in forum: Python
    Replies:
    3
    Views:
    386
    Steven Bethard
    Nov 9, 2004
  5. Carlo Razzeto

    HTML stripping?

    Carlo Razzeto, Jul 10, 2007, in forum: ASP .Net
    Replies:
    1
    Views:
    327
    Alexey Smirnov
    Jul 10, 2007
Loading...

Share This Page