how to get the list of words from html files

Discussion in 'Perl Misc' started by BeHealthy@gmail.com, Oct 9, 2005.

  1. Guest

    I would like to get the list of words from a html file, so I need to
    remove the html tags and the punctuation before I split the string.

    perldoc suggests using s/<(?:[^>'"]*¦(['"]).*?\1)*>//gs to
    remove tags, but it doesn't work, why is that? I use s/<[^>]*>//
    instead, is this regular expression right?

    I'm also using s/(\.|\?|!|,|\"|:|\(|\)|\d)+/ /g to remove punctuation
    and digits. Any other simple way to do that? Thanks.
     
    , Oct 9, 2005
    #1
    1. Advertising

  2. wrote:
    > perldoc suggests using s/<(?:[^>'"]*¦(['"]).*?\1)*>//gs to
    > remove tags, but it doesn't work, why is that?


    Because you have an HTML error on line 89.

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Oct 9, 2005
    #2
    1. Advertising

  3. Dr.Ruud Guest

    schreef:

    > [...]


    See perlfaq9: use a parser.

    --
    Affijn, Ruud <http://www.pandora.com/?sc=sh770781&cmd=tunermini>

    "Gewoon is een tijger."
     
    Dr.Ruud, Oct 9, 2005
    #3
  4. "" <> writes:

    > perldoc suggests using s/<(?:[^>'"]*¦(['"]).*?\1)*>//gs to
    > remove tags


    No it doesn't. It gives that as a "simple minded" approach that will work
    for many files, but will fail for many others.

    perldoc -q "remove html"

    > but it doesn't work, why is that?


    The above FAQ, in the paragraph immediately before the above regex, explains
    several cases where it will fail. In the paragraph before *that*, it suggests
    better alternatives.

    Actually *reading* the FAQ works better than blindly copying examples from it
    and hoping for the best.

    sherm--

    --
    Cocoa programming in Perl: http://camelbones.sourceforge.net
    Hire me! My resume: http://www.dot-app.org
     
    Sherm Pendley, Oct 9, 2005
    #4
  5. <> wrote:
    > I need to
    > remove the html tags and the punctuation



    > perldoc suggests



    No it doesn't. You should read the surrounding text more carefully.


    > using s/<(?:[^>'"]*¦(['"]).*?\1)*>//gs to

    ^
    ^
    > remove tags, but it doesn't work, why is that?



    Did you copy/paste that code? That "vertical bar" (or) isn't
    the right/correct vertical bar character.


    > I use s/<[^>]*>//
    > instead, is this regular expression right?



    There does not exist a regular expression that is "right" for
    reliably removing HTML markup.

    You might be able to find a regex that is "good enough", knowing
    that it will occasionally fail. Only you can decide how robust
    it must be for your application.


    > I'm also using s/(\.|\?|!|,|\"|:|\(|\)|\d)+/ /g to remove punctuation
    > and digits.



    (there is no need to backslash the double quote character there.)


    To remove "some punctuation" you mean. There are lots of punctuation
    characters that you do not remove.

    You might want to turn it around to say what characters you want
    to keep, rather than what characters you want to discard...


    > Any other simple way to do that?



    You don't even need (or want) regular expressions for
    replacing "characters" (rather than "strings").

    For replacing characters you probably should use:

    perldoc -f tr


    Here's one that removes the same characters as your s///g does:

    tr/.?!,":()0-9/ /s;


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Oct 9, 2005
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Peter Strøiman
    Replies:
    1
    Views:
    2,109
    Peter Strøiman
    Aug 23, 2005
  2. Richard Heathfield
    Replies:
    7
    Views:
    380
    Barry Schwarz
    Oct 5, 2003
  3. utab

    Words Words

    utab, Feb 16, 2006, in forum: C++
    Replies:
    6
    Views:
    436
    Daniel T.
    Feb 16, 2006
  4. BerlinBrown
    Replies:
    6
    Views:
    4,620
  5. Lasse Edsvik

    replace words with bold words

    Lasse Edsvik, Oct 5, 2003, in forum: ASP General
    Replies:
    9
    Views:
    243
Loading...

Share This Page