how to get the list of words from html files

B

BeHealthy

I would like to get the list of words from a html file, so I need to
remove the html tags and the punctuation before I split the string.

perldoc suggests using s/<(?:[^>'"]*¦(['"]).*?\1)*>//gs to
remove tags, but it doesn't work, why is that? I use s/<[^>]*>//
instead, is this regular expression right?

I'm also using s/(\.|\?|!|,|\"|:|\(|\)|\d)+/ /g to remove punctuation
and digits. Any other simple way to do that? Thanks.
 
G

Gunnar Hjalmarsson

perldoc suggests using s/<(?:[^>'"]*¦(['"]).*?\1)*>//gs to
remove tags, but it doesn't work, why is that?

Because you have an HTML error on line 89.
 
S

Sherm Pendley

perldoc suggests using s/<(?:[^>'"]*¦(['"]).*?\1)*>//gs to
remove tags

No it doesn't. It gives that as a "simple minded" approach that will work
for many files, but will fail for many others.

perldoc -q "remove html"
but it doesn't work, why is that?

The above FAQ, in the paragraph immediately before the above regex, explains
several cases where it will fail. In the paragraph before *that*, it suggests
better alternatives.

Actually *reading* the FAQ works better than blindly copying examples from it
and hoping for the best.

sherm--
 
T

Tad McClellan

I need to
remove the html tags and the punctuation

perldoc suggests


No it doesn't. You should read the surrounding text more carefully.

using s/<(?:[^>'"]*¦(['"]).*?\1)*>//gs to ^
^
remove tags, but it doesn't work, why is that?


Did you copy/paste that code? That "vertical bar" (or) isn't
the right/correct vertical bar character.

I use s/<[^>]*>//
instead, is this regular expression right?


There does not exist a regular expression that is "right" for
reliably removing HTML markup.

You might be able to find a regex that is "good enough", knowing
that it will occasionally fail. Only you can decide how robust
it must be for your application.

I'm also using s/(\.|\?|!|,|\"|:|\(|\)|\d)+/ /g to remove punctuation
and digits.


(there is no need to backslash the double quote character there.)


To remove "some punctuation" you mean. There are lots of punctuation
characters that you do not remove.

You might want to turn it around to say what characters you want
to keep, rather than what characters you want to discard...

Any other simple way to do that?


You don't even need (or want) regular expressions for
replacing "characters" (rather than "strings").

For replacing characters you probably should use:

perldoc -f tr


Here's one that removes the same characters as your s///g does:

tr/.?!,":()0-9/ /s;
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,535
Members
45,007
Latest member
obedient dusk

Latest Threads

Top