how to get the list of words from html files

BeHealthy · Oct 9, 2005

I would like to get the list of words from a html file, so I need to
remove the html tags and the punctuation before I split the string.

perldoc suggests using s/<(?:[^>'"]*¦(['"]).*?\1)*>//gs to
remove tags, but it doesn't work, why is that? I use s/<[^>]*>//
instead, is this regular expression right?

I'm also using s/(\.|\?|!|,|\"|:|\(|\)|\d)+/ /g to remove punctuation
and digits. Any other simple way to do that? Thanks.

Gunnar Hjalmarsson · Oct 9, 2005

perldoc suggests using s/<(?:[^>'"]*¦(['"]).*?\1)*>//gs to
remove tags, but it doesn't work, why is that?

Because you have an HTML error on line 89.

Dr.Ruud · Oct 9, 2005

(e-mail address removed) schreef:

[...]

See perlfaq9: use a parser.

Sherm Pendley · Oct 9, 2005

[email protected] said:
perldoc suggests using s/<(?:[^>'"]*¦(['"]).*?\1)*>//gs to
remove tags

No it doesn't. It gives that as a "simple minded" approach that will work
for many files, but will fail for many others.

perldoc -q "remove html"

but it doesn't work, why is that?

The above FAQ, in the paragraph immediately before the above regex, explains
several cases where it will fail. In the paragraph before *that*, it suggests
better alternatives.

Actually *reading* the FAQ works better than blindly copying examples from it
and hoping for the best.

sherm--

Tad McClellan · Oct 9, 2005

I need to
remove the html tags and the punctuation

perldoc suggests

No it doesn't. You should read the surrounding text more carefully.

using s/<(?:[^>'"]*¦(['"]).*?\1)*>//gs to ^
^
remove tags, but it doesn't work, why is that?

Did you copy/paste that code? That "vertical bar" (or) isn't
the right/correct vertical bar character.

I use s/<[^>]*>//
instead, is this regular expression right?

There does not exist a regular expression that is "right" for
reliably removing HTML markup.

You might be able to find a regex that is "good enough", knowing
that it will occasionally fail. Only you can decide how robust
it must be for your application.

I'm also using s/(\.|\?|!|,|\"|:|\(|\)|\d)+/ /g to remove punctuation
and digits.

(there is no need to backslash the double quote character there.)

To remove "some punctuation" you mean. There are lots of punctuation
characters that you do not remove.

You might want to turn it around to say what characters you want
to keep, rather than what characters you want to discard...

Any other simple way to do that?

You don't even need (or want) regular expressions for
replacing "characters" (rather than "strings").

For replacing characters you probably should use:

perldoc -f tr

Here's one that removes the same characters as your s///g does:

tr/.?!,"

)0-9/ /s;

RegExp - Match specific words, but not if they're inside parenthesis (with or without other words within)	6	Jan 29, 2023
Hot to get the list of folders in google drive using php and curl	2	Oct 10, 2023
How to check the validation of js files or html files including js?	6	Jan 12, 2020
Batch Convert HTML to UTF-8 Files	2	Oct 2, 2023
Select files based on text list of filenames(part of the name:date) with condition	0	May 4, 2022
Python client/server that reads HTML body from server	1	Apr 12, 2023
How to get education and coding job coming from abroad starting new in the US? Advice of courses or places to look?	2	May 18, 2023
Trying to get JSON data from API into HTML table	7	Feb 1, 2021

how to get the list of words from html files

BeHealthy

Gunnar Hjalmarsson

Dr.Ruud

Sherm Pendley

Tad McClellan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads