B
BeHealthy
I would like to get the list of words from a html file, so I need to
remove the html tags and the punctuation before I split the string.
perldoc suggests using s/<(?:[^>'"]*¦(['"]).*?\1)*>//gs to
remove tags, but it doesn't work, why is that? I use s/<[^>]*>//
instead, is this regular expression right?
I'm also using s/(\.|\?|!|,|\"|:|\(|\)|\d)+/ /g to remove punctuation
and digits. Any other simple way to do that? Thanks.
remove the html tags and the punctuation before I split the string.
perldoc suggests using s/<(?:[^>'"]*¦(['"]).*?\1)*>//gs to
remove tags, but it doesn't work, why is that? I use s/<[^>]*>//
instead, is this regular expression right?
I'm also using s/(\.|\?|!|,|\"|:|\(|\)|\d)+/ /g to remove punctuation
and digits. Any other simple way to do that? Thanks.