Tricky regex (exclude some multiple characters)

U

Uncle_Fester

I want to test for "things that look more or less like real English
words" from parsed hypertext.

I know that

while ($text =~ /([A-Za-z0-9_\'\-]+)/g )

will catch most of what I want most of the time.

The tricky bit is this :

How might I allow 'oo' and 'ee' and not 'ff' or '--' ?
How might I exclude patterns like '_________' or '010101010101' ?

Any thoughts?
 
D

David Squire

Uncle_Fester said:
I want to test for "things that look more or less like real English
words" from parsed hypertext.

I know that

while ($text =~ /([A-Za-z0-9_\'\-]+)/g )

Why the capturing parens? Why + as well as g?
will catch most of what I want most of the time.

The tricky bit is this :

How might I allow 'oo' and 'ee' and not 'ff' or '--' ?
How might I exclude patterns like '_________' or '010101010101' ?

Any thoughts?

I don't know that you can do this in one regex. One approach would be to
use a dictionary (and an interface such as Lingua::Ispell).

A quick and dirty approach would be to do a second test that rejects
anything that contains too many non-alphabetic characters (say 2?), e.g.
(non-tested):

my $num_non-alpha = () = $text =~ m/[0-9_'-]/g; # I don't think that '
or - are special in regexex
next if $num_non-alpha > 2;


DS
 
D

David Squire

David said:
Uncle_Fester said:
I want to test for "things that look more or less like real English
words" from parsed hypertext.

I know that

while ($text =~ /([A-Za-z0-9_\'\-]+)/g )

Why the capturing parens? Why + as well as g?
will catch most of what I want most of the time.

The tricky bit is this :

How might I allow 'oo' and 'ee' and not 'ff' or '--' ?
How might I exclude patterns like '_________' or '010101010101' ?

Any thoughts?

I don't know that you can do this in one regex. One approach would be to
use a dictionary (and an interface such as Lingua::Ispell).

A quick and dirty approach would be to do a second test that rejects
anything that contains too many non-alphabetic characters (say 2?), e.g.
(non-tested):

my $num_non-alpha = () = $text =~ m/[0-9_'-]/g; # I don't think that '
or - are special in regexex
next if $num_non-alpha > 2;

.... and of course you can do a similar thing to exclude things such as
invalid repeated chars...
 
U

usenet

Uncle_Fester said:
How might I allow 'oo' and 'ee' and not 'ff' or '--' ?

That sounds somewhat arbitrary. Why would you want to exclude 'ff'?
Is it your intention to allow double vowels but not double consonants?
What specific ruleset determines what to keep and what to exclude? You
need to know the exact ruleset to write the regexp.

FWIW, you may wish to see the kind and helpful responses to a question
I posted earlier which has some things in common with your query:

http://tinyurl.com/of73s
 
X

Xicheng Jia

Uncle_Fester said:
I want to test for "things that look more or less like real English
words" from parsed hypertext.

I know that

while ($text =~ /([A-Za-z0-9_\'\-]+)/g )

will catch most of what I want most of the time.

The tricky bit is this :

How might I allow 'oo' and 'ee' and not 'ff' or '--' ?
How might I exclude patterns like '_________' or '010101010101' ?

Any thoughts?

I supposed your words are separated by whitespaces:

while ($text =~ /
(?:^|(?<=\s))
# left boundary
(?![\w'-]*([^eo]|01)\1{1,})
# no repeating '01' and any non-[eo] characters
([\w'-]+)
# matched words in $2
(?=$|\s)/gx)
# right boundary
{
print $2;
}

will filter out '0101', 'f__d', 'fxxd','f--d'.... and validate 'food',
'feed', '01', 'f_d'

Xicheng
 
U

Uncle_Fester

Gunnar said:
What do you have against such stuff?

Haa! Good attitude about my fluff!

Still -- it's an interesting excercise.

I suspect the best approach is to simply ignore repeats of three or
more non alphas or three repeated alphas. It's bound to work _most_ of
the time :)
 
A

Amelia

Uncle_Fester said:
I want to test for "things that look more or less like real English
words" from parsed hypertext.
[...]

How might I allow 'oo' and 'ee' and not 'ff' or '--' ?
How might I exclude patterns like '_________' or '010101010101' ?

Any thoughts?

That reminds me of comical science-fiction story by Isaac Asimov,
Nine Billion Names of God. What the monks in the story demanded to
compute is very similar to what you need - better be careful, read
it first and think if you really want to know the complete answer ;-}.
Tongue in cheek, but the story really can give you some clues as to how
to solve your problem (boundary conditions very congruent). In case
of emergency of supernatural, blame me, Lucy A. G. Faire ;-}.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top