Regular expression to match word given stem

D

duffy.paul

Hi,

I need a regular expression that will match a word stem to that stem
PLUS all common suffixes.

'make' should match makes, meker, makings etc.

A slight twist is that I am using the Porter stemmer so some of the
stems are not real words.

An example is "say" which has the stem of 'sai'.

Here's what I'm using right now;

Pattern p =
Pattern.compile("\\b("+item+"{1,2}|(("+item+"{0})[eiy]?|\1))(es|er|e?d|en|y|ing||ness|ional)?s?\\b",
Pattern.CASE_INSENSITIVE);

"item" is the stem which I am trying to match in some unstemmed text.

match the item with the last letter possibly doubled
or
the item minus its last letter and optionally ending in i,y or e
but in that case only match the item{0} in \1
and
with any of the optional endings;es, er, ed, etc.
and
possibly ending in s (for makings, makers and the like.)

The trouble is that me regex matches things I don't want it to match
The stem 'sai' (say) matches 'sad', for example, because the
("+item+"{0})[eiy]?|\1 part strips the i off the end and tthen finds
that sa(e?d) is a match.

Thanks
P.
 
H

hiwa

(e-mail address removed) ã®ãƒ¡ãƒƒã‚»ãƒ¼ã‚¸:
Hi,

I need a regular expression that will match a word stem to that stem
PLUS all common suffixes.

'make' should match makes, meker, makings etc.

A slight twist is that I am using the Porter stemmer so some of the
stems are not real words.

An example is "say" which has the stem of 'sai'.

Here's what I'm using right now;

Pattern p =
Pattern.compile("\\b("+item+"{1,2}|(("+item+"{0})[eiy]?|\1))(es|er|e?d|en|y|ing||ness|ional)?s?\\b",
Pattern.CASE_INSENSITIVE);

"item" is the stem which I am trying to match in some unstemmed text.

match the item with the last letter possibly doubled
or
the item minus its last letter and optionally ending in i,y or e
but in that case only match the item{0} in \1
and
with any of the optional endings;es, er, ed, etc.
and
possibly ending in s (for makings, makers and the like.)

The trouble is that me regex matches things I don't want it to match
The stem 'sai' (say) matches 'sad', for example, because the
("+item+"{0})[eiy]?|\1 part strips the i off the end and tthen finds
that sa(e?d) is a match.

Thanks
P.
If you have a hammer, you might see everything as nails.
But I'm afraid stem/suffix parsing issue is too complex a thing to be a
nail.
Principal weakness of regular expression is that it can't handle
conditionals.
 
C

Chris Uppal

I need a regular expression that will match a word stem to that stem
PLUS all common suffixes.

Then you are out of luck...

Stemming is a complex, and highly heuristic, algorithm and is not a suitable
application for regexps. (Indeed, very little /is/ a suitable application for
regexps -- I wish they had never been added to the standard library).

-- chris
 
P

Paul D

hiwa said:
(e-mail address removed) ã®ãƒ¡ãƒƒã‚»ãƒ¼ã‚¸:
Hi,

I need a regular expression that will match a word stem to that stem
PLUS all common suffixes.

'make' should match makes, meker, makings etc.

A slight twist is that I am using the Porter stemmer so some of the
stems are not real words.

An example is "say" which has the stem of 'sai'.

Here's what I'm using right now;

Pattern p =
Pattern.compile("\\b("+item+"{1,2}|(("+item+"{0})[eiy]?|\1))(es|er|e?d|en|y|ing||ness|ional)?s?\\b",
Pattern.CASE_INSENSITIVE);

"item" is the stem which I am trying to match in some unstemmed text.

match the item with the last letter possibly doubled
or
the item minus its last letter and optionally ending in i,y or e
but in that case only match the item{0} in \1
and
with any of the optional endings;es, er, ed, etc.
and
possibly ending in s (for makings, makers and the like.)

The trouble is that me regex matches things I don't want it to match
The stem 'sai' (say) matches 'sad', for example, because the
("+item+"{0})[eiy]?|\1 part strips the i off the end and tthen finds
that sa(e?d) is a match.

Thanks
P.
If you have a hammer, you might see everything as nails.
But I'm afraid stem/suffix parsing issue is too complex a thing to be a
nail.
Principal weakness of regular expression is that it can't handle
conditionals.


Thanks, I added conditionals for a few of the more common cases and it
works MUCH better. Still have a few unintentional matches: 'moth'
matches 'mother'. Breaking up the big regex into smaller pieces also
made if faster.
 
P

Paul D

Paul said:
hiwa said:
(e-mail address removed) ã®ãƒ¡ãƒƒã‚»ãƒ¼ã‚¸:
Hi,

I need a regular expression that will match a word stem to that stem
PLUS all common suffixes.

'make' should match makes, meker, makings etc.

A slight twist is that I am using the Porter stemmer so some of the
stems are not real words.

An example is "say" which has the stem of 'sai'.

Here's what I'm using right now;

Pattern p =
Pattern.compile("\\b("+item+"{1,2}|(("+item+"{0})[eiy]?|\1))(es|er|e?d|en|y|ing||ness|ional)?s?\\b",

Pattern.CASE_INSENSITIVE);

"item" is the stem which I am trying to match in some unstemmed text.

match the item with the last letter possibly doubled
or
the item minus its last letter and optionally ending in i,y or e
but in that case only match the item{0} in \1
and
with any of the optional endings;es, er, ed, etc.
and
possibly ending in s (for makings, makers and the like.)

The trouble is that me regex matches things I don't want it to match
The stem 'sai' (say) matches 'sad', for example, because the
("+item+"{0})[eiy]?|\1 part strips the i off the end and tthen finds
that sa(e?d) is a match.

Thanks
P.
If you have a hammer, you might see everything as nails.
But I'm afraid stem/suffix parsing issue is too complex a thing to be a
nail.
Principal weakness of regular expression is that it can't handle
conditionals.


Thanks, I added conditionals for a few of the more common cases and it
works MUCH better. Still have a few unintentional matches: 'moth'
matches 'mother'. Breaking up the big regex into smaller pieces also
made if faster.

Actually, just testing that the search term and the candidate target
have the same Porter stem gives me what I need. The stems of mother and
moth are not the same.

Then the long regex can be replaced by a much fuzzier one:
"\\b"+item+"{0,1}[a-z]{0,10}\\b"
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,565
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top