D
duffy.paul
Hi,
I need a regular expression that will match a word stem to that stem
PLUS all common suffixes.
'make' should match makes, meker, makings etc.
A slight twist is that I am using the Porter stemmer so some of the
stems are not real words.
An example is "say" which has the stem of 'sai'.
Here's what I'm using right now;
Pattern p =
Pattern.compile("\\b("+item+"{1,2}|(("+item+"{0})[eiy]?|\1))(es|er|e?d|en|y|ing||ness|ional)?s?\\b",
Pattern.CASE_INSENSITIVE);
"item" is the stem which I am trying to match in some unstemmed text.
match the item with the last letter possibly doubled
or
the item minus its last letter and optionally ending in i,y or e
but in that case only match the item{0} in \1
and
with any of the optional endings;es, er, ed, etc.
and
possibly ending in s (for makings, makers and the like.)
The trouble is that me regex matches things I don't want it to match
The stem 'sai' (say) matches 'sad', for example, because the
("+item+"{0})[eiy]?|\1 part strips the i off the end and tthen finds
that sa(e?d) is a match.
Thanks
P.
I need a regular expression that will match a word stem to that stem
PLUS all common suffixes.
'make' should match makes, meker, makings etc.
A slight twist is that I am using the Porter stemmer so some of the
stems are not real words.
An example is "say" which has the stem of 'sai'.
Here's what I'm using right now;
Pattern p =
Pattern.compile("\\b("+item+"{1,2}|(("+item+"{0})[eiy]?|\1))(es|er|e?d|en|y|ing||ness|ional)?s?\\b",
Pattern.CASE_INSENSITIVE);
"item" is the stem which I am trying to match in some unstemmed text.
match the item with the last letter possibly doubled
or
the item minus its last letter and optionally ending in i,y or e
but in that case only match the item{0} in \1
and
with any of the optional endings;es, er, ed, etc.
and
possibly ending in s (for makings, makers and the like.)
The trouble is that me regex matches things I don't want it to match
The stem 'sai' (say) matches 'sad', for example, because the
("+item+"{0})[eiy]?|\1 part strips the i off the end and tthen finds
that sa(e?d) is a match.
Thanks
P.