Regular Expression and string Matching/Replace

S

sanjay010

I have a list of key words. It has around 1000 key word now but can
grow to 5000 keywords.

My web application displays lot of texts which are stored in the
database. My requirement is to scan each text for the occurance of any
of the above keywords. If any keyword is present I have to replace that
with some custom values, before showing it to the user.

I was thinking of using using regular expression for replacing the
keyword in the text using matcher.replaceAll method as follows:
Pattern pattern = Pattern.compile(patternStr);
Matcher matcher = pattern.matcher(inputStr);
String output = matcher.replaceAll(replacementStr);

But My pattern string will have around 5000 keywords with the 'OR'
Logical Operator like- keyword1| keyword2 I keyword3 | ..........

Will such a big pattern string adversly affect the performance? What
can I do to speed up the performance?(Since my keyword list is not
static i would prefer to do the replacement just before showing the
text to the user)
Any suggestions are most welcome.
 
E

Eric Sosman

I have a list of key words. It has around 1000 key word now but can
grow to 5000 keywords.

My web application displays lot of texts which are stored in the
database. My requirement is to scan each text for the occurance of any
of the above keywords. If any keyword is present I have to replace that
with some custom values, before showing it to the user.

I was thinking of using using regular expression for replacing the
keyword in the text using matcher.replaceAll method as follows:
Pattern pattern = Pattern.compile(patternStr);
Matcher matcher = pattern.matcher(inputStr);
String output = matcher.replaceAll(replacementStr);

But My pattern string will have around 5000 keywords with the 'OR'
Logical Operator like- keyword1| keyword2 I keyword3 | ..........

Will such a big pattern string adversly affect the performance? What
can I do to speed up the performance?(Since my keyword list is not
static i would prefer to do the replacement just before showing the
text to the user)
Any suggestions are most welcome.

Can you separate the inputStr into a sequence of
"words," some of which are keywords and some not? If
so, you could prepare a HashMap containing the keywords
and their replacements, and look up each "word" in the
Map: if it's there, you output the replacement instead
of the word, but if it's not you output the word itself.
(If all the keywords have the same replacement text, you
might use a HashSet instead of a HashMap.)
 
C

Chris Smith

I have a list of key words. It has around 1000 key word now but can
grow to 5000 keywords.

My first thought is that if you have a variable number of up to 5000
keywords, you don't want to try to statically maintain a regular
expression that matches all of them. Instead, you probably want to keep
them in a configuration file of some type.

From that point, you have several options:

1. If the people editing this file can be trusted to understand regular
expressions, then you could have them put regular expressions into the
file. You are probably better off walking through the list and calling
replaceAll up to 5000 times... but if you insist, then you could
concatenate the expressions, remembering to use non-grouping parentheses
to contain each subexpression and ensure that it doesn't fall prey to
unexpected operator precedence.

2. If it would be better to store plain text keywords in the
configuration file, then you need a plain text search-and-replace. It
doesn't take too long to write one, but there are some performance
pitfalls. Instead, Google for "skeetutil" and use Jon's utility package
for the job. Jon is very knowledgable, and has thought through a lot of
the performace quirks that you might run into my creating this yourself.
It's a shame that the standard API doesn't solve this trivial problem,
but unfortunately everyone at Sun was blinded by regexp-mania when they
added this stuff to the java.lang.String API.

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
S

sanjay010

Hi Chris,
I can concatenate the 5000 keywords to create a regular expression as
you have suggested. I can store this in a table and update this when
the list of keywords change.
But my question is will this long regular expression affect the
performance of matching and replacement of string.
 
R

Roedy Green

Will such a big pattern string adversly affect the performance?

I think that will be impractical on many grounds.

What you do instead in split your text into words perhaps using a
regex split.
See http://mindprod.com/jgloss/regex.html

Then you look up the word in a HashMap. If it is in there, replace it
with the corresponding value, building up your new String in a
StringBuilder.
 
R

Roedy Green

But my question is will this long regular expression affect the
performance of matching and replacement of string.

Regular Expressions are for finding patterns, not for doing
translation. An analogy would be using a garage door opener for a
hammer.
 
J

John C. Bollinger

Hi Chris,
I can concatenate the 5000 keywords to create a regular expression as
you have suggested. I can store this in a table and update this when
the list of keywords change.
But my question is will this long regular expression affect the
performance of matching and replacement of string.

Relative to what? Every line of executable code affects performance, so
yes your approach will do. I'm guessing that what you really want to
know is "will it be fast enough?" There's no way to be sure other than
to test it and see. I'd guess, however, that it will be slow. Quite
possibly slower than using 5000 single-keyword regular expressions to do
the same job. That, and ugly, too.

On the other hand, if it fits your requirements then Eric's suggestion
is brilliant. Easy to write, easy to maintain, and as fast as you're
likely to get. Tokenizing the input document might be a bottleneck
there, but the keyword matching would be lightning fast, and the
tokenization is still much lighter than your proposed huge-regex approach.

I don't suppose it would work to do the substitution once for each
document, in advance, and remember the result? That is the fastest
possible solution at service time.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,904
Latest member
HealthyVisionsCBDPrice

Latest Threads

Top