Non-cpu intensive indexing...

J

jagonzal

Hi,

I have a bunch of String of fixed length, and a function F such that
F(String) returns one of a small set of possible values.

Right now the work is being done with regular expressions - it does the
job (identify patterns that result in a given return value) but it is
quite heavy in CPU terms, and this operation must be done at a peak of
100 hits per second.

Any ideas on an alternative to the regexps?

(the possible fixed length strings are about 100000)

Thanks in advance,

JGN
 
C

Chris

Hi,

I have a bunch of String of fixed length, and a function F such that
F(String) returns one of a small set of possible values.

Right now the work is being done with regular expressions - it does the
job (identify patterns that result in a given return value) but it is
quite heavy in CPU terms, and this operation must be done at a peak of
100 hits per second.

Any ideas on an alternative to the regexps?

(the possible fixed length strings are about 100000)

This depends entirely on your app's requirements. What, exactly, are
your regexes doing? Can it be done more simply?

Some things to think about:

-- Jakarta ORO instead of the JDK 1.4 regex library. I have no idea if
it's any faster, but you could try it.

-- You may be better off using arrays of char, or one giant array of
char containing all strings, than the String class. String creates a lot
of garbage.

-- Make sure you compile your regexes in advance. Don't recompile before
each comparison.

-- Depending on your app, you might be able to use a lexer instead. See
http://www.jflex.de/. There's a learning curve, but it's *really* fast.
 
J

jagonzal

Chris said:
This depends entirely on your app's requirements. What, exactly, are
your regexes doing? Can it be done more simply?

They are evaluating a given string for a match, so that finally a
function can return which classification this string is in.

Something along the lines of:

if(incoming string matches any of {X1} set of regexpes)
return "Value1"
if(incoming string matches any of {x2} set of regexps)
return "Value2"
....

there are about three or four possible values, but the regexps must
cover about 100000 different cases of "incoming string".
-- You may be better off using arrays of char, or one giant array of
char containing all strings, than the String class. String creates a lot
of garbage.

The strings are a restriction that I cannot remove :)

I'm thinking of making an index with the 100000 Strings.
-- Make sure you compile your regexes in advance. Don't recompile before
each comparison.

It's already done this way.
-- Depending on your app, you might be able to use a lexer instead. See
http://www.jflex.de/. There's a learning curve, but it's *really* fast.

Will check it out, thank you!

JGN
 
R

Robert Klemme

Chris said:
This depends entirely on your app's requirements. What, exactly, are
your regexes doing? Can it be done more simply?

Some things to think about:

-- Jakarta ORO instead of the JDK 1.4 regex library. I have no idea if
it's any faster, but you could try it.

-- You may be better off using arrays of char, or one giant array of
char containing all strings, than the String class. String creates a lot
of garbage.

-- Make sure you compile your regexes in advance. Don't recompile before
each comparison.

-- Depending on your app, you might be able to use a lexer instead. See
http://www.jflex.de/. There's a learning curve, but it's *really* fast.

-- Optimize the RX.

-- Preprocess and cache results (do those strings change? if yes and
there's repetition some form of LRU cache might help)

We certainly have too few information to narrow it down.

robert
 
B

Boris Stumm

Hi,

I have a bunch of String of fixed length, and a function F such that
F(String) returns one of a small set of possible values.

Right now the work is being done with regular expressions - it does the
job (identify patterns that result in a given return value) but it is
quite heavy in CPU terms, and this operation must be done at a peak of
100 hits per second.

Any ideas on an alternative to the regexps?

(the possible fixed length strings are about 100000)

If you know the 100000 strings in advance, a simple
Map<String, String> with the strings as key and the function
value as value will probably fastest, and you only have to compute
the function values once.

If you are afraid of memory consumption, use a cache instead of a
simple map.
 
O

Oliver Wong

They are evaluating a given string for a match, so that finally a
function can return which classification this string is in.

Something along the lines of:

if(incoming string matches any of {X1} set of regexpes)
return "Value1"
if(incoming string matches any of {x2} set of regexps)
return "Value2"
...

there are about three or four possible values, but the regexps must
cover about 100000 different cases of "incoming string".

Please give examples of:

(*) The incoming strings.
(*) The regular expressions.
(*) The values returned.

Otherwise, I can only offer the vague advice "look at decision trees".

- Oliver
 
C

Chris Uppal

Right now the work is being done with regular expressions - it does the
job (identify patterns that result in a given return value) but it is
quite heavy in CPU terms, and this operation must be done at a peak of
100 hits per second.

Just as a side-note, if you are comparing 100 strings a second using regexps,
and are seeing significant CPU load as a result, then (unless the strings are
very long) there must be something wrong with either the regexp implementation,
or the way you are using it (or perhaps both if the implementation is such that
it's forcing you to use it in the "wrong" way).

It should be possible to transform a task like this into a single DFA
(deterministic finite state machine) which will run in time proportional to the
length of the input string (not the complexity of the regexps). The constants
of proportionality depend on the DFA implementation, but should not be /huge/
even for a naive implementation. I would /hope/ that the Java regexp
implementation builds and uses such a DFA, but if not then the thing to do is
to do so yourself. Probably the easiest way to do that is to use a
pre-existing scanner-generator, such as the jflex generator which has already
been mentioned.

-- chris
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,432
Messages
2,571,680
Members
48,796
Latest member
Greg L.

Latest Threads

Top