String comparision efficency

D

Daniel

HI!
I have an appliation that will need to parse a text of unkown length.
I get the data as an array of strings (and thus I know then how much
data it is)
I want to do several things on each "row" (that is each element in the
array) I want to
1. look for a specific word (start) and count how many of them there
are in total.
2. Highlight the word alarm and append a short description to the end
of the line

my first thought was to use

for(int i=0;i<data.length++){
if(data.endsWith("start"){
// do the work..
}else if(data.startsWith("Alarm")){
//do different work
}
}

this seems to be rather inefficent as I get two (and possibly more, my
employer did not seem think these few features were enough..)
Is there a better way to do this.
As the alarm line also contains an alarm code, I do not know what it
looks like. Any suggestions?

regards
/daniel
 
A

Antti S. Brax

my first thought was to use

for(int i=0;i<data.length++){
if(data.endsWith("start"){
// do the work..
}else if(data.startsWith("Alarm")){
//do different work
}
}

this seems to be rather inefficent as I get two (and possibly more, my
employer did not seem think these few features were enough..)


You get two what? The worst case scenario in the above example is
that you will do two 4 character string matches (string ends with
"tart" and starts with "Alar". Do you have any idea about how
much processing power that takes?

"Premature optimization is the root of all evil".

First run your code and see if it really is inefficient.
As the alarm line also contains an alarm code, I do not know what it
looks like. Any suggestions?

If you don't know it, how the hell are we supposed to?
 
D

Daniel

First run your code and see if it really is inefficient.
I have.. it takes along with getting the data (I will run a test on
only the processing soon) 5 seconds on the test machine, which
contains almost a third of the ammount of data a "real" machine
contains. I estimate that between 3 and 4 seconds of that is time to
transfear the data to my app. (I did a testrun now, and yes,
approximately 1 second is for processing the text)
While this does not seem much, the users did think it took very long
when I did show the the application. I guess I will end up with
writing some code that shows progress rather than optemize this,
maybe..
If you don't know it, how the hell are we supposed to?
You mean you are not an all-powerful all-knowing divine being? ;)

I realize after reading your reply I did express myself a bit
unclear. What I mean is that the line looks like this "-103 ALARM=E03"
and that the "-103" is dependent on when it occured, and thus I do not
know which number it will be. and the "E03" is the alarm number, and
of course is unknown until runtime.

I saw that I thus can not use the endsWith or startsWith but am using
lastIndexOf()

My question for suggestions was to the problem of matching this. if
there were some relativiley simple way of doing this that I was
missing.
I will look into the links posted in the other reply.

regards
Daniel
 
S

Sebastian Millies

Am Mon, 14 Feb 2005 11:01:52 +0000 schrieb bugbear:

I hate to sully an object oriented langugage with anything
so banal and old fashoinedas an algorithm, but...

http://www.cs.rpi.edu/~musser/gp/algorithms.html

BugBear

which brings up the question: Are there any Java libraries
for sorting and searching thst incorporate these
algorithms (Introsort, or Knuth-Morris-Pratt, Boyer-Moore
and their hybrids, resp.)? The Java Collections Framework
for one, does not document the underlying algorithms in
detail, and neither does Jakarta ORO.

-- Sebastian
 
A

Antti S. Brax

REPLACE-OBVIOUSsDOTmilliesATidsMINUSscheerDOTde wrote in comp.lang.java.program$
The Java Collections Framework
for one, does not document the underlying algorithms in
detail, and neither does Jakarta ORO.

In what detail do you need the documentation?

From java.util.Arrays: The sorting algorithm is a tuned quicksort,
adapted from Jon L. Bentley and M. Douglas McIlroy's "Engineering
a Sort Function", Software-Practice and Experience, Vol. 23(11) P.
1249-1265 (November 1993). This algorithm offers n*log(n)
performance on many data sets that cause other quicksorts to
degrade to quadratic performance.

From java.util.Collections: The sorting algorithm is a modified
mergesort (in which the merge is omitted if the highest element
in the low sublist is less than the lowest element in the high
sublist). This algorithm offers guaranteed n log(n) performance.

I'd say that is quite well defined.


As a sidenote: does anyone know why Arrays.sort(Object[]) uses
a different algorithm than Arrays.sort(byte[])?
 
P

Patricia Shanahan

Antti said:
REPLACE-OBVIOUSsDOTmilliesATidsMINUSscheerDOTde wrote in comp.lang.java.program$ ....
As a sidenote: does anyone know why Arrays.sort(Object[]) uses
a different algorithm than Arrays.sort(byte[])?

The issue may be stability.

In a byte array, elements that compare equal are totally and
indistinguishably identical, so there is no need to preserve
their original order.

In an Object array two elements can compare equal but be
distinguishable, so keeping them in order may matter. The
Javadocs claim that the Object[] sort is stable.

Patricia
 
J

John C. Bollinger

Antti said:
As a sidenote: does anyone know why Arrays.sort(Object[]) uses
a different algorithm than Arrays.sort(byte[])?

The algorithm for sort(Object[]) has to be stable (specified in the
class-level docs for Arrays), which makes the order resulting from
Arrays.sort(Object[]) the same as that from Collections.sort(List).
Using a stable sort makes a reasonably good sense anyway. The
distinction between a stable and an unstable sort is lost when you sort
primitives, however. For those cases an algorithm was chosen that
exhibits better average performance (but the same asymptotic complexity).
 
J

Joona I Palaste

I hate to sully an object oriented langugage with anything
so banal and old fashoinedas an algorithm, but...

Surely you meant that "banal and old fashioned" jokingly? Even in object
oriented languages, methods must eventually be implemented with
algorithms. It's all ones and zeroes in the processor, after all. There
must be *some* layer that translates the high-flying ideas into
something the processor understands.
 
J

John C. Bollinger

Daniel said:
I have an appliation that will need to parse
[...]

an array of strings
[...]

I want to do several things on each
[...]

element in the array
[...]

1. look for a specific word (start) and count how many of them there
are in total.

Count how many lines contain this word? How many total occurrences of
the word there are?
2. Highlight the word alarm and append a short description to the end
of the line

What does it mean to "highlight" the word?
my first thought was to use

for(int i=0;i<data.length++){
if(data.endsWith("start"){
// do the work..
}else if(data.startsWith("Alarm")){
//do different work
}
}


This is not "doing several things" on each element of the array. In
particular, if you see "start" and the end of one element, you do not
check for "Alarm". Also, you are doing case-sensitive tests. That may
be fine for your application, but you should be sure. Furthermore, have
you considered what happens if any of the strings contains "restart" but
not "start"? Do you need to recognize multiple occurrences of "start"
in any one string?
this seems to be rather inefficent as I get two (and possibly more, my
employer did not seem think these few features were enough..)

Two *what*? If you need to check multiple unrelated conditions on a
string then you need to perform a separate check for each. If you are
permitted to use relationships among the conditions (e.g. a line
containing "start" will not be an "Alarm" line) then you can structure
the code to reduce the total number of tests, as indeed your example
appears to do.

Do you have specific requirements that you are supposed to meet? If so,
then we could be more helpful if we knew what those were.
Is there a better way to do this.
As the alarm line also contains an alarm code, I do not know what it
looks like. Any suggestions?

You might consider the Java regex machinery. It is unlikely to be as
fast as a simple String.startsWith(), but it sounds like your real
matching criteria are rather more complicated anyway. If you go this
route then DO compile the Pattern(s) once and reuse it, rather than
compiling it anew for each string.
 
J

John C. Bollinger

Daniel said:
I have.. it takes along with getting the data (I will run a test on
only the processing soon) 5 seconds on the test machine, which
contains almost a third of the ammount of data a "real" machine
contains. I estimate that between 3 and 4 seconds of that is time to
transfear the data to my app. (I did a testrun now, and yes,
approximately 1 second is for processing the text)
While this does not seem much, the users did think it took very long
when I did show the the application. I guess I will end up with
writing some code that shows progress rather than optemize this,
maybe..

You must have a large number of strings to work on. How are they
getting into your app? Do you have control of the representation? If
your app is creating the String array (as opposed to receiving it whole
over RMI, or something like that) then constructing a String[] to hold
them is inefficient.

Unless you actually need to hold everything in memory at the same time
(unlikely), it is much better to process the Strings one-by-one as you
receive them. Also, if you have not done so already then ensure that
whatever Reader or InputStream the data are coming in on is buffered.
If that doesn't improve performance enough then consider putting the
reading and processing into different threads. Since one will be doing
I/O, there is good potential to get some additional performance
improvement with this.

It is pointless to try to improve the parsing without touching the I/O,
as your tests show that about a 20% speedup is the absolute maximum you
can expect that way. If users complain that the application is too
slow, then the smaller improvement that you can reasonably hope for
(without reworking the I/O) is not going to satisfy them.
I realize after reading your reply I did express myself a bit
unclear. What I mean is that the line looks like this "-103 ALARM=E03"
and that the "-103" is dependent on when it occured, and thus I do not
know which number it will be. and the "E03" is the alarm number, and
of course is unknown until runtime.

I saw that I thus can not use the endsWith or startsWith but am using
lastIndexOf()

My question for suggestions was to the problem of matching this. if
there were some relatively simple way of doing this that I was
missing.

Back to the case-sensitivity question, then: you have now presented
three different possible capitalizations of the "ALARM" keyword you need
to recognize. Which is right? Do you need to parse out the other parts
of this line as well? In that case you may have crossed the threshold
where a suitable regex will be useful to you. Something like:

Pattern.compile("(-\\d+)\\s+ALARM=(.+)");

would get you a pattern that can match the example string you provided.
From the Matcher object you could then also separately extract the
"-103" and the "E03" without any additional parsing. The matching would
be fairly efficient because it can start at the beginning of the string
and reject many non-matches early.

A well-written pattern for the other test may also solve the problem of
distinguishing "start" from "restart", "starting", and "Astarte".
 
D

Daniel

You must have a large number of strings to work on. How are they
getting into your app? Do you have control of the representation? If
your app is creating the String array (as opposed to receiving it whole
over RMI, or something like that) then constructing a String[] to hold
them is inefficient.
I have a range from 0 to 1000 strings (or rows).
They get into my app via a bufferInputStream(Connected to a serialport
via the javax.comm package), so I get them as a single byte array.
As I want to display all the data in a JTextPane I assumed I had to
convert it to string. Perhaps that was a too quick conclusion?
I also did the split on "\r\n" to make it easier to distignuish
individual lines, as some lines require special treatment (as
discussed in this and previous postings)
Unless you actually need to hold everything in memory at the same time
(unlikely), it is much better to process the Strings one-by-one as you
receive them. Also, if you have not done so already then ensure that
whatever Reader or InputStream the data are coming in on is buffered.
If that doesn't improve performance enough then consider putting the
reading and processing into different threads. Since one will be doing
I/O, there is good potential to get some additional performance
improvement with this.

the thing is that I get them all in one go. As soon as I request the
data I get it all as one big chunk. I have already put it in it's own
thread so I wouldn't freeze the GUI in the process.
It is pointless to try to improve the parsing without touching the I/O,
as your tests show that about a 20% speedup is the absolute maximum you
can expect that way. If users complain that the application is too
slow, then the smaller improvement that you can reasonably hope for
(without reworking the I/O) is not going to satisfy them.

true, my thought now is that I would make the GUI more "alive" so
instead of a "Getting data from machine" perhaps something along the
lines "Getting data from machine (XX% done)" or similar, thus
informing the user of the progress. That of course means I would have
to measure how long I have gone, but I guess that I can fake some of
this, updating the procentage even though I don't really know how
much is really done, and then put real values there as soon as I have
them...

Back to the case-sensitivity question, then: you have now presented
three different possible capitalizations of the "ALARM" keyword you need
to recognize. Which is right? Do you need to parse out the other parts

I will check with the guys who wrote the software I am communicating
with, but from what I've seen so far "-103 ALARM=E04" is the correct
one.
of this line as well? In that case you may have crossed the threshold
not of this line, this line I must find which alarm number I have,
look it up and display the whole thing as "-103 ALARM=E04 (Temperature
sensor in boiler defect)"
So a regular expresion seems like the way to go.

Thank you for your ideas and suggestions!

regards
Daniel
 
D

Daniel

1. look for a specific word (start) and count how many of them there
Count how many lines contain this word? How many total occurrences of
the word there are?
yes :) in this case it would yield the same result, as a line may not
contain the word more than once. To answer your question more
properly, I need to count the number of total occrences of the word.
What does it mean to "highlight" the word?

in this case I mark it with bold font.
like this:

SimpleAttributeSet attr = new SimpleAttributeSet();
if (bold) {
StyleConstants.setBold(attr, true);
}
try {
doc.insertString(doc.getLength(), s + "\n", attr);
} catch (BadLocationException e) {
System.out.println("Error trying to add string: " + s);
}
}
This is not "doing several things" on each element of the array. In
particular, if you see "start" and the end of one element, you do not
check for "Alarm". Also, you are doing case-sensitive tests. That may
be fine for your application, but you should be sure. Furthermore, have
you considered what happens if any of the strings contains "restart" but
not "start"? Do you need to recognize multiple occurrences of "start"
in any one string?

I guess again I expressed myself a bit clumsy. since most of the
strings will not contain any of the words I'm looking for I would have
to do all the comparisions on them, which seems like a bit uneccesery
work, but I guess there is no real way to get out of that.
Not more than one occurancesy of "start" in any one string.
containing "start" will not be an "Alarm" line) then you can structure
the code to reduce the total number of tests, as indeed your example
appears to do.

yes, a line containing "start" will not contain alarm and vice versa.
So I can structure my code as I tried to already, meaning when I've
found one of the speical cases I'm looking for I do not have to check
the others.
Do you have specific requirements that you are supposed to meet? If so,
then we could be more helpful if we knew what those were.
no, The requrement so far is that I
1. count the number of occrances of the word "START"
2. Append the meaning of an alarm number at the end of the string
containg "ALARM"

You might consider the Java regex machinery. It is unlikely to be as
fast as a simple String.startsWith(), but it sounds like your real
matching criteria are rather more complicated anyway. If you go this
route then DO compile the Pattern(s) once and reuse it, rather than
compiling it anew for each string.

yeah, I will look into it. I have limited experience with regex and
none with the java version of it. but I guess I should be able to
figure something out.

A thought that struck me just now.
since each alarm line looks like this
"-XXX ALARM=EYY"
would it make sense (provided that no other line looks like this)
to do
if(data.getCharAt(10)=='='){
String s=data+ map.get(data.split("=")[1]); // provided I store the
definitions
//in a map.. but you get the idea.
textpane.addRow(s);
}

or is this a rather clumsy and inefficent way of achiving my goal?

/daniel
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,566
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top