Do java support 'String' methods properly ?

R

Red Orchid

I think that if the methods of 'String' are not
efficient, these exert a very bad influence on
the performance of an application.

Java do not support any String methods with ignoreCase
except 'String.equalsIgnoreCase' and 'String.regionMatches'.

Threrefore, I think that java leads a programmer to
write the codes that are inefficient.

In proof of my thinking, I will show examples bellow.

<example_#1>
String src = ..;
String key = ..;

... = src.toLowerCase().indexOf(key.toLowerCase());


#1 is very inefficient. because #1 converts the entire
of 'src' to lowerCase though there is no need to do so.

That is,
if src = "abcdefghij" and key = "bc", there is need
to convert only "abc" of src to lowerCase.

The codes like the above are found very common in java
application sources.


<example_#2>
'String.regionMatches' do not provide valuable aid for
solving the inefficient of #1.

Because ..
the folowing code is a part of 'String.regionMatches'
Assuming 'src.regionMatches(true, i, key, j, ...)'.


if (ignoreCase) {
...
char u1 = Character.toUpperCase(c1);
char u2 = Character.toUpperCase(c2);
if (u1 == u2) {
continue;
}
..

Whenver 'regionMatches' is called, 'key' is converted
to upperCase. it is inefficient.


<example_#3>
A programmer can write his method like the follows.

String k = key.toLowerCase();
for (int i = 0; i < src.length(); i++) {
for (int j = 0; j < k.length(); j++) {
char c = Character.toLowerCase(src.charAt(i + j));
if (c != k.charAt(j)) {

....

#3 calls 'String.charAt'. That is, #3 has the overhead
of function call (method call). if his method exists
inside 'String', he can access the field 'value[]' of
'String' and then his method will not have the overhead.

</example_*>


I think that java do not support 'String' methods enough
or properly.

What is your opinion ?

Thanks.
 
I

Ingo R. Homann

Hi,

Red said:
[String-operations are slow in Java]

Well, in "normal" applications, String-operations are *not* the bottleneck.

But *if* String-operations are a bottleneck, then even your idea of
linear searching a String is a very bad idea.

Google for Knuth-Morris-Pratt and Boyer-Moore-Horspool for *efficient*
String searching algorithms!

Ciao,
Ingo
 
R

Roedy Green

Google for Knuth-Morris-Pratt and Boyer-Moore-Horspool for *efficient*
String searching algorithms!

See http://mindprod.com/jgloss/products1.html#BOYER

Boyer-Moore does not make that much difference. I think they would pay
off if you were simultaneously searching for many different strings.
My implementation does NOT handle that.

A string search can be done very efficiently in a native method which
is what Jet does.
 
T

Thomas Hawtin

Red said:
I think that if the methods of 'String' are not
efficient, these exert a very bad influence on
the performance of an application.

Java do not support any String methods with ignoreCase
except 'String.equalsIgnoreCase' and 'String.regionMatches'.

Threrefore, I think that java leads a programmer to
write the codes that are inefficient.

In proof of my thinking, I will show examples bellow.

<example_#1>
String src = ..;
String key = ..;

.. = src.toLowerCase().indexOf(key.toLowerCase());


#1 is very inefficient. because #1 converts the entire
of 'src' to lowerCase though there is no need to do so.

Inefficient, possibly. Very inefficient, no.

Remember that (short lived) memory allocation is very cheap. Also
String.toLowerCase will not create a new String if the source String is
already lower case.

In general if you split code into passes then each pass becomes compact
and fast. If you try to do everything at once, then it rapidly becomes a
mess. It also leads to code duplication.
I think that java do not support 'String' methods enough
or properly.

It's a complicated area. And I'll probably make some mistakes...

There are plenty of inconsistencies. equalsIgnoreCase works on a
character basis. String.toUpperCase may create a String with a different
number of chars.

Take the German eszet (<http://en.wikipedia.org/wiki/Eszet>).
"Stra\u00dfe".toUpperCase() should give "STRASSE". Both literal strings
should be equal ignoring case, but equalsIgnoreCase is broken.

Another thing that is usually done incorrectly (including within Java)
is that toUpperCase and toLowerCase almost always should have a
specified Locale. For instance "i".toUpperCase() does not always give
"I" (Turkish will give "\u0130"). It seems that lots of programs exhibit
bugs when run in Turkish.

Tom Hawtin
 
G

Googmeister

Ingo said:
Hi,

Red said:
[String-operations are slow in Java]

Well, in "normal" applications, String-operations are *not* the bottleneck.

Yes, I agree here. Strings are very convenient to use in Java (as
compared to, say, C where they're associated with buffer overflow
errors). The tradeoff is reduced efficiency and increased memory
usage. If your bottleneck really is dealing with strings, you can
always go back to using an array of bytes or chars, and you're even
free to use the convention of terminating them with '\0'.
But *if* String-operations are a bottleneck, then even your idea of
linear searching a String is a very bad idea.

This is how Java implements indexOf() as well. It is efficient (linear)
for "typical" inputs, but has a quadratic worst case.
Google for Knuth-Morris-Pratt and Boyer-Moore-Horspool for *efficient*
String searching algorithms!

Anyone know why Java's indexOf doesn't use Boyer-Moore
(sublinear on typical inputs) for long strings? I suspect it's a
combination of requiring extra space proportional to the alphabet
size and difficulties in dealing with the subtleties of Unicode
(e.g., code points). But it would be nice to hear from the experts.
 
I

Ingo R. Homann

Hi,
This is how Java implements indexOf() as well. It is efficient (linear)
for "typical" inputs, but has a quadratic worst case.

Depends on what is 'typical' for you. :)

(BTW, linear search is not really quadratic, but M*N.)
Anyone know why Java's indexOf doesn't use Boyer-Moore
(sublinear on typical inputs) for long strings? I suspect it's a
combination of requiring extra space proportional to the alphabet
size and difficulties in dealing with the subtleties of Unicode
(e.g., code points). But it would be nice to hear from the experts.

IIRC, all algorithms (linear, BMH, KMP) have different advantages,
depending on the sizes of the text and the pattern, the size of the
alphabet, the distribution of the characters and if you are searching a
pattern in different texts (do not forget that you need to generate a
"suffix-list" for the pattern)!

If the both the text and the pattern are very short (perhaps only 10 to
20 chars?), it is obvious that linear search ist the best. If the
alphabet is very short (eg only A and B), the pattern might be AAAAA, so
that KMP (*) has no advantage over linear search as well.

Fact is: You simply cannot decide which algorithm is 'best' if you do
not know *anything* about the application (which of course sun can't).

And if you want to implement a large text-database or a gene-sequencer,
it is obvious that you have to think *a lot* about searching-algorithms,
so that *no* implementation (that has not been optimized especially for
the exact application (**)) is perfect.

Ciao,
Ingo

(*) It has been nearly 10 years since I implemented these algorithms,
and I remember there were some slight differences between BM and BMH
wich I do not remember right now. So, forgive me, if I confuse the
algorithms...

(**) In these cases, it might often be useful if you do not have the
whole text in RAM, but e.g. to have a linked list of "chunks" of texts,
with one chunk perhaps of the size of some MB. The rest remains on disk.
No default-implementation can deal with that!
 
G

Googmeister

Ingo said:
Hi,


Depends on what is 'typical' for you. :)

English text.
(BTW, linear search is not really quadratic, but M*N.)

Yes, in the worst case, this is quadratic in the input size.
The worst case occurs when M = N/2.
IIRC, all algorithms (linear, BMH, KMP) have different advantages,
depending on the sizes of the text and the pattern, the size of the
alphabet, the distribution of the characters and if you are searching a
pattern in different texts (do not forget that you need to generate a
"suffix-list" for the pattern)!

That's why I specifically said long strings. Arrays.sort uses insertion
sort for small inputs and mergesort/quicksort for large inputs. It
wouldn't
be inconsistent to do something smart (e.g., Boyer-Moore) for long
strings since it can lead to dramatic performance improvements.
Fact is: You simply cannot decide which algorithm is 'best' if you do
not know *anything* about the application (which of course sun can't).

Sun provides algorithms like mergesort and red-black trees. We
certainly expect reasonable efficient algorithms here, even though
they're not always the best for any particular application. My question
is why not use Boyer-Moore for long strings, where it is superior to
brute force, and in most common cases vastly superior?
 
I

Ingo R. Homann

Hi,
That's why I specifically said long strings. Arrays.sort uses insertion
sort for small inputs and mergesort/quicksort for large inputs. It
wouldn't
be inconsistent to do something smart (e.g., Boyer-Moore) for long
strings since it can lead to dramatic performance improvements.
...
Sun provides algorithms like mergesort and red-black trees. We
certainly expect reasonable efficient algorithms here, even though
they're not always the best for any particular application. My question
is why not use Boyer-Moore for long strings, where it is superior to
brute force, and in most common cases vastly superior?

That would indeed be possible. But - as I said in my last posting:

(1) I think that if you want to deal with really large texts, you will
have to implement an own (chunked) String-class, highly optimized to
your specific application, so that you cannot use String.indexOf at all.

(2) String.indexOf is not a bottleneck in most typical applications.

So, it may not be worth implementing.

Or, perhaps sun hasn't heard of BMH at all. ;-)

Ciao,
Ingo
 
C

Chris Uppal

Googmeister said:
That's why I specifically said long strings. Arrays.sort uses insertion
sort for small inputs and mergesort/quicksort for large inputs. It
wouldn't
be inconsistent to do something smart (e.g., Boyer-Moore) for long
strings since it can lead to dramatic performance improvements.

I think it would be at least as useful if it were provided for binary data too
(or even instead).

I'm not sure that your point about sorting carries over all that well. Sorting
is something that many apps do a lot of, and often the collection to be sorted
is big enough that insert-sort is not suitable. For a comparable situation to
apply to string searching (text or binary) searches of long strings would have
to be reasonably common -- and I'm not convinced that they are[*]. Especially
when, in practise, BM and naive searching both tend to operate in linear
time -- albeit with different constants. And then when you consider that the
long strings you are searching must have come /from/ somewhere, with attendent
I/O costs, it is probably quite hard to find an application for which string
matching is a genuine bottleneck[**].

(
[*] Although that might be as much a result of the lack a fast string search as
a cause of it.

[**] I'm not saying there aren't any, just that I wouldn't expect there to be
enough to push Sun into giving string matching high priority.
)

-- chris
 
R

Robert M. Gary

If you profile your application, I would be surprised if you found
String manipulation your bottleneck. If you are doing any I/O at all, I
would look at making that faster.

BTW: If you are creating a string to pass to other methods, consider
using StringBuffer instead. Its a first class object and therefore
passed by reference. The String class is passed by value and results in
a complete copy everytime you call a method.

-Robert
 
D

Daniel Dyer

BTW: If you are creating a string to pass to other methods, consider
using StringBuffer instead. Its a first class object and therefore
passed by reference. The String class is passed by value and results in
a complete copy everytime you call a method.

This is simply not true. There is no difference between the way Strings
and StringBuffers are passed and no concept of "first class object" versus
some other class of object. Everything in Java is passed by value,
whether it's a primitive or an object reference (note that it's the
reference that is passed, not the object itself). Strings are not copied
as they are passed around and the fact that they are immutable makes them
suitable for sharing.

Dan.
 
G

Googmeister

Chris said:
Googmeister said:
That's why I specifically said long strings. Arrays.sort uses insertion
sort for small inputs and mergesort/quicksort for large inputs. It
wouldn't
be inconsistent to do something smart (e.g., Boyer-Moore) for long
strings since it can lead to dramatic performance improvements.

I think it would be at least as useful if it were provided for binary data too
(or even instead).

I'm not sure that your point about sorting carries over all that well. Sorting
is something that many apps do a lot of, and often the collection to be sorted
is big enough that insert-sort is not suitable. For a comparable situation to
apply to string searching (text or binary) searches of long strings would have
to be reasonably common -- and I'm not convinced that they are[*].

Yes, I do agree that fast sorting is typically more useful than
fast string search.
Especially
when, in practise, BM and naive searching both tend to operate in linear
time -- albeit with different constants.

No, BM runs in *sublinear* time on typical inputs -- that is O(N / M),
where N is the length of the text and M is the length of the pattern.
This is what makes it so useful (assuming you have the text already
in memory). It also comes with a linear time worst case guarantee
that protects against pathological inputs.
 
C

Chris Uppal

Googmeister said:
No, BM runs in *sublinear* time on typical inputs -- that is O(N / M),

Oh I quite agree with the theoretical analysis. The position I was coming from
was twofold -- still is actually.

One point is that -- depending on circumstances -- "jumping" over bits of the
sequence to match may not buy you genuine sub-linear execution. For instance,
if you are streaming stuff off-disk, then unless the jumps are bigger than a
disk page, you will still have to read the data into memory in the OS's buffers
and/or the application's. Which is O(n). Or if the data is in main RAM, a
similar argument applies about fetching data into caches. If you end up
limited by disk I/O capacity, or by bandwidth to RAM, then reducing the number
of actual character (or byte) comparisons isn't cutting out the bottleneck
unless it /also/ reduces the amount of data that has to be moved over each
threshold.

The other point is that I would expect the strings you were searching /for/ to
be fixed for many applications (not always, of course). In such circumstances,
the expected execution time is O(n), simply because the O(1/m) bit is (for that
application) a constant. The app can't take advantage of the effect of bigger
M because there are no longer strings that it wants to look for :-(

Hence my comment. I should have been more precise, I admit.

This is what makes it so useful (assuming you have the text already
in memory). It also comes with a linear time worst case guarantee
that protects against pathological inputs.

Agreed, it's a very nice family of algorithms.

-- chris
 
R

Robert M. Gary

Actually, that's not true. Strings are kinda like real objects and
kinda like primatives. For instance you can say
String foo = "hello" just like you can say int i =5. You cannot say
StringBuffer foo = "hello" (because you must create the object first).

Here is the test though...
public class stringtest
{

stringtest()
{
String foo = "hello";
StringBuffer bar = new StringBuffer("hello");
testString (foo);
testBuffer(bar);
System.out.println("String is "+foo);
System.out.println("Buffer is "+bar);
}

private void testBuffer(StringBuffer bar)
{
bar.append("bye");
}

private void testString(String foo)
{
foo.toUpperCase();
}

public static void main(String[] args)
{
new stringtest();
}
}

The out put looks like
String is hello
Buffer is hellobye

Notice the string is left untouched (because it was passed by value)
but the StringBuffer was changed (because it was passed by reference).

-Robert
 
T

Thomas Hawtin

Robert said:
Actually, that's not true. Strings are kinda like real objects and
kinda like primatives. For instance you can say
String foo = "hello" just like you can say int i =5. You cannot say
StringBuffer foo = "hello" (because you must create the object first).

You have created the "hello" String. It happens to be written as a
literal and interned. A bit of syntax for literals doesn't make Strings
in anyway primitive.
Notice the string is left untouched (because it was passed by value)
but the StringBuffer was changed (because it was passed by reference).

String was left unchanged because you didn't mutate it. StringBuffer
changed because you mutated it.

If you were to call length on the StringBuffer instead of append, then
it wouldn't have changed. Similarly, if you had used reflection of the
String, then it may have changed. Synchronisation adds an interesting case.

Tom Hawtin
 
R

Roedy Green

One point is that -- depending on circumstances -- "jumping" over bits of the
sequence to match may not buy you genuine sub-linear execution. For instance,
if you are streaming stuff off-disk, then unless the jumps are bigger than a
disk page, you will still have to read the data into memory in the OS's buffers
and/or the application's. Which is O(n). Or if the data is in main RAM, a
similar argument applies about fetching data into caches. If you end up
limited by disk I/O capacity, or by bandwidth to RAM, then reducing the number
of actual character (or byte) comparisons isn't cutting out the bottleneck
unless it /also/ reduces the amount of data that has to be moved over each
threshold.

The other thing to understand is that mathematicians simplify. To them
all that counts in a sort are the number of compares or number of
swaps. They completely ignore the other overhead. Ditto for Boyer
Moore. They are all excited about reducing the number of compares,
but that is just a fraction of what goes on. Hopping taking extra
time. So even though you save in number of comparisons you lose it
setup for the next comparison.

For the third time, I invite you to experiment with an actual
implementation and see for yourself that the speed up in ordinary use
is not that dramatic. See http://mindprod.com/products1.html#BOYER
 
R

Roedy Green

The String class is passed by value and results in
a complete copy everytime you call a method.

a String is passed by value of the REFERENCE not the entire String.
Only 32 bits get pushed to the stack. Strings are immutable.
References share the same copy.
 
R

Roedy Green

Strings are kinda like real objects and
kinda like primatives.

Strings may have some native methods and they are final, but they are
ordinary objects.

They are special only in that you can create a String object with a
literal in the language and the + operator will work on them.

The + operator is just sugar to save you writing out the equivalent
StringBuffer code.

The literal feature is deeper. The byte code has a way of specially
recording String literals in a pool in the byte code.
 
R

Roedy Green

if you had used reflection of the
String, then it may have changed.

Strings are immutable. That is iron clad. The caller's String will
not change neither will the caller's reference to the String.

The only way I know of to change a String is to use JNI and once
inside cheat, taking advantage of C's total lack of restraints.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top