Do java support 'String' methods properly ?

Discussion in 'Java' started by Red Orchid, Nov 8, 2005.

  1. Red Orchid

    Red Orchid Guest

    I think that if the methods of 'String' are not
    efficient, these exert a very bad influence on
    the performance of an application.

    Java do not support any String methods with ignoreCase
    except 'String.equalsIgnoreCase' and 'String.regionMatches'.

    Threrefore, I think that java leads a programmer to
    write the codes that are inefficient.

    In proof of my thinking, I will show examples bellow.

    <example_#1>
    String src = ..;
    String key = ..;

    ... = src.toLowerCase().indexOf(key.toLowerCase());


    #1 is very inefficient. because #1 converts the entire
    of 'src' to lowerCase though there is no need to do so.

    That is,
    if src = "abcdefghij" and key = "bc", there is need
    to convert only "abc" of src to lowerCase.

    The codes like the above are found very common in java
    application sources.


    <example_#2>
    'String.regionMatches' do not provide valuable aid for
    solving the inefficient of #1.

    Because ..
    the folowing code is a part of 'String.regionMatches'
    Assuming 'src.regionMatches(true, i, key, j, ...)'.


    if (ignoreCase) {
    ...
    char u1 = Character.toUpperCase(c1);
    char u2 = Character.toUpperCase(c2);
    if (u1 == u2) {
    continue;
    }
    ..

    Whenver 'regionMatches' is called, 'key' is converted
    to upperCase. it is inefficient.


    <example_#3>
    A programmer can write his method like the follows.

    String k = key.toLowerCase();
    for (int i = 0; i < src.length(); i++) {
    for (int j = 0; j < k.length(); j++) {
    char c = Character.toLowerCase(src.charAt(i + j));
    if (c != k.charAt(j)) {

    ....

    #3 calls 'String.charAt'. That is, #3 has the overhead
    of function call (method call). if his method exists
    inside 'String', he can access the field 'value[]' of
    'String' and then his method will not have the overhead.

    </example_*>


    I think that java do not support 'String' methods enough
    or properly.

    What is your opinion ?

    Thanks.
     
    Red Orchid, Nov 8, 2005
    #1
    1. Advertising

  2. Hi,

    Red Orchid wrote:
    > [String-operations are slow in Java]


    Well, in "normal" applications, String-operations are *not* the bottleneck.

    But *if* String-operations are a bottleneck, then even your idea of
    linear searching a String is a very bad idea.

    Google for Knuth-Morris-Pratt and Boyer-Moore-Horspool for *efficient*
    String searching algorithms!

    Ciao,
    Ingo
     
    Ingo R. Homann, Nov 8, 2005
    #2
    1. Advertising

  3. Red Orchid

    Roedy Green Guest

    On Tue, 08 Nov 2005 10:28:12 +0100, "Ingo R. Homann"
    <> wrote, quoted or indirectly quoted someone who
    said :

    >Google for Knuth-Morris-Pratt and Boyer-Moore-Horspool for *efficient*
    >String searching algorithms!


    See http://mindprod.com/jgloss/products1.html#BOYER

    Boyer-Moore does not make that much difference. I think they would pay
    off if you were simultaneously searching for many different strings.
    My implementation does NOT handle that.

    A string search can be done very efficiently in a native method which
    is what Jet does.

    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
     
    Roedy Green, Nov 8, 2005
    #3
  4. Red Orchid wrote:
    > I think that if the methods of 'String' are not
    > efficient, these exert a very bad influence on
    > the performance of an application.
    >
    > Java do not support any String methods with ignoreCase
    > except 'String.equalsIgnoreCase' and 'String.regionMatches'.
    >
    > Threrefore, I think that java leads a programmer to
    > write the codes that are inefficient.
    >
    > In proof of my thinking, I will show examples bellow.
    >
    > <example_#1>
    > String src = ..;
    > String key = ..;
    >
    > .. = src.toLowerCase().indexOf(key.toLowerCase());
    >
    >
    > #1 is very inefficient. because #1 converts the entire
    > of 'src' to lowerCase though there is no need to do so.


    Inefficient, possibly. Very inefficient, no.

    Remember that (short lived) memory allocation is very cheap. Also
    String.toLowerCase will not create a new String if the source String is
    already lower case.

    In general if you split code into passes then each pass becomes compact
    and fast. If you try to do everything at once, then it rapidly becomes a
    mess. It also leads to code duplication.

    > I think that java do not support 'String' methods enough
    > or properly.


    It's a complicated area. And I'll probably make some mistakes...

    There are plenty of inconsistencies. equalsIgnoreCase works on a
    character basis. String.toUpperCase may create a String with a different
    number of chars.

    Take the German eszet (<http://en.wikipedia.org/wiki/Eszet>).
    "Stra\u00dfe".toUpperCase() should give "STRASSE". Both literal strings
    should be equal ignoring case, but equalsIgnoreCase is broken.

    Another thing that is usually done incorrectly (including within Java)
    is that toUpperCase and toLowerCase almost always should have a
    specified Locale. For instance "i".toUpperCase() does not always give
    "I" (Turkish will give "\u0130"). It seems that lots of programs exhibit
    bugs when run in Turkish.

    Tom Hawtin
    --
    Unemployed English Java programmer
    http://jroller.com/page/tackline/
     
    Thomas Hawtin, Nov 8, 2005
    #4
  5. Red Orchid

    Googmeister Guest

    Ingo R. Homann wrote:
    > Hi,
    >
    > Red Orchid wrote:
    > > [String-operations are slow in Java]



    > Well, in "normal" applications, String-operations are *not* the bottleneck.


    Yes, I agree here. Strings are very convenient to use in Java (as
    compared to, say, C where they're associated with buffer overflow
    errors). The tradeoff is reduced efficiency and increased memory
    usage. If your bottleneck really is dealing with strings, you can
    always go back to using an array of bytes or chars, and you're even
    free to use the convention of terminating them with '\0'.

    > But *if* String-operations are a bottleneck, then even your idea of
    > linear searching a String is a very bad idea.


    This is how Java implements indexOf() as well. It is efficient (linear)
    for "typical" inputs, but has a quadratic worst case.

    > Google for Knuth-Morris-Pratt and Boyer-Moore-Horspool for *efficient*
    > String searching algorithms!


    Anyone know why Java's indexOf doesn't use Boyer-Moore
    (sublinear on typical inputs) for long strings? I suspect it's a
    combination of requiring extra space proportional to the alphabet
    size and difficulties in dealing with the subtleties of Unicode
    (e.g., code points). But it would be nice to hear from the experts.
     
    Googmeister, Nov 8, 2005
    #5
  6. Hi,

    Googmeister wrote:
    > This is how Java implements indexOf() as well. It is efficient (linear)
    > for "typical" inputs, but has a quadratic worst case.


    Depends on what is 'typical' for you. :)

    (BTW, linear search is not really quadratic, but M*N.)

    >>Google for Knuth-Morris-Pratt and Boyer-Moore-Horspool for *efficient*
    >>String searching algorithms!

    >
    > Anyone know why Java's indexOf doesn't use Boyer-Moore
    > (sublinear on typical inputs) for long strings? I suspect it's a
    > combination of requiring extra space proportional to the alphabet
    > size and difficulties in dealing with the subtleties of Unicode
    > (e.g., code points). But it would be nice to hear from the experts.


    IIRC, all algorithms (linear, BMH, KMP) have different advantages,
    depending on the sizes of the text and the pattern, the size of the
    alphabet, the distribution of the characters and if you are searching a
    pattern in different texts (do not forget that you need to generate a
    "suffix-list" for the pattern)!

    If the both the text and the pattern are very short (perhaps only 10 to
    20 chars?), it is obvious that linear search ist the best. If the
    alphabet is very short (eg only A and B), the pattern might be AAAAA, so
    that KMP (*) has no advantage over linear search as well.

    Fact is: You simply cannot decide which algorithm is 'best' if you do
    not know *anything* about the application (which of course sun can't).

    And if you want to implement a large text-database or a gene-sequencer,
    it is obvious that you have to think *a lot* about searching-algorithms,
    so that *no* implementation (that has not been optimized especially for
    the exact application (**)) is perfect.

    Ciao,
    Ingo

    (*) It has been nearly 10 years since I implemented these algorithms,
    and I remember there were some slight differences between BM and BMH
    wich I do not remember right now. So, forgive me, if I confuse the
    algorithms...

    (**) In these cases, it might often be useful if you do not have the
    whole text in RAM, but e.g. to have a linked list of "chunks" of texts,
    with one chunk perhaps of the size of some MB. The rest remains on disk.
    No default-implementation can deal with that!
     
    Ingo R. Homann, Nov 8, 2005
    #6
  7. Red Orchid

    Googmeister Guest

    Ingo R. Homann wrote:
    > Hi,
    >
    > Googmeister wrote:
    > > This is how Java implements indexOf() as well. It is efficient (linear)
    > > for "typical" inputs, but has a quadratic worst case.

    >
    > Depends on what is 'typical' for you. :)


    English text.

    > (BTW, linear search is not really quadratic, but M*N.)


    Yes, in the worst case, this is quadratic in the input size.
    The worst case occurs when M = N/2.

    > >>Google for Knuth-Morris-Pratt and Boyer-Moore-Horspool for *efficient*
    > >>String searching algorithms!

    > >
    > > Anyone know why Java's indexOf doesn't use Boyer-Moore
    > > (sublinear on typical inputs) for long strings? I suspect it's a
    > > combination of requiring extra space proportional to the alphabet
    > > size and difficulties in dealing with the subtleties of Unicode
    > > (e.g., code points). But it would be nice to hear from the experts.

    >
    > IIRC, all algorithms (linear, BMH, KMP) have different advantages,
    > depending on the sizes of the text and the pattern, the size of the
    > alphabet, the distribution of the characters and if you are searching a
    > pattern in different texts (do not forget that you need to generate a
    > "suffix-list" for the pattern)!


    That's why I specifically said long strings. Arrays.sort uses insertion
    sort for small inputs and mergesort/quicksort for large inputs. It
    wouldn't
    be inconsistent to do something smart (e.g., Boyer-Moore) for long
    strings since it can lead to dramatic performance improvements.

    > Fact is: You simply cannot decide which algorithm is 'best' if you do
    > not know *anything* about the application (which of course sun can't).


    Sun provides algorithms like mergesort and red-black trees. We
    certainly expect reasonable efficient algorithms here, even though
    they're not always the best for any particular application. My question
    is why not use Boyer-Moore for long strings, where it is superior to
    brute force, and in most common cases vastly superior?
     
    Googmeister, Nov 8, 2005
    #7
  8. Red Orchid

    Roedy Green Guest

    On 8 Nov 2005 05:49:11 -0800, "Googmeister" <>
    wrote, quoted or indirectly quoted someone who said :

    > why not use Boyer-Moore for long strings, where it is superior to
    >brute force, and in most common cases vastly superior?


    I wrote one and experimented. The catch is you usually find the
    target string pretty early on, so Boyer Moore never gets a chance to
    overcome its startup overhead.

    See http://mindprod.com/products1.html#BOYER
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
     
    Roedy Green, Nov 8, 2005
    #8
  9. Hi,

    Googmeister wrote:
    > That's why I specifically said long strings. Arrays.sort uses insertion
    > sort for small inputs and mergesort/quicksort for large inputs. It
    > wouldn't
    > be inconsistent to do something smart (e.g., Boyer-Moore) for long
    > strings since it can lead to dramatic performance improvements.
    > ...
    > Sun provides algorithms like mergesort and red-black trees. We
    > certainly expect reasonable efficient algorithms here, even though
    > they're not always the best for any particular application. My question
    > is why not use Boyer-Moore for long strings, where it is superior to
    > brute force, and in most common cases vastly superior?


    That would indeed be possible. But - as I said in my last posting:

    (1) I think that if you want to deal with really large texts, you will
    have to implement an own (chunked) String-class, highly optimized to
    your specific application, so that you cannot use String.indexOf at all.

    (2) String.indexOf is not a bottleneck in most typical applications.

    So, it may not be worth implementing.

    Or, perhaps sun hasn't heard of BMH at all. ;-)

    Ciao,
    Ingo
     
    Ingo R. Homann, Nov 8, 2005
    #9
  10. Red Orchid

    Chris Uppal Guest

    Googmeister wrote:

    > That's why I specifically said long strings. Arrays.sort uses insertion
    > sort for small inputs and mergesort/quicksort for large inputs. It
    > wouldn't
    > be inconsistent to do something smart (e.g., Boyer-Moore) for long
    > strings since it can lead to dramatic performance improvements.


    I think it would be at least as useful if it were provided for binary data too
    (or even instead).

    I'm not sure that your point about sorting carries over all that well. Sorting
    is something that many apps do a lot of, and often the collection to be sorted
    is big enough that insert-sort is not suitable. For a comparable situation to
    apply to string searching (text or binary) searches of long strings would have
    to be reasonably common -- and I'm not convinced that they are[*]. Especially
    when, in practise, BM and naive searching both tend to operate in linear
    time -- albeit with different constants. And then when you consider that the
    long strings you are searching must have come /from/ somewhere, with attendent
    I/O costs, it is probably quite hard to find an application for which string
    matching is a genuine bottleneck[**].

    (
    [*] Although that might be as much a result of the lack a fast string search as
    a cause of it.

    [**] I'm not saying there aren't any, just that I wouldn't expect there to be
    enough to push Sun into giving string matching high priority.
    )

    -- chris
     
    Chris Uppal, Nov 8, 2005
    #10
  11. If you profile your application, I would be surprised if you found
    String manipulation your bottleneck. If you are doing any I/O at all, I
    would look at making that faster.

    BTW: If you are creating a string to pass to other methods, consider
    using StringBuffer instead. Its a first class object and therefore
    passed by reference. The String class is passed by value and results in
    a complete copy everytime you call a method.

    -Robert
     
    Robert M. Gary, Nov 8, 2005
    #11
  12. Red Orchid

    Daniel Dyer Guest

    On Tue, 08 Nov 2005 17:01:27 -0000, Robert M. Gary <>
    wrote:
    >
    > BTW: If you are creating a string to pass to other methods, consider
    > using StringBuffer instead. Its a first class object and therefore
    > passed by reference. The String class is passed by value and results in
    > a complete copy everytime you call a method.


    This is simply not true. There is no difference between the way Strings
    and StringBuffers are passed and no concept of "first class object" versus
    some other class of object. Everything in Java is passed by value,
    whether it's a primitive or an object reference (note that it's the
    reference that is passed, not the object itself). Strings are not copied
    as they are passed around and the fact that they are immutable makes them
    suitable for sharing.

    Dan.

    --
    Daniel Dyer
    http://www.dandyer.co.uk
     
    Daniel Dyer, Nov 8, 2005
    #12
  13. Red Orchid

    Googmeister Guest

    Chris Uppal wrote:
    > Googmeister wrote:
    >
    > > That's why I specifically said long strings. Arrays.sort uses insertion
    > > sort for small inputs and mergesort/quicksort for large inputs. It
    > > wouldn't
    > > be inconsistent to do something smart (e.g., Boyer-Moore) for long
    > > strings since it can lead to dramatic performance improvements.

    >
    > I think it would be at least as useful if it were provided for binary data too
    > (or even instead).
    >
    > I'm not sure that your point about sorting carries over all that well. Sorting
    > is something that many apps do a lot of, and often the collection to be sorted
    > is big enough that insert-sort is not suitable. For a comparable situation to
    > apply to string searching (text or binary) searches of long strings would have
    > to be reasonably common -- and I'm not convinced that they are[*].


    Yes, I do agree that fast sorting is typically more useful than
    fast string search.

    > Especially
    > when, in practise, BM and naive searching both tend to operate in linear
    > time -- albeit with different constants.


    No, BM runs in *sublinear* time on typical inputs -- that is O(N / M),
    where N is the length of the text and M is the length of the pattern.
    This is what makes it so useful (assuming you have the text already
    in memory). It also comes with a linear time worst case guarantee
    that protects against pathological inputs.
     
    Googmeister, Nov 8, 2005
    #13
  14. Red Orchid

    Chris Uppal Guest

    Googmeister wrote:

    > > Especially
    > > when, in practise, BM and naive searching both tend to operate in linear
    > > time -- albeit with different constants.

    >
    > No, BM runs in *sublinear* time on typical inputs -- that is O(N / M),


    Oh I quite agree with the theoretical analysis. The position I was coming from
    was twofold -- still is actually.

    One point is that -- depending on circumstances -- "jumping" over bits of the
    sequence to match may not buy you genuine sub-linear execution. For instance,
    if you are streaming stuff off-disk, then unless the jumps are bigger than a
    disk page, you will still have to read the data into memory in the OS's buffers
    and/or the application's. Which is O(n). Or if the data is in main RAM, a
    similar argument applies about fetching data into caches. If you end up
    limited by disk I/O capacity, or by bandwidth to RAM, then reducing the number
    of actual character (or byte) comparisons isn't cutting out the bottleneck
    unless it /also/ reduces the amount of data that has to be moved over each
    threshold.

    The other point is that I would expect the strings you were searching /for/ to
    be fixed for many applications (not always, of course). In such circumstances,
    the expected execution time is O(n), simply because the O(1/m) bit is (for that
    application) a constant. The app can't take advantage of the effect of bigger
    M because there are no longer strings that it wants to look for :-(

    Hence my comment. I should have been more precise, I admit.


    > This is what makes it so useful (assuming you have the text already
    > in memory). It also comes with a linear time worst case guarantee
    > that protects against pathological inputs.


    Agreed, it's a very nice family of algorithms.

    -- chris
     
    Chris Uppal, Nov 8, 2005
    #14
  15. Actually, that's not true. Strings are kinda like real objects and
    kinda like primatives. For instance you can say
    String foo = "hello" just like you can say int i =5. You cannot say
    StringBuffer foo = "hello" (because you must create the object first).

    Here is the test though...
    public class stringtest
    {

    stringtest()
    {
    String foo = "hello";
    StringBuffer bar = new StringBuffer("hello");
    testString (foo);
    testBuffer(bar);
    System.out.println("String is "+foo);
    System.out.println("Buffer is "+bar);
    }

    private void testBuffer(StringBuffer bar)
    {
    bar.append("bye");
    }

    private void testString(String foo)
    {
    foo.toUpperCase();
    }

    public static void main(String[] args)
    {
    new stringtest();
    }
    }

    The out put looks like
    String is hello
    Buffer is hellobye

    Notice the string is left untouched (because it was passed by value)
    but the StringBuffer was changed (because it was passed by reference).

    -Robert
     
    Robert M. Gary, Nov 8, 2005
    #15
  16. Robert M. Gary wrote:
    > Actually, that's not true. Strings are kinda like real objects and
    > kinda like primatives. For instance you can say
    > String foo = "hello" just like you can say int i =5. You cannot say
    > StringBuffer foo = "hello" (because you must create the object first).


    You have created the "hello" String. It happens to be written as a
    literal and interned. A bit of syntax for literals doesn't make Strings
    in anyway primitive.

    > Notice the string is left untouched (because it was passed by value)
    > but the StringBuffer was changed (because it was passed by reference).


    String was left unchanged because you didn't mutate it. StringBuffer
    changed because you mutated it.

    If you were to call length on the StringBuffer instead of append, then
    it wouldn't have changed. Similarly, if you had used reflection of the
    String, then it may have changed. Synchronisation adds an interesting case.

    Tom Hawtin
    --
    Unemployed English Java programmer
    http://jroller.com/page/tackline/
     
    Thomas Hawtin, Nov 9, 2005
    #16
  17. Red Orchid

    Roedy Green Guest

    On Tue, 8 Nov 2005 19:55:57 -0000, "Chris Uppal"
    <-THIS.org> wrote, quoted or indirectly
    quoted someone who said :

    >One point is that -- depending on circumstances -- "jumping" over bits of the
    >sequence to match may not buy you genuine sub-linear execution. For instance,
    >if you are streaming stuff off-disk, then unless the jumps are bigger than a
    >disk page, you will still have to read the data into memory in the OS's buffers
    >and/or the application's. Which is O(n). Or if the data is in main RAM, a
    >similar argument applies about fetching data into caches. If you end up
    >limited by disk I/O capacity, or by bandwidth to RAM, then reducing the number
    >of actual character (or byte) comparisons isn't cutting out the bottleneck
    >unless it /also/ reduces the amount of data that has to be moved over each
    >threshold.


    The other thing to understand is that mathematicians simplify. To them
    all that counts in a sort are the number of compares or number of
    swaps. They completely ignore the other overhead. Ditto for Boyer
    Moore. They are all excited about reducing the number of compares,
    but that is just a fraction of what goes on. Hopping taking extra
    time. So even though you save in number of comparisons you lose it
    setup for the next comparison.

    For the third time, I invite you to experiment with an actual
    implementation and see for yourself that the speed up in ordinary use
    is not that dramatic. See http://mindprod.com/products1.html#BOYER
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
     
    Roedy Green, Nov 9, 2005
    #17
  18. Red Orchid

    Roedy Green Guest

    On 8 Nov 2005 09:01:27 -0800, "Robert M. Gary" <>
    wrote, quoted or indirectly quoted someone who said :

    >The String class is passed by value and results in
    >a complete copy everytime you call a method.


    a String is passed by value of the REFERENCE not the entire String.
    Only 32 bits get pushed to the stack. Strings are immutable.
    References share the same copy.

    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
     
    Roedy Green, Nov 9, 2005
    #18
  19. Red Orchid

    Roedy Green Guest

    On 8 Nov 2005 15:38:57 -0800, "Robert M. Gary" <>
    wrote, quoted or indirectly quoted someone who said :

    > Strings are kinda like real objects and
    >kinda like primatives.


    Strings may have some native methods and they are final, but they are
    ordinary objects.

    They are special only in that you can create a String object with a
    literal in the language and the + operator will work on them.

    The + operator is just sugar to save you writing out the equivalent
    StringBuffer code.

    The literal feature is deeper. The byte code has a way of specially
    recording String literals in a pool in the byte code.

    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
     
    Roedy Green, Nov 9, 2005
    #19
  20. Red Orchid

    Roedy Green Guest

    On Wed, 09 Nov 2005 01:07:40 +0000, Thomas Hawtin
    <> wrote, quoted or indirectly quoted someone
    who said :

    >if you had used reflection of the
    >String, then it may have changed.


    Strings are immutable. That is iron clad. The caller's String will
    not change neither will the caller's reference to the String.

    The only way I know of to change a String is to use JNI and once
    inside cheat, taking advantage of C's total lack of restraints.
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
     
    Roedy Green, Nov 9, 2005
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    2
    Views:
    436
    Larry Barowski
    Jan 31, 2006
  2. cwdjrxyz
    Replies:
    2
    Views:
    822
    Beauregard T. Shagnasty
    Sep 24, 2010
  3. albert kao
    Replies:
    12
    Views:
    615
    Roedy Green
    Oct 7, 2011
  4. Kenneth McDonald
    Replies:
    5
    Views:
    345
    Kenneth McDonald
    Sep 26, 2008
  5. Aaron Gray
    Replies:
    5
    Views:
    118
    Thomas 'PointedEars' Lahn
    Jul 22, 2008
Loading...

Share This Page