String containing algorithm

Discussion in 'Java' started by JS, Nov 25, 2005.

  1. JS

    JS Guest

    Hi all,
    Can anyone help me with a small problem I'm having. I need to write a method
    which takes two Strings of any lengths and returns any letters which are
    common between the two. The Strings will be something like ABECA and they
    could be of any length.

    For example:
    ABECA and ECDRE would return EC, I dont mind about the order which they come
    in because I am processing them further anyway.

    I just cant get any algorithms to work on it, not even brute force.
    Any help is appreciated, thanks in advance.
    JS
     
    JS, Nov 25, 2005
    #1
    1. Advertising

  2. JS

    zero Guest

    "JS" <> wrote in
    news:TSIhf.4040$:

    > Hi all,
    > Can anyone help me with a small problem I'm having. I need to write a
    > method which takes two Strings of any lengths and returns any letters
    > which are common between the two. The Strings will be something like
    > ABECA and they could be of any length.
    >
    > For example:
    > ABECA and ECDRE would return EC, I dont mind about the order which
    > they come in because I am processing them further anyway.
    >
    > I just cant get any algorithms to work on it, not even brute force.
    > Any help is appreciated, thanks in advance.
    > JS
    >
    >
    >


    This sounds suspiciously like a homework assignment.

    Here's a possible algorithm (no code, I'm sure you can handle that
    yourself)

    prerequisites: Strings A and B of arbitrary length; empty string (or
    stringbuffer) C.

    1. take the first character in string A, and compare it to every
    character in B
    1a. alternative to 1: take the first character in string A, turn it into
    a CharSequence, and use String method contains()
    2. if you have a match, add it to a C
    3. take the next character in A, and we're back at 1.

    If you want a case-insensitive algorithm just add toLowerCase or
    toUpperCase.

    If you want to weed out duplicates, check if the current character is
    already contained in C.

    --
    Beware the False Authority Syndrome
     
    zero, Nov 25, 2005
    #2
    1. Advertising

  3. JS

    Roedy Green Guest

    On Fri, 25 Nov 2005 18:38:43 GMT, "JS" <>
    wrote, quoted or indirectly quoted someone who said :

    >For example:
    >ABECA and ECDRE would return EC, I dont mind about the order which they come
    >in because I am processing them further anyway.


    I would do it with two java.util.BitSets
    each 64K bits long (possibly shorter if you can guarantee a narrower
    char range.). Go through String a turning on bits corresponding to
    chars in bita. (index bit by char number). Then repeat with b and
    bitb.

    Then compute the logical AND of bita and bitb.
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
     
    Roedy Green, Nov 25, 2005
    #3
  4. JS

    Roedy Green Guest

    On Fri, 25 Nov 2005 18:38:43 GMT, "JS" <>
    wrote, quoted or indirectly quoted someone who said :

    >For example:
    >ABECA and ECDRE would return EC, I dont mind about the order which they come
    >in because I am processing them further anyway.


    another approach is to create a HashSet of the chars in string a. Then
    create another HashSet for String b. Then enumerate HashSet b and
    remove els that don't exist in set a.

    Another approach to is use a nested loop

    pseudocode:

    for each char in string a
    for each char in string b

    if achar == bchar add to HashSet if not already there.
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
     
    Roedy Green, Nov 25, 2005
    #4
  5. JS

    Alan Krueger Guest

    JS wrote:
    > Hi all,
    > Can anyone help me with a small problem I'm having. I need to write a method
    > which takes two Strings of any lengths and returns any letters which are
    > common between the two. The Strings will be something like ABECA and they
    > could be of any length.
    >
    > For example:
    > ABECA and ECDRE would return EC, I dont mind about the order which they come
    > in because I am processing them further anyway.


    Please clarify. Your example fits your text, but it also matches a
    "longest common substring" search. The answers you've received so far
    appear to address both of these.

    For instance, what happens if you use "ABCDEF" and "FEDCBA"? If it's
    all the letters in common in both strings, you'll get "ABCDEF" in some
    permutation. If it's a longest common substring, you'll get one of the
    letters.

    If you're truly just searching for all letters in common between the two
    strings, you might want to use a HashSet<Character> for each string, add
    each character in each string to its respective HashSet, and obtain
    the set of common characters by intersecting the sets.
     
    Alan Krueger, Nov 25, 2005
    #5
  6. JS

    Alan Krueger Guest

    Roedy Green wrote:
    > On Fri, 25 Nov 2005 18:38:43 GMT, "JS" <>
    > wrote, quoted or indirectly quoted someone who said :
    >
    >>For example:
    >>ABECA and ECDRE would return EC, I dont mind about the order which they come
    >>in because I am processing them further anyway.

    >
    > another approach is to create a HashSet of the chars in string a. Then
    > create another HashSet for String b. Then enumerate HashSet b and
    > remove els that don't exist in set a.


    Set.retainAll already does that last bit.

    > Another approach to is use a nested loop
    >
    > pseudocode:
    >
    > for each char in string a
    > for each char in string b
    > if achar == bchar add to HashSet if not already there.


    That's unnecessarily O(n^2) - not horrible for a quick-and-dirty, but to
    be avoided in production and homework assignments. Sorting them and
    then linearly merging them would be O(n log n), and the HashSet approach
    is O(n) when the hash table is large enough and the hash function is
    well-tuned to the data.
     
    Alan Krueger, Nov 25, 2005
    #6
  7. JS

    Alan Krueger Guest

    Roedy Green wrote:
    > On Fri, 25 Nov 2005 18:38:43 GMT, "JS" <>
    > wrote, quoted or indirectly quoted someone who said :
    >
    >>For example:
    >>ABECA and ECDRE would return EC, I dont mind about the order which they come
    >>in because I am processing them further anyway.

    >
    > I would do it with two java.util.BitSets
    > each 64K bits long (possibly shorter if you can guarantee a narrower
    > char range.). Go through String a turning on bits corresponding to
    > chars in bita. (index bit by char number). Then repeat with b and
    > bitb.


    A sparse set implementation would be better; if you have string lengths
    far below 64K, it's not going to use very much of that space.

    Plus, while Java represents characters internally in UTF-16, there are
    far more than 2^16 code points possible.
     
    Alan Krueger, Nov 25, 2005
    #7
  8. JS

    Roedy Green Guest

    On Fri, 25 Nov 2005 16:16:12 -0600, Alan Krueger
    <> wrote, quoted or indirectly quoted someone
    who said :

    >Plus, while Java represents characters internally in UTF-16, there are
    >far more than 2^16 code points possible.


    char is 16 bits so there can't possibly be more than 2^16 = 64K
    combinations. With a java.util.BitSet you can track presence with 8K
    bytes worth of bits (stored as longs in a BitSet).

    most of the time you have an upper bound on your unicode chars
    considerably lower than 64K.

    The advantage of BitSet is the speed of direct addressing. Any sparce
    scheme will have a lot of lookup overhead.
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
     
    Roedy Green, Nov 25, 2005
    #8
  9. JS

    Alan Krueger Guest

    Roedy Green wrote:
    > On Fri, 25 Nov 2005 16:16:12 -0600, Alan Krueger
    > <> wrote, quoted or indirectly quoted someone
    > who said :
    >
    >>Plus, while Java represents characters internally in UTF-16, there are
    >>far more than 2^16 code points possible.

    >
    > char is 16 bits so there can't possibly be more than 2^16 = 64K
    > combinations.


    There cannot be more than 2^16 char values, but there are far more than
    2^16 Unicode code points, depending on your version of Unicode.

    http://java.sun.com/developer/technicalArticles/Intl/Supplementary/

    "Supplementary characters are characters in the Unicode
    standard whose code points are above U+FFFF, and which
    therefore cannot be described as single 16-bit entities
    such as the char data type in the Java programming
    language. [...]

    "These are now interpreted as UTF-16 sequences, and the
    implementations of these APIs is changed to correctly
    handle supplementary characters. The enhancements are
    part of version 5.0 of the Java 2 Platform, Standard
    Edition (J2SE) [...]

    "The Unicode standard therefore has been extended to allow
    up to 1,112,064 characters."

    Because it's UTF-16, these characters are serialized into 16-bit
    character sequences, which means that not every character represents a
    single Unicode code point, sometimes multiple characters are used.

    > most of the time you have an upper bound on your unicode chars
    > considerably lower than 64K.
    >
    > The advantage of BitSet is the speed of direct addressing. Any sparce
    > scheme will have a lot of lookup overhead.


    No, a properly-sized HashSet will have CONSTANT TIME lookup, just like a
    lookup table.
     
    Alan Krueger, Nov 26, 2005
    #9
  10. JS

    JS Guest

    Thanks everyone for your help. The easiest one sounds like zeros idea which
    should work a treat. I'm not worried about the time of the algorithm because
    it isnt for submission or publication anywhere, and for those of you still
    left wondering, sorry about my explaination, i meant any letters in any
    order regardless of where they are.
    Thanks again
    JS
    "Alan Krueger" <> wrote in message
    news:nLLhf.764$...
    > JS wrote:
    > > Hi all,
    > > Can anyone help me with a small problem I'm having. I need to write a

    method
    > > which takes two Strings of any lengths and returns any letters which are
    > > common between the two. The Strings will be something like ABECA and

    they
    > > could be of any length.
    > >
    > > For example:
    > > ABECA and ECDRE would return EC, I dont mind about the order which they

    come
    > > in because I am processing them further anyway.

    >
    > Please clarify. Your example fits your text, but it also matches a
    > "longest common substring" search. The answers you've received so far
    > appear to address both of these.
    >
    > For instance, what happens if you use "ABCDEF" and "FEDCBA"? If it's
    > all the letters in common in both strings, you'll get "ABCDEF" in some
    > permutation. If it's a longest common substring, you'll get one of the
    > letters.
    >
    > If you're truly just searching for all letters in common between the two
    > strings, you might want to use a HashSet<Character> for each string, add
    > each character in each string to its respective HashSet, and obtain
    > the set of common characters by intersecting the sets.
     
    JS, Nov 26, 2005
    #10
  11. Alan Krueger wrote:
    > Roedy Green wrote:
    >>
    >> char is 16 bits so there can't possibly be more than 2^16 = 64K
    >> combinations.


    > http://java.sun.com/developer/technicalArticles/Intl/Supplementary/


    > "The Unicode standard therefore has been extended to allow
    > up to 1,112,064 characters."


    That is code points 0-0x10ffff less 0xd800-0xdfff (which are used as
    surrogate pairs to create the post-16-bit values).

    The quoted article doesn't show how to iterate over code points of a
    String, which goes something like:

    final int num = str.length();
    for (int off=0; off<num; ) {
    int codePoint = str.codePointAt(off);
    ...
    off += Character.charCount(codePoint);
    }

    What a mess.

    >> The advantage of BitSet is the speed of direct addressing. Any sparce
    >> scheme will have a lot of lookup overhead.

    >
    >
    > No, a properly-sized HashSet will have CONSTANT TIME lookup, just like a
    > lookup table.


    You call it CONSTANT TIME; I call it ORDER OF MAGNITUDE.

    For the typical case of US ASCII characters, the BitSet method will be
    contained within a mere 16 bytes or so of data. BitSet wins hands down
    there.

    In the worst case scenario, all characters appear. BitSet will expand to
    actively use around 136 K. A HashSet will be using, say 50 MB. One of
    these approaches will be more cache-friendly.

    In some ways it is similar to the ArrayList vs LinkedList trade-off.

    Where HashSet may win is for very short strings with the odd high code
    point, particularly if the BitSet is not kept.

    Tom Hawtin
    --
    Unemployed English Java programmer
    http://jroller.com/page/tackline/
     
    Thomas Hawtin, Nov 26, 2005
    #11
  12. JS

    Alan Krueger Guest

    Thomas Hawtin wrote:
    > Alan Krueger wrote:
    >
    >> Roedy Green wrote:
    >>
    >>>
    >>> char is 16 bits so there can't possibly be more than 2^16 = 64K
    >>> combinations.

    >
    >
    >> http://java.sun.com/developer/technicalArticles/Intl/Supplementary/

    >
    >
    >> "The Unicode standard therefore has been extended to allow
    >> up to 1,112,064 characters."

    >
    >
    > That is code points 0-0x10ffff less 0xd800-0xdfff (which are used as
    > surrogate pairs to create the post-16-bit values).
    >
    > The quoted article doesn't show how to iterate over code points of a
    > String, which goes something like:
    >
    > final int num = str.length();
    > for (int off=0; off<num; ) {
    > int codePoint = str.codePointAt(off);
    > ...
    > off += Character.charCount(codePoint);
    > }
    >
    > What a mess.
    >
    >>> The advantage of BitSet is the speed of direct addressing. Any sparce
    >>> scheme will have a lot of lookup overhead.

    >>
    >>
    >>
    >> No, a properly-sized HashSet will have CONSTANT TIME lookup, just like
    >> a lookup table.

    >
    >
    > You call it CONSTANT TIME; I call it ORDER OF MAGNITUDE.


    It's still an amortised O(c) time algorithm. You're talking about space
    which is a different issue.

    > For the typical case of US ASCII characters, the BitSet method will be
    > contained within a mere 16 bytes or so of data. BitSet wins hands down
    > there.
    >
    > In the worst case scenario, all characters appear. BitSet will expand to
    > actively use around 136 K. A HashSet will be using, say 50 MB. One of
    > these approaches will be more cache-friendly.


    In the worst case of the OP's problem, remember that you also have two
    strings which combined fill at least 1.1 million code points of at least
    16 bits each. If the string has duplication, we're already talking
    about many megabytes of string to walk, which isn't cache-friendly either.

    > In some ways it is similar to the ArrayList vs LinkedList trade-off.
    >
    > Where HashSet may win is for very short strings with the odd high code
    > point, particularly if the BitSet is not kept.


    It also helps that BitSet expands as needed, but one little outlier will
    make the BitSet expand to its worst-case size, whereas HashSet will only
    expand when it runs out of room.
     
    Alan Krueger, Nov 26, 2005
    #12
  13. JS

    Roedy Green Guest

    On Fri, 25 Nov 2005 19:40:29 -0600, Alan Krueger
    <> wrote, quoted or indirectly quoted someone
    who said :

    >There cannot be more than 2^16 char values, but there are far more than
    >2^16 Unicode code points, depending on your version of Unicode.
    >
    >http://java.sun.com/developer/technicalArticles/Intl/Supplementary/


    Here is my new understanding of codepoints. I was under the delusion
    the 32-bit stuff had no effect on ordinary String and char[]

    See http://mindprod.com/jgloss/codepoint.html for the formatted
    version:

    Please speak now, and at any time in future you notice an error:

    Unicode started out as a 16-bit code with 64K possible characters. At
    first, this seemed more than enough to encode all the world's
    alphabets. Unicode was such a success, scholars also soon wanted to
    encode dead scripts such as cuneiform as well, and soon all the slots
    were full.

    Unicode was extended to 32 bits, with the corresponding UTF-16
    encoding also extended with a clumsy system of surrogate characters to
    encode the 32-bit characters above 0xffff.

    The term codepoint in Java tends to be used to mean a slot in the
    32-bit Unicode assignment, though I suspect the term is also valid to
    mean a spot in Unicode-16 or any other character set.

    Java now straddles the 16-bit and 32-bit worlds. You might think Java
    would now have a 32-bit analog to Character, perhaps called CodePoint,
    and a 32-bit analog to String, perhaps called CodePoints, but it does
    not. Instead, Strings and char[] are permitted to contain surrogate
    pairs which encode a single high-32-bit codepoint.

    StringBuilder.append( int codepoint ) will accept 32-bit codepoints to
    append.

    FontMetrics.charWidth( int codepoint ) will tell you the width in
    pixels to render a given codepoint.

    Character.isValidCodePoint( int codepoint ) will tell you if there is
    a glyph assigned to that codepoint. That is still no guarantee your
    Font will render it though. Character. codePointAt and codePointBefore
    let you deal with 32-bit codepoints encoded as surrogate pairs in char
    arrays. Most of the Character methods now have a version that accepts
    an int codepoint such as toLowerCase.




    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
     
    Roedy Green, Nov 26, 2005
    #13
  14. JS

    IchBin Guest

    Roedy Green wrote:
    > On Fri, 25 Nov 2005 19:40:29 -0600, Alan Krueger
    > <> wrote, quoted or indirectly quoted someone
    > who said :
    >
    >> There cannot be more than 2^16 char values, but there are far more than
    >> 2^16 Unicode code points, depending on your version of Unicode.
    >>
    >> http://java.sun.com/developer/technicalArticles/Intl/Supplementary/

    >
    > Here is my new understanding of codepoints. I was under the delusion
    > the 32-bit stuff had no effect on ordinary String and char[]
    >
    > See http://mindprod.com/jgloss/codepoint.html for the formatted
    > version:
    >
    > Please speak now, and at any time in future you notice an error:
    >
    > Unicode started out as a 16-bit code with 64K possible characters. At
    > first, this seemed more than enough to encode all the world's
    > alphabets. Unicode was such a success, scholars also soon wanted to
    > encode dead scripts such as cuneiform as well, and soon all the slots
    > were full.
    >
    > Unicode was extended to 32 bits, with the corresponding UTF-16
    > encoding also extended with a clumsy system of surrogate characters to
    > encode the 32-bit characters above 0xffff.
    >
    > The term codepoint in Java tends to be used to mean a slot in the
    > 32-bit Unicode assignment, though I suspect the term is also valid to
    > mean a spot in Unicode-16 or any other character set.
    >
    > Java now straddles the 16-bit and 32-bit worlds. You might think Java
    > would now have a 32-bit analog to Character, perhaps called CodePoint,
    > and a 32-bit analog to String, perhaps called CodePoints, but it does
    > not. Instead, Strings and char[] are permitted to contain surrogate
    > pairs which encode a single high-32-bit codepoint.
    >
    > StringBuilder.append( int codepoint ) will accept 32-bit codepoints to
    > append.
    >
    > FontMetrics.charWidth( int codepoint ) will tell you the width in
    > pixels to render a given codepoint.
    >
    > Character.isValidCodePoint( int codepoint ) will tell you if there is
    > a glyph assigned to that codepoint. That is still no guarantee your
    > Font will render it though. Character. codePointAt and codePointBefore
    > let you deal with 32-bit codepoints encoded as surrogate pairs in char
    > arrays. Most of the Character methods now have a version that accepts
    > an int codepoint such as toLowerCase.


    Sorry guys but not to go off into bit land but I threw this together.
    What is wrong with doing it this way..

    public String stringMatch (String in1, String in2, boolean mixedCase)
    {
    String key = in1;
    String target = in2;
    String foundMatch = "";

    if (in1.length()>in2.length()){
    key=in2;
    target=in1;
    }
    if (!mixedCase){
    key=key.toUpperCase();
    target=target.toUpperCase();
    }
    for(int i = 0; i < key.length(); i++){
    if ( target.indexOf( key.substring(i,i+1)) > -1)
    foundMatch = foundMatch.concat( key.substring(i,i+1));
    }
    return foundMatch;
    }

    --


    Thanks in Advance...
    IchBin, Pocono Lake, Pa, USA
    http://weconsultants.servebeer.com/JHackerAppManager
    __________________________________________________________________________

    'If there is one, Knowledge is the "Fountain of Youth"'
    -William E. Taylor, Regular Guy (1952-)
     
    IchBin, Nov 26, 2005
    #14
  15. IchBin wrote:
    >
    > Sorry guys but not to go off into bit land but I threw this together.
    > What is wrong with doing it this way..


    Well, it's much more fun to talk about all sorts of esoterica.

    > for(int i = 0; i < key.length(); i++){
    > if ( target.indexOf( key.substring(i,i+1)) > -1)
    > foundMatch = foundMatch.concat( key.substring(i,i+1));
    > }


    Firstly, that doesn't take into account code points we were talking
    about, uses String in places where char would be sufficient, the
    formatting is a bit odd and != -1 is probably more obvious than > -1, IMO.

    Secondly, that looks like an O(n^2) solution to an O(n) problem. Not
    necessarily a problem, but generally not advisable.

    Tom Hawtin
    --
    Unemployed English Java programmer
    http://jroller.com/page/tackline/
     
    Thomas Hawtin, Nov 26, 2005
    #15
  16. JS

    Roedy Green Guest

    On Sat, 26 Nov 2005 23:05:35 +0000, Thomas Hawtin
    <> wrote, quoted or indirectly quoted someone
    who said :

    >Secondly, that looks like an O(n^2) solution to an O(n) problem. Not
    >necessarily a problem, but generally not advisable.


    if you could guarantee an upper bound on the length of the two
    strings, you might find an O(n^2) solution could win out over an O(n).
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
     
    Roedy Green, Nov 26, 2005
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ahmed Moustafa
    Replies:
    0
    Views:
    780
    Ahmed Moustafa
    Nov 15, 2003
  2. Replies:
    15
    Views:
    879
    Charles Richmond
    Apr 10, 2005
  3. Bapaiah Katepalli
    Replies:
    1
    Views:
    1,498
    Mike Treseler
    Jun 23, 2006
  4. Tung Chau
    Replies:
    0
    Views:
    375
    Tung Chau
    Aug 6, 2004
  5. Bob Hatch
    Replies:
    3
    Views:
    194
    Brian Candler
    Feb 1, 2011
Loading...

Share This Page