Bug in Regex Split

Discussion in 'Java' started by Roedy Green, Oct 25, 2003.

  1. Roedy Green

    Roedy Green Guest

    I don't think this behaviour is defensible even if documented
    somewhere.

    Consider this:

    Pattern spaceSplitter = Pattern.compile( " " );

    now try it on strings with lead, embedded and trailing spaces.

    lead spaces turn into "" each, but trailing spaces are ignored.

    e.g.

    ..split( "..a.b..c.." ); (where . represents space )

    gives:

    ""
    ""
    "a"
    "b"
    ""
    "c"

    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Oct 25, 2003
    #1
    1. Advertising

  2. Roedy Green

    skeptic Guest

    Roedy Green <> wrote in message news:<>...
    > I don't think this behaviour is defensible even if documented
    > somewhere.
    >
    > Consider this:
    >
    > Pattern spaceSplitter = Pattern.compile( " " );
    >
    > now try it on strings with lead, embedded and trailing spaces.
    >
    > lead spaces turn into "" each, but trailing spaces are ignored.
    >
    > e.g.
    >
    > .split( "..a.b..c.." ); (where . represents space )
    >
    > gives:
    >
    > ""
    > ""
    > "a"
    > "b"
    > ""
    > "c"


    Strictly speaking it's not a bug as there is no formal regex
    replacement/splitting specification in the java documentation. I also
    doubt if one ever existed at all, because those perl regexes are more
    of art than a science (real regular expressions aside).

    Nevertheless, I agree that it does wrong.
    But that's not the real problem: I just could conform my app to the
    given behaviour.
    The real problem is that the behaviour may silently change in the next
    version of jre.
    A macabre picture gets on the mind where some search/replace-type app
    goes wild in a batch mode after a JRE upgrade.

    Regards
     
    skeptic, Oct 25, 2003
    #2
    1. Advertising

  3. Roedy Green

    Dave Glasser Guest

    Roedy Green <> wrote on Sat, 25 Oct 2003 06:36:38
    GMT in comp.lang.java.programmer:

    >I don't think this behaviour is defensible even if documented
    >somewhere.
    >
    >Consider this:
    >
    > Pattern spaceSplitter = Pattern.compile( " " );
    >
    >now try it on strings with lead, embedded and trailing spaces.
    >
    >lead spaces turn into "" each, but trailing spaces are ignored.
    >
    >e.g.
    >
    >.split( "..a.b..c.." ); (where . represents space )
    >
    >gives:
    >
    >""
    >""
    >"a"
    >"b"
    >""
    >"c"


    That does seem odd that it seems to treat leading and trailing
    whitespace differently. I just ran this perl script that does the same
    operation, however:

    $_ = " a b c ";
    @outs = split(/ /);
    foreach $elem (@outs) {
    print "\n-$elem-";
    }



    and the ouptput is:

    --
    --
    -a-
    -b-
    --
    -c-

    Which implies that perl follows the same rules. (Hopefully a perl
    expert will read this and explain why.)

    Do you want your output array to include a zero-length string wherever
    there's a boundary between two space characters? It seems to me you'd
    want to treat any amount of contiguous whitespace as a single
    delimiter, and therefore use "\\s+" (one or more whitespace
    characters) instead of " ".

    --
    Check out QueryForm, a free, open source, Java/Swing-based
    front end for relational databases.

    http://qform.sourceforge.net
     
    Dave Glasser, Oct 26, 2003
    #3
  4. Roedy Green

    Rene Guest

    Dave Glasser <> wrote:
    > Roedy Green <> wrote on Sat, 25 Oct 2003 06:36:38
    > GMT in comp.lang.java.programmer:
    >

    [snip]
    > >lead spaces turn into "" each, but trailing spaces are ignored.
    > >
    > >e.g.
    > >
    > >.split( "..a.b..c.." ); (where . represents space )
    > >
    > >gives:
    > >
    > >""
    > >""
    > >"a"
    > >"b"
    > >""
    > >"c"

    >
    > That does seem odd that it seems to treat leading and trailing
    > whitespace differently. I just ran this perl script that does the same
    > operation, however:
    >
    > $_ = " a b c ";
    > @outs = split(/ /);
    > foreach $elem (@outs) {
    > print "\n-$elem-";
    > }


    It works if you add the flag "x". So try @outs = split(/ /x); and it will
    give: (changed - to + so to not break quoting at --<space>)

    + +
    + +
    +a+
    + +
    +b+
    + +
    + +
    +c+
    + +
    + +

    What /x does is to change behaviour with regard to comments. You can do
    things like that in perl:

    m{
    \w+: # match a word
    ( # begin group
    \s+ # match one or more whitespaces
    \w+ # match another word
    ) # end group
    \s* # match zero or more digits
    \d+ # match some digits
    }x;

    In this case the regexp is far more readable, but contains a *lot*
    whitespace and comment chars that are not part of the expression, so you
    need to flag them as such. It seems that trailing spaces are generally
    considered comments.

    I didn't expect it either and I'm not a regexp master (had to take the
    camel book out of the shelf for that :) ) Skimming through the javadoc, I'd
    expect the Pattern.COMMENTS flag to yield the same behaviour.

    > Which implies that perl follows the same rules. (Hopefully a perl
    > expert will read this and explain why.)


    Here's my 2 cents but I'm not an expert on that matter.

    CU

    Rene

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service New Rate! $9.95/Month 50GB
     
    Rene, Oct 26, 2003
    #4
  5. Roedy Green

    Filip Larsen Guest

    Roedy Green wrote

    > I don't think this behaviour is defensible even if documented
    > somewhere.
    >
    > Consider this:
    >
    > Pattern spaceSplitter = Pattern.compile( " " );
    >
    > now try it on strings with lead, embedded and trailing spaces.
    >
    > lead spaces turn into "" each, but trailing spaces are ignored.
    >
    > e.g.
    >
    > .split( "..a.b..c.." ); (where . represents space )
    >
    > gives:
    >
    > ""
    > ""
    > "a"
    > "b"
    > ""
    > "c"


    The call to split(s) corresponds to a call to split(s,0), where zero it the
    limit parameter. For that call, zero means that trailing empty strings in
    the result are removed (as the documentation clearly says). If called with
    negative limit the result will contain trailing empty strings.

    So, in your case a call to .split(" a b c ",-1) gives the expected "",
    "", "a", "", "b", "", "".

    Whether or not it is defensible for the .split(CharSequence) call to remove
    trailing spaces is a different issue. If it somehow is the normal behaviour
    of split in other regex implementations, then I think it is a good choice to
    let split(CharSequence) be aligned with that.


    Regards,
    --
    Filip Larsen
     
    Filip Larsen, Oct 26, 2003
    #5
  6. Roedy Green

    Phil... Guest

    I always use trim() so the problem does not exist.

    "Filip Larsen" <> wrote in message
    news:bngoib$2o49$...
    > Roedy Green wrote
    >
    > > I don't think this behaviour is defensible even if documented
    > > somewhere.
    > >
    > > Consider this:
    > >
    > > Pattern spaceSplitter = Pattern.compile( " " );
    > >
    > > now try it on strings with lead, embedded and trailing spaces.
    > >
    > > lead spaces turn into "" each, but trailing spaces are ignored.
    > >
    > > e.g.
    > >
    > > .split( "..a.b..c.." ); (where . represents space )
    > >
    > > gives:
    > >
    > > ""
    > > ""
    > > "a"
    > > "b"
    > > ""
    > > "c"

    >
    > The call to split(s) corresponds to a call to split(s,0), where zero it

    the
    > limit parameter. For that call, zero means that trailing empty strings in
    > the result are removed (as the documentation clearly says). If called with
    > negative limit the result will contain trailing empty strings.
    >
    > So, in your case a call to .split(" a b c ",-1) gives the expected "",
    > "", "a", "", "b", "", "".
    >
    > Whether or not it is defensible for the .split(CharSequence) call to

    remove
    > trailing spaces is a different issue. If it somehow is the normal

    behaviour
    > of split in other regex implementations, then I think it is a good choice

    to
    > let split(CharSequence) be aligned with that.
    >
    >
    > Regards,
    > --
    > Filip Larsen
    >
    >
     
    Phil..., Oct 26, 2003
    #6
  7. Roedy Green

    Roedy Green Guest

    On Sun, 26 Oct 2003 00:12:37 -0400, Dave Glasser <>
    wrote or quoted :

    >Do you want your output array to include a zero-length string wherever
    >there's a boundary between two space characters?


    In this particular case I am trying to break a phrase into words in
    such a way that I can reconstruct it precisely the way I found it. So
    lead, embedded and trailing blanks are significant. I store words
    with an implied trailing blank on all but the first.

    I could see split ignoring lead and trail blanks or treating strings
    of embedded blanks as one, but not the asymmetric thing it does.
    It should treat lead, trail and embedded blanks the same way.
    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Oct 26, 2003
    #7
  8. Roedy Green

    Roedy Green Guest

    On Sun, 26 Oct 2003 17:47:40 GMT, "Phil..." <> wrote or
    quoted :

    >I always use trim() so the problem does not exist.


    In my case I was processing filenames. The lead and trail blanks are
    significant. I decided to simply treat these as errors and ask the
    user to rename his files.

    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Oct 26, 2003
    #8
  9. Roedy Green

    Phil... Guest

    If it needs to be exact and split() isn't doin it for ya
    maybe try charAt() and make a loop that does
    exactly what you want, doesn't seem hard


    "Roedy Green" <> wrote in message
    news:...
    > On Sun, 26 Oct 2003 00:12:37 -0400, Dave Glasser <>
    > wrote or quoted :
    >
    > >Do you want your output array to include a zero-length string wherever
    > >there's a boundary between two space characters?

    >
    > In this particular case I am trying to break a phrase into words in
    > such a way that I can reconstruct it precisely the way I found it. So
    > lead, embedded and trailing blanks are significant. I store words
    > with an implied trailing blank on all but the first.
    >
    > I could see split ignoring lead and trail blanks or treating strings
    > of embedded blanks as one, but not the asymmetric thing it does.
    > It should treat lead, trail and embedded blanks the same way.
    > --
    > Canadian Mind Products, Roedy Green.
    > Coaching, problem solving, economical contract programming.
    > See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Phil..., Oct 26, 2003
    #9
  10. Roedy Green

    Alan Moore Guest

    On 25 Oct 2003 07:35:36 -0700, (skeptic) wrote:

    >But that's not the real problem: I just could conform my app to the
    >given behaviour.
    >The real problem is that the behaviour may silently change in the next
    >version of jre.
    >A macabre picture gets on the mind where some search/replace-type app
    >goes wild in a batch mode after a JRE upgrade.


    I doubt that will happen. The only reason to change its behavior
    would be if someone showed that it was not consistent with Perl's
    split function--which has happened once or twice. But, AFAIK, the
    split() method is now fully Perl-compliant.
     
    Alan Moore, Oct 27, 2003
    #10
  11. Roedy Green

    Roedy Green Guest

    On Mon, 27 Oct 2003 07:05:42 GMT, Alan Moore <>
    wrote or quoted :

    >I doubt that will happen. The only reason to change its behavior
    >would be if someone showed that it was not consistent with Perl's
    >split function--which has happened once or twice. But, AFAIK, the
    >split() method is now fully Perl-compliant.


    and this strange behaviour is documented in the dual argument split,
    so they can't change it now.

    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Oct 27, 2003
    #11
  12. Roedy Green

    Roedy Green Guest

    On Mon, 27 Oct 2003 07:05:42 GMT, Alan Moore <>
    wrote or quoted :

    >I doubt that will happen. The only reason to change its behavior
    >would be if someone showed that it was not consistent with Perl's
    >split function--which has happened once or twice. But, AFAIK, the
    >split() method is now fully Perl-compliant.


    It is not as strange as it first appears. In many languages lead and
    embedded separators are significant, but trailing ones are not.

    I think particularly back to OS JCL which was nuts on null positional
    parameters.


    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Oct 27, 2003
    #12
  13. Roedy Green wrote:
    > On Sun, 26 Oct 2003 17:47:40 GMT, "Phil..." <> wrote or
    > quoted :
    >
    >
    >>I always use trim() so the problem does not exist.

    >
    >
    > In my case I was processing filenames. The lead and trail blanks are
    > significant. I decided to simply treat these as errors and ask the
    > user to rename his files.
    >


    Roedy,

    Why did you not just use Pattern.split(CharSequence,int) where you pass
    a negative limit? According to the documentation it sounds like it does
    what you want.

    Ray
     
    Raymond DeCampo, Nov 2, 2003
    #13
  14. Roedy Green

    Roedy Green Guest

    On Sun, 02 Nov 2003 21:34:20 GMT, Raymond DeCampo
    <> wrote or quoted :

    >Why did you not just use Pattern.split(CharSequence,int) where you pass
    >a negative limit? According to the documentation it sounds like it does
    >what you want.


    Because I did not know about it at the time I complained.

    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Nov 3, 2003
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    2
    Views:
    476
  2. Carlos Ribeiro
    Replies:
    11
    Views:
    714
    Alex Martelli
    Sep 17, 2004
  3. Replies:
    3
    Views:
    798
    Reedick, Andrew
    Jul 1, 2008
  4. trans.  (T. Onoma)

    split on '' (and another for split -1)

    trans. (T. Onoma), Dec 27, 2004, in forum: Ruby
    Replies:
    10
    Views:
    226
    Florian Gross
    Dec 28, 2004
  5. Replies:
    3
    Views:
    167
    Paul Lalli
    Oct 27, 2005
Loading...

Share This Page