regular expression with split goes wrong ?

Discussion in 'Perl Misc' started by jh3an, Mar 10, 2008.

  1. jh3an

    jh3an Guest

    Here is mysterious code, please look:

    $x = '12aba34ba5';
    @num = split /(a|b)+/, $x;

    now, @num has ('12','a','34','a','5').

    I don't understand.
    I was expecting that @num would have '12','34','5'.
    However, it is not.

    Why..? Please help me.
    jh3an, Mar 10, 2008
    #1
    1. Advertising

  2. jh3an <> writes:

    > Here is mysterious code, please look:
    >
    > $x = '12aba34ba5';
    > @num = split /(a|b)+/, $x;
    >
    > now, @num has ('12','a','34','a','5').
    >
    > I don't understand.
    > I was expecting that @num would have '12','34','5'.
    > However, it is not.


    See perldoc -f split:

    If the PATTERN contains parentheses, additional list elements
    are created from each matching substring in the delimiter.

    split(/([,-])/, "1-10,20", 3);

    produces the list value

    (1, '-', 10, ',', 20)

    IOW, you can use some non-capturing syntax, like:

    @num = split /[ab]+/,$x;

    to discard the separators.

    --
    Joost Diepenmaat | blog: http://joost.zeekat.nl/ | work: http://zeekat.nl/
    Joost Diepenmaat, Mar 10, 2008
    #2
    1. Advertising

  3. jh3an

    Riad KACED Guest

    I would propose the following for your case :
    @num = split /D+/,$x;
    This will split with any a-zA-Z

    Riad.
    Riad KACED, Mar 11, 2008
    #3
  4. jh3an

    jh3an Guest

    Thank you everyone !
    jh3an, Mar 11, 2008
    #4
  5. jh3an

    Guest

    Joost Diepenmaat <> wrote:
    > jh3an <> writes:
    >
    > > Here is mysterious code, please look:
    > >
    > > $x = '12aba34ba5';
    > > @num = split /(a|b)+/, $x;
    > >
    > > now, @num has ('12','a','34','a','5').
    > >
    > > I don't understand.
    > > I was expecting that @num would have '12','34','5'.
    > > However, it is not.

    >
    > See perldoc -f split:
    >
    > If the PATTERN contains parentheses, additional list elements
    > are created from each matching substring in the delimiter.



    That really should say "If the PATTERN contains capturing parentheses,..."
    ^^^^^^^^^

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
    , Mar 11, 2008
    #5
  6. jh3an

    Ben Bullock Guest

    On Tue, 11 Mar 2008 21:46:38 +0000, Abigail wrote:

    > If *both* the pattern *and* the subject (the string matched against) are
    > not in UTF-8, then, and only then, does \D equal [^0-9].
    >
    > However, if either of them is in UTF-8 format (which does not
    > necessarely mean they contain a non-ASCII character), then \D excludes a
    > lot more than just the digits 0 to 9.
    >
    > $ perl -wE 'chr =~ /[^0-9]/ or $c ++ for 0x00 .. 0xD7FF; say $c' 10
    > $ perl -wE 'chr =~ /\D/ or $c ++ for 0x00 .. 0xD7FF; say $c' 220


    You need to use (0x00 .. 0xD7FF, 0xE000 .. 0xFDCF, 0xFDF0.. 0xFFFD) here,
    otherwise you miss 10 characters ("FULLWIDTH DIGIT X" in Unicode-speak).
    The following gives 230 rather than 220 for the count:

    #!/usr/bin/perl
    use warnings;
    use strict;
    use Unicode::UCD 'charinfo';
    sub count_match
    {
    my ($re)=@_;
    my $c;
    for my $n (0x00 .. 0xD7FF, 0xE000 .. 0xFDCF, 0xFDF0.. 0xFFFD) {
    if (chr($n) =~ /$re/) {
    my $ci = charinfo($n);
    print sprintf ('%02X', $n), " which is ", $$ci{name}
    , " matches\n";
    $c++;
    }
    }
    print "There are $c characters matching \"$re\".\n";
    }
    count_match('\d');

    However, I got the above list of valid Unicode numbers here by trial and
    error (running with 0x00..0xFFFF and seeing where Perl complained about
    "Unicode character xxx is illegal") so there might be something I've
    missed.
    Ben Bullock, Mar 16, 2008
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. PC
    Replies:
    3
    Views:
    2,652
    Alex I. Varyanick
    Apr 24, 2005
  2. VSK
    Replies:
    2
    Views:
    2,290
  3. =?Utf-8?B?UmFlZCBTYXdhbGhh?=

    Regular Expression to Split Text with \r\n

    =?Utf-8?B?UmFlZCBTYXdhbGhh?=, Mar 22, 2005, in forum: ASP .Net
    Replies:
    1
    Views:
    4,864
    =?Utf-8?B?Q293Ym95IChHcmVnb3J5IEEuIEJlYW1lcikgLSBN
    Mar 22, 2005
  4. Richard
    Replies:
    5
    Views:
    525
  5. rhaavik
    Replies:
    6
    Views:
    4,756
    Hendrik Maryns
    Nov 17, 2005
Loading...

Share This Page