regular expression with split goes wrong ?

jh3an · Mar 10, 2008

Here is mysterious code, please look:

$x = '12aba34ba5';
@num = split /(a|b)+/, $x;

now, @num has ('12','a','34','a','5').

I don't understand.
I was expecting that @num would have '12','34','5'.
However, it is not.

Why..? Please help me.

Joost Diepenmaat · Mar 10, 2008

jh3an said:
Here is mysterious code, please look:

$x = '12aba34ba5';
@num = split /(a|b)+/, $x;

now, @num has ('12','a','34','a','5').

I don't understand.
I was expecting that @num would have '12','34','5'.
However, it is not.

See perldoc -f split:

If the PATTERN contains parentheses, additional list elements
are created from each matching substring in the delimiter.

split(/([,-])/, "1-10,20", 3);

produces the list value

(1, '-', 10, ',', 20)

IOW, you can use some non-capturing syntax, like:

@num = split /[ab]+/,$x;

to discard the separators.

Riad KACED · Mar 11, 2008

I would propose the following for your case :
@num = split /D+/,$x;
This will split with any a-zA-Z

Riad.

jh3an · Mar 11, 2008

Thank you everyone !

xhoster · Mar 11, 2008

Joost Diepenmaat said:
See perldoc -f split:

If the PATTERN contains parentheses, additional list elements
are created from each matching substring in the delimiter.

That really should say "If the PATTERN contains capturing parentheses,..."
^^^^^^^^^

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Ben Bullock · Mar 16, 2008

If *both* the pattern *and* the subject (the string matched against) are
not in UTF-8, then, and only then, does \D equal [^0-9].

However, if either of them is in UTF-8 format (which does not
necessarely mean they contain a non-ASCII character), then \D excludes a
lot more than just the digits 0 to 9.

$ perl -wE 'chr =~ /[^0-9]/ or $c ++ for 0x00 .. 0xD7FF; say $c' 10
$ perl -wE 'chr =~ /\D/ or $c ++ for 0x00 .. 0xD7FF; say $c' 220

You need to use (0x00 .. 0xD7FF, 0xE000 .. 0xFDCF, 0xFDF0.. 0xFFFD) here,
otherwise you miss 10 characters ("FULLWIDTH DIGIT X" in Unicode-speak).
The following gives 230 rather than 220 for the count:

#!/usr/bin/perl
use warnings;
use strict;
use Unicode::UCD 'charinfo';
sub count_match
{
my ($re)=@_;
my $c;
for my $n (0x00 .. 0xD7FF, 0xE000 .. 0xFDCF, 0xFDF0.. 0xFFFD) {
if (chr($n) =~ /$re/) {
my $ci = charinfo($n);
print sprintf ('%02X', $n), " which is ", $$ci{name}
, " matches\n";
$c++;
}
}
print "There are $c characters matching \"$re\".\n";
}
count_match('\d');

However, I got the above list of valid Unicode numbers here by trial and
error (running with 0x00..0xFFFF and seeing where Perl complained about
"Unicode character xxx is illegal") so there might be something I've
missed.

Can someone tell me what's wrong with this question on StackOverflow?	0	Aug 19, 2023
Trouble with prediction code, for the life of me I can't figure out why it isnt running properly. Help would be appreciated.	0	Jul 8, 2023
Recursion regular expression (xtended)	1	Aug 16, 2010
Minimum Total Difficulty	0	Nov 15, 2023
FAQ 6.5 I put a regular expression into $/ but it didn't work. What's wrong?	0	Jan 28, 2011
Coding going wrong	1	Oct 22, 2019
I need some help on a format issue that should be simple for someone here (but not me!)	0	Jul 6, 2023
Regular expression problem	13	Mar 10, 2013

regular expression with split goes wrong ?

jh3an

Joost Diepenmaat

Riad KACED

jh3an

xhoster

Ben Bullock

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads