regular expression with split goes wrong ?

J

jh3an

Here is mysterious code, please look:

$x = '12aba34ba5';
@num = split /(a|b)+/, $x;

now, @num has ('12','a','34','a','5').

I don't understand.
I was expecting that @num would have '12','34','5'.
However, it is not.

Why..? Please help me.
 
J

Joost Diepenmaat

jh3an said:
Here is mysterious code, please look:

$x = '12aba34ba5';
@num = split /(a|b)+/, $x;

now, @num has ('12','a','34','a','5').

I don't understand.
I was expecting that @num would have '12','34','5'.
However, it is not.

See perldoc -f split:

If the PATTERN contains parentheses, additional list elements
are created from each matching substring in the delimiter.

split(/([,-])/, "1-10,20", 3);

produces the list value

(1, '-', 10, ',', 20)

IOW, you can use some non-capturing syntax, like:

@num = split /[ab]+/,$x;

to discard the separators.
 
R

Riad KACED

I would propose the following for your case :
@num = split /D+/,$x;
This will split with any a-zA-Z

Riad.
 
X

xhoster

Joost Diepenmaat said:
See perldoc -f split:

If the PATTERN contains parentheses, additional list elements
are created from each matching substring in the delimiter.


That really should say "If the PATTERN contains capturing parentheses,..."
^^^^^^^^^

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
B

Ben Bullock

If *both* the pattern *and* the subject (the string matched against) are
not in UTF-8, then, and only then, does \D equal [^0-9].

However, if either of them is in UTF-8 format (which does not
necessarely mean they contain a non-ASCII character), then \D excludes a
lot more than just the digits 0 to 9.

$ perl -wE 'chr =~ /[^0-9]/ or $c ++ for 0x00 .. 0xD7FF; say $c' 10
$ perl -wE 'chr =~ /\D/ or $c ++ for 0x00 .. 0xD7FF; say $c' 220

You need to use (0x00 .. 0xD7FF, 0xE000 .. 0xFDCF, 0xFDF0.. 0xFFFD) here,
otherwise you miss 10 characters ("FULLWIDTH DIGIT X" in Unicode-speak).
The following gives 230 rather than 220 for the count:

#!/usr/bin/perl
use warnings;
use strict;
use Unicode::UCD 'charinfo';
sub count_match
{
my ($re)=@_;
my $c;
for my $n (0x00 .. 0xD7FF, 0xE000 .. 0xFDCF, 0xFDF0.. 0xFFFD) {
if (chr($n) =~ /$re/) {
my $ci = charinfo($n);
print sprintf ('%02X', $n), " which is ", $$ci{name}
, " matches\n";
$c++;
}
}
print "There are $c characters matching \"$re\".\n";
}
count_match('\d');

However, I got the above list of valid Unicode numbers here by trial and
error (running with 0x00..0xFFFF and seeing where Perl complained about
"Unicode character xxx is illegal") so there might be something I've
missed.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top