Unexpected RegEx results


Q

QoS

Hello, having some trouble solving this regular expression puzzle.
It is possible to solve the issue using some if statements, but im
curious why this is occurring.

The data involved looks similar to the following:

ALWAYSPRESENT:0008:0:OPTIONAL
OPTIONAL
OPTIONAL

Where the data will always start with a name.
This is followed by a colon some numbers a colon some numbers and a colon,
which will all be discarded.
Then there may or may not be some additional data after that.

Next there might be a newline followed by some optional data.
Finally there might be a newline followed by some optional data.

Ok here is my issue, the RegEx im using to do this will place data found
in the 3rd memory variable in the variable $4 when there is no match
to fill $4. So $4 will contain data but $3 will not, when i expected rather
that $3 would contain data and $4 would not.

Example troublesome data:

ALWAYSPRESENT:0008:0:pRESENT
PRESENT
NOTPRESENT

This is the offending RegEx.

$msg =~ /(.*?):.*:(.*)\n*(^.*)\n*(^.*)/m;

Thanks for any assistance.
 
Ad

Advertisements

Q

QoS

Jim Gibson said:
Hello, having some trouble solving this regular expression puzzle.
It is possible to solve the issue using some if statements, but im
curious why this is occurring.

The data involved looks similar to the following:

ALWAYSPRESENT:0008:0:OPTIONAL
OPTIONAL
OPTIONAL

Where the data will always start with a name.
This is followed by a colon some numbers a colon some numbers and a colon,
which will all be discarded.
Then there may or may not be some additional data after that.

Next there might be a newline followed by some optional data.
Finally there might be a newline followed by some optional data.

Ok here is my issue, the RegEx im using to do this will place data found
in the 3rd memory variable in the variable $4 when there is no match
to fill $4. So $4 will contain data but $3 will not, when i expected rather
that $3 would contain data and $4 would not.

Example troublesome data:

ALWAYSPRESENT:0008:0:pRESENT
PRESENT
NOTPRESENT

This is the offending RegEx.

$msg =~ /(.*?):.*:(.*)\n*(^.*)\n*(^.*)/m;

I can't follow your logic entirely, but I suspect that you simply have
too many unqualified '*' characters in your regex (I count 6) and it is
causing confusion. For example, '\n*' need not match any characters at
all. Perhaps you want '\n?' or '\n+' there instead.

In any case, please post a complete, runnable program and somebody,
perhaps even me, will be able to help you.

--
Jim Gibson


----------------------------------------------------------

----------------------------------------------------------
color]

Ok, here is an example that demonstrates the quirks.
Notice in the second printout that what was in $3 in the first
printout is now in $4 and $3 contains ''.

And thanks very much for giving this a go!

#!usr/bin/Perl
use strict;
use warnings;

my $data;
$data = 'Some Text:0000:0:More Text'."\n".
'Text text'."\n".
'Text text text.'."\n";
&reformat($data);
$data = 'Some Text:0000:0:More Text'."\n".
'Text text'."\n";
&reformat($data);

exit;

sub reformat
{
my $msg = $_[0] || die "Invalid option in reformat\n";
my $out;
$msg =~ /(.*?):.*:(.*)\n*(^.*)\n*(^.*)/m;
$out = "$1,".
"1,".
"00000000000,".
"0000000,".
"000,".
"$2,".
"$3,".
"$4\n";
print $out;
print '=======================================================',"\n";
return(1);
}
 
J

John W. Krahn

Hello, having some trouble solving this regular expression puzzle.
It is possible to solve the issue using some if statements, but im
curious why this is occurring.

The data involved looks similar to the following:

ALWAYSPRESENT:0008:0:OPTIONAL
OPTIONAL
OPTIONAL

Where the data will always start with a name.
This is followed by a colon some numbers a colon some numbers and a colon,
which will all be discarded.
Then there may or may not be some additional data after that.

Next there might be a newline followed by some optional data.
Finally there might be a newline followed by some optional data.

Ok here is my issue, the RegEx im using to do this will place data found
in the 3rd memory variable in the variable $4 when there is no match
to fill $4. So $4 will contain data but $3 will not, when i expected rather
that $3 would contain data and $4 would not.

Example troublesome data:

ALWAYSPRESENT:0008:0:pRESENT
PRESENT
NOTPRESENT

This is the offending RegEx.

$msg =~ /(.*?):.*:(.*)\n*(^.*)\n*(^.*)/m;

$ perl -le'
my @x = ( <<ONE, <<TWO );
ALWAYSPRESENT:0008:0:OPTIONAL
OPTIONAL
OPTIONAL
ONE
ALWAYSPRESENT:0008:0:pRESENT
PRESENT
TWO

for ( @x ) {
print "1=$1 2=$2 3=$3 4=$4" if /(.*?):.*:(.*)\n*(^.*)\n*(^.*)/m;
}
'
1=ALWAYSPRESENT 2=OPTIONAL 3=OPTIONAL 4=OPTIONAL
1=ALWAYSPRESENT 2=PRESENT 3= 4=PRESENT



You are using the /m option and the ^ anchor which tells perl that there
*must* be at least three lines even if there are only two lines.

$ perl -le'
my @x = ( <<ONE, <<TWO );
ALWAYSPRESENT:0008:0:OPTIONAL
OPTIONAL
OPTIONAL
ONE
ALWAYSPRESENT:0008:0:pRESENT
PRESENT
TWO

for ( @x ) {
print "1=$1 2=$2 3=$3 4=$4" if /(.*?):.*:(.*)\n*(.*)\n*(.*)/;
}
'
1=ALWAYSPRESENT 2=OPTIONAL 3=OPTIONAL 4=OPTIONAL
1=ALWAYSPRESENT 2=PRESENT 3=PRESENT 4=




John
 
M

Mirco Wahab

my $data;
$data = 'Some Text:0000:0:More Text'."\n".
'Text text'."\n".
'Text text text.'."\n";

Thats better. Real data ;-)

My first shot:


....
my $data='
Some Text:0000:0:More Text
Text text
Text text text
';

my $rg = qr/
^([^:]+) : \d+ : \d+ : ([^\n]+)?\n
(?: ^([^:\n]+?) \n)?
(?: ^([^:\n]+?) (?:\n|$) )?/mx;

if( $data =~ /$rg/ ) {
print join "\n", map defined $_?$_:'undef', ($1, $2, $3, $4);
}


Regards

M.
 
Q

QoS

[Snip]

Thank you everybody for helping solve this little mystery.

Your solutions and workarounds are quite clever!
I was unaware of that 'm' option side-effect.
 
M

Mirco Wahab

(e-mail address removed) wrote in message-id:
<[email protected]>
[Snip]

Thank you everybody for helping solve this little mystery.

Your solutions and workarounds are quite clever!
I was unaware of that 'm' option side-effect.

I was under the impression your data
would not only consist of /one/ record
but rather a good sequence of them, so
the regex would need to climb down
(find) the records and spit out the
correct matches,

# Example: four record thing with "offending" structure ==>

my $morestuff='
ALWAYSPRESENT:0008:0:pRESENT
PRESENT
MAYBEPRESENT
Some Text 1:0000:0:More Text 1
Some Text 2:0000:0:More Text 2
Text22 text22 text22
Some Text 3:0000:0:More Text 3
Text3 text3
Text33 text33 text33
';
# and so on ...

# Now, the regex should identify them
# and step along ==>

my $rg = qr/ \s*
^([^:]+) : \d+ : \d+ : ([^\n]+)?\n
(?: ^([^:\n]+?) \n)?
(?: ^([^:\n]+?) (?:\n|$) )?/mx;

# This was the "shortest" thing I could find so
# far (within your constraints), the record-
# walking would be within a while ==>

while( $morestuff =~ /$rg/g ) {

printf "%s %s\n\t%s\n\t%s\n",
$1||'undef', $2||'undef',
$3||'undef', $4||'undef';

}

# ... which would give the correct matches.


Maybe I misunderstood your problem somehow,
but I found the task quite nice and interesting
(maybe somebody would write down a really simple
regular expression for that - (not me, sleeping
time now in this country ;-).

Regards

Mirco
 
Ad

Advertisements

B

Broke

Very good job Mr. Wahab !
I didn't know yet the
secret of the qr in your
code and just learned it.
It's extremely useful.

Many thanks !
 

Top