How to do variable-width look-behind?

J

jl_post

Hi,

I have a Perl script that processes multi-line input. The problem
is, sometimes this input has newlines stuck in arbitrary places (such
as right in the middle of a valid token). This makes the input out-of-
spec, but I have no control over this, so I want to correct it if I
can. What's more is, sometimes this newline breaks a token in two,
where the first half still looks like a valid token while the other
does not, and vice-versa.

I'm trying to modify my Perl script so that it reviews every
newline and see if it should be discarded. The logic I want to use is
to throw out every newline UNLESS it is flanked (on both sides) by
valid tokens. I would like to be able to do something like this:

# Create a regular expression that matches tokens
# like "N50E40", "N50 E40", "N5000 E4000",
# "50N40E", "50N 40E", and "5000N4000E":
my $tokenRegExp = qr/\b(?:[NS]\d+\s*[EW]\d+|\d+[NS]\s*\d+[EW])\b/;

# Remove newlines that are not surrounded by valid tokens:
$input =~ s/(?<!$tokenRegExp)\n(?=$tokenRegExp)//g; # no token
before
$input =~ s/(?<=$tokenRegExp)\n(?!$tokenRegExp)//g; # no token
after
$input =~ s/(?<!$tokenRegExp)\n(?!$tokenRegExp)//g; # no tokens

The problem is is that the look-behind assertions (both positive
and negative) only work for fixed-width expressions, according to
"perldoc perlre". Unfortunately, it would be so useful for me to be
able to match a string with a variable look-behind, that I'm hoping
there's a logical work-around to this limitation.

Is there any way for me to work around this limitation?

Thanks.

-- Jean-Luc
 
C

C.DeRykus

Hi,

   I have a Perl script that processes multi-line input.  The problem
is, sometimes this input has newlines stuck in arbitrary places (such
as right in the middle of a valid token).  This makes the input out-of-
spec, but I have no control over this, so I want to correct it if I
can.  What's more is, sometimes this newline breaks a token in two,
where the first half still looks like a valid token while the other
does not, and vice-versa.

   I'm trying to modify my Perl script so that it reviews every
newline and see if it should be discarded.  The logic I want to use is
to throw out every newline UNLESS it is flanked (on both sides) by
valid tokens.  I would like to be able to do something like this:

   # Create a regular expression that matches tokens
   # like "N50E40", "N50 E40", "N5000 E4000",
   # "50N40E", "50N 40E", and "5000N4000E":
   my $tokenRegExp = qr/\b(?:[NS]\d+\s*[EW]\d+|\d+[NS]\s*\d+[EW])\b/;

   # Remove newlines that are not surrounded by valid tokens:
   $input =~ s/(?<!$tokenRegExp)\n(?=$tokenRegExp)//g;  # no token
before
   $input =~ s/(?<=$tokenRegExp)\n(?!$tokenRegExp)//g;  # no token
after
   $input =~ s/(?<!$tokenRegExp)\n(?!$tokenRegExp)//g;  # no tokens

   The problem is is that the look-behind assertions (both positive
and negative) only work for fixed-width expressions, according to
"perldoc perlre".  Unfortunately, it would be so useful for me to be
able to match a string with a variable look-behind, that I'm hoping
there's a logical work-around to this limitation.

   Is there any way for me to work around this limitation?

IIUC you could use 5.10's more efficient counterparts for $`,$',$& :

while ( $input =~ m/ \n /gpx ) { # note /p switch: perldoc perlre

my( $pre, $post ) = ( ${^PREMATCH}, ${^POSTMATCH} );

unless ( $pre =~ / $tokenRegExp $/x and
$post =~ / ^ $tokenRegExp /x )
{
substr($input, pos($input)-1, 1, "" );
}
}
 
J

jl_post

I would like to be able to do something like this:

   # Remove newlines that are not surrounded by valid tokens:
   $input =~ s/(?<!$tokenRegExp)\n(?=$tokenRegExp)//g;
   $input =~ s/(?<=$tokenRegExp)\n(?!$tokenRegExp)//g;
   $input =~ s/(?<!$tokenRegExp)\n(?!$tokenRegExp)//g;

   The problem is is that the look-behind assertions (both positive
and negative) only work for fixed-width expressions, according to
"perldoc perlre".


Ben Morrow replied:
you could use the usual solution for positive look-behind:

s/($tokenRegExp)\n(?=$tokenRegExp)/$1\0/g;
s/\n//g;
s/\0/\n/g;

Ah. Thanks for showing me. Yes, that would work for my purposes,
as null bytes (and a whole lot of other characters) are guaranteed to
not appear in my $text.

Before I had read your reply, I came up with a solution of my own.
It's not as simple as yours, but it did appear to work correctly:
First, I split the $text into an array of lines, and then looped
through every pair of lines. If the a line does not end in the token
or its next line does not begin with the token, then I chomp() that
line. Then I set $text to the lines.

Here's essentially what I did:

$text = do {
my @lines = split m/(?<=\n)/, $text;

foreach my $i (0 .. $#lines-1)
{
my ($current, $next) = @lines[$i, $i+1];

# Skip removing newline if surrounded by tokens:
next if $current =~ m/$tokenRegExp$/
and $next =~ m/^$tokenRegExp/;

# This is a linebreak we want to remove:
chomp($lines[$i]);
}

join '', @lines # "returns" the lines into $text
};

It's not quite as elegant or simple as your solution, but it did
appear to work well.


Charles DeRykus replied:
you could use 5.10's more efficient counterparts for $`,$',$& :

while ( $input =~ m/ \n /gpx )
{ # note /p switch: perldoc perlre
my( $pre, $post ) = ( ${^PREMATCH}, ${^POSTMATCH} );
unless ( $pre =~ / $tokenRegExp $/x and
$post =~ / ^ $tokenRegExp /x )
{
substr($input, pos($input)-1, 1, "" );
}

Oh, wow! I never knew about the /p switch! (That's especially odd
since I look up "perldoc perlre" fairly frequently.) Thanks for
telling me about it. Since the script I'm working on must run on
machines that aren't guaranteed to have Perl 5.10, I won't use it
right now, but I'll keep it in mind for scripts of my own use.

Since $input is being modified inside its own while($input =~ m//g)
loop, I might suggest considering saving off pos($input) before the
substr() and then restoring it right after. Otherwise, the while-
match will start back at the beginning of $input.

(That might not be a problem in this case, but saving and restoring
the pos() at least ensures that the while-match loop won't revisit
parts of $input that have already been processed.)

Anyway, thanks for your help and "$input", Ben and Charles. I
appreciate it!

-- Jean-Luc
 
S

sln

Hi,

I have a Perl script that processes multi-line input. The problem
is, sometimes this input has newlines stuck in arbitrary places (such
as right in the middle of a valid token).

If this is the case, that you have '\n' within arbitrary locations within
an otherwise valid token, then you have much bigger problems than worrying
about look behind.

So, I'm not sure what you mean by 'arbitrary locations'. I would have to
asume it could be anywhere.

So, your real problem is actually *framing* a valid token, not newlines.
Which is easier said than done. You would need some kind of analizer that
can distinguish a particular token pattern from other stream info, separate
them, clean them up, and put them back together.

Something like below might help. But, if your token stream is a little less
arbitrary, this could be reduced to something simpler.
I was just going to do/keep it for my own curiosity, but I thought I would post it
here anyway.

-sln
---------------

# tok_parse_nclean.pl
# -sln, 2/10
#
use strict;
use warnings;

my $input = qq(
55N
40E

N
5
1
E42
N53 E44
this is some other token text

N 5 and 38N11E more stuff 28N 99 E
N5000 E4000
59N
48E

56N 410 E
5000N4000 E
);

##
my @tokens = ();
my $new_input = '';
my $digs = '\s*\d[\d\s]*';
my $TOKenRegExp = qr/(?:\b(?:[NS]$digs[EW]$digs|$digs[NS]$digs[EW])\b)/;

##
print "Input:\n",'-'x8,"\n";
print $input,"\n";

##
print "New Input:\n",'-'x8,"\n";

while ($input =~ /( ($TOKenRegExp) | (?:(?!$TOKenRegExp)[^\n])+\n? )/xsg)
{
my $tok = $1;
my $istok = defined $2;
$tok =~ s/^\s*|\s*$//g;
next if !length( $tok );

if ($istok) {
$tok =~ s/\n//g;
$tok =~ s/(?<=\d)\s+(?=\d)//;
$tok =~ s/^[NS]\K\s+(?=\d)//;
$tok =~ s/\d\K\s+(?=[EW]$)//;
$tok =~ s/\s+/ /g;
push @tokens, $tok;
$tok .= "\n";
} else {
$tok .= ' ';
}
$new_input .= $tok;
}
print $new_input,"\n";

##
print "Found Tokens:\n",'-'x8,"\n";
for (@tokens) {
print "'$_'\n";
}
print '-'x8,"\n";

__END__

Input:
--------

55N
40E

N
5
1
E42
N53 E44
this is some other token text

N 5 and 38N11E more stuff 28N 99 E
N5000 E4000
59N
48E

56N 410 E
5000N4000 E

New Input:
--------
55N 40E
N51 E42
N53 E44
this is some other token text N 5 and 38N11E
more stuff 28N 99E
N5000 E4000
59N 48E
56N 410E
5000N4000E

Found Tokens:
--------
'55N 40E'
'N51 E42'
'N53 E44'
'38N11E'
'28N 99E'
'N5000 E4000'
'59N 48E'
'56N 410E'
'5000N4000E'
--------
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,535
Members
45,007
Latest member
obedient dusk

Latest Threads

Top