Variable length lookbehind not implemented

fmassion · Aug 21, 2013

Hi folks:

My text (sample):

saddle stitcher: repl. of 8 saddle stitcher 

Goal:
I want to put numbers in square brakets, but only if they do not occur within tags.

My code:

#!/usr/bin/perl -w
open(IN,'sample.txt') || die("Datei kann nicht geöffnet werden!\n");
my $number = '(?<!<.*?)\d+(?!.*?>)';
while(<IN>) {
$_ =~ s/$number/\[$number\]/g;
print "$_\n";
}
close (IN);

Error message:

Variable length lookbehind not implemented in regex m/(?<!<.*?)\d+(?!.*?>)/at D:\Perl\test.pl line 5, <IN> line 1.

I couldn't find an explanation for this error message. Has anyone an idea?

Charles DeRykus · Aug 21, 2013

Hi folks:

My text (sample):

saddle stitcher: repl. of 8 saddle stitcher 

Goal:
I want to put numbers in square brakets, but only if they do not occur within tags.

My code:

#!/usr/bin/perl -w
open(IN,'sample.txt') || die("Datei kann nicht geöffnet werden!\n");
my $number = '(?<!<.*?)\d+(?!.*?>)';
while(<IN>) {
$_ =~ s/$number/\[$number\]/g;
print "$_\n";
}
close (IN);

Error message:

Variable length lookbehind not implemented in regex m/(?<!<.*?)\d+(?!.*?>)/ at D:\Perl\test.pl line 5, <IN> line 1.

I couldn't find an explanation for this error message. Has anyone an idea?

See "negative look-behind" in perlre. The explanation is "works only for
fixed-width look-behind".

A quick, probably fragile, alternative:

my text;
{ undef $/; $text = <IN>;}

while ( $text =~ /\G ([^<]*?) (<.*?>) /sgx ) {
my($out, $in) = ($1,$2);
$out =~ s/(\d+)/[$1]/ag;
print $out, $in;
}

Charles DeRykus · Aug 21, 2013

....

my text;
{ undef $/; $text = <IN>;}

Better written: { local $/; $text = <IN>}

Uri Guttman · Aug 22, 2013

CD> Better written: { local $/; $text = <IN>}

even better:

use File::Slurp ;
my $text = read_file( $file ) ;

uri

fmassion · Aug 22, 2013

Thanks to all of you for the explanations.

This code does the trick:

use File::Slurp ;
my $text = read_file( 'testfile.txt' ) ;
while ( $text =~ /\G ([^<]*?) (<.*?>) /sgx ) {
my($out, $in) = ($1,$2);
$out =~ s/(\d+)/[$1]/ag;
print $out, $in;
}

It also works with these lines:
my text;
{ undef $/; $text = <IN>;}

This is the result of the test:

saddle stitcher:| repl. of [2] saddle stitcher Mishandled paper:| repl. of mishandled paper Please add [8] staples .... (only numbers outside the tags have been processed.)
Francois

fmassion · Aug 22, 2013

Sorry, I found a flaw in the expression:

while ( $text =~ /\G([^<]*?)(<.*?>)/sgx ) {

If the text doesn't end with a tag, the last $out is not printed in:
print $out, $in;

The last printed character is a ">"
We need somehow to find an expression whicht prints the remaining characters.

Rainer Weikusat · Aug 22, 2013

Charles DeRykus said:
Better written: { local $/; $text = <IN>}

Adding the reason for that: local $/ creates a new binding for $/
which is dynamically scoped to the enclosing block (it has dynamic
extent and indefinite scope[*]). This implies that $/ reverts to its
former value after the enclosing block has finished executing. Except
in very 'controlled and limited' circumstance, this is preferable to
overwriting whatever the current value happens to be at the moment and
'leaking' this 'local policy descision' to the all code executeing
after the block.

[*] The Lisp-terminology[**] is somewhat lacking here because the
newly established binding is only visible to code which is reachable
via an execution path starting in the block and this will usually only
be a subset of all of the program code (in absence of travesties like
'execute a random function found via the symbol table of a random
package').

[**]

http://www.cs.cmu.edu/Groups/AI/html/cltl/clm/node43.html

Rainer Weikusat · Aug 22, 2013

Sorry, I found a flaw in the expression:

while ( $text =~ /\G([^<]*?)(<.*?>)/sgx ) {

If the text doesn't end with a tag, the last $out is not printed in:
print $out, $in;

The last printed character is a ">"

You could use a proper 'lexer' for HTML.

NB: This is something I just wrote down because I thought it couldn't
be that difficult. It is assumed that numbers which are part of a word
shouldn't be bracketed.

--------------
{
local $/;
$_ = <STDIN>;
}

my $in_tag;

{
unless ($in_tag) {
/\G</gc && do {
++$in_tag;
print('<');
redo;
};

/\G\b(\d+)\b/gc && do {
print("[$1]");
redo;
};

(/\G(\d+)/gc
|| /\G([^\d<]+)/gc) && do {
print($1);
redo;
};
} else {
/\G>/gc && do {
print('>');
--$in_tag;
redo;
};

/\G</gc && do {
print('<');
++$in_tag;
redo;
};

/\G([^<>]+)/gc && do {
print($1);
redo;
};
}
}

Charles DeRykus · Aug 22, 2013

Sorry, I found a flaw in the expression:

while ( $text =~ /\G([^<]*?)(<.*?>)/sgx ) {

If the text doesn't end with a tag, the last $out is not printed in:
print $out, $in;

The last printed character is a ">"
We need somehow to find an expression whicht prints the remaining characters.

This might be a quick fix.. but again it's probably fragile
in many cases.

while ( $text =~ /\G ([^<]*) (?: (<.*?>) | \z ) /sgx ) {
my($out, $in) = ($1 // '', $2 // '');
$out =~ s/(\d+)/[$1]/ag;
print $out,$in;
}

If unfamiliar with any of the above replacement regex items:

See: perldoc perlre # (?: ) and/or \z
perldoc perlop # \G and/or //

also perlre for the /a modifier

Rainer Weikusat · Aug 22, 2013

Charles DeRykus said:
Sorry, I found a flaw in the expression:

while ( $text =~ /\G([^<]*?)(<.*?>)/sgx ) {

If the text doesn't end with a tag, the last $out is not printed in:
print $out, $in;

Click to expand...

[...]

This might be a quick fix.. but again it's probably fragile
in many cases.

while ( $text =~ /\G ([^<]*) (?: (<.*?>) | \z ) /sgx ) {
my($out, $in) = ($1 // '', $2 // '');
$out =~ s/(\d+)/[$1]/ag;
print $out,$in;
}

It will also replace numbers in words (which may or may not be
desired). Also, according to a quick test, using

while ( $text =~ /\G ([^<]*) (<.*?>)? /sgx ) {

works, too.

Rainer Weikusat · Aug 22, 2013

[...]

while ( $text =~ /\G ([^<]*) (?: (<.*?>) | \z ) /sgx ) {
my($out, $in) = ($1 // '', $2 // '');

Also according to a quick test I made, a () which matched an empty
string (this includes 'optional' ()s which didn't match anything)
causes an empty string to be put into the corresponding $n which
implies that the $1 // '' is not even useful as workaround for
less-than-useful perl runtime warnings.

Charles DeRykus · Aug 22, 2013

[...]

while ( $text =~ /\G ([^<]*) (?: (<.*?>) | \z ) /sgx ) {
my($out, $in) = ($1 // '', $2 // '');

Click to expand...

Also according to a quick test I made, a () which matched an empty
string (this includes 'optional' ()s which didn't match anything)
causes an empty string to be put into the corresponding $n which
implies that the $1 // '' is not even useful as workaround for
less-than-useful perl runtime warnings.

That's much better. (But, that's why I was careful to use the weasel
words "quick" and "fragile" when responding

And since the html's pedigree is unknown, an un-entified "<" causes
problems for both:

just a single un-entified < and any no. 1,2,... to \z vanish

You could add /c and take care of even that I think but, at some point
if you want another great leap, a parser is the way to go.

Charles DeRykus · Aug 22, 2013

...

You could add /c and take care of even that I think...

Nope, /c doesn't help.

Charles DeRykus · Aug 23, 2013

if you want another great leap, a parser is the way to go.

I'm not sure this is the "great leap" but here's a possible parser approach:

use HTML::TreeBuilder;

my $root = HTML::TreeBuilder->new_from_file( $filename );

foreach my $tag ($root->look_down(sub{1) ) {
while( my($index,$child) = each $tag->content_array_ref ) {
unless ( ref($child) eq "HTML::Element" ) {
$child =~ s/(\d+)/[$1]/ag; # 1replaces no's in words
$tag->splice_content( $index,1,$child );
}
}
}
print $root->as_HTML();

Charles DeRykus · Aug 23, 2013

...
foreach my $tag ($root->look_down(sub{1) ) {

^^^^^^^

foreach my $tag ( $root->look_down(sub{1}) ) {

fmassion · Aug 23, 2013

Also, according to a quick test, using

while ( $text =~ /\G ([^<]*) (<.*?>)? /sgx ) {

works, too.

Yes it works, but unfortunately I get an error message about "uninitialized value $in"

My test strings (it's bullshit, just to test the expression). In practise I am using chunks of HTML/XML files, i.e. text which cannot be parsed because not all the required tags are in the text.

Test sentences:
2-side slitting 64 scrap box is full <S 64R> Please empty slitting 654 scrap box
Please 345 set Saddle stitcher 2-Side <S 65 R> slitting 1008 scrap box5
2-side slitting 64 scrap box is full <S 64R> Please empty slitting 654 scrap box

Result with "while ( $text =~ /\G ([^<]*?) (<.*?>) /sgx ) { "

[2]-side slitting [64] scrap box is full <S 64R> Please empty slitting [654] scrap box
Please [345] set Saddle stitcher [2]-Side <S 65 R> slitting [1008] scrap box[5]
[2]-side slitting [64] scrap box is full <S 64R>

Result with while "( $text =~ /\G ([^<]*) (<.*?>)? /sgx ) {"

[2]-side slitting [64] scrap box is full <S 64R> Please empty slitting [654] scrap box
Please [345] set Saddle stitcher [2]-Side <S 65 R> slitting [1008] scrap box[5]
Use of uninitialized value $in in print at D:\Perl\test.pl line 18.
Use of uninitialized value $in in print at D:\Perl\test.pl line 18.
[2]-side slitting [64] scrap box is full <S 64R> Please empty slitting [654] scrap box

This is line 18: print $out, $in;

Thus all sentences have been processed as they should have, but there are 2 times an uninitialized value "$in".

Rainer Weikusat · Aug 23, 2013

Also, according to a quick test, using

while ( $text =~ /\G ([^<]*) (<.*?>)? /sgx ) {

Click to expand...

works, too.

Click to expand...

Yes it works, but unfortunately I get an error message about
"uninitialized value $in"

The easiest way to deal with spurious warnings is "don't enable them"
:->. perl does automatic type conversions whenever necessary but some
people are STRONGLY (!!!!) convinced that programmer convenience is a
surefire way to achieve disaster (why these people dabble in perl
instead of 'languages designed to be obnoxious', ie, C++ or Java,
escapes me ...).

Apart from that, there are various more-or-less ugly workarounds.
The

my ($out, $in) = ($1 // '', $2 // '')

would be one.

Some others

------
while ( $text =~ /\G ([^<]+)|(<.*?>) /sgx ) {
if ($1) {
my $out = $1;
$out =~ s/(\d+)/[$1]/g;
print $out;
} else {
print $2;
}
}
------

This matches either a 'free text' sequence or a complete tag and
performs the substitution when the 'free text' match was successful.

------
while ( $text =~ /\G ([^<]+|<.*?>) /sgx ) {
my $out = $1;
$out =~ s/(\d+)/[$1]/g if $out !~ /^</;
print $out;
}
-----

This is essentially the same except that the matched text always ends
up in $1 so the content of that needs to be examined in order to
determine which it was.

-----
for ($text) {
/\G([^<]+)/gc && do {
my $out = $1;
$out =~ s/(\d+)/[$1]/g;
print $out;
redo;
};

/\G(<.*?>)/g && do {
print $1;
redo;
};
}
----

This use for to alias text to $_. It then checks if either a 'free
text' sequence or a complete tag can be found at the current match
position and performs the correct action for each, followed by a
'redo' in order to restart the loop. If neither pattern matched, end
of the input has obviously been reached and the loop (sort of)
terminates.

NB: The first match needs an additional /c to avoid resetting the
match position if it fails. The second one doesn't because if it
fails, the loop will terminate, anyway.

Rainer Weikusat · Aug 23, 2013

Charles DeRykus said:
[...]

while ( $text =~ /\G ([^<]*) (?: (<.*?>) | \z ) /sgx ) {
my($out, $in) = ($1 // '', $2 // '');

Click to expand...

Also according to a quick test I made, a () which matched an empty
string (this includes 'optional' ()s which didn't match anything)
causes an empty string to be put into the corresponding $n which
implies that the $1 // '' is not even useful as workaround for
less-than-useful perl runtime warnings.

Click to expand...

That's much better. (But, that's why I was careful to use the weasel
words "quick" and "fragile" when responding

And since the html's pedigree is unknown, an un-entified "<" causes
problems for both:

just a single un-entified < and any no. 1,2,... to \z vanish

Filters are ill-suited for syntax checking because they will produce
garbage output in case of errors.

BTW: Why <.*?> and not <.*>?

Rainer Weikusat · Aug 23, 2013

[...]

-----
for ($text) {
/\G([^<]+)/gc && do {
my $out = $1;
$out =~ s/(\d+)/[$1]/g;
print $out;
redo;
};

/\G(<.*?>)/g && do {

This should be

/\G(<.*?>)/gs

so that tags formatted like this

<
hippocampus
are also matched.

fmassion · Aug 25, 2013

Thanks Rainer,

I'll just mention here what worked and what didn't work:

This didn't work as expected. The last bit of text has not been processed:

Q: Hi-HO! How to implement this search engine... ?	1	Sep 20, 2010
sort function warning, variable will not stay shared	8	Oct 2, 2009
How to get the variable name, not values?	6	Apr 22, 2007
using variable in search and replace	1	Apr 29, 2006
cannot write file, cannot do variable length packet	0	Feb 8, 2004
Indentation styles	27	May 7, 2013
need help with a cart I inherited, need to increase number of total characters allowed	3	Oct 22, 2007
How to get table from some html	12	Feb 5, 2007

Variable length lookbehind not implemented

fmassion

Charles DeRykus

Charles DeRykus

Uri Guttman

fmassion

fmassion

Rainer Weikusat

Rainer Weikusat

Charles DeRykus

Rainer Weikusat

Rainer Weikusat

Charles DeRykus

Charles DeRykus

Charles DeRykus

Charles DeRykus

fmassion

Rainer Weikusat

Rainer Weikusat

Rainer Weikusat

fmassion

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads