Variable length lookbehind not implemented

Discussion in 'Perl Misc' started by fmassion@web.de, Aug 21, 2013.

  1. Guest

    Hi folks:

    My text (sample):

    saddle stitcher: <font color="#008080"><b>repl. of 8 saddle stitcher</b></font> <font color="#8000FF">

    Goal:
    I want to put numbers in square brakets, but only if they do not occur within tags.

    My code:

    #!/usr/bin/perl -w
    open(IN,'sample.txt') || die("Datei kann nicht geöffnet werden!\n");
    my $number = '(?<!<.*?)\d+(?!.*?>)';
    while(<IN>) {
    $_ =~ s/$number/\[$number\]/g;
    print "$_\n";
    }
    close (IN);

    Error message:

    Variable length lookbehind not implemented in regex m/(?<!<.*?)\d+(?!.*?>)/at D:\Perl\test.pl line 5, <IN> line 1.

    I couldn't find an explanation for this error message. Has anyone an idea?
     
    , Aug 21, 2013
    #1
    1. Advertising

  2. On 8/21/2013 10:14 AM, wrote:
    > Hi folks:
    >
    > My text (sample):
    >
    > saddle stitcher: <font color="#008080"><b>repl. of 8 saddle stitcher</b></font> <font color="#8000FF">
    >
    > Goal:
    > I want to put numbers in square brakets, but only if they do not occur within tags.
    >
    > My code:
    >
    > #!/usr/bin/perl -w
    > open(IN,'sample.txt') || die("Datei kann nicht geöffnet werden!\n");
    > my $number = '(?<!<.*?)\d+(?!.*?>)';
    > while(<IN>) {
    > $_ =~ s/$number/\[$number\]/g;
    > print "$_\n";
    > }
    > close (IN);
    >
    > Error message:
    >
    > Variable length lookbehind not implemented in regex m/(?<!<.*?)\d+(?!.*?>)/ at D:\Perl\test.pl line 5, <IN> line 1.
    >
    > I couldn't find an explanation for this error message. Has anyone an idea?
    >


    See "negative look-behind" in perlre. The explanation is "works only for
    fixed-width look-behind".

    A quick, probably fragile, alternative:

    my text;
    { undef $/; $text = <IN>;}

    while ( $text =~ /\G ([^<]*?) (<.*?>) /sgx ) {
    my($out, $in) = ($1,$2);
    $out =~ s/(\d+)/[$1]/ag;
    print $out, $in;
    }



    --
    Charles DeRykus
     
    Charles DeRykus, Aug 21, 2013
    #2
    1. Advertising

  3. On 8/21/2013 2:11 PM, Charles DeRykus wrote:
    > ....
    >
    > my text;
    > { undef $/; $text = <IN>;}
    >


    Better written: { local $/; $text = <IN>}

    --
    Charles DeRykus
     
    Charles DeRykus, Aug 21, 2013
    #3
  4. Uri Guttman Guest

    >>>>> "CD" == Charles DeRykus <> writes:

    CD> On 8/21/2013 2:11 PM, Charles DeRykus wrote:
    >> ....
    >>
    >> my text;
    >> { undef $/; $text = <IN>;}
    >>


    CD> Better written: { local $/; $text = <IN>}

    even better:

    use File::Slurp ;
    my $text = read_file( $file ) ;

    uri
     
    Uri Guttman, Aug 22, 2013
    #4
  5. Guest

    Thanks to all of you for the explanations.

    This code does the trick:

    use File::Slurp ;
    my $text = read_file( 'testfile.txt' ) ;
    while ( $text =~ /\G ([^<]*?) (<.*?>) /sgx ) {
    my($out, $in) = ($1,$2);
    $out =~ s/(\d+)/[$1]/ag;
    print $out, $in;
    }

    It also works with these lines:
    my text;
    { undef $/; $text = <IN>;}

    This is the result of the test:

    saddle stitcher:| </font><font color="#008080"><b>repl. of [2] saddle stitcher</b></font> <font color="#8000FF">Mishandled paper:| </font><font color="#008080"><b>repl. of mishandled paper</b></font><br>Please add [8] staples .... (only numbers outside the tags have been processed.)
    Francois
     
    , Aug 22, 2013
    #5
  6. Guest

    Sorry, I found a flaw in the expression:

    while ( $text =~ /\G([^<]*?)(<.*?>)/sgx ) {

    If the text doesn't end with a tag, the last $out is not printed in:
    print $out, $in;

    The last printed character is a ">"
    We need somehow to find an expression whicht prints the remaining characters.
     
    , Aug 22, 2013
    #6
  7. Charles DeRykus <> writes:
    > On 8/21/2013 2:11 PM, Charles DeRykus wrote:
    >> ....
    >>
    >> my text;
    >> { undef $/; $text = <IN>;}
    >>

    >
    > Better written: { local $/; $text = <IN>}


    Adding the reason for that: local $/ creates a new binding for $/
    which is dynamically scoped to the enclosing block (it has dynamic
    extent and indefinite scope[*]). This implies that $/ reverts to its
    former value after the enclosing block has finished executing. Except
    in very 'controlled and limited' circumstance, this is preferable to
    overwriting whatever the current value happens to be at the moment and
    'leaking' this 'local policy descision' to the all code executeing
    after the block.

    [*] The Lisp-terminology[**] is somewhat lacking here because the
    newly established binding is only visible to code which is reachable
    via an execution path starting in the block and this will usually only
    be a subset of all of the program code (in absence of travesties like
    'execute a random function found via the symbol table of a random
    package').

    [**]

    http://www.cs.cmu.edu/Groups/AI/html/cltl/clm/node43.html
     
    Rainer Weikusat, Aug 22, 2013
    #7
  8. writes:
    > Sorry, I found a flaw in the expression:
    >
    > while ( $text =~ /\G([^<]*?)(<.*?>)/sgx ) {
    >
    > If the text doesn't end with a tag, the last $out is not printed in:
    > print $out, $in;
    >
    > The last printed character is a ">"


    You could use a proper 'lexer' for HTML.

    NB: This is something I just wrote down because I thought it couldn't
    be that difficult. It is assumed that numbers which are part of a word
    shouldn't be bracketed.

    --------------
    {
    local $/;
    $_ = <STDIN>;
    }

    my $in_tag;

    {
    unless ($in_tag) {
    /\G</gc && do {
    ++$in_tag;
    print('<');
    redo;
    };

    /\G\b(\d+)\b/gc && do {
    print("[$1]");
    redo;
    };

    (/\G(\d+)/gc
    || /\G([^\d<]+)/gc) && do {
    print($1);
    redo;
    };
    } else {
    /\G>/gc && do {
    print('>');
    --$in_tag;
    redo;
    };

    /\G</gc && do {
    print('<');
    ++$in_tag;
    redo;
    };

    /\G([^<>]+)/gc && do {
    print($1);
    redo;
    };
    }
    }
     
    Rainer Weikusat, Aug 22, 2013
    #8
  9. On 8/22/2013 6:05 AM, wrote:
    > Sorry, I found a flaw in the expression:
    >
    > while ( $text =~ /\G([^<]*?)(<.*?>)/sgx ) {
    >
    > If the text doesn't end with a tag, the last $out is not printed in:
    > print $out, $in;
    >
    > The last printed character is a ">"
    > We need somehow to find an expression whicht prints the remaining characters.




    This might be a quick fix.. but again it's probably fragile
    in many cases.

    while ( $text =~ /\G ([^<]*) (?: (<.*?>) | \z ) /sgx ) {
    my($out, $in) = ($1 // '', $2 // '');
    $out =~ s/(\d+)/[$1]/ag;
    print $out,$in;
    }

    If unfamiliar with any of the above replacement regex items:

    See: perldoc perlre # (?: ) and/or \z
    perldoc perlop # \G and/or //

    also perlre for the /a modifier

    --
    Charles DeRykus
     
    Charles DeRykus, Aug 22, 2013
    #9
  10. Charles DeRykus <> writes:
    > On 8/22/2013 6:05 AM, wrote:
    >> Sorry, I found a flaw in the expression:
    >>
    >> while ( $text =~ /\G([^<]*?)(<.*?>)/sgx ) {
    >>
    >> If the text doesn't end with a tag, the last $out is not printed in:
    >> print $out, $in;


    [...]

    > This might be a quick fix.. but again it's probably fragile
    > in many cases.
    >
    > while ( $text =~ /\G ([^<]*) (?: (<.*?>) | \z ) /sgx ) {
    > my($out, $in) = ($1 // '', $2 // '');
    > $out =~ s/(\d+)/[$1]/ag;
    > print $out,$in;
    > }


    It will also replace numbers in words (which may or may not be
    desired). Also, according to a quick test, using

    while ( $text =~ /\G ([^<]*) (<.*?>)? /sgx ) {

    works, too.
     
    Rainer Weikusat, Aug 22, 2013
    #10
  11. Charles DeRykus <> writes:

    [...]

    > while ( $text =~ /\G ([^<]*) (?: (<.*?>) | \z ) /sgx ) {
    > my($out, $in) = ($1 // '', $2 // '');


    Also according to a quick test I made, a () which matched an empty
    string (this includes 'optional' ()s which didn't match anything)
    causes an empty string to be put into the corresponding $n which
    implies that the $1 // '' is not even useful as workaround for
    less-than-useful perl runtime warnings.
     
    Rainer Weikusat, Aug 22, 2013
    #11
  12. On 8/22/2013 2:01 PM, Rainer Weikusat wrote:
    > Charles DeRykus <> writes:
    >
    > [...]
    >
    >> while ( $text =~ /\G ([^<]*) (?: (<.*?>) | \z ) /sgx ) {
    >> my($out, $in) = ($1 // '', $2 // '');

    >
    > Also according to a quick test I made, a () which matched an empty
    > string (this includes 'optional' ()s which didn't match anything)
    > causes an empty string to be put into the corresponding $n which
    > implies that the $1 // '' is not even useful as workaround for
    > less-than-useful perl runtime warnings.
    >


    That's much better. (But, that's why I was careful to use the weasel
    words "quick" and "fragile" when responding :)

    And since the html's pedigree is unknown, an un-entified "<" causes
    problems for both:

    just a single un-entified < and any no. 1,2,... to \z vanish


    You could add /c and take care of even that I think but, at some point
    if you want another great leap, a parser is the way to go.

    --
    Charles DeRykus
     
    Charles DeRykus, Aug 22, 2013
    #12
  13. On 8/22/2013 2:52 PM, Charles DeRykus wrote:
    > ...
    >
    > You could add /c and take care of even that I think...
    >


    Nope, /c doesn't help.

    --
    Charles DeRykus
     
    Charles DeRykus, Aug 22, 2013
    #13
  14. On 8/22/2013 2:52 PM, Charles DeRykus wrote:
    > On 8/22/2013 2:01 PM, Rainer Weikusat wrote:
    >> Charles DeRykus <> writes:
    >> ...

    > if you want another great leap, a parser is the way to go.
    >


    I'm not sure this is the "great leap" but here's a possible parser approach:

    use HTML::TreeBuilder;

    my $root = HTML::TreeBuilder->new_from_file( $filename );

    foreach my $tag ($root->look_down(sub{1) ) {
    while( my($index,$child) = each $tag->content_array_ref ) {
    unless ( ref($child) eq "HTML::Element" ) {
    $child =~ s/(\d+)/[$1]/ag; # 1replaces no's in words
    $tag->splice_content( $index,1,$child );
    }
    }
    }
    print $root->as_HTML();

    --
    Charles DeRykus
     
    Charles DeRykus, Aug 23, 2013
    #14
  15. On 8/22/2013 10:53 PM, Charles DeRykus wrote:
    > ...
    > foreach my $tag ($root->look_down(sub{1) ) {

    ^^^^^^^

    foreach my $tag ( $root->look_down(sub{1}) ) {

    --
    Charles DeRykus
     
    Charles DeRykus, Aug 23, 2013
    #15
  16. Guest

    > Also, according to a quick test, using
    >
    > while ( $text =~ /\G ([^<]*) (<.*?>)? /sgx ) {


    > works, too.


    Yes it works, but unfortunately I get an error message about "uninitialized value $in"

    My test strings (it's bullshit, just to test the expression). In practise I am using chunks of HTML/XML files, i.e. text which cannot be parsed because not all the required tags are in the text.

    Test sentences:
    2-side slitting 64 scrap box is full <S 64R> Please empty slitting 654 scrap box
    Please 345 set Saddle stitcher 2-Side <S 65 R> slitting 1008 scrap box5
    2-side slitting 64 scrap box is full <S 64R> Please empty slitting 654 scrap box

    Result with "while ( $text =~ /\G ([^<]*?) (<.*?>) /sgx ) { "

    [2]-side slitting [64] scrap box is full <S 64R> Please empty slitting [654] scrap box
    Please [345] set Saddle stitcher [2]-Side <S 65 R> slitting [1008] scrap box[5]
    [2]-side slitting [64] scrap box is full <S 64R>

    Result with while "( $text =~ /\G ([^<]*) (<.*?>)? /sgx ) {"

    [2]-side slitting [64] scrap box is full <S 64R> Please empty slitting [654] scrap box
    Please [345] set Saddle stitcher [2]-Side <S 65 R> slitting [1008] scrap box[5]
    Use of uninitialized value $in in print at D:\Perl\test.pl line 18.
    Use of uninitialized value $in in print at D:\Perl\test.pl line 18.
    [2]-side slitting [64] scrap box is full <S 64R> Please empty slitting [654] scrap box

    This is line 18: print $out, $in;

    Thus all sentences have been processed as they should have, but there are 2 times an uninitialized value "$in".
     
    , Aug 23, 2013
    #16
  17. writes:
    >> Also, according to a quick test, using
    >>
    >> while ( $text =~ /\G ([^<]*) (<.*?>)? /sgx ) {

    >
    >> works, too.

    >
    > Yes it works, but unfortunately I get an error message about
    > "uninitialized value $in"


    The easiest way to deal with spurious warnings is "don't enable them"
    :->. perl does automatic type conversions whenever necessary but some
    people are STRONGLY (!!!!) convinced that programmer convenience is a
    surefire way to achieve disaster (why these people dabble in perl
    instead of 'languages designed to be obnoxious', ie, C++ or Java,
    escapes me ...).

    Apart from that, there are various more-or-less ugly workarounds.
    The

    my ($out, $in) = ($1 // '', $2 // '')

    would be one.

    Some others

    ------
    while ( $text =~ /\G ([^<]+)|(<.*?>) /sgx ) {
    if ($1) {
    my $out = $1;
    $out =~ s/(\d+)/[$1]/g;
    print $out;
    } else {
    print $2;
    }
    }
    ------

    This matches either a 'free text' sequence or a complete tag and
    performs the substitution when the 'free text' match was successful.

    ------
    while ( $text =~ /\G ([^<]+|<.*?>) /sgx ) {
    my $out = $1;
    $out =~ s/(\d+)/[$1]/g if $out !~ /^</;
    print $out;
    }
    -----

    This is essentially the same except that the matched text always ends
    up in $1 so the content of that needs to be examined in order to
    determine which it was.

    -----
    for ($text) {
    /\G([^<]+)/gc && do {
    my $out = $1;
    $out =~ s/(\d+)/[$1]/g;
    print $out;
    redo;
    };

    /\G(<.*?>)/g && do {
    print $1;
    redo;
    };
    }
    ----

    This use for to alias text to $_. It then checks if either a 'free
    text' sequence or a complete tag can be found at the current match
    position and performs the correct action for each, followed by a
    'redo' in order to restart the loop. If neither pattern matched, end
    of the input has obviously been reached and the loop (sort of)
    terminates.

    NB: The first match needs an additional /c to avoid resetting the
    match position if it fails. The second one doesn't because if it
    fails, the loop will terminate, anyway.
     
    Rainer Weikusat, Aug 23, 2013
    #17
  18. Charles DeRykus <> writes:
    > On 8/22/2013 2:01 PM, Rainer Weikusat wrote:
    >> Charles DeRykus <> writes:
    >>
    >> [...]
    >>
    >>> while ( $text =~ /\G ([^<]*) (?: (<.*?>) | \z ) /sgx ) {
    >>> my($out, $in) = ($1 // '', $2 // '');

    >>
    >> Also according to a quick test I made, a () which matched an empty
    >> string (this includes 'optional' ()s which didn't match anything)
    >> causes an empty string to be put into the corresponding $n which
    >> implies that the $1 // '' is not even useful as workaround for
    >> less-than-useful perl runtime warnings.
    >>

    >
    > That's much better. (But, that's why I was careful to use the weasel
    > words "quick" and "fragile" when responding :)
    >
    > And since the html's pedigree is unknown, an un-entified "<" causes
    > problems for both:
    >
    > just a single un-entified < and any no. 1,2,... to \z vanish


    Filters are ill-suited for syntax checking because they will produce
    garbage output in case of errors.

    BTW: Why <.*?> and not <.*>?
     
    Rainer Weikusat, Aug 23, 2013
    #18
  19. Rainer Weikusat <> writes:

    [...]


    > -----
    > for ($text) {
    > /\G([^<]+)/gc && do {
    > my $out = $1;
    > $out =~ s/(\d+)/[$1]/g;
    > print $out;
    > redo;
    > };
    >
    > /\G(<.*?>)/g && do {


    This should be

    /\G(<.*?>)/gs

    so that tags formatted like this

    <
    hippocampus
    >


    are also matched.
     
    Rainer Weikusat, Aug 23, 2013
    #19
  20. Guest

    Thanks Rainer,

    I'll just mention here what worked and what didn't work:

    This didn't work as expected. The last bit of text has not been processed:
    > my ($out, $in) = ($1 // '', $2 // '')
    >

    This worked:
    > ------
    > while ( $text =~ /\G ([^<]+)|(<.*?>) /sgx ) {
    > if ($1) {
    > my $out = $1;
    > $out =~ s/(\d+)/[$1]/g;
    > print $out;
    > } else {
    > print $2;
    > }
    > }

    This worked:
    > ------
    > while ( $text =~ /\G ([^<]+|<.*?>) /sgx ) {
    > my $out = $1;
    > $out =~ s/(\d+)/[$1]/g if $out !~ /^</;
    > print $out;
    > }

    This worked:
    > -----
    > for ($text) {
    > /\G([^<]+)/gc && do {
    > my $out = $1;
    > $out =~ s/(\d+)/[$1]/g;
    > print $out;
    > redo;
    > };
    > /\G(<.*?>)/g && do {
    > print $1;
    > redo;
    > };
    > }
     
    , Aug 25, 2013
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Thomas F. O'Connell

    Negative Lookbehind and Wildcards

    Thomas F. O'Connell, Feb 27, 2004, in forum: Perl
    Replies:
    1
    Views:
    741
    Gunnar Hjalmarsson
    Feb 28, 2004
  2. mail
    Replies:
    1
    Views:
    543
    Will Stranathan
    Mar 2, 2004
  3. OKB (not okblacke)

    Variable-width lookbehind

    OKB (not okblacke), Nov 17, 2007, in forum: Python
    Replies:
    6
    Views:
    340
    OKB (not okblacke)
    Nov 20, 2007
  4. Gabriel Rossetti
    Replies:
    0
    Views:
    608
    Gabriel Rossetti
    Mar 31, 2009
  5. MRAB
    Replies:
    0
    Views:
    553
Loading...

Share This Page