Regex testing and UTF8 awarenes or Regex and numeric pattern matching

Discussion in 'Perl Misc' started by sln@netherlands.com, Mar 10, 2009.

  1. Guest

    Reading through the pod's for info on utf8 and possible interger matching,
    and setting up numerous tests, I inadvertantly discovered what utf really is
    in its entirety.

    Unfortunately, only utf-8 is allowed (my 5.8.6 version) within Perl. All the
    gates, entry points are covered. Internally, its as the documentation says,
    pure utf8. The BOM (byte order mark) is different for utf16/32.

    If you try to force internal variant, say utf32, you get malformed or
    utf-16 surrogate errors. It almost seems impossible then, you could
    convert external utf32 (no BOM) or utf16 to internal utf8. But then, how
    could you test it internal when there is no conversion functions.
    It does no good internally because it is not an entry point. You are inside
    Perl, which doesen't understand anything other that utf8 or byte demotion.

    Not a very good strategy. This leaves holes if one wanted to do utf32 character
    processing inside a regular expression. Of course I don't want to do that, I
    want to process binary 32 bit integers with some of the niceties of the regex engine.

    If the regex engine is so nice as to process some intermittant range of 32-bit
    integers encoded as characters (utf8) perhaps its almost there towards integer
    pattern matching. Albeit the constructs need to be changed a little, it would
    be a powerfull binary parser. Don't you agree?

    Below is some menutia of trials and errors spaghetii code I've tried.
    Within ranges, encoding 32 bit integers for basic pattern matching works well.
    Of course, it is very slow in character classes, as opposed to groups, but sometimes
    putting a few 0-256 character range in a class won't cause it to crash whereas in groups it
    will. Over that, ranges have the surrogate or malformed utf8.

    Outside of problem ranges (BOM) it works flawlessly in groups, and its real fast.

    So, my question is, why is Perl so short sided in this regard. Just some apparently simple
    adjustments and it could be a high grade binary processor.

    The junk code is below. If you haven't tried it or can't explain it
    don't bother replying. I've read all the unicode there is in the pods and understand it
    completely.

    -sln

    ------------------------------------------------------------------------------
    ##
    use warnings;
    use strict;

    printf ">>>>>>> \n%d %d %d %d \n<<<<<<<<<<<\n", 0xdf20,0xdf21,0xdf22,0xdf23,;

    binmode STDOUT, ':utf8';

    #my @ar = (120000,21,22,23,24,25,26,27,28,ord('a'),30);

    #my @ar = (20000,20001,20002,0,20003,20004,20005,23336,20007,20008,20009,30000);

    #my @ar = (0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15);

    my @ar = ();
    #push @ar, $_ for (0 .. 280);
    #push @ar, $_ for (240 .. 280);
    push @ar, $_ for (0 .. 70000);

    push @ar, 0xdf20;
    push @ar, 0xdf21;
    push @ar, 0xdf21;
    push @ar, 0xdf22;

    my $str = pack 'U*', @ar;
    #print "\nstr = ",$str,"\nlength = ",length($str),"\n";


    foreach my $cur (@ar) # here $cur is the frame position
    {


    # my $pattern = sprintf "(\\x{%x}\\x{%x}\\x{%x})(.{0,5})\\x{%x}", $cur,$cur+1,$cur+2,$cur+5;

    # my $pattern = sprintf "(\\x{%x}\\x{%x}\\x{%x})(\$)", $cur,$cur+1,$cur+2; # GETS 3 at end of string

    # 3? >> my $pattern = sprintf "(\\x{%x}\\x{%x}\\x{%x}.{0,8})([\\x{%x}-\\x{%x}])", $cur,$cur+1,$cur+2, $cur+4,$cur+10;

    # 2 >> my $pattern = sprintf "(\\x{%x}\\x{%x}\\x{%x}).{0,8}([\\x{%x}-\\x{%x}])", $cur,$cur+1,$cur+2, $cur+7,$cur+10;

    # -- my $pattern = sprintf "(\\x{%x}\\x{%x}\\x{%x}).*?(\\x{%x})", $cur,$cur+1,$cur+2, $cur+4;

    # my $pattern = sprintf "(%c%c%c).*?(%c)", $cur,$cur+1,$cur+2, $cur+5;
    # my $pattern = sprintf "(\\%c\\%c\\%c).*?(\\%c)", $cur,$cur+1,$cur+2, $cur+5;


    #my $pattern = sprintf "(\\x{%x}\\x{%x}\\x{%x}).*?(\\x{%x})", $cur,$cur+1,$cur+2, $cur+5;
    my $pattern = sprintf "(%c%c%c).*?(%c)", $cur,$cur+1,$cur+2, $cur+5;
    if ($cur < 256)
    {
    $pattern = sprintf "([\\x{%x}][\\x{%x}][\\x{%x}]).*?([\\x{0%x}])", $cur,$cur+1,$cur+2,$cur+5;
    }


    # my $pattern = sprintf "(\\x{%x}\\x{%x})(.*?)(\\x{%x})", $cur,$cur+1,$cur+5;

    # my $pattern = sprintf "(\\x{%x}\\x{%x}\\x{%x}).*?(\\x{%x})", $cur,$cur+1,$cur+2,$cur+5;

    # my $pattern = sprintf "(\\x{0%x}\\x{0%x}\\x{0%x})[^\\x{0%x}]*?(\\x{0%x})", $cur,$cur+1,$cur+2,$cur+5,$cur+5;

    # my $pattern = sprintf "([\\x{0%x}\\x{0%x}\\x{0%x}]{3}).*?([\\x{0%x}])", $cur,$cur+1,$cur+2,$cur+4;

    ### apparently \\x{%x} must exist in char class
    ### and here [\\%c] won't work because of unknown escaped chars like \J
    ##

    #my $pattern = sprintf "([\\x{%x}\\x{%x}\\x{%x}]).*?([\\x{0%x}])", $cur,$cur+1,$cur+2,$cur+5;

    # my $pattern = sprintf "[\\x{%x}][\\x{%x}][\\x{%x}].*?[\\x{0%x}]", $cur,$cur+1,$cur+2,$cur+5;


    #my $s1 = sprintf "%c%c%c",$cur,$cur+1,$cur+2;
    #my $s2 = sprintf "%c",$cur+5;
    #$s1 = quotemeta($s1);
    #$s2 = quotemeta($s2);
    #my $pattern = sprintf "(%s).*?(%s)", $s1,$s2;


    # my $pattern = sprintf "(\\%c\\%c\\%c).*?(\\%c)", $cur,$cur+1,$cur+2, $cur+5;


    # my $pattern = sprintf "([\\x{%x}][\\x{%x}][\\x{%x}]).*?([\\x{0%x}])", $cur,$cur+1,$cur+2,$cur+5;

    # my $pattern = sprintf "(\\x{0%x}\\x{0%x})(.*?)(\\x{0%x})", $cur,$cur+1,$cur+4;

    # --> my $pattern = sprintf "(\\%c\\%c\\%c)[^\\%c]*?(\\0%c)", $cur,$cur+1,$cur+2, $cur+5, $cur+5;

    # my $pattern = sprintf "(\\x{%x}\\x{%x}\\x{%x}).*?(\\x{%x})", $cur,$cur+1,$cur+2, $cur+4;

    # my $pattern = sprintf "(%s).{0,5}([\\x{%x}-\\x{%x}])", $test, $cur+7,$cur+10;

    # my $pattern = sprintf "(%c%c%c).{0,5}([%c-%c])", ($cur,$cur+1,$cur+2), $cur+7,$cur+10;

    #>> my $pattern = sprintf "([%c-%c]{3}).{0,5}([%c-%c])", ($cur,$cur+2), $cur+7,$cur+10;


    # print "\n----------------------------\ncur = $cur\n";
    # print "pattern = $pattern\n";

    #$str =~ /($pattern)/s;

    #my @p = unpack ('U*',$pattern);
    #my @p = map {ord $_} split '',$pattern;
    #print "pat = @p\n";

    if ( $str =~ /($pattern)/s) ### NEED '/s' BECAUSE '.*?' WON'T MATCH '\n' WITHOUT IT
    {
    #print "$cur\n";
    print "$cur\n" if ($cur % 1000 == 0);
    next;

    my @m1 = unpack ('U*',$1);
    my @m2 = unpack ('U*',$2);
    my @m3 = unpack ('U*',$3);

    print "matched:\n 1 = '@m1', length = ".length($1).
    "\n 2 = '@m2', length = ".length($2).
    "\n 3 = '@m3', length = ".length($3)."\n";

    printf "\$3 = %d\n",ord $3;
    }
    else
    {
    print STDERR "didn't match $cur\n";
    print "didn't match $cur\n";
    }

    }
    , Mar 10, 2009
    #1
    1. Advertising

  2. Guest

    On Tue, 10 Mar 2009 03:07:03 GMT, wrote:

    >Reading through the pod's for info on utf8 and possible interger matching,
    >and setting up numerous tests, I inadvertantly discovered what utf really is
    >in its entirety.
    >

    [snip]

    Oh, I made a mistake. This group is strictly for beginner.

    -sln
    , Mar 10, 2009
    #2
    1. Advertising

  3. Guest

    On Tue, 10 Mar 2009 03:48:17 GMT, wrote:

    >On Tue, 10 Mar 2009 03:07:03 GMT, wrote:
    >
    >>Reading through the pod's for info on utf8 and possible interger matching,
    >>and setting up numerous tests, I inadvertantly discovered what utf really is
    >>in its entirety.
    >>

    >[snip]
    >
    >Oh, I made a mistake. This group is strictly for beginner.
    >

    And for CPAN module awareness of input parameters.
    Pardon for the mind expanding observations.

    -sln
    , Mar 10, 2009
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Xah Lee
    Replies:
    1
    Views:
    930
    Ilias Lazaridis
    Sep 22, 2006
  2. Xah Lee
    Replies:
    8
    Views:
    457
    Ilias Lazaridis
    Sep 26, 2006
  3. gry
    Replies:
    2
    Views:
    707
    Alf P. Steinbach
    Mar 13, 2012
  4. Marc Bissonnette

    Pattern matching : not matching problem

    Marc Bissonnette, Jan 8, 2004, in forum: Perl Misc
    Replies:
    9
    Views:
    221
    Marc Bissonnette
    Jan 13, 2004
  5. Xah Lee
    Replies:
    2
    Views:
    211
    Xah Lee
    Sep 25, 2006
Loading...

Share This Page