Index of first and last non-"\xff" in a long string

Discussion in 'Perl Misc' started by w.c.humann@arcor.de, Nov 12, 2007.

  1. Guest

    I'm going through several PGM images, overlaying (i.e. ANDing) them
    and would also like to determine the bounding box. For that I need to
    find the first and last non-white (i.e. non-"\xff") pixel in every
    line. Now I have one line in a string $data (one pixel per character),
    possibly several 1000 charcters long. I tried 3 alternatives so far.
    All 3 work, but there may be even faster ways to do this:

    # slow
    my $first = -1;
    1 while substr($data, ++$first, 1) eq "\xff" and $first < $width - 1;
    my $last = $width;
    1 while substr($data, --$last, 1) eq "\xff" and $last > $first;
    print STDERR "f: $first, l: $last, ";

    # the match for $first2 is fast, but the one for $last2 is really slow
    my $first2 = length( ($data =~ /^(\xff+)/)[0] );
    my $last2 = $width - 1 - length( ($data =~ /(\xff+)$/)[0] );
    print STDERR "f2: $first2, l2: $last2, ";

    # best solution so far. "tr" is the slowest part of this.
    # Is there a way without the "tr"?
    $data =~ tr|\x00-\xfe|\x00|;
    my $first3 = index $data, "\x00";
    my $last3 = ($first3 > -1) ? rindex $data, "\x00" : -1;
    print STDERR "f3: $first3, l3: $last3, ";

    print STDERR "\n";

    Thanks,
    Wolfram
     
    , Nov 12, 2007
    #1
    1. Advertising

  2. Ben Morrow Guest

    Quoth :
    > I'm going through several PGM images, overlaying (i.e. ANDing) them
    > and would also like to determine the bounding box. For that I need to
    > find the first and last non-white (i.e. non-"\xff") pixel in every
    > line. Now I have one line in a string $data (one pixel per character),
    > possibly several 1000 charcters long. I tried 3 alternatives so far.
    > All 3 work, but there may be even faster ways to do this:


    On my machine, the benchmark below gives

    Rate subst sloop chop match reverse index C
    subst 403/s -- -90% -94% -96% -97% -97% -100%
    sloop 4078/s 912% -- -36% -58% -73% -73% -96%
    chop 6340/s 1474% 55% -- -35% -58% -59% -94%
    match 9754/s 2322% 139% 54% -- -36% -37% -91%
    reverse 15247/s 3686% 274% 140% 56% -- -1% -85%
    index 15376/s 3718% 277% 143% 58% 1% -- -85%
    C 104065/s 25739% 2452% 1541% 967% 583% 577% --

    so index is probably the best you're going to get without using C.

    Ben

    #!/usr/bin/perl

    use Benchmark qw/cmpthese/;

    my $str = ("\xff" x 160) . ("f" x 10_000) . ("\xff" x 150);
    my $len = length $str;

    use Inline C => <<'EOC';

    IV
    unindex(SV *sv, const char *str)
    {
    const char *pv;
    IV len, i = 0;
    const char chr = str[0];

    pv = SvPV(sv, len);
    while (pv == chr) i++;

    return i;
    }

    IV
    unrindex(SV *sv, const char *str)
    {
    const char *pv;
    IV len, i;
    const char chr = str[0];

    pv = SvPV(sv, len);
    i = len - 1;
    while (pv == chr) i--;

    return i;
    }

    EOC

    cmpthese -3, {
    match => sub {
    local $_ = $str;
    # this is the fastest single match I can come up with
    /^((?>\xff*)).*[^\xff]((?>\xff*))$/;
    161 == 1 + $+[1] or die "\$+ failed: " . (1 + $+[1]);
    10_160 == $-[2] or die "\$- failed: " . $-[2];
    },
    index => sub {
    local $_ = $str;
    tr,\x00-\xfe,\x00,;
    161 == 1 + index $_, "\x00"
    or die "index failed: " . 1 + index $_, "\x00";
    10_160 == 1 + rindex $_, "\x00"
    or die "rindex failed: " . 1 + rindex $_, "\x00";
    },
    subst => sub {
    local $_ = $str;
    /[^\xff]/g;
    161 == pos or die "pos failed: " . pos;
    s/\xff+$//;
    10_160 == length or die "subst failed: " . length;
    },
    sloop => sub {
    local $_ = $str;
    /[^\xff]/g;
    161 == pos or die "pos failed: " . pos;
    1 while s/\xff$//;
    10_160 == length or die "sloop failed: " . length;
    },
    chop => sub {
    local $_ = $str;
    /[^\xff]/g;
    161 == pos or die "pos failed: " . pos;
    1 while "\xff" eq chop;
    10_160 == 1 + length or die "chop failed: " . (1 + length);
    },
    reverse => sub {
    local $_ = $str;
    /[^\xff]/g;
    161 == pos or die "pos failed: " . pos;
    $_ = reverse;
    /[^\xff]/g;
    10_160 == (1 + $len - pos)
    or die "reverse failed: " . (1 + $len - pos);
    },
    C => sub {
    local $_ = $str;
    161 == 1 + unindex $_, "\xff"
    or die "unindex failed: " . (1 + unindex $_, "\xff");
    10_160 == 1 + unrindex $_, "\xff"
    or die "unrindex failed: " . (1 + unrindex $_, "\xff");
    },
    };
     
    Ben Morrow, Nov 12, 2007
    #2
    1. Advertising

  3. wrote:
    >
    > I'm going through several PGM images, overlaying (i.e. ANDing) them
    > and would also like to determine the bounding box. For that I need to
    > find the first and last non-white (i.e. non-"\xff") pixel in every
    > line. Now I have one line in a string $data (one pixel per character),
    > possibly several 1000 charcters long. I tried 3 alternatives so far.
    > All 3 work, but there may be even faster ways to do this:
    >
    > # slow
    > my $first = -1;
    > 1 while substr($data, ++$first, 1) eq "\xff" and $first < $width - 1;
    > my $last = $width;
    > 1 while substr($data, --$last, 1) eq "\xff" and $last > $first;
    > print STDERR "f: $first, l: $last, ";
    >
    > # the match for $first2 is fast, but the one for $last2 is really slow
    > my $first2 = length( ($data =~ /^(\xff+)/)[0] );
    > my $last2 = $width - 1 - length( ($data =~ /(\xff+)$/)[0] );
    > print STDERR "f2: $first2, l2: $last2, ";
    >
    > # best solution so far. "tr" is the slowest part of this.
    > # Is there a way without the "tr"?
    > $data =~ tr|\x00-\xfe|\x00|;
    > my $first3 = index $data, "\x00";
    > my $last3 = ($first3 > -1) ? rindex $data, "\x00" : -1;
    > print STDERR "f3: $first3, l3: $last3, ";
    >
    > print STDERR "\n";


    Try this in your testing:

    $data =~ /[^\xff].*[^\xff]/s and my ( $first, $last ) = ( $-[ 0 ], $+[ 0
    ] - 1 );



    John
    --
    use Perl;
    program
    fulfillment
     
    John W. Krahn, Nov 12, 2007
    #3
  4. Guest

    On Nov 12, 8:57 pm, "John W. Krahn" <> wrote:
    >
    > Try this in your testing:
    >
    > $data =~ /[^\xff].*[^\xff]/s and my ( $first, $last ) = ( $-[ 0 ], $+[ 0
    > ] - 1 );


    Thanks John,

    at least a hundred times faster than my attempt at pattern matching --
    but still several times slower than tr/index/rindex.

    Wolfram
     
    , Nov 12, 2007
    #4
  5. Guest

    On Nov 12, 8:56 pm, Ben Morrow <> wrote:
    >
    > On my machine, the benchmark below gives
    >
    > Rate subst sloop chop match reverse index C
    > subst 403/s -- -90% -94% -96% -97% -97% -100%
    > sloop 4078/s 912% -- -36% -58% -73% -73% -96%
    > chop 6340/s 1474% 55% -- -35% -58% -59% -94%
    > match 9754/s 2322% 139% 54% -- -36% -37% -91%
    > reverse 15247/s 3686% 274% 140% 56% -- -1% -85%
    > index 15376/s 3718% 277% 143% 58% 1% -- -85%
    > C 104065/s 25739% 2452% 1541% 967% 583% 577% --
    >
    > so index is probably the best you're going to get without using C.


    Hey, some great ideas here, thanks Ben.

    Glad I had already found the fastest pure-perl solution :)
    (but 'reverse' is so close, the order might change per run...)

    'Inline C' is great but less portable, and I'm mainly using this on
    win32.

    Wolfram
     
    , Nov 12, 2007
    #5
  6. Ben Morrow Guest

    Quoth :
    > On Nov 12, 8:57 pm, "John W. Krahn" <> wrote:
    > >
    > > Try this in your testing:
    > >
    > > $data =~ /[^\xff].*[^\xff]/s and my ( $first, $last ) = ( $-[ 0 ], $+[ 0
    > > ] - 1 );


    D'oh! I knew my match didn't need to do so much backtracking...

    > at least a hundred times faster than my attempt at pattern matching --
    > but still several times slower than tr/index/rindex.


    Interesting... which version of perl? With
    This is perl, v5.8.8 built for i386-freebsd-64int

    and adding this

    innerm => sub {
    local $_ = $str;
    /[^\xff].*[^\xff]/s;
    161 == ($-[0] + 1) or die "innerm \$+ failed: " . ($-[0] + 1);
    10_160 == $+[0] or die "innerm \$- failed: " . $+[0];
    },

    to my previous benchmark, I get

    Rate match index innerm C
    match 9609/s -- -41% -53% -91%
    index 16398/s 71% -- -21% -85%
    innerm 20641/s 115% 26% -- -81%
    C 109225/s 1037% 566% 429% --

    though seriously increasing the number of trailing "\xff"s causes both
    'match' and 'innerm' to perform dramatically badly, so maybe this is an
    artefact of my test string.

    Ben
     
    Ben Morrow, Nov 12, 2007
    #6
  7. Guest

    On Nov 12, 10:08 pm, Ben Morrow <> wrote:
    > Interesting... which version of perl? With
    > This is perl, v5.8.8 built for i386-freebsd-64int


    Mine is:
    This is perl, v5.8.7 built for MSWin32-x86-multi-thread

    >
    > and adding this
    >
    > innerm => sub {
    > local $_ = $str;
    > /[^\xff].*[^\xff]/s;
    > 161 == ($-[0] + 1) or die "innerm \$+ failed: " . ($-[0] + 1);
    > 10_160 == $+[0] or die "innerm \$- failed: " . $+[0];
    > },
    >
    > to my previous benchmark, I get
    >
    > Rate match index innerm C
    > match 9609/s -- -41% -53% -91%
    > index 16398/s 71% -- -21% -85%
    > innerm 20641/s 115% 26% -- -81%
    > C 109225/s 1037% 566% 429% --


    Well, with your test-string I get:

    Rate subst sloop chop match index reverse
    innerm
    subst 1306/s -- -91% -94% -94% -94% -95%
    -97%
    sloop 13917/s 965% -- -33% -41% -41% -52%
    -72%
    chop 20894/s 1499% 50% -- -11% -11% -28%
    -59%
    match 23417/s 1693% 68% 12% -- -1% -19%
    -54%
    index 23561/s 1704% 69% 13% 1% -- -18%
    -53%
    reverse 28902/s 2112% 108% 38% 23% 23% --
    -43%
    innerm 50510/s 3767% 263% 142% 116% 114% 75%
    --

    so indeed 'innerm' wins, but...

    > though seriously increasing the number of trailing "\xff"s causes both
    > 'match' and 'innerm' to perform dramatically badly, so maybe this is an
    > artefact of my test string.


    typical files have blank lines at top and bottom and if I modify your
    script like this:

    # lenght of left, middle and right part of string
    #my ($l,$m,$r) = (160, 10_000, 150);
    my ($l,$m,$r) = (5000, 2, 5000);

    my $str = ("\xff" x $l) . ("f" x $m) . ("\xff" x $r);
    my $len = length $str;

    cmpthese -3, {
    match => sub {
    local $_ = $str;
    # this is the fastest single match I can come up with
    /^((?>\xff*)).*[^\xff]((?>\xff*))$/;
    $l+1 == 1 + $+[1] or die "\$+ failed: " . (1 + $+[1]);
    $l+$m == $-[2] or die "\$- failed: " . $-[2];
    },
    etc.

    the result is:

    Rate subst sloop chop match innerm
    reverse index
    subst 1.44/s -- -100% -100% -100% -100%
    -100% -100%
    sloop 462/s 32117% -- -37% -75% -82%
    -86% -98%
    chop 737/s 51284% 59% -- -60% -71%
    -78% -96%
    match 1844/s 128383% 299% 150% -- -28%
    -45% -91%
    innerm 2575/s 179348% 457% 249% 40% --
    -23% -87%
    reverse 3352/s 233498% 625% 355% 82% 30%
    -- -84%
    index 20452/s 1424967% 4323% 2673% 1009% 694%
    510% --

    Stange enough with my ($l,$m,$r) = (5000, 1, 5000); 'innerm' fails!?

    An optimization not considered so far: Once I've found a left and
    right bound in a line, I only need to check from the edges up these
    bounds in all following lines, because my bounding-box can only grow
    (and never shrink) while checking further lines. As a faked test I've
    used this:

    index2 => sub {
    my $left = substr($str,0,$l+500);
    my $right = substr($str,$l+$m-500,$r+500);
    $left =~ tr,\x00-\xfe,\x00,;
    $right =~ tr,\x00-\xfe,\x00,;
    $l+1 == 1 + index $left, "\x00"
    or die "index failed: " . (1 + index $left, "\x00");
    500 == 1 + rindex $right, "\x00"
    or die "rindex failed: " . (1 + rindex $right, "\x00");
    },

    The potential savings are big but of course highly dependent on the
    actual image:
    Rate index reverse innerm index2
    index 23924/s -- -17% -53% -82%
    reverse 28749/s 20% -- -43% -79%
    innerm 50574/s 111% 76% -- -63%
    index2 136616/s 471% 375% 170% --


    Wolfram
     
    , Nov 13, 2007
    #7
  8. wrote:
    >
    > On Nov 12, 10:08 pm, Ben Morrow <> wrote:
    > > Interesting... which version of perl? With
    > > This is perl, v5.8.8 built for i386-freebsd-64int
    > >
    > > and adding this
    > >
    > > innerm => sub {
    > > local $_ = $str;
    > > /[^\xff].*[^\xff]/s;


    [ SNIP ]

    > Stange enough with my ($l,$m,$r) = (5000, 1, 5000); 'innerm' fails!?


    Not strange at all. The pattern has to match at least two [^\xff]
    characters.


    John
    --
    use Perl;
    program
    fulfillment
     
    John W. Krahn, Nov 13, 2007
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Evan
    Replies:
    6
    Views:
    310
  2. Mathieu Dutour

    long long and long

    Mathieu Dutour, Jul 17, 2007, in forum: C Programming
    Replies:
    4
    Views:
    493
    santosh
    Jul 24, 2007
  3. Shawn W_
    Replies:
    5
    Views:
    298
    Aldric Giacomoni
    Sep 16, 2009
  4. Tomasz Chmielewski

    sorting index-15, index-9, index-110 "the human way"?

    Tomasz Chmielewski, Mar 4, 2008, in forum: Perl Misc
    Replies:
    4
    Views:
    321
    Tomasz Chmielewski
    Mar 4, 2008
  5. Jesse
    Replies:
    1
    Views:
    264
    smallpond
    Oct 3, 2009
Loading...

Share This Page