Efficiently convert characters to octal representation

Discussion in 'Perl Misc' started by Worky Workerson, Jul 28, 2006.

  1. I have a (possibly binary) string like "worky" where I'd like to
    convert each byte to its octal representation, resulting in a string
    "\167\157\162\153\171". I have two solutions, however I'm looking for
    any way that would be faster.

    Control:
    $content = 'worky';
    return $content;

    Solution 1 (in place w/regex):
    $content = 'worky';
    $content =~ s/(.|\n)/sprintf("\\%03o", ord $1)/eg;
    return $content

    Solution 2 (index into string):
    $content = 'worky';
    do {
    use bytes;
    foreach my $idx (0..(length($content)-1)) {
    $ret .= sprintf("\\%03o", ord(substr($content, $idx, 1)));
    }
    };
    return $ret;

    Based on a quick cmpthese benchmark, the control is about 16 times
    faster than solution 1 and about 9 times faster than solution 2.

    Does anyone know of A) The fastest way to do this or B) some
    tips/tricks on how to speedup my methods?

    Thanks!
    -Worky
     
    Worky Workerson, Jul 28, 2006
    #1
    1. Advertising

  2. After tinkering for a while, my best solution is now:
    $content = 'worky'
    return join('', map {sprintf("\\%03o", $_)} unpack("C*", $content));

    Anyone got anything better?

    Thanks!
    -Worky
     
    Worky Workerson, Jul 28, 2006
    #2
    1. Advertising

  3. Worky Workerson

    DJ Stunks Guest

    Worky Workerson wrote:
    > I have a (possibly binary) string like "worky" where I'd like to
    > convert each byte to its octal representation, resulting in a string
    > "\167\157\162\153\171". I have two solutions, however I'm looking for
    > any way that would be faster.
    >
    > Control:
    > $content = 'worky';
    > return $content;
    >
    > Solution 1 (in place w/regex):
    > $content = 'worky';
    > $content =~ s/(.|\n)/sprintf("\\%03o", ord $1)/eg;
    > return $content
    >
    > Solution 2 (index into string):
    > $content = 'worky';
    > do {
    > use bytes;
    > foreach my $idx (0..(length($content)-1)) {
    > $ret .= sprintf("\\%03o", ord(substr($content, $idx, 1)));
    > }
    > };
    > return $ret;
    >
    > Based on a quick cmpthese benchmark, the control is about 16 times
    > faster than solution 1 and about 9 times faster than solution 2.
    >
    > Does anyone know of A) The fastest way to do this or B) some
    > tips/tricks on how to speedup my methods?


    if you have the benchmark set up, try this sub:

    #!/usr/bin/perl

    use strict;
    use warnings;

    my $string = 'worky';
    print convert_to_octal($string);

    sub convert_to_octal {
    my ($string) = @_;

    return map { sprintf '\\%03o', ord $_ }
    split //, $string;

    }

    __END__

    -jp
     
    DJ Stunks, Jul 28, 2006
    #3
  4. Worky Workerson

    Ben Morrow Guest

    Quoth "Worky Workerson" <>:

    [please quote some context when you reply. Have you read the Posting
    Guidelines?]

    [converting a string into octal esacpes]

    > After tinkering for a while, my best solution is now:
    > $content = 'worky'
    > return join('', map {sprintf("\\%03o", $_)} unpack("C*", $content));
    >
    > Anyone got anything better?


    Here's a couple more, in the spirit of TMTOWTDI:

    #!/usr/bin/perl

    use warnings;
    use strict;
    use Math::BaseCnv;
    use Benchmark qw/cmpthese/;

    $\ = "\n";
    my $w = 'worky';

    my %subs = (
    regex => sub { (my $x = $w) =~ s/(.)/sprintf '\\%3o', ord $1/egs; $x; },
    substr => sub {
    use bytes;
    my $x;
    for (0..(length $w) - 1) {
    $x .= sprintf '\\%3o', ord substr $w, $_, 1;
    }
    return $x;
    },
    unpack => sub { join '', map sprintf('\\%3o', $_), unpack 'C*', $w },
    split => sub {
    use bytes;
    join '', map sprintf('\\%3o', ord), split //, $w
    },
    cnv => sub { '\\' . join '\\', map cnv($_, 10, 8), unpack 'C*', $w; },
    );

    for (keys %subs) {
    print "$_ => " . $subs{$_}->();
    }

    cmpthese -3, \%subs;

    __END__

    This gives (on my machine)

    cnv => \167\157\162\153\171
    regex => \167\157\162\153\171
    unpack => \167\157\162\153\171
    substr => \167\157\162\153\171
    split => \167\157\162\153\171
    Rate cnv regex split unpack substr
    cnv 5618/s -- -87% -87% -92% -92%
    regex 42639/s 659% -- -0% -39% -39%
    split 42712/s 660% 0% -- -39% -39%
    unpack 69589/s 1139% 63% 63% -- -1%
    substr 70257/s 1151% 65% 64% 1% --

    This is usually true: substr > unpack > split > regex. The reason is
    that Perl ops are so much slower than C.

    However, I am hard-pressed to think of a situation where it's worth
    writing anything other than 'regex' above, as clarity is almost always
    more important than speed.

    Also note that the 'regex' solution would need a 'use bytes' to be
    strictly compatible with the others. I'm not sure why you think you need
    it: if you've read your data from a binmode :raw filehandle it's binary
    anyway; otherwise you want to encode it with Encode into a suitable
    encoding.

    Ben

    --
    Joy and Woe are woven fine,
    A Clothing for the Soul divine William Blake
    Under every grief and pine 'Auguries of Innocence'
    Runs a joy with silken twine.
     
    Ben Morrow, Jul 28, 2006
    #4
  5. > sub convert_to_octal {
    > my ($string) = @_;
    >
    > return map { sprintf '\\%03o', ord $_ }
    > split //, $string;
    >
    > }


    Its about 25% slower than my "best" solution listed previously, which
    was basically the same thing with unpack instead of split. Also, since
    the data might be binary, I'm worried about the split // ... isn't that
    a character split (vs a binary split)?
     
    Worky Workerson, Jul 28, 2006
    #5
  6. Ben Morrow wrote:
    > [converting a string into octal esacpes]
    >
    > my %subs = (
    > regex => sub { (my $x = $w) =~ s/(.)/sprintf '\\%3o', ord $1/egs; $x; },
    > substr => sub {
    > use bytes;
    > my $x;
    > for (0..(length $w) - 1) {
    > $x .= sprintf '\\%3o', ord substr $w, $_, 1;
    > }
    > return $x;
    > },
    > unpack => sub { join '', map sprintf('\\%3o', $_), unpack 'C*', $w },
    > split => sub {
    > use bytes;
    > join '', map sprintf('\\%3o', ord), split //, $w
    > },
    > cnv => sub { '\\' . join '\\', map cnv($_, 10, 8), unpack 'C*', $w; },
    > );


    > However, I am hard-pressed to think of a situation where it's worth
    > writing anything other than 'regex' above, as clarity is almost always
    > more important than speed.


    I'm doing database ETL and transforming 300GB of CSV into something the
    database likes to load. According to DProf, this was my biggest
    slacker by far, partly because it is called so often. Every little bit
    of speed helps :)

    > Also note that the 'regex' solution would need a 'use bytes' to be
    > strictly compatible with the others. I'm not sure why you think you need
    > it: if you've read your data from a binmode :raw filehandle it's binary
    > anyway; otherwise you want to encode it with Encode into a suitable
    > encoding.


    I guess I'm still a little fuzzy on the whole perl/binary thing. I'm
    reading in CSV where most of the columns are ASCII but I'm not sure
    what sort of data will be stored in one of the columns. I am declaring
    binmode on the filehandle ... do I still need the 'use bytes' on the
    substr approach?
     
    Worky Workerson, Jul 28, 2006
    #6
  7. Worky Workerson

    Ben Morrow Guest

    Quoth "Worky Workerson" <>:
    > Ben Morrow wrote:
    > > [converting a string into octal esacpes]

    >
    > > However, I am hard-pressed to think of a situation where it's worth
    > > writing anything other than 'regex' above, as clarity is almost always
    > > more important than speed.

    >
    > I'm doing database ETL and transforming 300GB of CSV into something the
    > database likes to load. According to DProf, this was my biggest
    > slacker by far, partly because it is called so often. Every little bit
    > of speed helps :)


    Fair enough :). A lot of people seem to come here saying 'I want to do
    <foo> really fast' without thinking whether that's really necessary.

    > > Also note that the 'regex' solution would need a 'use bytes' to be
    > > strictly compatible with the others. I'm not sure why you think you need
    > > it: if you've read your data from a binmode :raw filehandle it's binary
    > > anyway; otherwise you want to encode it with Encode into a suitable
    > > encoding.

    >
    > I guess I'm still a little fuzzy on the whole perl/binary thing.


    Yeah, it's kinda complicated. It's made harder by the fact that Perl has
    to be backwards-compatible, so a lot of the time just fudging things
    seems to work...

    > I'm reading in CSV where most of the columns are ASCII but I'm not
    > sure what sort of data will be stored in one of the columns. I am
    > declaring binmode on the filehandle ... do I still need the 'use
    > bytes' on the substr approach?


    If you are reading from a binary filehandle, then the data is all 8bit
    (as opposed to wider than that) anyway, so you don't. You may get a
    slight speed benefit by declaring 'use bytes' at the top of the script.

    --
    For far more marvellous is the truth than any artists of the past imagined!
    Why do the poets of the present not speak of it? What men are poets who can
    speak of Jupiter if he were like a man, but if he is an immense spinning
    sphere of methane and ammonia must be silent?~Feynmann~
     
    Ben Morrow, Jul 28, 2006
    #7
  8. Worky Workerson

    Guest

    "Worky Workerson" <> wrote:
    > I have a (possibly binary) string like "worky" where I'd like to
    > convert each byte to its octal representation, resulting in a string
    > "\167\157\162\153\171". I have two solutions, however I'm looking for
    > any way that would be faster.
    >

    ....
    > Solution 2 (index into string):
    > $content = 'worky';
    > do {
    > use bytes;
    > foreach my $idx (0..(length($content)-1)) {
    > $ret .= sprintf("\\%03o", ord(substr($content, $idx, 1)));
    > }
    > };
    > return $ret;

    ....
    > Does anyone know of A) The fastest way to do this or B) some
    > tips/tricks on how to speedup my methods?


    This seems like a pretty strange thing to need to optimize. How many
    times do you need to do this operation on a 5 character fixed string?
    If you don't need to do it on a 5 character fixed string, then your
    benchmark should incorporate realistic sizes and with more realistic
    methods for obtaining the non-fixed thing you want to operate on.

    Anyway, if really need the speed, this Inline C code is about 3 times
    faster than sol2.

    Rate sol1 sol2 sol3 control2
    sol1 70274/s -- -45% -82% -94%
    sol2 127219/s 81% -- -67% -89%
    sol3 385820/s 449% 203% -- -66%
    control2 1122504/s 1497% 782% 191% --


    Benchmark::cmpthese(-5, {
    'control' => sub {control($text)},
    'sol1' => sub {sol1($text)},
    'sol2' => sub {sol2($text)},
    'sol3' => sub {sol3($text)},
    });
    __END__
    __C__
    SV* sol3(SV* a) {
    STRLEN len;
    int i;
    unsigned char * s;
    SV* ret;
    s = SvPV(a,len);
    ret = newSV(4*len);
    for (i=0; i<len; i++,s++) {
    sv_catpvf(ret, "\\%03o", *s);
    };
    return ret;
    };


    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Jul 28, 2006
    #8
  9. Worky Workerson

    Sisyphus Guest

    <> wrote in message
    ..
    ..
    >
    > Anyway, if really need the speed, this Inline C code is about 3 times
    > faster than sol2.
    >


    A neat little Inline C routine .... so I saved the code and ran it:

    -----------------------------
    D:\pscrpt\inline\>cat char2octal.pl
    use warnings;
    use Inline C => Config =>
    BUILD_NOISY => 1;

    use Inline C => <<'EOC';

    SV* c2o(SV* a) {
    STRLEN len;
    int i;
    unsigned char * s;
    SV* ret;
    s = SvPV(a,len);
    ret = newSV(4*len);
    for (i=0; i<len; i++,s++) {
    sv_catpvf(ret, "\\%03o", *s);
    }
    return ret;
    }

    EOC

    print c2o('abcdABCD'), "\n"; #line 22

    D:\pscrpt\inline\>perl char2octal.pl
    Use of uninitialized value in subroutine entry at char2octal.pl line 22.
    \141\142\143\144\101\102\103\104
    -----------------------------

    I'm sure it's one of those questions that will make me go "Doh!", but I
    can't for the life of me see what is causing that "uninitialized" warning.
    Any hints ? (I'm running perl 5.8.8 on Win32.)

    Cheers,
    Rob
     
    Sisyphus, Jul 29, 2006
    #9
  10. Worky Workerson

    Guest

    "Sisyphus" <> wrote:
    > <> wrote in message
    > .
    > .
    > >
    > > Anyway, if really need the speed, this Inline C code is about 3 times
    > > faster than sol2.
    > >

    >
    > A neat little Inline C routine .... so I saved the code and ran it:
    >
    > -----------------------------
    > D:\pscrpt\inline\>cat char2octal.pl
    > use warnings;
    > use Inline C => Config =>
    > BUILD_NOISY => 1;
    >
    > use Inline C => <<'EOC';
    >
    > SV* c2o(SV* a) {
    > STRLEN len;
    > int i;
    > unsigned char * s;
    > SV* ret;
    > s = SvPV(a,len);
    > ret = newSV(4*len);
    > for (i=0; i<len; i++,s++) {
    > sv_catpvf(ret, "\\%03o", *s);
    > }
    > return ret;
    > }
    >
    > EOC
    >
    > print c2o('abcdABCD'), "\n"; #line 22
    >
    > D:\pscrpt\inline\>perl char2octal.pl
    > Use of uninitialized value in subroutine entry at char2octal.pl line 22.
    > \141\142\143\144\101\102\103\104
    > -----------------------------
    >
    > I'm sure it's one of those questions that will make me go "Doh!", but I
    > can't for the life of me see what is causing that "uninitialized"
    > warning. Any hints ? (I'm running perl 5.8.8 on Win32.)


    Ah, I forgot to turn on warnings and so never saw it.

    Apparently sv_catpvf, unlike .= operator, doesn't care for undefined
    values. So make that:

    ret = newSV(4*len);
    sv_setpv(ret, "");
    for (i=0; i<len; i++,s++) {

    I guess Inline warnings all get reported as being at subroutine entry?


    For what it's worth, I've made another uglier one that is about twice again
    as fast. This is going to wrap like crazy:

    Xho

    SV* sol32(SV* a) {
    static const char * cache[]={"\\000","\\001","\\002","\\003","\\004",
    "\\005","\\006","\\007","\\010","\\011","\\012","\\013","\\014","\\015",
    "\\016","\\017","\\020","\\021","\\022","\\023","\\024","\\025","\\026",
    "\\027","\\030","\\031","\\032","\\033","\\034","\\035","\\036","\\037",
    "\\040","\\041","\\042","\\043","\\044","\\045","\\046","\\047","\\050",
    "\\051","\\052","\\053","\\054","\\055","\\056","\\057","\\060","\\061",
    "\\062","\\063","\\064","\\065","\\066","\\067","\\070","\\071","\\072",
    "\\073","\\074","\\075","\\076","\\077","\\100","\\101","\\102","\\103",
    "\\104","\\105","\\106","\\107","\\110","\\111","\\112","\\113","\\114",
    "\\115","\\116","\\117","\\120","\\121","\\122","\\123","\\124","\\125",
    "\\126","\\127","\\130","\\131","\\132","\\133","\\134","\\135","\\136",
    "\\137","\\140","\\141","\\142","\\143","\\144","\\145","\\146","\\147",
    "\\150","\\151","\\152","\\153","\\154","\\155","\\156","\\157","\\160",
    "\\161","\\162","\\163","\\164","\\165","\\166","\\167","\\170","\\171",
    "\\172","\\173","\\174","\\175","\\176","\\177","\\200","\\201","\\202",
    "\\203","\\204","\\205","\\206","\\207","\\210","\\211","\\212","\\213",
    "\\214","\\215","\\216","\\217","\\220","\\221","\\222","\\223","\\224",
    "\\225","\\226","\\227","\\230","\\231","\\232","\\233","\\234","\\235",
    "\\236","\\237","\\240","\\241","\\242","\\243","\\244","\\245","\\246",
    "\\247","\\250","\\251","\\252","\\253","\\254","\\255","\\256","\\257",
    "\\260","\\261","\\262","\\263","\\264","\\265","\\266","\\267","\\270",
    "\\271","\\272","\\273","\\274","\\275","\\276","\\277","\\300","\\301",
    "\\302","\\303","\\304","\\305","\\306","\\307","\\310","\\311","\\312",
    "\\313","\\314","\\315","\\316","\\317","\\320","\\321","\\322","\\323",
    "\\324","\\325","\\326","\\327","\\330","\\331","\\332","\\333","\\334",
    "\\335","\\336","\\337","\\340","\\341","\\342","\\343","\\344","\\345",
    "\\346","\\347","\\350","\\351","\\352","\\353","\\354","\\355","\\356",
    "\\357","\\360","\\361","\\362","\\363","\\364","\\365","\\366","\\367",
    "\\370","\\371","\\372","\\373","\\374","\\375","\\376","\\377"};

    STRLEN len;
    int i;
    unsigned char * s;
    SV* ret;
    s = SvPV(a,len);
    ret = newSV(4*len);
    sv_setpv(ret, "");
    for (i=0; i<len; i++,s++) {
    sv_catpv(ret, cache[*s]);
    };
    return ret;
    };

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Jul 29, 2006
    #10
  11. Worky Workerson wrote:
    > I have a (possibly binary) string like "worky" where I'd like to
    > convert each byte to its octal representation, resulting in a string
    > "\167\157\162\153\171". I have two solutions, however I'm looking for
    > any way that would be faster.
    >
    > Control:
    > $content = 'worky';
    > return $content;
    >
    > Solution 1 (in place w/regex):
    > $content = 'worky';
    > $content =~ s/(.|\n)/sprintf("\\%03o", ord $1)/eg;
    > return $content
    >
    > Solution 2 (index into string):
    > $content = 'worky';
    > do {
    > use bytes;
    > foreach my $idx (0..(length($content)-1)) {
    > $ret .= sprintf("\\%03o", ord(substr($content, $idx, 1)));
    > }
    > };
    > return $ret;
    >
    > Based on a quick cmpthese benchmark, the control is about 16 times
    > faster than solution 1 and about 9 times faster than solution 2.
    >
    > Does anyone know of A) The fastest way to do this or B) some
    > tips/tricks on how to speedup my methods?


    Create the translation table first:

    my %table = map { chr, sprintf '\%03o', $_ } 0 .. 255;


    $content =~ s/(.)/$table{$1}/sg;


    foreach my $idx ( 0 .. length( $content ) - 1 ) {
    $ret .= $table{ substr $content, $idx, 1 };
    }




    John
    --
    use Perl;
    program
    fulfillment
     
    John W. Krahn, Jul 29, 2006
    #11
  12. Worky Workerson

    Ben Morrow Guest

    Quoth :
    > "Sisyphus" <> wrote:
    > >
    > > D:\pscrpt\inline\>perl char2octal.pl
    > > Use of uninitialized value in subroutine entry at char2octal.pl line 22.
    > > \141\142\143\144\101\102\103\104
    > > -----------------------------
    > >
    > > I'm sure it's one of those questions that will make me go "Doh!", but I
    > > can't for the life of me see what is causing that "uninitialized"
    > > warning. Any hints ? (I'm running perl 5.8.8 on Win32.)

    >
    > Ah, I forgot to turn on warnings and so never saw it.
    >
    > Apparently sv_catpvf, unlike .= operator, doesn't care for undefined
    > values. So make that:
    >
    > ret = newSV(4*len);
    > sv_setpv(ret, "");
    > for (i=0; i<len; i++,s++) {
    >
    > I guess Inline warnings all get reported as being at subroutine entry?


    Yes, as with all warnings thrown inside XS. The currently executing Perl
    op is the sub call, so that's what you get: the whole XS sub is run as
    part of the sub call op, which then returns rather than jumping to the
    start of the sub as it would with Perl.

    > For what it's worth, I've made another uglier one that is about twice again
    > as fast. This is going to wrap like crazy:
    >
    > SV* sol32(SV* a) {
    > static const char * cache[]={"\\000","\\001","\\002","\\003","\\004",


    I wondered about that (in Perl, not in C); to make it a little less ugly
    you could have

    static const char cache[0x100][5]; /* c arrays confuse me :( */

    void populate_cache (void) {
    int i;
    for (i=0; i<0x100; i++) {
    Copy(form("\\%3o", i), cache, 5, char);
    }
    }

    Ben

    --
    Outside of a dog, a book is a man's best friend.
    Inside of a dog, it's too dark to read.
    Groucho Marx
     
    Ben Morrow, Jul 29, 2006
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hostos
    Replies:
    7
    Views:
    5,244
    La'ie Techie
    Oct 15, 2003
  2. Replies:
    2
    Views:
    775
  3. Stefan Mueller
    Replies:
    3
    Views:
    33,063
    Stefan Mueller
    Jul 23, 2006
  4. KB
    Replies:
    5
    Views:
    773
    Steven D'Aprano
    Jul 31, 2005
  5. Bill H

    Convert potion of string to Octal

    Bill H, Jan 3, 2006, in forum: Perl Misc
    Replies:
    10
    Views:
    242
    Bill H
    Jan 4, 2006
Loading...

Share This Page