Efficiently convert characters to octal representation

W

Worky Workerson

I have a (possibly binary) string like "worky" where I'd like to
convert each byte to its octal representation, resulting in a string
"\167\157\162\153\171". I have two solutions, however I'm looking for
any way that would be faster.

Control:
$content = 'worky';
return $content;

Solution 1 (in place w/regex):
$content = 'worky';
$content =~ s/(.|\n)/sprintf("\\%03o", ord $1)/eg;
return $content

Solution 2 (index into string):
$content = 'worky';
do {
use bytes;
foreach my $idx (0..(length($content)-1)) {
$ret .= sprintf("\\%03o", ord(substr($content, $idx, 1)));
}
};
return $ret;

Based on a quick cmpthese benchmark, the control is about 16 times
faster than solution 1 and about 9 times faster than solution 2.

Does anyone know of A) The fastest way to do this or B) some
tips/tricks on how to speedup my methods?

Thanks!
-Worky
 
W

Worky Workerson

After tinkering for a while, my best solution is now:
$content = 'worky'
return join('', map {sprintf("\\%03o", $_)} unpack("C*", $content));

Anyone got anything better?

Thanks!
-Worky
 
D

DJ Stunks

Worky said:
I have a (possibly binary) string like "worky" where I'd like to
convert each byte to its octal representation, resulting in a string
"\167\157\162\153\171". I have two solutions, however I'm looking for
any way that would be faster.

Control:
$content = 'worky';
return $content;

Solution 1 (in place w/regex):
$content = 'worky';
$content =~ s/(.|\n)/sprintf("\\%03o", ord $1)/eg;
return $content

Solution 2 (index into string):
$content = 'worky';
do {
use bytes;
foreach my $idx (0..(length($content)-1)) {
$ret .= sprintf("\\%03o", ord(substr($content, $idx, 1)));
}
};
return $ret;

Based on a quick cmpthese benchmark, the control is about 16 times
faster than solution 1 and about 9 times faster than solution 2.

Does anyone know of A) The fastest way to do this or B) some
tips/tricks on how to speedup my methods?

if you have the benchmark set up, try this sub:

#!/usr/bin/perl

use strict;
use warnings;

my $string = 'worky';
print convert_to_octal($string);

sub convert_to_octal {
my ($string) = @_;

return map { sprintf '\\%03o', ord $_ }
split //, $string;

}

__END__

-jp
 
B

Ben Morrow

Quoth "Worky Workerson" <[email protected]>:

[please quote some context when you reply. Have you read the Posting
Guidelines?]

[converting a string into octal esacpes]
After tinkering for a while, my best solution is now:
$content = 'worky'
return join('', map {sprintf("\\%03o", $_)} unpack("C*", $content));

Anyone got anything better?

Here's a couple more, in the spirit of TMTOWTDI:

#!/usr/bin/perl

use warnings;
use strict;
use Math::BaseCnv;
use Benchmark qw/cmpthese/;

$\ = "\n";
my $w = 'worky';

my %subs = (
regex => sub { (my $x = $w) =~ s/(.)/sprintf '\\%3o', ord $1/egs; $x; },
substr => sub {
use bytes;
my $x;
for (0..(length $w) - 1) {
$x .= sprintf '\\%3o', ord substr $w, $_, 1;
}
return $x;
},
unpack => sub { join '', map sprintf('\\%3o', $_), unpack 'C*', $w },
split => sub {
use bytes;
join '', map sprintf('\\%3o', ord), split //, $w
},
cnv => sub { '\\' . join '\\', map cnv($_, 10, 8), unpack 'C*', $w; },
);

for (keys %subs) {
print "$_ => " . $subs{$_}->();
}

cmpthese -3, \%subs;

__END__

This gives (on my machine)

cnv => \167\157\162\153\171
regex => \167\157\162\153\171
unpack => \167\157\162\153\171
substr => \167\157\162\153\171
split => \167\157\162\153\171
Rate cnv regex split unpack substr
cnv 5618/s -- -87% -87% -92% -92%
regex 42639/s 659% -- -0% -39% -39%
split 42712/s 660% 0% -- -39% -39%
unpack 69589/s 1139% 63% 63% -- -1%
substr 70257/s 1151% 65% 64% 1% --

This is usually true: substr > unpack > split > regex. The reason is
that Perl ops are so much slower than C.

However, I am hard-pressed to think of a situation where it's worth
writing anything other than 'regex' above, as clarity is almost always
more important than speed.

Also note that the 'regex' solution would need a 'use bytes' to be
strictly compatible with the others. I'm not sure why you think you need
it: if you've read your data from a binmode :raw filehandle it's binary
anyway; otherwise you want to encode it with Encode into a suitable
encoding.

Ben
 
W

Worky Workerson

sub convert_to_octal {
my ($string) = @_;

return map { sprintf '\\%03o', ord $_ }
split //, $string;

}

Its about 25% slower than my "best" solution listed previously, which
was basically the same thing with unpack instead of split. Also, since
the data might be binary, I'm worried about the split // ... isn't that
a character split (vs a binary split)?
 
W

Worky Workerson

Ben said:
[converting a string into octal esacpes]

my %subs = (
regex => sub { (my $x = $w) =~ s/(.)/sprintf '\\%3o', ord $1/egs; $x; },
substr => sub {
use bytes;
my $x;
for (0..(length $w) - 1) {
$x .= sprintf '\\%3o', ord substr $w, $_, 1;
}
return $x;
},
unpack => sub { join '', map sprintf('\\%3o', $_), unpack 'C*', $w },
split => sub {
use bytes;
join '', map sprintf('\\%3o', ord), split //, $w
},
cnv => sub { '\\' . join '\\', map cnv($_, 10, 8), unpack 'C*', $w; },
);
However, I am hard-pressed to think of a situation where it's worth
writing anything other than 'regex' above, as clarity is almost always
more important than speed.

I'm doing database ETL and transforming 300GB of CSV into something the
database likes to load. According to DProf, this was my biggest
slacker by far, partly because it is called so often. Every little bit
of speed helps :)
Also note that the 'regex' solution would need a 'use bytes' to be
strictly compatible with the others. I'm not sure why you think you need
it: if you've read your data from a binmode :raw filehandle it's binary
anyway; otherwise you want to encode it with Encode into a suitable
encoding.

I guess I'm still a little fuzzy on the whole perl/binary thing. I'm
reading in CSV where most of the columns are ASCII but I'm not sure
what sort of data will be stored in one of the columns. I am declaring
binmode on the filehandle ... do I still need the 'use bytes' on the
substr approach?
 
B

Ben Morrow

Quoth "Worky Workerson said:
Ben said:
[converting a string into octal esacpes]
However, I am hard-pressed to think of a situation where it's worth
writing anything other than 'regex' above, as clarity is almost always
more important than speed.

I'm doing database ETL and transforming 300GB of CSV into something the
database likes to load. According to DProf, this was my biggest
slacker by far, partly because it is called so often. Every little bit
of speed helps :)

Fair enough :). A lot of people seem to come here saying 'I want to do
I guess I'm still a little fuzzy on the whole perl/binary thing.

Yeah, it's kinda complicated. It's made harder by the fact that Perl has
to be backwards-compatible, so a lot of the time just fudging things
seems to work...
I'm reading in CSV where most of the columns are ASCII but I'm not
sure what sort of data will be stored in one of the columns. I am
declaring binmode on the filehandle ... do I still need the 'use
bytes' on the substr approach?

If you are reading from a binary filehandle, then the data is all 8bit
(as opposed to wider than that) anyway, so you don't. You may get a
slight speed benefit by declaring 'use bytes' at the top of the script.
 
X

xhoster

Worky Workerson said:
I have a (possibly binary) string like "worky" where I'd like to
convert each byte to its octal representation, resulting in a string
"\167\157\162\153\171". I have two solutions, however I'm looking for
any way that would be faster.
....
Solution 2 (index into string):
$content = 'worky';
do {
use bytes;
foreach my $idx (0..(length($content)-1)) {
$ret .= sprintf("\\%03o", ord(substr($content, $idx, 1)));
}
};
return $ret; ....
Does anyone know of A) The fastest way to do this or B) some
tips/tricks on how to speedup my methods?

This seems like a pretty strange thing to need to optimize. How many
times do you need to do this operation on a 5 character fixed string?
If you don't need to do it on a 5 character fixed string, then your
benchmark should incorporate realistic sizes and with more realistic
methods for obtaining the non-fixed thing you want to operate on.

Anyway, if really need the speed, this Inline C code is about 3 times
faster than sol2.

Rate sol1 sol2 sol3 control2
sol1 70274/s -- -45% -82% -94%
sol2 127219/s 81% -- -67% -89%
sol3 385820/s 449% 203% -- -66%
control2 1122504/s 1497% 782% 191% --


Benchmark::cmpthese(-5, {
'control' => sub {control($text)},
'sol1' => sub {sol1($text)},
'sol2' => sub {sol2($text)},
'sol3' => sub {sol3($text)},
});
__END__
__C__
SV* sol3(SV* a) {
STRLEN len;
int i;
unsigned char * s;
SV* ret;
s = SvPV(a,len);
ret = newSV(4*len);
for (i=0; i<len; i++,s++) {
sv_catpvf(ret, "\\%03o", *s);
};
return ret;
};


Xho
 
S

Sisyphus

..
..
Anyway, if really need the speed, this Inline C code is about 3 times
faster than sol2.

A neat little Inline C routine .... so I saved the code and ran it:

-----------------------------
D:\pscrpt\inline\>cat char2octal.pl
use warnings;
use Inline C => Config =>
BUILD_NOISY => 1;

use Inline C => <<'EOC';

SV* c2o(SV* a) {
STRLEN len;
int i;
unsigned char * s;
SV* ret;
s = SvPV(a,len);
ret = newSV(4*len);
for (i=0; i<len; i++,s++) {
sv_catpvf(ret, "\\%03o", *s);
}
return ret;
}

EOC

print c2o('abcdABCD'), "\n"; #line 22

D:\pscrpt\inline\>perl char2octal.pl
Use of uninitialized value in subroutine entry at char2octal.pl line 22.
\141\142\143\144\101\102\103\104
-----------------------------

I'm sure it's one of those questions that will make me go "Doh!", but I
can't for the life of me see what is causing that "uninitialized" warning.
Any hints ? (I'm running perl 5.8.8 on Win32.)

Cheers,
Rob
 
X

xhoster

Sisyphus said:
.
.

A neat little Inline C routine .... so I saved the code and ran it:

-----------------------------
D:\pscrpt\inline\>cat char2octal.pl
use warnings;
use Inline C => Config =>
BUILD_NOISY => 1;

use Inline C => <<'EOC';

SV* c2o(SV* a) {
STRLEN len;
int i;
unsigned char * s;
SV* ret;
s = SvPV(a,len);
ret = newSV(4*len);
for (i=0; i<len; i++,s++) {
sv_catpvf(ret, "\\%03o", *s);
}
return ret;
}

EOC

print c2o('abcdABCD'), "\n"; #line 22

D:\pscrpt\inline\>perl char2octal.pl
Use of uninitialized value in subroutine entry at char2octal.pl line 22.
\141\142\143\144\101\102\103\104
-----------------------------

I'm sure it's one of those questions that will make me go "Doh!", but I
can't for the life of me see what is causing that "uninitialized"
warning. Any hints ? (I'm running perl 5.8.8 on Win32.)

Ah, I forgot to turn on warnings and so never saw it.

Apparently sv_catpvf, unlike .= operator, doesn't care for undefined
values. So make that:

ret = newSV(4*len);
sv_setpv(ret, "");
for (i=0; i<len; i++,s++) {

I guess Inline warnings all get reported as being at subroutine entry?


For what it's worth, I've made another uglier one that is about twice again
as fast. This is going to wrap like crazy:

Xho

SV* sol32(SV* a) {
static const char * cache[]={"\\000","\\001","\\002","\\003","\\004",
"\\005","\\006","\\007","\\010","\\011","\\012","\\013","\\014","\\015",
"\\016","\\017","\\020","\\021","\\022","\\023","\\024","\\025","\\026",
"\\027","\\030","\\031","\\032","\\033","\\034","\\035","\\036","\\037",
"\\040","\\041","\\042","\\043","\\044","\\045","\\046","\\047","\\050",
"\\051","\\052","\\053","\\054","\\055","\\056","\\057","\\060","\\061",
"\\062","\\063","\\064","\\065","\\066","\\067","\\070","\\071","\\072",
"\\073","\\074","\\075","\\076","\\077","\\100","\\101","\\102","\\103",
"\\104","\\105","\\106","\\107","\\110","\\111","\\112","\\113","\\114",
"\\115","\\116","\\117","\\120","\\121","\\122","\\123","\\124","\\125",
"\\126","\\127","\\130","\\131","\\132","\\133","\\134","\\135","\\136",
"\\137","\\140","\\141","\\142","\\143","\\144","\\145","\\146","\\147",
"\\150","\\151","\\152","\\153","\\154","\\155","\\156","\\157","\\160",
"\\161","\\162","\\163","\\164","\\165","\\166","\\167","\\170","\\171",
"\\172","\\173","\\174","\\175","\\176","\\177","\\200","\\201","\\202",
"\\203","\\204","\\205","\\206","\\207","\\210","\\211","\\212","\\213",
"\\214","\\215","\\216","\\217","\\220","\\221","\\222","\\223","\\224",
"\\225","\\226","\\227","\\230","\\231","\\232","\\233","\\234","\\235",
"\\236","\\237","\\240","\\241","\\242","\\243","\\244","\\245","\\246",
"\\247","\\250","\\251","\\252","\\253","\\254","\\255","\\256","\\257",
"\\260","\\261","\\262","\\263","\\264","\\265","\\266","\\267","\\270",
"\\271","\\272","\\273","\\274","\\275","\\276","\\277","\\300","\\301",
"\\302","\\303","\\304","\\305","\\306","\\307","\\310","\\311","\\312",
"\\313","\\314","\\315","\\316","\\317","\\320","\\321","\\322","\\323",
"\\324","\\325","\\326","\\327","\\330","\\331","\\332","\\333","\\334",
"\\335","\\336","\\337","\\340","\\341","\\342","\\343","\\344","\\345",
"\\346","\\347","\\350","\\351","\\352","\\353","\\354","\\355","\\356",
"\\357","\\360","\\361","\\362","\\363","\\364","\\365","\\366","\\367",
"\\370","\\371","\\372","\\373","\\374","\\375","\\376","\\377"};

STRLEN len;
int i;
unsigned char * s;
SV* ret;
s = SvPV(a,len);
ret = newSV(4*len);
sv_setpv(ret, "");
for (i=0; i<len; i++,s++) {
sv_catpv(ret, cache[*s]);
};
return ret;
};
 
J

John W. Krahn

Worky said:
I have a (possibly binary) string like "worky" where I'd like to
convert each byte to its octal representation, resulting in a string
"\167\157\162\153\171". I have two solutions, however I'm looking for
any way that would be faster.

Control:
$content = 'worky';
return $content;

Solution 1 (in place w/regex):
$content = 'worky';
$content =~ s/(.|\n)/sprintf("\\%03o", ord $1)/eg;
return $content

Solution 2 (index into string):
$content = 'worky';
do {
use bytes;
foreach my $idx (0..(length($content)-1)) {
$ret .= sprintf("\\%03o", ord(substr($content, $idx, 1)));
}
};
return $ret;

Based on a quick cmpthese benchmark, the control is about 16 times
faster than solution 1 and about 9 times faster than solution 2.

Does anyone know of A) The fastest way to do this or B) some
tips/tricks on how to speedup my methods?

Create the translation table first:

my %table = map { chr, sprintf '\%03o', $_ } 0 .. 255;


$content =~ s/(.)/$table{$1}/sg;


foreach my $idx ( 0 .. length( $content ) - 1 ) {
$ret .= $table{ substr $content, $idx, 1 };
}




John
 
B

Ben Morrow

Quoth (e-mail address removed):
Ah, I forgot to turn on warnings and so never saw it.

Apparently sv_catpvf, unlike .= operator, doesn't care for undefined
values. So make that:

ret = newSV(4*len);
sv_setpv(ret, "");
for (i=0; i<len; i++,s++) {

I guess Inline warnings all get reported as being at subroutine entry?

Yes, as with all warnings thrown inside XS. The currently executing Perl
op is the sub call, so that's what you get: the whole XS sub is run as
part of the sub call op, which then returns rather than jumping to the
start of the sub as it would with Perl.
For what it's worth, I've made another uglier one that is about twice again
as fast. This is going to wrap like crazy:

SV* sol32(SV* a) {
static const char * cache[]={"\\000","\\001","\\002","\\003","\\004",

I wondered about that (in Perl, not in C); to make it a little less ugly
you could have

static const char cache[0x100][5]; /* c arrays confuse me :( */

void populate_cache (void) {
int i;
for (i=0; i<0x100; i++) {
Copy(form("\\%3o", i), cache, 5, char);
}
}

Ben
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,009
Latest member
GidgetGamb

Latest Threads

Top