Efficiently convert characters to octal representation

Worky Workerson · Jul 28, 2006

I have a (possibly binary) string like "worky" where I'd like to
convert each byte to its octal representation, resulting in a string
"\167\157\162\153\171". I have two solutions, however I'm looking for
any way that would be faster.

Control:
$content = 'worky';
return $content;

Solution 1 (in place w/regex):
$content = 'worky';
$content =~ s/(.|\n)/sprintf("\\%03o", ord $1)/eg;
return $content

Solution 2 (index into string):
$content = 'worky';
do {
use bytes;
foreach my $idx (0..(length($content)-1)) {
$ret .= sprintf("\\%03o", ord(substr($content, $idx, 1)));
}
};
return $ret;

Based on a quick cmpthese benchmark, the control is about 16 times
faster than solution 1 and about 9 times faster than solution 2.

Does anyone know of A) The fastest way to do this or B) some
tips/tricks on how to speedup my methods?

Thanks!
-Worky

Worky Workerson · Jul 28, 2006

After tinkering for a while, my best solution is now:
$content = 'worky'
return join('', map {sprintf("\\%03o", $_)} unpack("C*", $content));

Anyone got anything better?

Thanks!
-Worky

DJ Stunks · Jul 28, 2006

Worky said:
I have a (possibly binary) string like "worky" where I'd like to
convert each byte to its octal representation, resulting in a string
"\167\157\162\153\171". I have two solutions, however I'm looking for
any way that would be faster.

Control:
$content = 'worky';
return $content;

Solution 1 (in place w/regex):
$content = 'worky';
$content =~ s/(.|\n)/sprintf("\\%03o", ord $1)/eg;
return $content

Solution 2 (index into string):
$content = 'worky';
do {
use bytes;
foreach my $idx (0..(length($content)-1)) {
$ret .= sprintf("\\%03o", ord(substr($content, $idx, 1)));
}
};
return $ret;

Based on a quick cmpthese benchmark, the control is about 16 times
faster than solution 1 and about 9 times faster than solution 2.

Does anyone know of A) The fastest way to do this or B) some
tips/tricks on how to speedup my methods?

if you have the benchmark set up, try this sub:

#!/usr/bin/perl

use strict;
use warnings;

my $string = 'worky';
print convert_to_octal($string);

sub convert_to_octal {
my ($string) = @_;

return map { sprintf '\\%03o', ord $_ }
split //, $string;

}

__END__

-jp

Ben Morrow · Jul 28, 2006

Quoth "Worky Workerson" <[email protected]>:

[please quote some context when you reply. Have you read the Posting
Guidelines?]

[converting a string into octal esacpes]

After tinkering for a while, my best solution is now:
$content = 'worky'
return join('', map {sprintf("\\%03o", $_)} unpack("C*", $content));

Anyone got anything better?

Here's a couple more, in the spirit of TMTOWTDI:

#!/usr/bin/perl

use warnings;
use strict;
use Math::BaseCnv;
use Benchmark qw/cmpthese/;

$\ = "\n";
my $w = 'worky';

my %subs = (
regex => sub { (my $x = $w) =~ s/(.)/sprintf '\\%3o', ord $1/egs; $x; },
substr => sub {
use bytes;
my $x;
for (0..(length $w) - 1) {
$x .= sprintf '\\%3o', ord substr $w, $_, 1;
}
return $x;
},
unpack => sub { join '', map sprintf('\\%3o', $_), unpack 'C*', $w },
split => sub {
use bytes;
join '', map sprintf('\\%3o', ord), split //, $w
},
cnv => sub { '\\' . join '\\', map cnv($_, 10, 8), unpack 'C*', $w; },
);

for (keys %subs) {
print "$_ => " . $subs{$_}->();
}

cmpthese -3, \%subs;

__END__

This gives (on my machine)

cnv => \167\157\162\153\171
regex => \167\157\162\153\171
unpack => \167\157\162\153\171
substr => \167\157\162\153\171
split => \167\157\162\153\171
Rate cnv regex split unpack substr
cnv 5618/s -- -87% -87% -92% -92%
regex 42639/s 659% -- -0% -39% -39%
split 42712/s 660% 0% -- -39% -39%
unpack 69589/s 1139% 63% 63% -- -1%
substr 70257/s 1151% 65% 64% 1% --

This is usually true: substr > unpack > split > regex. The reason is
that Perl ops are so much slower than C.

However, I am hard-pressed to think of a situation where it's worth
writing anything other than 'regex' above, as clarity is almost always
more important than speed.

Also note that the 'regex' solution would need a 'use bytes' to be
strictly compatible with the others. I'm not sure why you think you need
it: if you've read your data from a binmode :raw filehandle it's binary
anyway; otherwise you want to encode it with Encode into a suitable
encoding.

Ben

Worky Workerson · Jul 28, 2006

sub convert_to_octal {

my ($string) = @_;

return map { sprintf '\\%03o', ord $_ }
split //, $string;

}

Its about 25% slower than my "best" solution listed previously, which
was basically the same thing with unpack instead of split. Also, since
the data might be binary, I'm worried about the split // ... isn't that
a character split (vs a binary split)?

Worky Workerson · Jul 28, 2006

Ben said:
[converting a string into octal esacpes]

my %subs = (
regex => sub { (my $x = $w) =~ s/(.)/sprintf '\\%3o', ord $1/egs; $x; },
substr => sub {
use bytes;
my $x;
for (0..(length $w) - 1) {
$x .= sprintf '\\%3o', ord substr $w, $_, 1;
}
return $x;
},
unpack => sub { join '', map sprintf('\\%3o', $_), unpack 'C*', $w },
split => sub {
use bytes;
join '', map sprintf('\\%3o', ord), split //, $w
},
cnv => sub { '\\' . join '\\', map cnv($_, 10, 8), unpack 'C*', $w; },
);

However, I am hard-pressed to think of a situation where it's worth
writing anything other than 'regex' above, as clarity is almost always
more important than speed.

I'm doing database ETL and transforming 300GB of CSV into something the
database likes to load. According to DProf, this was my biggest
slacker by far, partly because it is called so often. Every little bit
of speed helps

Also note that the 'regex' solution would need a 'use bytes' to be
strictly compatible with the others. I'm not sure why you think you need
it: if you've read your data from a binmode :raw filehandle it's binary
anyway; otherwise you want to encode it with Encode into a suitable
encoding.

I guess I'm still a little fuzzy on the whole perl/binary thing. I'm
reading in CSV where most of the columns are ASCII but I'm not sure
what sort of data will be stored in one of the columns. I am declaring
binmode on the filehandle ... do I still need the 'use bytes' on the
substr approach?

Ben Morrow · Jul 28, 2006

Quoth "Worky Workerson said:
Ben said:

[converting a string into octal esacpes]

Click to expand...

However, I am hard-pressed to think of a situation where it's worth
writing anything other than 'regex' above, as clarity is almost always
more important than speed.

Click to expand...

I'm doing database ETL and transforming 300GB of CSV into something the
database likes to load. According to DProf, this was my biggest
slacker by far, partly because it is called so often. Every little bit
of speed helps

Fair enough

. A lot of people seem to come here saying 'I want to do

I guess I'm still a little fuzzy on the whole perl/binary thing.

Yeah, it's kinda complicated. It's made harder by the fact that Perl has
to be backwards-compatible, so a lot of the time just fudging things
seems to work...

I'm reading in CSV where most of the columns are ASCII but I'm not
sure what sort of data will be stored in one of the columns. I am
declaring binmode on the filehandle ... do I still need the 'use
bytes' on the substr approach?

If you are reading from a binary filehandle, then the data is all 8bit
(as opposed to wider than that) anyway, so you don't. You may get a
slight speed benefit by declaring 'use bytes' at the top of the script.

xhoster · Jul 28, 2006

Worky Workerson said:
I have a (possibly binary) string like "worky" where I'd like to
convert each byte to its octal representation, resulting in a string
"\167\157\162\153\171". I have two solutions, however I'm looking for
any way that would be faster.
....
Solution 2 (index into string):
$content = 'worky';
do {
use bytes;
foreach my $idx (0..(length($content)-1)) {
$ret .= sprintf("\\%03o", ord(substr($content, $idx, 1)));
}
};
return $ret; ....
Does anyone know of A) The fastest way to do this or B) some
tips/tricks on how to speedup my methods?

This seems like a pretty strange thing to need to optimize. How many
times do you need to do this operation on a 5 character fixed string?
If you don't need to do it on a 5 character fixed string, then your
benchmark should incorporate realistic sizes and with more realistic
methods for obtaining the non-fixed thing you want to operate on.

Anyway, if really need the speed, this Inline C code is about 3 times
faster than sol2.

Rate sol1 sol2 sol3 control2
sol1 70274/s -- -45% -82% -94%
sol2 127219/s 81% -- -67% -89%
sol3 385820/s 449% 203% -- -66%
control2 1122504/s 1497% 782% 191% --

Benchmark::cmpthese(-5, {
'control' => sub {control($text)},
'sol1' => sub {sol1($text)},
'sol2' => sub {sol2($text)},
'sol3' => sub {sol3($text)},
});
__END__
__C__
SV* sol3(SV* a) {
STRLEN len;
int i;
unsigned char * s;
SV* ret;
s = SvPV(a,len);
ret = newSV(4*len);
for (i=0; i<len; i++,s++) {
sv_catpvf(ret, "\\%03o", *s);
};
return ret;
};

Xho

Sisyphus · Jul 28, 2006

..
..

Anyway, if really need the speed, this Inline C code is about 3 times
faster than sol2.

A neat little Inline C routine .... so I saved the code and ran it:

-----------------------------
D:\pscrpt\inline\>cat char2octal.pl
use warnings;
use Inline C => Config =>
BUILD_NOISY => 1;

use Inline C => <<'EOC';

SV* c2o(SV* a) {
STRLEN len;
int i;
unsigned char * s;
SV* ret;
s = SvPV(a,len);
ret = newSV(4*len);
for (i=0; i<len; i++,s++) {
sv_catpvf(ret, "\\%03o", *s);
}
return ret;
}

EOC

print c2o('abcdABCD'), "\n"; #line 22

D:\pscrpt\inline\>perl char2octal.pl
Use of uninitialized value in subroutine entry at char2octal.pl line 22.
\141\142\143\144\101\102\103\104
-----------------------------

I'm sure it's one of those questions that will make me go "Doh!", but I
can't for the life of me see what is causing that "uninitialized" warning.
Any hints ? (I'm running perl 5.8.8 on Win32.)

Cheers,
Rob

xhoster · Jul 28, 2006

Sisyphus said:
.
.

A neat little Inline C routine .... so I saved the code and ran it:

-----------------------------
D:\pscrpt\inline\>cat char2octal.pl
use warnings;
use Inline C => Config =>
BUILD_NOISY => 1;

use Inline C => <<'EOC';

SV* c2o(SV* a) {
STRLEN len;
int i;
unsigned char * s;
SV* ret;
s = SvPV(a,len);
ret = newSV(4*len);
for (i=0; i<len; i++,s++) {
sv_catpvf(ret, "\\%03o", *s);
}
return ret;
}

EOC

print c2o('abcdABCD'), "\n"; #line 22

D:\pscrpt\inline\>perl char2octal.pl
Use of uninitialized value in subroutine entry at char2octal.pl line 22.
\141\142\143\144\101\102\103\104
-----------------------------

I'm sure it's one of those questions that will make me go "Doh!", but I
can't for the life of me see what is causing that "uninitialized"
warning. Any hints ? (I'm running perl 5.8.8 on Win32.)

Ah, I forgot to turn on warnings and so never saw it.

Apparently sv_catpvf, unlike .= operator, doesn't care for undefined
values. So make that:

ret = newSV(4*len);
sv_setpv(ret, "");
for (i=0; i<len; i++,s++) {

I guess Inline warnings all get reported as being at subroutine entry?

For what it's worth, I've made another uglier one that is about twice again
as fast. This is going to wrap like crazy:

Xho

SV* sol32(SV* a) {
static const char * cache[]={"\\000","\\001","\\002","\\003","\\004",
"\\005","\\006","\\007","\\010","\\011","\\012","\\013","\\014","\\015",
"\\016","\\017","\\020","\\021","\\022","\\023","\\024","\\025","\\026",
"\\027","\\030","\\031","\\032","\\033","\\034","\\035","\\036","\\037",
"\\040","\\041","\\042","\\043","\\044","\\045","\\046","\\047","\\050",
"\\051","\\052","\\053","\\054","\\055","\\056","\\057","\\060","\\061",
"\\062","\\063","\\064","\\065","\\066","\\067","\\070","\\071","\\072",
"\\073","\\074","\\075","\\076","\\077","\\100","\\101","\\102","\\103",
"\\104","\\105","\\106","\\107","\\110","\\111","\\112","\\113","\\114",
"\\115","\\116","\\117","\\120","\\121","\\122","\\123","\\124","\\125",
"\\126","\\127","\\130","\\131","\\132","\\133","\\134","\\135","\\136",
"\\137","\\140","\\141","\\142","\\143","\\144","\\145","\\146","\\147",
"\\150","\\151","\\152","\\153","\\154","\\155","\\156","\\157","\\160",
"\\161","\\162","\\163","\\164","\\165","\\166","\\167","\\170","\\171",
"\\172","\\173","\\174","\\175","\\176","\\177","\\200","\\201","\\202",
"\\203","\\204","\\205","\\206","\\207","\\210","\\211","\\212","\\213",
"\\214","\\215","\\216","\\217","\\220","\\221","\\222","\\223","\\224",
"\\225","\\226","\\227","\\230","\\231","\\232","\\233","\\234","\\235",
"\\236","\\237","\\240","\\241","\\242","\\243","\\244","\\245","\\246",
"\\247","\\250","\\251","\\252","\\253","\\254","\\255","\\256","\\257",
"\\260","\\261","\\262","\\263","\\264","\\265","\\266","\\267","\\270",
"\\271","\\272","\\273","\\274","\\275","\\276","\\277","\\300","\\301",
"\\302","\\303","\\304","\\305","\\306","\\307","\\310","\\311","\\312",
"\\313","\\314","\\315","\\316","\\317","\\320","\\321","\\322","\\323",
"\\324","\\325","\\326","\\327","\\330","\\331","\\332","\\333","\\334",
"\\335","\\336","\\337","\\340","\\341","\\342","\\343","\\344","\\345",
"\\346","\\347","\\350","\\351","\\352","\\353","\\354","\\355","\\356",
"\\357","\\360","\\361","\\362","\\363","\\364","\\365","\\366","\\367",
"\\370","\\371","\\372","\\373","\\374","\\375","\\376","\\377"};

STRLEN len;
int i;
unsigned char * s;
SV* ret;
s = SvPV(a,len);
ret = newSV(4*len);
sv_setpv(ret, "");
for (i=0; i<len; i++,s++) {
sv_catpv(ret, cache[*s]);
};
return ret;
};

John W. Krahn · Jul 28, 2006

Worky said:
I have a (possibly binary) string like "worky" where I'd like to
convert each byte to its octal representation, resulting in a string
"\167\157\162\153\171". I have two solutions, however I'm looking for
any way that would be faster.

Control:
$content = 'worky';
return $content;

Solution 1 (in place w/regex):
$content = 'worky';
$content =~ s/(.|\n)/sprintf("\\%03o", ord $1)/eg;
return $content

Solution 2 (index into string):
$content = 'worky';
do {
use bytes;
foreach my $idx (0..(length($content)-1)) {
$ret .= sprintf("\\%03o", ord(substr($content, $idx, 1)));
}
};
return $ret;

Based on a quick cmpthese benchmark, the control is about 16 times
faster than solution 1 and about 9 times faster than solution 2.

Does anyone know of A) The fastest way to do this or B) some
tips/tricks on how to speedup my methods?

Create the translation table first:

my %table = map { chr, sprintf '\%03o', $_ } 0 .. 255;

$content =~ s/(.)/$table{$1}/sg;

foreach my $idx ( 0 .. length( $content ) - 1 ) {
$ret .= $table{ substr $content, $idx, 1 };
}

John

Ben Morrow · Jul 28, 2006

Quoth (e-mail address removed):

Ah, I forgot to turn on warnings and so never saw it.

Apparently sv_catpvf, unlike .= operator, doesn't care for undefined
values. So make that:

ret = newSV(4*len);
sv_setpv(ret, "");
for (i=0; i<len; i++,s++) {

I guess Inline warnings all get reported as being at subroutine entry?

Yes, as with all warnings thrown inside XS. The currently executing Perl
op is the sub call, so that's what you get: the whole XS sub is run as
part of the sub call op, which then returns rather than jumping to the
start of the sub as it would with Perl.

For what it's worth, I've made another uglier one that is about twice again
as fast. This is going to wrap like crazy:

SV* sol32(SV* a) {
static const char * cache[]={"\\000","\\001","\\002","\\003","\\004",

I wondered about that (in Perl, not in C); to make it a little less ugly
you could have

static const char cache[0x100][5]; /* c arrays confuse me

*/

void populate_cache (void) {
int i;
for (i=0; i<0x100; i++) {
Copy(form("\\%3o", i), cache, 5, char);
}
}

Ben

Convert IEEE single from integer representation	5	Mar 9, 2007
Rich Text Format (RTF) Document Builder in C++: Code and Features	0	Sep 28, 2025
need help with a cart I inherited, need to increase number of total characters allowed	3	Oct 21, 2007
Text processing	29	Sep 26, 2011
Help: How to pass a struct as a pointer to Win32API?	3	Jul 26, 2010
EditableRegions again	0	Apr 21, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Jan 12, 2008

Efficiently convert characters to octal representation

Worky Workerson

Worky Workerson

DJ Stunks

Ben Morrow

Worky Workerson

Worky Workerson

Ben Morrow

xhoster

Sisyphus

xhoster

John W. Krahn

Ben Morrow

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads