Efficient field splitting? unpack or substr

ifiaz · Oct 10, 2003

I have a data that looks like this in a single line.

"01 17060757 EG 6880232 N 0131020321 17 060712 l 8828 TR6322
00030070 01 20030317060807749544 060645 244 PA1"

for about 280,000 lines.

The fields are fixed-widths. You can't extract it using delimiters as
some of the
fields may be blank.

I originally wrote an awkscript and used substr to extract the fields
from $0
and it took 25.26 seconds to calculate the summary.

Field Splitting in awk, for your info
F1 =substr($0, 1, 2)
TiltTime =substr($0, 4, 8)
....
....

Using awk to perl converter, the same thing in perl took only 11.03
seconds.
(awk to perl used substr as well)

Field Splitting in awk to perl, for your info
$F1 = substr($_, 1, 2);
$TiltTime = substr($_, 4, 8);
....
....

Now, I wrote a perl script, but only replaced the field splitting part
with
unpack. Now, the script takes 21.5 seconds.

Field Splitting in perl using unpack, for your info

($F1, $TiltTime, ...) =
unpack("a2xa8xa2xa3a5xa1xa10xa2xa6xa1xa4xa8xa6xa8xa2xa20xa6xa3xa3",
$_);
....
....

Why is unpack not efficient? Am I doing anything wrong?
Should I stick to substr to do such field splitting in the future?
Can I write it any other way to make it more efficient.

- Fiaz Idris

James Willmore · Oct 10, 2003

On 9 Oct 2003 21:14:40 -0700

Why is unpack not efficient? Am I doing anything wrong?
Should I stick to substr to do such field splitting in the future?
Can I write it any other way to make it more efficient.

It does not appear that you're doing anything wrong. 'unpack' will
look at the whole line and, well, unpack it

'substr', you're
telling the script _exactly_ where to look, so it's not looking at the
whole line.

The question you need to ask yourself is this - do I _need_ to examine
the whole line, or just extract the required data from the line? Use
substr for just pieces of the line, unpack for the whole line.

HTH

--
Jim

Copyright notice: all code written by the author in this post is
released under the GPL. http://www.gnu.org/licenses/gpl.txt
for more information.

a fortune quote ...
Never hit a man with glasses. Hit him with a baseball bat.

Steve Grazzini · Oct 10, 2003

ifiaz said:
Why is unpack not efficient?

Remember that unpack() has to parse the template every time
through the loop...

Should I stick to substr to do such field splitting in the
future?

That's up to you. I'll just mention that the unpack() version
can be much, much easier to read.

@fields = qw(one two three ...);
$template = qq(a4 a12 a3 ...);

while (<>) {
@data{ @fields } = unpack $template, $_;
}

Christopher Hamel · Oct 10, 2003

I have a data that looks like this in a single line.

"01 17060757 EG 6880232 N 0131020321 17 060712 l 8828 TR6322
00030070 01 20030317060807749544 060645 244 PA1"
. . .
Using awk to perl converter, the same thing in perl took only 11.03
seconds.
(awk to perl used substr as well)

Field Splitting in awk to perl, for your info
$F1 = substr($_, 1, 2);
$TiltTime = substr($_, 4, 8);
....
....

Now, I wrote a perl script, but only replaced the field splitting part
with
unpack. Now, the script takes 21.5 seconds.

Field Splitting in perl using unpack, for your info

($F1, $TiltTime, ...) =
unpack("a2xa8xa2xa3a5xa1xa10xa2xa6xa1xa4xa8xa6xa8xa2xa20xa6xa3xa3",
$_);
....
....

Why is unpack not efficient? Am I doing anything wrong?
Should I stick to substr to do such field splitting in the future?
Can I write it any other way to make it more efficient.

- Fiaz Idris

I have found that unpack is significantly slower as well. I can't say
conclusively why, but my guess is that it's built to do much more than
just extract certain characters from a string the way you appear to be
using it.

Believe it or not, a regex is very fast at this sort of thing if
performance is a major concern.

my $string = 'one two three four';
my ($o,$tw,$th,$f) = $line =~ /^(...).(...).(.....).(....)/;
# or /^(.{3}).(.{3}).(.{5}).(.{4})/

Benchmark this against substr with your data, and I think you'll find
that this is much faster. In past cases where I've looked to do
something similar, the regex has won, except in cases where I've
needed only a small portion of the large string.

Keith Keller · Oct 10, 2003

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
NotDashEscaped: You need GnuPG to verify this message

The fields are fixed-widths. You can't extract it using delimiters as
some of the
fields may be blank.

If your delimiters are spaces, sure. If you are able to generate the
file using a different delimiter (tab is a common one) then maybe
splitting on the delimiter will be easier. Your only concern would be
to pick a character that you know for certain doesn't appear in any of
your data fields.

--keith

--
(e-mail address removed)-francisco.ca.us
(try just my userid to email me)
AOLSFAQ=http://wombat.san-francisco.ca.us/cgi-bin/fom

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iEYEARECAAYFAj+GzpgACgkQhVcNCxZ5ID+GAACfTbqQ/uY+Mgy8iwjSX10lTuky
vvUAoIqgXfoDC2deKM9AcnN8FWNGZ2i7
=n5+s
-----END PGP SIGNATURE-----

ifiaz · Oct 10, 2003

I have found that unpack is significantly slower as well. I can't say

conclusively why, but my guess is that it's built to do much more than
just extract certain characters from a string the way you appear to be
using it.

Believe it or not, a regex is very fast at this sort of thing if
performance is a major concern.

my $string = 'one two three four';
my ($o,$tw,$th,$f) = $line =~ /^(...).(...).(.....).(....)/;
# or /^(.{3}).(.{3}).(.{5}).(.{4})/

Benchmark this against substr with your data, and I think you'll find
that this is much faster. In past cases where I've looked to do
something similar, the regex has won, except in cases where I've
needed only a small portion of the large string.

I did try to use the regex as you have told me.
But, infact it is slower than substr.

I forgot the time it took, it is about 21 seconds (certainly
greater than 20 seconds). Since I am at home now for the
weekend, I can't verify it exactly about the seconds.

Thanks to all of you. If you have any further input on this
you are most certainly welcome.

ifiaz · Oct 13, 2003

I have found that unpack is significantly slower as well. I can't say

conclusively why, but my guess is that it's built to do much more than
just extract certain characters from a string the way you appear to be
using it.

Believe it or not, a regex is very fast at this sort of thing if
performance is a major concern.

my $string = 'one two three four';
my ($o,$tw,$th,$f) = $line =~ /^(...).(...).(.....).(....)/;
# or /^(.{3}).(.{3}).(.{5}).(.{4})/

Benchmark this against substr with your data, and I think you'll find
that this is much faster. In past cases where I've looked to do
something similar, the regex has won, except in cases where I've
needed only a small portion of the large string.

Sorry for posting the reply in a compeltely new thread with the subject
name "Re: Efficient field splitting? unpack or substr or regex".
That was in error. So, I hereby repeat the reply below.

I did try to use the regex as you have told me.
But, infact it is slower than substr. It took 23.49 seconds.

Field Splitting in perl using regex, for your info
($F1, $TiltTime, ....) = $_ =~ /(.{2}) (.{8}) (.{2}) ..../;
Thanks to all of you. If you have any further input in this
you are most certainly welcome.

And, for you keith, The delimiters are space, and that can't be changed
atleast for now.

RE Help splitting CVS data	7	Jan 20, 2013
FAQ 5.8 How can I manipulate fixed-record-length files?	0	Apr 16, 2011
C++ IO (or equivalency of perl's pack/unpack?)	4	Nov 20, 2003
need help with a cart I inherited, need to increase number of total characters allowed	3	Oct 22, 2007
Newbie question: most efficient way to search fields of this file	9	Apr 14, 2006
Could someone help me with this source code?	5	Jan 20, 2007
Perl script to replace awk	3	Jul 26, 2004
LoadXml failing because field too long????	1	Feb 5, 2007

Efficient field splitting? unpack or substr

ifiaz

James Willmore

Steve Grazzini

Christopher Hamel

Keith Keller

ifiaz

ifiaz

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads