Efficient field splitting? unpack or substr

I

ifiaz

I have a data that looks like this in a single line.

"01 17060757 EG 6880232 N 0131020321 17 060712 l 8828 TR6322
00030070 01 20030317060807749544 060645 244 PA1"

for about 280,000 lines.

The fields are fixed-widths. You can't extract it using delimiters as
some of the
fields may be blank.

I originally wrote an awkscript and used substr to extract the fields
from $0
and it took 25.26 seconds to calculate the summary.

Field Splitting in awk, for your info
F1 =substr($0, 1, 2)
TiltTime =substr($0, 4, 8)
....
....

Using awk to perl converter, the same thing in perl took only 11.03
seconds.
(awk to perl used substr as well)

Field Splitting in awk to perl, for your info
$F1 = substr($_, 1, 2);
$TiltTime = substr($_, 4, 8);
....
....

Now, I wrote a perl script, but only replaced the field splitting part
with
unpack. Now, the script takes 21.5 seconds.

Field Splitting in perl using unpack, for your info

($F1, $TiltTime, ...) =
unpack("a2xa8xa2xa3a5xa1xa10xa2xa6xa1xa4xa8xa6xa8xa2xa20xa6xa3xa3",
$_);
....
....

Why is unpack not efficient? Am I doing anything wrong?
Should I stick to substr to do such field splitting in the future?
Can I write it any other way to make it more efficient.


- Fiaz Idris
 
J

James Willmore

On 9 Oct 2003 21:14:40 -0700
Why is unpack not efficient? Am I doing anything wrong?
Should I stick to substr to do such field splitting in the future?
Can I write it any other way to make it more efficient.

It does not appear that you're doing anything wrong. 'unpack' will
look at the whole line and, well, unpack it :) 'substr', you're
telling the script _exactly_ where to look, so it's not looking at the
whole line.

The question you need to ask yourself is this - do I _need_ to examine
the whole line, or just extract the required data from the line? Use
substr for just pieces of the line, unpack for the whole line.

HTH

--
Jim

Copyright notice: all code written by the author in this post is
released under the GPL. http://www.gnu.org/licenses/gpl.txt
for more information.

a fortune quote ...
Never hit a man with glasses. Hit him with a baseball bat.
 
S

Steve Grazzini

ifiaz said:
Why is unpack not efficient?

Remember that unpack() has to parse the template every time
through the loop...
Should I stick to substr to do such field splitting in the
future?

That's up to you. I'll just mention that the unpack() version
can be much, much easier to read.

@fields = qw(one two three ...);
$template = qq(a4 a12 a3 ...);

while (<>) {
@data{ @fields } = unpack $template, $_;
}
 
C

Christopher Hamel

I have a data that looks like this in a single line.

"01 17060757 EG 6880232 N 0131020321 17 060712 l 8828 TR6322
00030070 01 20030317060807749544 060645 244 PA1"
. . .
Using awk to perl converter, the same thing in perl took only 11.03
seconds.
(awk to perl used substr as well)

Field Splitting in awk to perl, for your info
$F1 = substr($_, 1, 2);
$TiltTime = substr($_, 4, 8);
....
....

Now, I wrote a perl script, but only replaced the field splitting part
with
unpack. Now, the script takes 21.5 seconds.

Field Splitting in perl using unpack, for your info

($F1, $TiltTime, ...) =
unpack("a2xa8xa2xa3a5xa1xa10xa2xa6xa1xa4xa8xa6xa8xa2xa20xa6xa3xa3",
$_);
....
....

Why is unpack not efficient? Am I doing anything wrong?
Should I stick to substr to do such field splitting in the future?
Can I write it any other way to make it more efficient.


- Fiaz Idris

I have found that unpack is significantly slower as well. I can't say
conclusively why, but my guess is that it's built to do much more than
just extract certain characters from a string the way you appear to be
using it.

Believe it or not, a regex is very fast at this sort of thing if
performance is a major concern.

my $string = 'one two three four';
my ($o,$tw,$th,$f) = $line =~ /^(...).(...).(.....).(....)/;
# or /^(.{3}).(.{3}).(.{5}).(.{4})/

Benchmark this against substr with your data, and I think you'll find
that this is much faster. In past cases where I've looked to do
something similar, the regex has won, except in cases where I've
needed only a small portion of the large string.
 
K

Keith Keller

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
NotDashEscaped: You need GnuPG to verify this message

The fields are fixed-widths. You can't extract it using delimiters as
some of the
fields may be blank.

If your delimiters are spaces, sure. If you are able to generate the
file using a different delimiter (tab is a common one) then maybe
splitting on the delimiter will be easier. Your only concern would be
to pick a character that you know for certain doesn't appear in any of
your data fields.

--keith

--
(e-mail address removed)-francisco.ca.us
(try just my userid to email me)
AOLSFAQ=http://wombat.san-francisco.ca.us/cgi-bin/fom

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iEYEARECAAYFAj+GzpgACgkQhVcNCxZ5ID+GAACfTbqQ/uY+Mgy8iwjSX10lTuky
vvUAoIqgXfoDC2deKM9AcnN8FWNGZ2i7
=n5+s
-----END PGP SIGNATURE-----
 
I

ifiaz

I have found that unpack is significantly slower as well. I can't say
conclusively why, but my guess is that it's built to do much more than
just extract certain characters from a string the way you appear to be
using it.

Believe it or not, a regex is very fast at this sort of thing if
performance is a major concern.

my $string = 'one two three four';
my ($o,$tw,$th,$f) = $line =~ /^(...).(...).(.....).(....)/;
# or /^(.{3}).(.{3}).(.{5}).(.{4})/

Benchmark this against substr with your data, and I think you'll find
that this is much faster. In past cases where I've looked to do
something similar, the regex has won, except in cases where I've
needed only a small portion of the large string.

I did try to use the regex as you have told me.
But, infact it is slower than substr.

I forgot the time it took, it is about 21 seconds (certainly
greater than 20 seconds). Since I am at home now for the
weekend, I can't verify it exactly about the seconds.

Thanks to all of you. If you have any further input on this
you are most certainly welcome.
 
I

ifiaz

I have found that unpack is significantly slower as well. I can't say
conclusively why, but my guess is that it's built to do much more than
just extract certain characters from a string the way you appear to be
using it.

Believe it or not, a regex is very fast at this sort of thing if
performance is a major concern.

my $string = 'one two three four';
my ($o,$tw,$th,$f) = $line =~ /^(...).(...).(.....).(....)/;
# or /^(.{3}).(.{3}).(.{5}).(.{4})/

Benchmark this against substr with your data, and I think you'll find
that this is much faster. In past cases where I've looked to do
something similar, the regex has won, except in cases where I've
needed only a small portion of the large string.

Sorry for posting the reply in a compeltely new thread with the subject
name "Re: Efficient field splitting? unpack or substr or regex".
That was in error. So, I hereby repeat the reply below.

I did try to use the regex as you have told me.
But, infact it is slower than substr. It took 23.49 seconds.

Field Splitting in perl using regex, for your info
($F1, $TiltTime, ....) = $_ =~ /(.{2}) (.{8}) (.{2}) ..../;
Thanks to all of you. If you have any further input in this
you are most certainly welcome.

And, for you keith, The delimiters are space, and that can't be changed
atleast for now.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top