Use of hashes and speed - suggestions ?

S

Smitty

I have a requirement to parse a very large log file, and extract a
variety of data.

One of the things I need to do is build a cross reference map from one
symbolic name to another, and for this I guess I use a hash. The data
for an element of this 'map' can be found within a single line of the
log file that might look something like this

....key....is known as value....

The reason for this hash is that I then need to process later lines in
the file for two different things

First I need to look for a line something like this:
.......Created object 'key' ........

Then later I need to look for a line something like this
.......Processed object 'value'......

Where the 'key' and 'value' are the same as what would be in the map
above, but sometimes the processed lines contain objects which were not
created 'locally' so I need to ignore them. Also, not all created
objects get processed.

The main requirement is to discover at what time I have processed XXX
of the 'created' key objects. So I was imagining I would need another
hash with the key being the value and the value being the key from
above, so I would also have a Xref in the above loop like this.

## process the MAP entries
my %map = ();
my %xref = ();
while(<>)
{
$_ =~ /...(key).is a ..(value).../;
${map{$1}} = $2;
${xref{$2}} = $1;
}

## process the 'created' and 'processed' entries
my $counter = 0;
my %created_map = ();
while(<>)
{
$_ =~ /...Create object (key).../;
if($1)
{
${created_map{$1}} = ${map{$1}} ;
} else {

$_ =~ /...Processed object (value).../;
if($1)
{
## get the key from the value
my $key = ${xref{$1}};
if( ${created_map{$key}} )
{
## if we created it, count it
$counter ++;
}
if( $counter >= XXXX )
{
## do the work regarding the creation of the XXXth
object
}
}
}
}


Forgive me if the code above isn't compilable - consider it akin to
psuedo code, it's not really a requirement for the purpose of this
quesiton

Now, the quesiton is.
Am I going to pay a performance penalty for all those hash lookups, and
can anyone suggest a better 'perlish' way which could help me acheive
the same results with better performance?
 
I

it_says_BALLS_on_your forehead

Smitty said:
I have a requirement to parse a very large log file, and extract a
variety of data.

One of the things I need to do is build a cross reference map from one
symbolic name to another, and for this I guess I use a hash. The data
for an element of this 'map' can be found within a single line of the
log file that might look something like this

...key....is known as value....

The reason for this hash is that I then need to process later lines in
the file for two different things

First I need to look for a line something like this:
......Created object 'key' ........

Then later I need to look for a line something like this
......Processed object 'value'......

Where the 'key' and 'value' are the same as what would be in the map
above, but sometimes the processed lines contain objects which were not
created 'locally' so I need to ignore them. Also, not all created
objects get processed.

The main requirement is to discover at what time I have processed XXX
of the 'created' key objects. So I was imagining I would need another
hash with the key being the value and the value being the key from
above, so I would also have a Xref in the above loop like this.

## process the MAP entries
my %map = ();
my %xref = ();
while(<>)
{
$_ =~ /...(key).is a ..(value).../;
${map{$1}} = $2;
${xref{$2}} = $1;
}

## process the 'created' and 'processed' entries
my $counter = 0;
my %created_map = ();
while(<>)
{
$_ =~ /...Create object (key).../;
if($1)
{
${created_map{$1}} = ${map{$1}} ;
} else {

$_ =~ /...Processed object (value).../;
if($1)
{
## get the key from the value
my $key = ${xref{$1}};
if( ${created_map{$key}} )
{
## if we created it, count it
$counter ++;
}
if( $counter >= XXXX )
{
## do the work regarding the creation of the XXXth
object
}
}
}
}


Forgive me if the code above isn't compilable - consider it akin to
psuedo code, it's not really a requirement for the purpose of this
quesiton

Now, the quesiton is.
Am I going to pay a performance penalty for all those hash lookups, and

hash lookups are constant time: O(1).
 
S

Smitty

it_says_BALLS_on_your forehead said:
hash lookups are constant time: O(1).

Hmm... I am not sure how that answers my question, could you explain
please.
 
X

xhoster

Smitty said:
I have a requirement to parse a very large log file, and extract a
variety of data.

Some people consider 10 Meg to be very large, and some people consider
20 Gig to still be medium.

The main requirement is to discover at what time I have processed XXX
of the 'created' key objects.

How do you know which of the created objects are obscene?
So I was imagining I would need another
hash with the key being the value and the value being the key from
above, so I would also have a Xref in the above loop like this.

## process the MAP entries
my %map = ();
my %xref = ();
while(<>)
{
$_ =~ /...(key).is a ..(value).../;
${map{$1}} = $2;
${xref{$2}} = $1;

Why the extra curlies? $map{$1}=$2 looks much nicer.
}

## process the 'created' and 'processed' entries
my $counter = 0;
my %created_map = ();
while(<>)

I hope you reset the said:
{
$_ =~ /...Create object (key).../;
if($1)

Um, no. An unsuccessful match does not undef $1, it leaves it at the
previous value. You need to test the success of the m// operator itself.

{
${created_map{$1}} = ${map{$1}} ;

Since you can look up $map{$some_key_from_created_map} at a later time, why
store that value here as well as there? It just wastes memory.
$created_map{$1}=();

} else {

$_ =~ /...Processed object (value).../;
if($1)
{
## get the key from the value
my $key = ${xref{$1}};
if( ${created_map{$key}} )

if( exists ${created_map{$key}} )
{
## if we created it, count it
$counter ++;
}
if( $counter >= XXXX )
{
## do the work regarding the creation of the XXXth
object
}
}
}
}

Nowhere here do you use %map in any meaningful way. So you could get
rid of it entirely.
Forgive me if the code above isn't compilable - consider it akin to
psuedo code, it's not really a requirement for the purpose of this
quesiton

Now, the quesiton is.
Am I going to pay a performance penalty for all those hash lookups,

Yes. Hashes, while quite nice, are not magically instantaneous. Things
will be especially bad if the hashes get so large that they cause swapping.
and
can anyone suggest a better 'perlish' way which could help me acheive
the same results with better performance?

It is hard to get more perlsish than hashes.

Xho
 
J

jack

Some people consider 10 Meg to be very large, and some people consider
20 Gig to still be medium.

This script will be processing about 5 Gig of log files (broken down
into 100M chunks) per day, so I guess that's that insignificant, but
perhaps not very large either.
How do you know which of the created objects are obscene?

Funny. OK 'some number' as represented by XXX
Why the extra curlies? $map{$1}=$2 looks much nicer.

It seems to me I read somewhere that this was 'safer' for some reason;
I immediately adopted the syntax, while simultaneaously forgetting the
reason why. Is it necessary or not ?
I hope you reset the <> iterator somewhere.

Well, actually, the first bit, filling the 'map' hash, will have a
'last' in it somewhere, but I neglected to mention it, since it really
isn't that important to the main issue.

Um, no. An unsuccessful match does not undef $1, it leaves it at the
previous value. You need to test the success of the m// operator itself.

Oh crap. !!!
How many places in my other scripts have I done that !!!!!!!!!

So, the matching returns an array like:
my ($key) = ($_ =~ /...Create object (key).../);

so I test $key ???

or is there a preferred method.
Since you can look up $map{$some_key_from_created_map} at a later time, why
store that value here as well as there? It just wastes memory.
$created_map{$1}=();

I guess you mean store a null in the 'created_map' hash. Yes, good
idea, thanks
if( exists ${created_map{$key}} )


Nowhere here do you use %map in any meaningful way. So you could get
rid of it entirely.

Not sure I understand what you are saying. I reference %map within the
same loop that the else is a part of.

Could you explain ?
It is hard to get more perlsish than hashes.

Well, I was wondering about retrieving the list of values from the
hash, rather than creating a seperate hash, soes perl return a
reference to the existing values or a new list of values ?
 
X

xhoster

This script will be processing about 5 Gig of log files (broken down
into 100M chunks) per day, so I guess that's that insignificant, but
perhaps not very large either.

So the hashes will only accumulate over the 100M chunks, and will span
all 5 Gig at one time? In that case, a modern server should be OK.
It seems to me I read somewhere that this was 'safer' for some reason;
I immediately adopted the syntax, while simultaneaously forgetting the
reason why. Is it necessary or not ?

In this situation it is not necessary. I find it confusing, because I
initially read it as ${$xref{$2}} and was trying to figure out why you
were introducing a useless layer of scalar references. I can't think of
a situation where your usage is necessary, but there may be one.

Oh crap. !!!
How many places in my other scripts have I done that !!!!!!!!!

So, the matching returns an array like:
my ($key) = ($_ =~ /...Create object (key).../);

so I test $key ???

What if $key is the '0' or the empty string? You would have to test the
definedness of key, rather than it's truth/false value.
or is there a preferred method.

My prefered method is

if (/...Create object (key).../) {
# do something with $1

I guess you mean store a null in the 'created_map' hash. Yes, good
idea, thanks


Not sure I understand what you are saying. I reference %map within the
same loop that the else is a part of.

Could you explain ?

That I can see, the only place you access %map is to assign it's keys'
values to %created_map. But then you never use the values in %created_map,
only the existence of the keys. In that case, there is no need to assign
values to %created_map. In which case, there is no need to have %map in
the first place. If you do use the values of %map (or %created_map) in
some part of the code that was elided for brevity, then you do of course
need %map.

Well, I was wondering about retrieving the list of values from the
hash, rather than creating a seperate hash, soes perl return a
reference to the existing values or a new list of values ?

I'm sorry, I don't understand. By list of values, do you mean an hash
slice? Or a Hash of Arrayrefs? I don't immediately see how your code could
be improved be the use of either one of those. (Well, unless there is not
a one-to-one corresponce between "key" and "value", in which case your
current code is broken so it is not merely a performance issue.) Pretty
much everything in your current code deals in a scalar context, so when you
talk about lists, I assume that refers to some alternative code you have in
mind but haven't shown?

Xho
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top