Combining multiple hash references into one hash reference

A

Arvin Portlock

I know lots of ways to combine multiple hashes into a single
hash but I'm very concerned about memory and copy by value.
I'm processing some XML documents and have several thousand
elements that must be linked to relatively few hashes. These
hashes have unique keys among them so I don't have to worry
about one hash element overwriting another with the same
key. The following works as it should:

my $hash1 = {
'key1' => 'Value 1',
'key2' => 'Value 2',
'key3' => 'Value 3',
};

my $hash2 = {
'key4' => 'Value 4',
'key5' => 'Value 5',
'key6' => 'Value 6',
};

my %newhash = (%$hash1, %$hash2);

# The following do not work:
# my $newhash = { $hash1, $hash2 };
# my $newhash = [ $hash1, $hash2 ];
# my %newhash = ( $hash1, $hash2 );

foreach my $key (keys %newhash) {
print "$key: $newhash{$key}\n";
}

But I'm concerned I'm creating copies of each of
these elements for all of the thousands of instances
of %newhash I will be creating. Is there a faster and
memory efficient way to do this?

Thanks!

Arvin
 
P

Paul Lalli

Arvin said:
The following works as it should:

my $hash1 = {
'key1' => 'Value 1',
'key2' => 'Value 2',
'key3' => 'Value 3',
};

my $hash2 = {
'key4' => 'Value 4',
'key5' => 'Value 5',
'key6' => 'Value 6',
};

my %newhash = (%$hash1, %$hash2);

# The following do not work:
# my $newhash = { $hash1, $hash2 };
# my $newhash = [ $hash1, $hash2 ];
# my %newhash = ( $hash1, $hash2 );

I'm not entirely sure what you're going for, but I think this is the
syntax you're looking for:

my $newhash = { %$hash1, %$hash2 };
foreach my $key (keys %newhash) {
print "$key: $newhash{$key}\n";
}

But I'm concerned I'm creating copies of each of
these elements for all of the thousands of instances
of %newhash I will be creating. Is there a faster and
memory efficient way to do this?

If you're really just wanting to loop through all the "inner" hashes,
why not just build an array of the references you already have?

my @hashes = ($hash1, $hash2);
foreach my $hash (@hashes){
foreach my $key (keys %$hash){
print "$key: $hash->{$key}\n";
}
}

Paul Lalli
 
X

xhoster

Arvin Portlock said:
I know lots of ways to combine multiple hashes into a single
hash but I'm very concerned about memory and copy by value.
I'm processing some XML documents and have several thousand
elements that must be linked to relatively few hashes. These
hashes have unique keys among them so I don't have to worry
about one hash element overwriting another with the same
key. The following works as it should:

my $hash1 = {
'key1' => 'Value 1',
'key2' => 'Value 2',
'key3' => 'Value 3',
};

my $hash2 = {
'key4' => 'Value 4',
'key5' => 'Value 5',
'key6' => 'Value 6',
};

my %newhash = (%$hash1, %$hash2);

# The following do not work:
# my $newhash = { $hash1, $hash2 };
# my $newhash = [ $hash1, $hash2 ];
# my %newhash = ( $hash1, $hash2 );

All of those work. They do exactly what they should do, even if that is
not what you want them to do.

Maybe this is more to your liking:
my $newhash = { %$hash1, %$hash2 };
foreach my $key (keys %newhash) {
print "$key: $newhash{$key}\n";
}

Presumably, that isn't all you are doing, because if it were you would
just use two loops, one for hash1 and one for hash2, and never make the
combined hash in the first place. And not making the combined hash in the
first place is, of course, the best solution if you can get away with it.
If you need more generalized than just those two hashes, then use an AoH
with nested loop for printing.
But I'm concerned I'm creating copies of each of
these elements for all of the thousands of instances
of %newhash I will be creating.

Will $hash1 and $hash2 go out of scope or get redefined shortly after
%newhash (or $newhash) is created from them? If so, you most likely
needn't worry on the memory front. And will all these thousands of
instances of %newhash also be properly scoped?
Is there a faster and
memory efficient way to do this?

Is this micro-optimization week or something?

Would it be acceptable to add %$hash2 into %$hash1 rather than making
a brand new %newhash? If so,
@{$hash1}{keys %$hash2}=values %$hash2;
is somewhat more memory efficient.

If not, then:
my %newhash=%$hash1;
undef $hash1;
@newhash{keys %$hash2}=values %$hash2;


Xho
 
A

Arvin Portlock

Arvin said:
my %newhash = (%$hash1, %$hash2);

# The following do not work:
# my $newhash = { $hash1, $hash2 };
# my $newhash = [ $hash1, $hash2 ];
# my %newhash = ( $hash1, $hash2 );

Maybe this is more to your liking:
my $newhash = { %$hash1, %$hash2 };

That dereferencing % there makes me nervous. It looks
to me like a new hash is being created and then a reference
to it is being assigned to $newhash. So for each of the
thousands of instances of $newhash, each will have (a ref-
erence to) its very own copy of %hash1 and %hash2.

$hash1 and $hash2 do not go out of scope. Also they are not
the only hashes. There are typically 20 or 30 of them
throughout the life of the program. Each XML element contains
some combination from among those 20 or 30. E.g.,

$newhash1 = { %$hash40, %$hash31, %$hash12 };
$newhash10012 = { %$hash1, %$hash21, %$hash26 };

In my XML document (METS for the curious), There are some
30 or forty elements at the top of the document, then further
down there are some thousands that reference some of those
elements with attributes of type IDREFS.

<element id="id1"> ... </element>
<element id="id2"> ... </element>
....
<element id="id30"> ... </element>

<refelement ids="id1 id6 id21"/>
<refelement ids="id22 id11 id21"/>
.... etc. for thousands of <refelements>

Each of the thousands of elements are stored in an array.
At the end of the program I will loop through each of the
thousands of elements and extract certain values from each
one. I want to be able to extract those values by a key
name (which is why an array won't quite work as I can't
access the elements efficiently by key name).

BTW, the above was only an attempt to simplify the problem.
In reality of course I won't be naming my hashes %hash1,
%hash2, etc. Nor will I name the reference elements
$newhash12, $newhash643, etc. The <elements> will live
in a small hash keyed by the id. The <refelement>s will
live in a large array. And I want to be able to write
things like this:

foreach my $refelement (@bigarray) {
print $refelement->{size}, "\n";
print $refelement->{type}, "\n";
}

Where "size" and "type" are typical keys from
among the original 20 or 30 elements (assuming
<refelement ids="id1 id6 id21"/>, "size" may be
a key from the element referenced by "id1", "type"
may come from "id21", and so on.

I'm trying to simplify the problem without posting the
entire huge program, but this may be a bit closer to
what I want (except it doesn't quite work):

my $hashelements = {
'1' => {
'key1' => 'Value 1',
'key2' => 'Value 2',
'key3' => 'Value 3'
},

'6' => {
'key4' => 'Value 4',
'key5' => 'Value 5',
'key6' => 'Value 6'
},

'21' => {
'key7' => 'Value 7',
'key8' => 'Value 8',
'key9' => 'Value 9'
},
};

my $newhash = {};
foreach my $id (1, 6, 21) {
foreach my $key (keys %{$hashelements->{$id}}) {
$newhash->{$key} = \{$hashelements->{$id}->{$key}};
}
}

foreach my $key (keys %$newhash) {
print "$key: ", $newhash->{$key}, "\n";
}

That $newhash->{$key} = \{$hashelements->{$id}->{$key}} part
is an attempt to make sure I only create a reference to
the value rather than make a copy of the value itself.

Perhaps using "each" somehow is the answer. Can't quite get
that to work either though.

Arvin
 
T

Tad McClellan

[ Please do not top-post.

Please stop top-posting very very soon.
]


Arvin Portlock said:
That dereferencing % there makes me nervous.


Why?

What "danger" do you see that we can help you to avoid?

%newhash and %$newhash should both contain the same keys and values.

It looks
to me like a new hash is being created and then a reference
to it is being assigned to $newhash.


Good, since that _is_ what is happening.

I think maybe your question is more about the contents of this
created hash rather than about the hash itself...

The anon hash contains _copies_ of the keys and values returned
by the dererencing operation. The named hash (%newhash) also
contains copies of the keys and values returned by the dererencing
operation.

So for each of the
thousands of instances of $newhash,


There is only _one_ $newhash scalar.

Do you mean that it will take on thousands of _values_ (hashrefs)?

That shouldn't be a problem. Perl's reference counting will free up
the old one when $newhash no longer refers to the old one.

each will have (a ref-
erence to) its very own copy of %hash1 and %hash2.


There *are no* such hashes in any of the code above.

It may have been a dual typo on your part, but it pretty much
stops us in our tracks with regard to figuring out what you
are asking...
 
X

xhoster

Arvin Portlock said:
Arvin said:
my %newhash = (%$hash1, %$hash2);

# The following do not work:
# my $newhash = { $hash1, $hash2 };
# my $newhash = [ $hash1, $hash2 ];
# my %newhash = ( $hash1, $hash2 );

Maybe this is more to your liking:
my $newhash = { %$hash1, %$hash2 };

That dereferencing % there makes me nervous. It looks
to me like a new hash is being created and then a reference
to it is being assigned to $newhash. So for each of the
thousands of instances of $newhash, each will have (a ref-
erence to) its very own copy of %hash1 and %hash2.

Yes, that's right.
$hash1 and $hash2 do not go out of scope.

Why not? As far as I can tell (and I admit to being a bit lost here, with
all the non-trivial transtions from XML to hashes), once they are put into
$newhash, they are no longer needed.
In my XML document (METS for the curious), There are some
30 or forty elements at the top of the document, then further
down there are some thousands that reference some of those
elements with attributes of type IDREFS.

<element id="id1"> ... </element>
<element id="id2"> ... </element>
...
<element id="id30"> ... </element>

<refelement ids="id1 id6 id21"/>
<refelement ids="id22 id11 id21"/>
... etc. for thousands of <refelements>
....

BTW, the above was only an attempt to simplify the problem.
In reality of course I won't be naming my hashes %hash1,
%hash2, etc.

I'm not sure that is the best choice for simplifying. $hash{1}{foo} and
$hash{2}{foo} are not much more complicated than $hash1{foo} and
$hash2{foo}, and they give valuable clues about the (simplified away)
structure of the program.

Nor will I name the reference elements
$newhash12, $newhash643, etc. The <elements> will live
in a small hash keyed by the id. The <refelement>s will
live in a large array. And I want to be able to write
things like this:

foreach my $refelement (@bigarray) {
print $refelement->{size}, "\n";
print $refelement->{type}, "\n";
}

Where "size" and "type" are typical keys from
among the original 20 or 30 elements (assuming
<refelement ids="id1 id6 id21"/>, "size" may be
a key from the element referenced by "id1",

The key *is* what a hash element is referenced by, so the key of
the element referenced by "id1" is "id1"! Do you mean that the literal
string "size" might be the *value* of the element whose key is "id1"? Or
do you mean that the size will be the value of the element referenced by
the first component of the space-separated refelement? (which in this case
happens to be id1, because id1 is the first component of "id1 id6 id21")
I'm trying to simplify the problem without posting the
entire huge program, but this may be a bit closer to
what I want (except it doesn't quite work):

my $hashelements = {
'1' => {
'key1' => 'Value 1',
'key2' => 'Value 2',
'key3' => 'Value 3'
},

'6' => {
'key4' => 'Value 4',
'key5' => 'Value 5',
'key6' => 'Value 6'
},

'21' => {
'key7' => 'Value 7',
'key8' => 'Value 8',
'key9' => 'Value 9'
},
};

my $newhash = {};
foreach my $id (1, 6, 21) {
foreach my $key (keys %{$hashelements->{$id}}) {
$newhash->{$key} = \{$hashelements->{$id}->{$key}};

You are creating an ref to an anonymous hash (by using curlies) then taking
a reference to that (using backslash). So you get a reference to a scalar
which holds a reference to a one-element hash. Try this:

$newhash->{$key} = \($hashelements->{$id}->{$key});

The parenthesis are not actually necessary, but in this case they make it
easier to read correctly (at least for me).

}
}

foreach my $key (keys %$newhash) {
print "$key: ", $newhash->{$key}, "\n";

You need an extra dereference:

print "$key: ", ${$newhash->{$key}}, "\n";
}

That $newhash->{$key} = \{$hashelements->{$id}->{$key}} part
is an attempt to make sure I only create a reference to
the value rather than make a copy of the value itself.

I'm not sure what you hope to gain by taking the reference. A reference
to the scalar holding the string "Value 9" is not much (if any) smaller
than the thing it is referencing in the first place. You would be better
off just copying it unless either a) You need a change made through
$newhash to be reflected in the original structure, or b) the actual string
is much much bigger than it's example of "Value 9" (which I suspose is not
unlikely)


But I still don't see why you want both $hashelements and $newhash to
exist simultaneously. Unless you are doing something else with
$hashelements which are you aren't showing us or telling us about, there is
no need for it once $newhash is made. Which means you could dispense with
$hashelements altogether, and change whatever is making $hashelements so
that it just makes $newhash directly, instead. If you do, for some reason,
need $hashelements in addition to $newhash, then I think you now know how
to do what you want.


Perhaps using "each" somehow is the answer. Can't quite get
that to work either though.

Each doesn't address the root of what you are trying to do, but you could
use it for a slight memory efficiency improvement. For example, replace

foreach my $key (keys %$newhash) {

with

while (defined (my $key = each %$newhash)) {

The first way makes a list which holds a copy of all the keys in
%$newhash right up front. The second one copies the keys one at
a time, as it goes through the hash, so that the memory for each
key can be reused.

Or you could use the list context "each". You have to change the print
statement, too, so it isn't a drop-in replacement, but it does look better
than what it replaces in this case:

while (my ($key,$v) = each %$newhash) {
print "$key: $$v\n";
};


Xho
 
A

Arvin Portlock

Why not? As far as I can tell (and I admit to being a bit lost here, with
all the non-trivial transtions from XML to hashes), once they are put into
$newhash, they are no longer needed.

Let's see. On the face of it your are right. Once I've
collected my data I'll no longer refer directly to $hash1,
$hash2, etc. I want to keep it around merely to act as
the target of my references. The storehouse for my strings.
Sort of like:

$string = 'Long string I want to share throughout my program';

$stringref1 = \$string;
....
$stringref10000 = \$string;

I know even if $string goes out of scope the references
to it will remain, so I suppose it could go out of scope
at some point. I don't think it matters in the end, does
it? My whole point to this question is that I only want
one instance of this very long string with thousands of
references pointing to it. The problem is complicated be-
cause in fact it's not a single, simple string, but strings
packed into the values of a hash references. I only ever
want one instance of the actual values but with thousands
of references to them. PLUS I need to be able to access
them through their original hash keys.
I'm not sure that is the best choice for simplifying. $hash{1}{foo} and
$hash{2}{foo} are not much more complicated than $hash1{foo} and
$hash2{foo}, and they give valuable clues about the (simplified away)
structure of the program.

The usual reason for using hashes over named variables. Don't
know how many there will be, etc. Could be only 5, could be 20.
The key *is* what a hash element is referenced by, so the key of
the element referenced by "id1" is "id1"!

Yes, that's exactly right. The key will always be that id value.
foreach my $refelement (@bigarray) {
print $refelement->{size}, "\n";
print $refelement->{type}, "\n";
}

Where "size" and "type" are typical keys from
among the original 20 or 30 elements (assuming
<refelement ids="id1 id6 id21"/>, "size" may be
a key from the element referenced by "id1",
Do you mean that the literal
string "size" might be the *value* of the element whose key is "id1"? Or
do you mean that the size will be the value of the element referenced by
the first component of the space-separated refelement? (which in this case
happens to be id1, because id1 is the first component of "id1 id6 id21")

Heh, poor choice of example key names I guess. The elements
I am referencing are things like filenames, size of the file,
format of the file, the time it was created, the url where
it can be found... about 20 or 30 values in all. I'll rewrite
that example as:

foreach my $refelement (@bigarray) {
print $refelement->{key1}, "\n";
print $refelement->{key2}, "\n";
}

You are creating an ref to an anonymous hash (by using curlies) then
taking
a reference to that (using backslash).

Yes, of course you are right. Thank you for pointing that out.
That didn't occur to me.
So you get a reference to a scalar
which holds a reference to a one-element hash. Try this:

$newhash->{$key} = \($hashelements->{$id}->{$key});

The parenthesis are not actually necessary, but in this case they make it
easier to read correctly (at least for me).

With things like this I always get confused about when
parentheses are needed and when they are not. I think I'd
better retain the parentheses.
You need an extra dereference:
print "$key: ", ${$newhash->{$key}}, "\n";

I was hoping for a simpler syntax and thought perhaps
there was some way to avoid the extra dollar dereferencer.
That may not be possible I suppose.
You would be better
off just copying it unless either a) You need a change made through
$newhash to be reflected in the original structure, or b) the actual
string
is much much bigger than it's example of "Value 9" (which I suspose is not
unlikely)

No, they'll never be changed. Yes, they're bigger than just
"Value 9". It won't take a huge amount of memory, I certainly
*could* get away with just making copies without a huge impact
on performance. Frankly, I just wanted to learn some new perl
tricks and hope to understand references better as a bonus.
Plus, coming from C++, all this deep copying seems distasteful
on general principle.
But I still don't see why you want both $hashelements and $newhash to
exist simultaneously. Unless you are doing something else with
$hashelements which are you aren't showing us or telling us about

No nothing else. It could be $hashelements isn't necessary.
Just someplace to store the actual instances of the strings
which will be referenced elsewhere. But I'm open to all
suggestions.
there is
no need for it once $newhash is made. Which means you could dispense with
$hashelements altogether, and change whatever is making $hashelements so
that it just makes $newhash directly, instead.

But then each $newhash will have copies of the strings rather
than references.

You've been very helpful. Thanks for taking the time to
try and understand what I'm trying to do.

Arvin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,566
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top