Combining multiple hash references into one hash reference

Discussion in 'Perl Misc' started by Arvin Portlock, Sep 1, 2005.

  1. I know lots of ways to combine multiple hashes into a single
    hash but I'm very concerned about memory and copy by value.
    I'm processing some XML documents and have several thousand
    elements that must be linked to relatively few hashes. These
    hashes have unique keys among them so I don't have to worry
    about one hash element overwriting another with the same
    key. The following works as it should:

    my $hash1 = {
    'key1' => 'Value 1',
    'key2' => 'Value 2',
    'key3' => 'Value 3',
    };

    my $hash2 = {
    'key4' => 'Value 4',
    'key5' => 'Value 5',
    'key6' => 'Value 6',
    };

    my %newhash = (%$hash1, %$hash2);

    # The following do not work:
    # my $newhash = { $hash1, $hash2 };
    # my $newhash = [ $hash1, $hash2 ];
    # my %newhash = ( $hash1, $hash2 );

    foreach my $key (keys %newhash) {
    print "$key: $newhash{$key}\n";
    }

    But I'm concerned I'm creating copies of each of
    these elements for all of the thousands of instances
    of %newhash I will be creating. Is there a faster and
    memory efficient way to do this?

    Thanks!

    Arvin
    Arvin Portlock, Sep 1, 2005
    #1
    1. Advertising

  2. Arvin Portlock

    Paul Lalli Guest

    Arvin Portlock wrote:
    > The following works as it should:
    >
    > my $hash1 = {
    > 'key1' => 'Value 1',
    > 'key2' => 'Value 2',
    > 'key3' => 'Value 3',
    > };
    >
    > my $hash2 = {
    > 'key4' => 'Value 4',
    > 'key5' => 'Value 5',
    > 'key6' => 'Value 6',
    > };
    >
    > my %newhash = (%$hash1, %$hash2);
    >
    > # The following do not work:
    > # my $newhash = { $hash1, $hash2 };
    > # my $newhash = [ $hash1, $hash2 ];
    > # my %newhash = ( $hash1, $hash2 );


    I'm not entirely sure what you're going for, but I think this is the
    syntax you're looking for:

    my $newhash = { %$hash1, %$hash2 };

    >
    > foreach my $key (keys %newhash) {
    > print "$key: $newhash{$key}\n";
    > }
    >
    > But I'm concerned I'm creating copies of each of
    > these elements for all of the thousands of instances
    > of %newhash I will be creating. Is there a faster and
    > memory efficient way to do this?


    If you're really just wanting to loop through all the "inner" hashes,
    why not just build an array of the references you already have?

    my @hashes = ($hash1, $hash2);
    foreach my $hash (@hashes){
    foreach my $key (keys %$hash){
    print "$key: $hash->{$key}\n";
    }
    }

    Paul Lalli
    Paul Lalli, Sep 1, 2005
    #2
    1. Advertising

  3. Arvin Portlock

    Guest

    Arvin Portlock <> wrote:
    > I know lots of ways to combine multiple hashes into a single
    > hash but I'm very concerned about memory and copy by value.
    > I'm processing some XML documents and have several thousand
    > elements that must be linked to relatively few hashes. These
    > hashes have unique keys among them so I don't have to worry
    > about one hash element overwriting another with the same
    > key. The following works as it should:
    >
    > my $hash1 = {
    > 'key1' => 'Value 1',
    > 'key2' => 'Value 2',
    > 'key3' => 'Value 3',
    > };
    >
    > my $hash2 = {
    > 'key4' => 'Value 4',
    > 'key5' => 'Value 5',
    > 'key6' => 'Value 6',
    > };
    >
    > my %newhash = (%$hash1, %$hash2);
    >
    > # The following do not work:
    > # my $newhash = { $hash1, $hash2 };
    > # my $newhash = [ $hash1, $hash2 ];
    > # my %newhash = ( $hash1, $hash2 );


    All of those work. They do exactly what they should do, even if that is
    not what you want them to do.

    Maybe this is more to your liking:
    my $newhash = { %$hash1, %$hash2 };

    >
    > foreach my $key (keys %newhash) {
    > print "$key: $newhash{$key}\n";
    > }


    Presumably, that isn't all you are doing, because if it were you would
    just use two loops, one for hash1 and one for hash2, and never make the
    combined hash in the first place. And not making the combined hash in the
    first place is, of course, the best solution if you can get away with it.
    If you need more generalized than just those two hashes, then use an AoH
    with nested loop for printing.

    >
    > But I'm concerned I'm creating copies of each of
    > these elements for all of the thousands of instances
    > of %newhash I will be creating.


    Will $hash1 and $hash2 go out of scope or get redefined shortly after
    %newhash (or $newhash) is created from them? If so, you most likely
    needn't worry on the memory front. And will all these thousands of
    instances of %newhash also be properly scoped?

    > Is there a faster and
    > memory efficient way to do this?


    Is this micro-optimization week or something?

    Would it be acceptable to add %$hash2 into %$hash1 rather than making
    a brand new %newhash? If so,
    @{$hash1}{keys %$hash2}=values %$hash2;
    is somewhat more memory efficient.

    If not, then:
    my %newhash=%$hash1;
    undef $hash1;
    @newhash{keys %$hash2}=values %$hash2;


    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
    , Sep 1, 2005
    #3
  4. wrote:

    > Arvin Portlock wrote:
    >
    > >my %newhash = (%$hash1, %$hash2);
    > >
    > ># The following do not work:
    > ># my $newhash = { $hash1, $hash2 };
    > ># my $newhash = [ $hash1, $hash2 ];
    > ># my %newhash = ( $hash1, $hash2 );

    >
    > Maybe this is more to your liking:
    > my $newhash = { %$hash1, %$hash2 };


    That dereferencing % there makes me nervous. It looks
    to me like a new hash is being created and then a reference
    to it is being assigned to $newhash. So for each of the
    thousands of instances of $newhash, each will have (a ref-
    erence to) its very own copy of %hash1 and %hash2.

    $hash1 and $hash2 do not go out of scope. Also they are not
    the only hashes. There are typically 20 or 30 of them
    throughout the life of the program. Each XML element contains
    some combination from among those 20 or 30. E.g.,

    $newhash1 = { %$hash40, %$hash31, %$hash12 };
    $newhash10012 = { %$hash1, %$hash21, %$hash26 };

    In my XML document (METS for the curious), There are some
    30 or forty elements at the top of the document, then further
    down there are some thousands that reference some of those
    elements with attributes of type IDREFS.

    <element id="id1"> ... </element>
    <element id="id2"> ... </element>
    ....
    <element id="id30"> ... </element>

    <refelement ids="id1 id6 id21"/>
    <refelement ids="id22 id11 id21"/>
    .... etc. for thousands of <refelements>

    Each of the thousands of elements are stored in an array.
    At the end of the program I will loop through each of the
    thousands of elements and extract certain values from each
    one. I want to be able to extract those values by a key
    name (which is why an array won't quite work as I can't
    access the elements efficiently by key name).

    BTW, the above was only an attempt to simplify the problem.
    In reality of course I won't be naming my hashes %hash1,
    %hash2, etc. Nor will I name the reference elements
    $newhash12, $newhash643, etc. The <elements> will live
    in a small hash keyed by the id. The <refelement>s will
    live in a large array. And I want to be able to write
    things like this:

    foreach my $refelement (@bigarray) {
    print $refelement->{size}, "\n";
    print $refelement->{type}, "\n";
    }

    Where "size" and "type" are typical keys from
    among the original 20 or 30 elements (assuming
    <refelement ids="id1 id6 id21"/>, "size" may be
    a key from the element referenced by "id1", "type"
    may come from "id21", and so on.

    I'm trying to simplify the problem without posting the
    entire huge program, but this may be a bit closer to
    what I want (except it doesn't quite work):

    my $hashelements = {
    '1' => {
    'key1' => 'Value 1',
    'key2' => 'Value 2',
    'key3' => 'Value 3'
    },

    '6' => {
    'key4' => 'Value 4',
    'key5' => 'Value 5',
    'key6' => 'Value 6'
    },

    '21' => {
    'key7' => 'Value 7',
    'key8' => 'Value 8',
    'key9' => 'Value 9'
    },
    };

    my $newhash = {};
    foreach my $id (1, 6, 21) {
    foreach my $key (keys %{$hashelements->{$id}}) {
    $newhash->{$key} = \{$hashelements->{$id}->{$key}};
    }
    }

    foreach my $key (keys %$newhash) {
    print "$key: ", $newhash->{$key}, "\n";
    }

    That $newhash->{$key} = \{$hashelements->{$id}->{$key}} part
    is an attempt to make sure I only create a reference to
    the value rather than make a copy of the value itself.

    Perhaps using "each" somehow is the answer. Can't quite get
    that to work either though.

    Arvin

    >
    >
    >
    > >foreach my $key (keys %newhash) {
    > > print "$key: $newhash{$key}\n";
    > >}

    >
    >
    > Presumably, that isn't all you are doing, because if it were you would
    > just use two loops, one for hash1 and one for hash2, and never make the
    > combined hash in the first place. And not making the combined hash in the
    > first place is, of course, the best solution if you can get away with it.
    > If you need more generalized than just those two hashes, then use an AoH
    > with nested loop for printing.
    >
    >
    > >But I'm concerned I'm creating copies of each of
    > >these elements for all of the thousands of instances
    > >of %newhash I will be creating.

    >
    >
    > Will $hash1 and $hash2 go out of scope or get redefined shortly after
    > %newhash (or $newhash) is created from them? If so, you most likely
    > needn't worry on the memory front. And will all these thousands of
    > instances of %newhash also be properly scoped?
    >
    >
    > >Is there a faster and
    > >memory efficient way to do this?

    >
    >
    > Is this micro-optimization week or something?
    >
    > Would it be acceptable to add %$hash2 into %$hash1 rather than making
    > a brand new %newhash? If so,
    > @{$hash1}{keys %$hash2}=values %$hash2;
    > is somewhat more memory efficient.
    >
    > If not, then:
    > my %newhash=%$hash1;
    > undef $hash1;
    > @newhash{keys %$hash2}=values %$hash2;
    >
    >
    > Xho
    >
    Arvin Portlock, Sep 1, 2005
    #4
  5. [ Please do not top-post.

    Please stop top-posting very very soon.
    ]


    Arvin Portlock <> wrote:
    > wrote:
    >> Arvin Portlock wrote:
    >>
    >> >my %newhash = (%$hash1, %$hash2);


    >> Maybe this is more to your liking:
    >> my $newhash = { %$hash1, %$hash2 };

    >
    > That dereferencing % there makes me nervous.



    Why?

    What "danger" do you see that we can help you to avoid?

    %newhash and %$newhash should both contain the same keys and values.


    > It looks
    > to me like a new hash is being created and then a reference
    > to it is being assigned to $newhash.



    Good, since that _is_ what is happening.

    I think maybe your question is more about the contents of this
    created hash rather than about the hash itself...

    The anon hash contains _copies_ of the keys and values returned
    by the dererencing operation. The named hash (%newhash) also
    contains copies of the keys and values returned by the dererencing
    operation.


    > So for each of the
    > thousands of instances of $newhash,



    There is only _one_ $newhash scalar.

    Do you mean that it will take on thousands of _values_ (hashrefs)?

    That shouldn't be a problem. Perl's reference counting will free up
    the old one when $newhash no longer refers to the old one.


    > each will have (a ref-
    > erence to) its very own copy of %hash1 and %hash2.



    There *are no* such hashes in any of the code above.

    It may have been a dual typo on your part, but it pretty much
    stops us in our tracks with regard to figuring out what you
    are asking...



    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Sep 2, 2005
    #5
  6. Arvin Portlock

    Guest

    Arvin Portlock <> wrote:
    > wrote:
    >
    > > Arvin Portlock wrote:
    > >
    > > >my %newhash = (%$hash1, %$hash2);
    > > >
    > > ># The following do not work:
    > > ># my $newhash = { $hash1, $hash2 };
    > > ># my $newhash = [ $hash1, $hash2 ];
    > > ># my %newhash = ( $hash1, $hash2 );

    > >
    > > Maybe this is more to your liking:
    > > my $newhash = { %$hash1, %$hash2 };

    >
    > That dereferencing % there makes me nervous. It looks
    > to me like a new hash is being created and then a reference
    > to it is being assigned to $newhash. So for each of the
    > thousands of instances of $newhash, each will have (a ref-
    > erence to) its very own copy of %hash1 and %hash2.


    Yes, that's right.

    > $hash1 and $hash2 do not go out of scope.


    Why not? As far as I can tell (and I admit to being a bit lost here, with
    all the non-trivial transtions from XML to hashes), once they are put into
    $newhash, they are no longer needed.

    >
    > In my XML document (METS for the curious), There are some
    > 30 or forty elements at the top of the document, then further
    > down there are some thousands that reference some of those
    > elements with attributes of type IDREFS.
    >
    > <element id="id1"> ... </element>
    > <element id="id2"> ... </element>
    > ...
    > <element id="id30"> ... </element>
    >
    > <refelement ids="id1 id6 id21"/>
    > <refelement ids="id22 id11 id21"/>
    > ... etc. for thousands of <refelements>
    >

    ....
    >
    > BTW, the above was only an attempt to simplify the problem.
    > In reality of course I won't be naming my hashes %hash1,
    > %hash2, etc.


    I'm not sure that is the best choice for simplifying. $hash{1}{foo} and
    $hash{2}{foo} are not much more complicated than $hash1{foo} and
    $hash2{foo}, and they give valuable clues about the (simplified away)
    structure of the program.


    > Nor will I name the reference elements
    > $newhash12, $newhash643, etc. The <elements> will live
    > in a small hash keyed by the id. The <refelement>s will
    > live in a large array. And I want to be able to write
    > things like this:
    >
    > foreach my $refelement (@bigarray) {
    > print $refelement->{size}, "\n";
    > print $refelement->{type}, "\n";
    > }
    >
    > Where "size" and "type" are typical keys from
    > among the original 20 or 30 elements (assuming
    > <refelement ids="id1 id6 id21"/>, "size" may be
    > a key from the element referenced by "id1",


    The key *is* what a hash element is referenced by, so the key of
    the element referenced by "id1" is "id1"! Do you mean that the literal
    string "size" might be the *value* of the element whose key is "id1"? Or
    do you mean that the size will be the value of the element referenced by
    the first component of the space-separated refelement? (which in this case
    happens to be id1, because id1 is the first component of "id1 id6 id21")

    >
    > I'm trying to simplify the problem without posting the
    > entire huge program, but this may be a bit closer to
    > what I want (except it doesn't quite work):
    >
    > my $hashelements = {
    > '1' => {
    > 'key1' => 'Value 1',
    > 'key2' => 'Value 2',
    > 'key3' => 'Value 3'
    > },
    >
    > '6' => {
    > 'key4' => 'Value 4',
    > 'key5' => 'Value 5',
    > 'key6' => 'Value 6'
    > },
    >
    > '21' => {
    > 'key7' => 'Value 7',
    > 'key8' => 'Value 8',
    > 'key9' => 'Value 9'
    > },
    > };
    >
    > my $newhash = {};
    > foreach my $id (1, 6, 21) {
    > foreach my $key (keys %{$hashelements->{$id}}) {
    > $newhash->{$key} = \{$hashelements->{$id}->{$key}};


    You are creating an ref to an anonymous hash (by using curlies) then taking
    a reference to that (using backslash). So you get a reference to a scalar
    which holds a reference to a one-element hash. Try this:

    $newhash->{$key} = \($hashelements->{$id}->{$key});

    The parenthesis are not actually necessary, but in this case they make it
    easier to read correctly (at least for me).


    > }
    > }
    >
    > foreach my $key (keys %$newhash) {
    > print "$key: ", $newhash->{$key}, "\n";


    You need an extra dereference:

    print "$key: ", ${$newhash->{$key}}, "\n";

    > }
    >
    > That $newhash->{$key} = \{$hashelements->{$id}->{$key}} part
    > is an attempt to make sure I only create a reference to
    > the value rather than make a copy of the value itself.


    I'm not sure what you hope to gain by taking the reference. A reference
    to the scalar holding the string "Value 9" is not much (if any) smaller
    than the thing it is referencing in the first place. You would be better
    off just copying it unless either a) You need a change made through
    $newhash to be reflected in the original structure, or b) the actual string
    is much much bigger than it's example of "Value 9" (which I suspose is not
    unlikely)


    But I still don't see why you want both $hashelements and $newhash to
    exist simultaneously. Unless you are doing something else with
    $hashelements which are you aren't showing us or telling us about, there is
    no need for it once $newhash is made. Which means you could dispense with
    $hashelements altogether, and change whatever is making $hashelements so
    that it just makes $newhash directly, instead. If you do, for some reason,
    need $hashelements in addition to $newhash, then I think you now know how
    to do what you want.



    > Perhaps using "each" somehow is the answer. Can't quite get
    > that to work either though.


    Each doesn't address the root of what you are trying to do, but you could
    use it for a slight memory efficiency improvement. For example, replace

    foreach my $key (keys %$newhash) {

    with

    while (defined (my $key = each %$newhash)) {

    The first way makes a list which holds a copy of all the keys in
    %$newhash right up front. The second one copies the keys one at
    a time, as it goes through the hash, so that the memory for each
    key can be reused.

    Or you could use the list context "each". You have to change the print
    statement, too, so it isn't a drop-in replacement, but it does look better
    than what it replaces in this case:

    while (my ($key,$v) = each %$newhash) {
    print "$key: $$v\n";
    };


    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
    , Sep 2, 2005
    #6
  7. wrote:

    > >$hash1 and $hash2 do not go out of scope.

    >
    > Why not? As far as I can tell (and I admit to being a bit lost here, with
    > all the non-trivial transtions from XML to hashes), once they are put into
    > $newhash, they are no longer needed.


    Let's see. On the face of it your are right. Once I've
    collected my data I'll no longer refer directly to $hash1,
    $hash2, etc. I want to keep it around merely to act as
    the target of my references. The storehouse for my strings.
    Sort of like:

    $string = 'Long string I want to share throughout my program';

    $stringref1 = \$string;
    ....
    $stringref10000 = \$string;

    I know even if $string goes out of scope the references
    to it will remain, so I suppose it could go out of scope
    at some point. I don't think it matters in the end, does
    it? My whole point to this question is that I only want
    one instance of this very long string with thousands of
    references pointing to it. The problem is complicated be-
    cause in fact it's not a single, simple string, but strings
    packed into the values of a hash references. I only ever
    want one instance of the actual values but with thousands
    of references to them. PLUS I need to be able to access
    them through their original hash keys.

    > >BTW, the above was only an attempt to simplify the problem.
    > >In reality of course I won't be naming my hashes %hash1,
    > >%hash2, etc.

    >
    > I'm not sure that is the best choice for simplifying. $hash{1}{foo} and
    > $hash{2}{foo} are not much more complicated than $hash1{foo} and
    > $hash2{foo}, and they give valuable clues about the (simplified away)
    > structure of the program.


    The usual reason for using hashes over named variables. Don't
    know how many there will be, etc. Could be only 5, could be 20.

    > The key *is* what a hash element is referenced by, so the key of
    > the element referenced by "id1" is "id1"!


    Yes, that's exactly right. The key will always be that id value.

    >foreach my $refelement (@bigarray) {
    > print $refelement->{size}, "\n";
    > print $refelement->{type}, "\n";
    > }
    >
    > Where "size" and "type" are typical keys from
    > among the original 20 or 30 elements (assuming
    > <refelement ids="id1 id6 id21"/>, "size" may be
    > a key from the element referenced by "id1",


    > Do you mean that the literal
    > string "size" might be the *value* of the element whose key is "id1"? Or
    > do you mean that the size will be the value of the element referenced by
    > the first component of the space-separated refelement? (which in this case
    > happens to be id1, because id1 is the first component of "id1 id6 id21")


    Heh, poor choice of example key names I guess. The elements
    I am referencing are things like filenames, size of the file,
    format of the file, the time it was created, the url where
    it can be found... about 20 or 30 values in all. I'll rewrite
    that example as:

    foreach my $refelement (@bigarray) {
    print $refelement->{key1}, "\n";
    print $refelement->{key2}, "\n";
    }


    >
    > >my $newhash = {};
    > >foreach my $id (1, 6, 21) {
    > > foreach my $key (keys %{$hashelements->{$id}}) {
    > > $newhash->{$key} = \{$hashelements->{$id}->{$key}};

    >
    >
    > You are creating an ref to an anonymous hash (by using curlies) then
    > taking
    > a reference to that (using backslash).


    Yes, of course you are right. Thank you for pointing that out.
    That didn't occur to me.

    > So you get a reference to a scalar
    > which holds a reference to a one-element hash. Try this:
    >
    > $newhash->{$key} = \($hashelements->{$id}->{$key});
    >
    > The parenthesis are not actually necessary, but in this case they make it
    > easier to read correctly (at least for me).


    With things like this I always get confused about when
    parentheses are needed and when they are not. I think I'd
    better retain the parentheses.

    > >foreach my $key (keys %$newhash) {
    > > print "$key: ", $newhash->{$key}, "\n";

    >
    > You need an extra dereference:
    > print "$key: ", ${$newhash->{$key}}, "\n";


    I was hoping for a simpler syntax and thought perhaps
    there was some way to avoid the extra dollar dereferencer.
    That may not be possible I suppose.

    > You would be better
    > off just copying it unless either a) You need a change made through
    > $newhash to be reflected in the original structure, or b) the actual
    > string
    > is much much bigger than it's example of "Value 9" (which I suspose is not
    > unlikely)


    No, they'll never be changed. Yes, they're bigger than just
    "Value 9". It won't take a huge amount of memory, I certainly
    *could* get away with just making copies without a huge impact
    on performance. Frankly, I just wanted to learn some new perl
    tricks and hope to understand references better as a bonus.
    Plus, coming from C++, all this deep copying seems distasteful
    on general principle.

    > But I still don't see why you want both $hashelements and $newhash to
    > exist simultaneously. Unless you are doing something else with
    > $hashelements which are you aren't showing us or telling us about


    No nothing else. It could be $hashelements isn't necessary.
    Just someplace to store the actual instances of the strings
    which will be referenced elsewhere. But I'm open to all
    suggestions.

    > there is
    > no need for it once $newhash is made. Which means you could dispense with
    > $hashelements altogether, and change whatever is making $hashelements so
    > that it just makes $newhash directly, instead.


    But then each $newhash will have copies of the strings rather
    than references.

    You've been very helpful. Thanks for taking the time to
    try and understand what I'm trying to do.

    Arvin

    >
    >
    >
    >
    >
    > >Perhaps using "each" somehow is the answer. Can't quite get
    > >that to work either though.

    >
    >
    > Each doesn't address the root of what you are trying to do, but you could
    > use it for a slight memory efficiency improvement. For example, replace
    >
    > foreach my $key (keys %$newhash) {
    >
    > with
    >
    > while (defined (my $key = each %$newhash)) {
    >
    > The first way makes a list which holds a copy of all the keys in
    > %$newhash right up front. The second one copies the keys one at
    > a time, as it goes through the hash, so that the memory for each
    > key can be reused.
    >
    > Or you could use the list context "each". You have to change the print
    > statement, too, so it isn't a drop-in replacement, but it does look better
    > than what it replaces in this case:
    >
    > while (my ($key,$v) = each %$newhash) {
    > print "$key: $$v\n";
    > };
    >
    >
    > Xho
    >
    Arvin Portlock, Sep 2, 2005
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Red
    Replies:
    5
    Views:
    6,266
  2. Replies:
    0
    Views:
    498
  3. Chuck
    Replies:
    0
    Views:
    369
    Chuck
    Aug 28, 2003
  4. Replies:
    4
    Views:
    485
    Joe Kesselman
    Feb 25, 2007
  5. jquertil
    Replies:
    5
    Views:
    115
    Thomas 'PointedEars' Lahn
    Feb 26, 2008
Loading...

Share This Page