Detecting duplicate keys in a hash I am "requiring"

G

Gupit

Hi,
I have a configuration file which is in a perl hash (of hashes and
arrays) format.

$hash = {
key1 => { key2 => val2,
key3 => val3,
},
key4 => [ vala,valb,],
key1 => {key5 => val5,
key6 => val6
),
};

In my script I do
use vars ($hash);
if (eval "require config_file") {
print "Required config_file";
} else {
print "Couldn't require config_file\n";
}

Now if you notice in the config_file, $hash has two duplicate keys
i.e. key1 has been specified twice. When using require, perl simply
keeps the last value it encountered for key1.

I would like to detect these duplicates. Is there a stricter mode in
perl that will print warnings when perl encounters duplicate keys in
hashes. I could turn it on for the eval require bit.

Is there any other way to detect these duplicates? Unfortunately the
config_file is really huge and uses hash of hashes of hashes and
arrays and so on and writing a script to parse the file would be
painful.

Thanks in advance
G
 
T

Tassilo v. Parseval

Also sprach Gupit:
I have a configuration file which is in a perl hash (of hashes and
arrays) format.

$hash = {
key1 => { key2 => val2,
key3 => val3,
},
key4 => [ vala,valb,],
key1 => {key5 => val5,
key6 => val6
),
};

In my script I do
use vars ($hash);
if (eval "require config_file") {
print "Required config_file";
} else {
print "Couldn't require config_file\n";
}

Now if you notice in the config_file, $hash has two duplicate keys
i.e. key1 has been specified twice. When using require, perl simply
keeps the last value it encountered for key1.

I would like to detect these duplicates. Is there a stricter mode in
perl that will print warnings when perl encounters duplicate keys in
hashes. I could turn it on for the eval require bit.

No, there's no such mode I know of. You can implement that however, but
you'd need to change your logic a little. First of all, you need a tied
hash that will do the warning:

package Tie::StrictHash;
require Tie::Hash;
@Tie::StrictHash::mad:ISA = qw(Tie::StdHash);
use Carp;
sub STORE {
my ($hash, $key, $val) = @_;
carp "Duplicate key '$key' detected" if exists $hash->{ $key };
# store it nonetheless, overwriting the previous key/value pair
$hash->SUPER::STORE($key, $val);
}

package main;
use vars qw(%hash);
tie %hash => 'Tie::StrictHash';
require 'config_file';

For the above, you have to turn the toplevel hash-ref into a real hash
to make it work.
Is there any other way to detect these duplicates? Unfortunately the
config_file is really huge and uses hash of hashes of hashes and
arrays and so on and writing a script to parse the file would be
painful.

I'd say any attempt to parse it yourself is extremely likely to fail.
The above solution is more realistic, however it's only a partial one.
Tie::StrictHash cannot detect duplicate keys in nested hashes, only in
the top-level one. I tried a recursive tie/STORE for each sub-hash but
then I realized that those attempts are futile since the subhash has
already been processed by perl before STORE gets to see it.

It's an interesting problem. But I don't see how it could be solved.

Tassilo
 
A

Anno Siegel

Gupit said:
Hi,
I have a configuration file which is in a perl hash (of hashes and
arrays) format.

$hash = {
key1 => { key2 => val2,
key3 => val3,
},
key4 => [ vala,valb,],
key1 => {key5 => val5,
key6 => val6
),
};

In my script I do
use vars ($hash);
if (eval "require config_file") {
print "Required config_file";
} else {
print "Couldn't require config_file\n";
}

Now if you notice in the config_file, $hash has two duplicate keys
i.e. key1 has been specified twice.

There are no duplicate keys in a hash, hash keys are unique by
construction. Every entry for the same key overwrites the previous
one.
When using require, perl simply
keeps the last value it encountered for key1.

Not only with require -- the hash doesn't ever contain the duplicate.
I would like to detect these duplicates. Is there a stricter mode in
perl that will print warnings when perl encounters duplicate keys in
hashes. I could turn it on for the eval require bit.

Then specify you configuration structure as a list(ref), not a hash.
Replace the outermost pair of "{}" with "[]" (and change the name from
$hash to something descriptive). Then read "perldoc -q duplicate"
to see how to detect duplicates in the list you are importing.
Is there any other way to detect these duplicates? Unfortunately the
config_file is really huge and uses hash of hashes of hashes and
arrays and so on and writing a script to parse the file would be
painful.

Whatever the structure of the hash, the top level structure is one
of key/value pairs. That isn't so hard to parse.

Anno
 
G

Gupit

Unfortunately duplicate keys may have been specified in the lower
level hash structure as well. Replacing only the top level hash
structure wont help there.

e.g. see key7 in the following example

$hash = {
key1 => { key2 => val2,
key3 => val3,
},
key4 => [ vala,valb,],
key1 => {key5 => val5,
key6 => {
key7 => val7,
key8 => val8,
key7 => val9,
}
),
};

(I dont call this structure $hash in real life, I have simplified the
actual data structure here for ease of understanding and to retain the
focus of my question :))

regards,
G


Gupit said:
Hi,
I have a configuration file which is in a perl hash (of hashes and
arrays) format.

$hash = {
key1 => { key2 => val2,
key3 => val3,
},
key4 => [ vala,valb,],
key1 => {key5 => val5,
key6 => val6
),
};

In my script I do
use vars ($hash);
if (eval "require config_file") {
print "Required config_file";
} else {
print "Couldn't require config_file\n";
}

Now if you notice in the config_file, $hash has two duplicate keys
i.e. key1 has been specified twice.

There are no duplicate keys in a hash, hash keys are unique by
construction. Every entry for the same key overwrites the previous
one.
When using require, perl simply
keeps the last value it encountered for key1.

Not only with require -- the hash doesn't ever contain the duplicate.
I would like to detect these duplicates. Is there a stricter mode in
perl that will print warnings when perl encounters duplicate keys in
hashes. I could turn it on for the eval require bit.

Then specify you configuration structure as a list(ref), not a hash.
Replace the outermost pair of "{}" with "[]" (and change the name from
$hash to something descriptive). Then read "perldoc -q duplicate"
to see how to detect duplicates in the list you are importing.
Is there any other way to detect these duplicates? Unfortunately the
config_file is really huge and uses hash of hashes of hashes and
arrays and so on and writing a script to parse the file would be
painful.

Whatever the structure of the hash, the top level structure is one
of key/value pairs. That isn't so hard to parse.

Anno
 
A

Anno Siegel

Please don't top-post. To restore context, I have to recapitulate.

My suggestion, WRT the data structure below, was to use a list (of pairs)
instead of the top-level hash to allow duplicate keys. The list can
trivially be converted to the hash, but can also be checked for duplicates.
Unfortunately duplicate keys may have been specified in the lower
level hash structure as well. Replacing only the top level hash
structure wont help there.

e.g. see key7 in the following example

$hash = {
key1 => { key2 => val2,
key3 => val3,
},
key4 => [ vala,valb,],
key1 => {key5 => val5,
key6 => {
key7 => val7,
key8 => val8,
key7 => val9,
}
),
};

(I dont call this structure $hash in real life, I have simplified the
actual data structure here for ease of understanding and to retain the
focus of my question :))

Well, if the problem re-appears at a lower level, re-apply the solution.
In other words, re-design the data structure so that every hash that may
have a key specified more than once is replaced with a list of pairs.

Mind you, I don't know if this is a good solution in your case -- I know
too little about what use you will make of the structure. It's just
the first thing that came to mind.

[snip TOFU]

Anno
 
T

Tassilo v. Parseval

Also sprach Anno Siegel:
Unfortunately duplicate keys may have been specified in the lower
level hash structure as well. Replacing only the top level hash
structure wont help there.

e.g. see key7 in the following example

$hash = {
key1 => { key2 => val2,
key3 => val3,
},
key4 => [ vala,valb,],
key1 => {key5 => val5,
key6 => {
key7 => val7,
key8 => val8,
key7 => val9,
}
),
};

(I dont call this structure $hash in real life, I have simplified the
actual data structure here for ease of understanding and to retain the
focus of my question :))

Well, if the problem re-appears at a lower level, re-apply the solution.
In other words, re-design the data structure so that every hash that may
have a key specified more than once is replaced with a list of pairs.

If we look at $hash->{key4}, regarding everything as a list likely has
its problems, too. The OP might be able to solve this with a
dispatch-table for particular entries. If he knows that there is always
a list associated with 'key4', then this is feasible. Here is a
spontaneous solution (might not be complete since it doesn't handle
structures within array-refs, only within hash-refs). And no, I do not
intend to win a beauty-award for it:

use Data::Dumper;
use strict;
use vars qw($list);
eval "require 'config_file'";

my %dispatch;
# these hash-keys all have another hash-ref as value
@dispatch{ qw/key1 key2 key3 key5 key6 key6/ } =
(sub {
my ($key, $val, $hash) = @_;
die "Duplicate key '$key'\n" if exists $hash->{ $key };
if (@$val > 1) {
for (my $i = 0; $i <= $#$val; $i += 2) {
die "Duplicate key '$val->[$i]'\n"
if exists $hash->{ $key }->{ $val->[$i] };
if (! ref $val->[$i+1]) {
$hash->{ $key }->{ $val->[$i] } = $val->[$i+1];
} else {
$dispatch{$val->[$i]}->($val->[$i], $val->[$i+1], $hash);
}
}
} else {
$hash->{ $key } = $val;
}
}) x 6;
# takes an array-ref as value
$dispatch{ key4 } =
sub {
my ($key, $val, $hash) = @_;
if (exists $hash->{ $key }) {
die "Duplicate key '$key'\n";
}
$hash->{ $key } = $val;
};

my %hash;
for (my $i = 0; $i <= $#$list; $i += 2) {
$dispatch{$list->[$i]}->($list->[$i], $list->[$i+1], \%hash);
}
print Dumper \%hash;

For a config-file without duplicate values like this:

$list = [
key1 => [ key2 => 'val2',
key3 => 'val3',
key4 => 'val1',
],
key4 => [ qw/vala valb/ ],
key5 => [ key5 => 'val5',
key6 => 'val6'
],
];

it will correctly produce:

$VAR1 = {
'key5' => {
'key5' => 'val5',
'key6' => 'val6'
},
'key4' => [
'vala',
'valb'
],
'key1' => {
'key2' => 'val2',
'key4' => 'val1',
'key3' => 'val3'
}
};

and die when duplicate keys within a sub-structure are detected.

Tassilo
 
A

Anno Siegel

Tassilo v. Parseval said:
Also sprach Anno Siegel:
Unfortunately duplicate keys may have been specified in the lower
level hash structure as well. Replacing only the top level hash
structure wont help there.

e.g. see key7 in the following example

$hash = {
key1 => { key2 => val2,
key3 => val3,
},
key4 => [ vala,valb,],
key1 => {key5 => val5,
key6 => {
key7 => val7,
key8 => val8,
key7 => val9,
}
),
};

(I dont call this structure $hash in real life, I have simplified the
actual data structure here for ease of understanding and to retain the
focus of my question :))

Well, if the problem re-appears at a lower level, re-apply the solution.
In other words, re-design the data structure so that every hash that may
have a key specified more than once is replaced with a list of pairs.

If we look at $hash->{key4}, regarding everything as a list likely has
its problems, too.

Yes, the "real list" among the hashes had me worried to, but I
decided to gloss over it.
The OP might be able to solve this with a
dispatch-table for particular entries. If he knows that there is always
a list associated with 'key4', then this is feasible.

We need a way to tell an ordinary list structure from one that is meant
to represent a hash with multiple keys. A dispatch table, like you
suggest, can do that. Another way would be to point to the "real"
listrefs though another reference. When traversing the structure,
whenever we meet a scalar ref, we know it contains a "real" listref.
A plain listref represents a multi-hash. (Or the other way around.)

Similarly, one could bless one kind of listrefs into a class to make
them distinguishable.
Here is a
spontaneous solution (might not be complete since it doesn't handle
structures within array-refs, only within hash-refs). And no, I do not
intend to win a beauty-award for it:

[code and results snipped]
...and die when duplicate keys within a sub-structure are detected.

I am beginning to have doubts about that. I get the impression the
OP *wants* the duplicate keys.

Anno
 
T

Tassilo v. Parseval

Also sprach Anno Siegel:
Yes, the "real list" among the hashes had me worried to, but I
decided to gloss over it.


We need a way to tell an ordinary list structure from one that is meant
to represent a hash with multiple keys. A dispatch table, like you
suggest, can do that. Another way would be to point to the "real"
listrefs though another reference. When traversing the structure,
whenever we meet a scalar ref, we know it contains a "real" listref.
A plain listref represents a multi-hash. (Or the other way around.)

Similarly, one could bless one kind of listrefs into a class to make
them distinguishable.

Gee. We are beginning to make up our own little meta-language now. :)
I think when things start to get that sophisticated, a complete new
approach is worth a look. Maybe the OP should use a real config-file
format. Either one already supported in the CPAN or his own one that is
easy to parse.

But still, the problem presented here again makes me want a tie_deeply()
mechanism for perl. There are situations (like this one), where tying
the nested structures is impossible because it happens too late.
I am beginning to have doubts about that. I get the impression the
OP *wants* the duplicate keys.

Well, first step would be s/die/warn/g. :) I'm too lazy now to modify
my routines accordingly, but it shouldn't be too hard to turn the value
of an already existing key/value pair into an array-ref and push the
second value onto it.

Tassilo
 
A

Anno Siegel

Tassilo v. Parseval said:
Also sprach Anno Siegel:


Gee. We are beginning to make up our own little meta-language now. :)
I think when things start to get that sophisticated, a complete new
approach is worth a look. Maybe the OP should use a real config-file
format. Either one already supported in the CPAN or his own one that is
easy to parse.

Indeed. There are very useful config modules out there. However, I'm
afraid we lost the OP a few plies ago :)
But still, the problem presented here again makes me want a tie_deeply()
mechanism for perl. There are situations (like this one), where tying
the nested structures is impossible because it happens too late.

I'm not sure what deep-tying would be (something like autovivification
to another tied object?).
Well, first step would be s/die/warn/g. :) I'm too lazy now to modify
my routines accordingly, but it shouldn't be too hard to turn the value
of an already existing key/value pair into an array-ref and push the
second value onto it.

I wouldn't invest too much in it either, for the moment.

However, it occurred to me that this problem may have didactical uses
for the introduction of Perl objects:`

Use the problem to show that it's desirable to make similar references
distinguishable, and introduce bless/ref for the purpose.

Show a solution that uses bless() to mark those lists that represent
multi-key hashes, but doesn't use method calls yet. There'll be
a few "if ( ref( ...) eq '...' ) {" in the code.

Then introduce method calls and show how they simplify the solution.
End up with an implementation of a class "Multi_key_hash" that solves
the original problem.

That way, objects can be introduced with good motivation for every
step.

Anno
 
T

Tassilo v. Parseval

Also sprach Anno Siegel:
Indeed. There are very useful config modules out there. However, I'm
afraid we lost the OP a few plies ago :)

In my direct follow-up to him I warned him that a solution is
non-trivial. ;-)
I'm not sure what deep-tying would be (something like autovivification
to another tied object?).

Yes, though I wouldn't call it autovivification. I didn't yet think how
the interface should look like, but this

%tied_hash = ( key1 => 'val1',
key2 => { subkey1 => 1,
subkey2 => 2 },
);

should not result in

(tied %tied_hash)->STORE('key1', 'val1');
STORE('key2', { subkey1 => 1, subkey2 => 2 });

but rather in

(tied %tied_hash)->STORE('key1', 'val1');
->STORE('key2', 'subkey1', 1);
->STORE('key2', 'subkey2', 2);

or so. And the above should happen recursively. It needs a proper
calling convention and all, then it would be very nice to have. It would
make a module such as MLDBM much more natural in use.

The idea however is not new. It's just tough finding volunteers to
implement it. ;-)
I wouldn't invest too much in it either, for the moment.

However, it occurred to me that this problem may have didactical uses
for the introduction of Perl objects:`

Use the problem to show that it's desirable to make similar references
distinguishable, and introduce bless/ref for the purpose.

Show a solution that uses bless() to mark those lists that represent
multi-key hashes, but doesn't use method calls yet. There'll be
a few "if ( ref( ...) eq '...' ) {" in the code.

Then introduce method calls and show how they simplify the solution.
End up with an implementation of a class "Multi_key_hash" that solves
the original problem.

That way, objects can be introduced with good motivation for every
step.

Yes, it would be a more than welcome variation of the schemes that seem
to be preferred in those OO-tutorials, such as deriving pigs from
animals, letting them oink etc. ;-)

Tassilo
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top