An odd sort requirement - data munging


C

cartercc

I have a series of date related values, such as 01/2007, 02/2007,
03/2008, 04/2006, etc. These values represent a column in various
files that range from several dozen rows deep to almost 1M rows deep.
My job is to create reports from a collection of these types of files.

I create a number of refs to hashes that have the general appearance
of this:
$h{$k1}{$k2}{$k3} => data (generally but not always a simple count).

I write to an outfile generally like this:
foreach $k1 (sort keys %h) {
foreach $k2 (sort keys %{$h{$k1}}) (
print OUTFILE " <k1>$k1</k1> <k2>$k2</k2> <val>$h{$k1}{$k2}</val>
\n";
}
}

When I run this, it works perfectly, sorting the date values into
perfect numerical order. When something's so perfect, you know it's
wrong!

Here's the problem: the ordering of the dates isn't numerical, the
proper order is- 03/2005, 04/2005 ... 01/2006, 02/2006
03/2006, 04/2006 ... 01/2007, 02/2007
03/2007, 04/2007 ... 01/2008, 02/2008

Here's another view of the problem - when I print data for a year, the
ordering is:
03/07, 04/07, 05/07 ... 01/07, 02/07

What I would like to do is overload the sort operator (call it
'sort_y') to sort in this non-numerical order. Can this be done? Can
it be done perhaps in C and compiled to run in Perl? Can it be done
algorithmetically by passing sort a function of some kind? I've tried
this, and the logic is very clumsy and full of stupid relational
operators and elsifs.

The former solution was to copy the data into Excel and manually cut
and paste the columns in the correct order. Some of these reports are
enormous (25 cols by 2500 rows) and I don't want to do this. This took
a lot of time and was very much error prone.

Thanks, CC.
 
Ad

Advertisements

C

Charlton Wilbur

cc> What I would like to do is overload the sort operator (call it
cc> 'sort_y') to sort in this non-numerical order. Can this be done?

perldoc -f sort, and notice that sort takes an optional code block or
function name.

Charlton
 
U

Uri Guttman

c> I have a series of date related values, such as 01/2007, 02/2007,
c> 03/2008, 04/2006, etc. These values represent a column in various
c> files that range from several dozen rows deep to almost 1M rows deep.
c> My job is to create reports from a collection of these types of files.

c> I create a number of refs to hashes that have the general appearance
c> of this:
c> $h{$k1}{$k2}{$k3} => data (generally but not always a simple count).

you don't need to build up such a hash tree to sort this data. it will
slow you down. you can sort directly on the fields but we don't know the
original record format (a single long string?).

c> I write to an outfile generally like this:
c> foreach $k1 (sort keys %h) {
c> foreach $k2 (sort keys %{$h{$k1}}) (

those are string comparisons (the default for sort).

c> When I run this, it works perfectly, sorting the date values into
c> perfect numerical order. When something's so perfect, you know it's
c> wrong!

you say numerical order (but use string compares as i said above)

c> Here's the problem: the ordering of the dates isn't numerical, the
c> proper order is- 03/2005, 04/2005 ... 01/2006, 02/2006
c> 03/2006, 04/2006 ... 01/2007, 02/2007
c> 03/2007, 04/2007 ... 01/2008, 02/2008


c> Here's another view of the problem - when I print data for a year, the
c> ordering is:
c> 03/07, 04/07, 05/07 ... 01/07, 02/07

can you show the code for this sort?

c> The former solution was to copy the data into Excel and manually cut
c> and paste the columns in the correct order. Some of these reports are
c> enormous (25 cols by 2500 rows) and I don't want to do this. This took
c> a lot of time and was very much error prone.

25 cols x 2500 rows is not enormous for perl.

try using Sort::Maker for this. you just define how you want each key
extracted (using a regex or substr or any code) from your record and how
to sort it (numeric/string, up/down). in your case i would say use a
string sort since you have padded numbers. then make sure you sort the
keys in the order you want - year first, and then month. you will get a
clean and fast sort without all of your extra code and not needing to
use external programs.

uri
 
C

cartercc

Thanks, Uri,

  c> I have a series of date related values, such as 01/2007, 02/2007,
  c> 03/2008, 04/2006, etc. These values represent a column in various
  c> files that range from several dozen rows deep to almost 1M rows deep.
  c> My job is to create reports from a collection of these types of files.

  c> I create a number of refs to hashes that have the general appearance
  c> of this:
  c> $h{$k1}{$k2}{$k3} => data (generally but not always a simple count).

you don't need to build up such a hash tree to sort this data. it will
slow you down. you can sort directly on the fields but we don't know the
original record format (a single long string?).

Actually, this is a general solution to a number of different
problems. The keys could represent a number of different types, such
as states, cities, course names (e.g., ART 109, BIO 225, MAT 4556),
people names, and so on. The common factor is that every type of field
is both unique and sortable, so I can use the same data structure
regardless of the type of data.

With particular reference to this sort problem, it's the only one type
that can't be sorted normally, which is the out-of-order date types.
  c> I write to an outfile generally like this:
  c> foreach $k1 (sort keys %h) {
  c>   foreach $k2 (sort keys %{$h{$k1}}) (

those are string comparisons (the default for sort).

You are right. Actually, isn't it an ASCII sort? When I sort character
values, case makes a difference.
  c> When I run this, it works perfectly, sorting the date values into
  c> perfect numerical order. When something's so perfect, you know it's
  c> wrong!

you say numerical order (but use string compares as i said above)

True -- excuse my casualness.
  c> Here's the problem: the ordering of the dates isn't numerical, the
  c> proper order is- 03/2005, 04/2005 ... 01/2006, 02/2006
  c> 03/2006, 04/2006 ... 01/2007, 02/2007
  c> 03/2007, 04/2007 ... 01/2008, 02/2008

  c> Here's another view of the problem - when I print data for a year,the
  c> ordering is:
  c> 03/07, 04/07, 05/07 ... 01/07, 02/07

can you show the code for this sort?

There is no code other than that shown above. I have the data in a
hash of hash refs. What I meant was -- the INSTITUTIONAL ordering is
different than the NUMERIC ordering. What it prints is, "1 2 3 4 5".
What I want it to print is "3 4 5 1 2". And, I want it to print this
only for this kind of string, not for other strings that might contain
numeric characters, such as zip codes, area codes, ID numbers, etc.
  c> The former solution was to copy the data into Excel and manually cut
  c> and paste the columns in the correct order. Some of these reports are
  c> enormous (25 cols by 2500 rows) and I don't want to do this. This took
  c> a lot of time and was very much error prone.

25 cols x 2500 rows is not enormous for perl.

Right. However, it is for manual processing. This is why I want a
scripted solution. My scripted solution is about 95% complete, and if
I could solve this problem it would be close to 100%.
try using Sort::Maker for this. you just define how you want each key
extracted (using a regex or substr or any code) from your record and how
to sort it (numeric/string, up/down). in your case i would say use a
string sort since you have padded numbers. then make sure you sort the
keys in the order you want - year first, and then month. you will get a
clean and fast sort without all of your extra code and not needing to
use external programs.

Thanks, I'll look at this.

And in replying to your message, an idea occurred to me.

The central component in all these strings is the slash: /. If I write
a function that takes a key as the parameter such as "2006/03" and
returns a string such as "2006/1/03", then I could control exactly how
the sort would work. When I was finished with the report, I could run
it through another function that would substitute "/1/" with just "/"
and that would solve the problem.

Thanks, CC.
 
J

Jürgen Exner

cartercc said:
I have a series of date related values, such as 01/2007, 02/2007,
03/2008, 04/2006, etc. These values represent a column in various
files that range from several dozen rows deep to almost 1M rows deep.
My job is to create reports from a collection of these types of files.

I create a number of refs to hashes that have the general appearance
of this:
$h{$k1}{$k2}{$k3} => data (generally but not always a simple count).

After scratching my head for some time I am guessing that probably $k1
contains the month and $k2 contains the year.
If you had provided a minimal, self-contained script as requested in the
posting guidelines it would have been much easier to identify your data
structure.
I write to an outfile generally like this:
foreach $k1 (sort keys %h) {
foreach $k2 (sort keys %{$h{$k1}}) (
print OUTFILE " <k1>$k1</k1> <k2>$k2</k2> <val>$h{$k1}{$k2}</val>
\n";
}
}
Here's the problem: the ordering of the dates isn't numerical, the
proper order is- 03/2005, 04/2005 ... 01/2006, 02/2006
03/2006, 04/2006 ... 01/2007, 02/2007
03/2007, 04/2007 ... 01/2008, 02/2008

Well, that is what you are asking for. Assuming $k1 and $k2 are month
and year respectively then you are sorting your data by month and within
each month by year.
You could just reverse those two, sorting by year first and then within
each year by month.

Another solution would be to write a custom compare function. You will
have to pass the pair of year and month for each of $a and $b as those
are actually the number you want to sort. And then once you get that
sorted list just loop through it and print the corresponding values from
the data set.
I'd be interested in coding it but I'm not good enough to do it without
any testing and since you didn't provide any self-contained program that
could be used a test bed that's not an option.

Yet another solution would be to change your data structure. Your HoHoA
has the granularity of the time spans reversed. Had you put year as the
top value, then your algorithm above would have worked naturally.
What I would like to do is overload the sort operator (call it
'sort_y') to sort in this non-numerical order. Can this be done?

Why would you want to do that? Why don't you simply write your own
custom compare function and use that instead of the default said:
Can
it be done perhaps in C and compiled to run in Perl? Can it be done
algorithmetically by passing sort a function of some kind?

Dah, did you even read the man page for sort()? That's what the first
argument of sort() is all about!

jue
 
C

cartercc

After scratching my head for some time I am guessing that probably $k1
contains the month and $k2 contains the year.
If you had provided a minimal, self-contained script as requested in the
posting guidelines it would have been much easier to identify your data
structure.

No. $k1, etc., contains ANYTHING that's sortable and unique. It can
contain names, like "Exner, J", "New York," or "Baltimore" or numbers
(telephone, area code, ID numbers) or other values. The contents of
the keys are not relevant to the code or to the question.
Well, that is what you are asking for. Assuming $k1 and $k2 are month
and year respectively then you are sorting your data by month and within
each month by year.
You could just reverse those two, sorting by year first and then within
each year by month.

Actually, no. Here is a sample of a data file:
"07/T1","A27","117"
"07/T1","D01","3"
"07/T1","EA27","30"
"07/T1","EF20","52"
....
"08/T5","V26","17"
"08/T5","W03","11"
"08/T5","W04","4"
"08/T5","W05","1"

Hee is a sample of another data file:
1222413 G07 07/T2 07/RFA 07/T2
1247990 FH1 08/T4 08/RSP 08/T4
1094529 EARMY 05/T4 05/T4 05/T5 07/T1 07/RFA 07/T2
1247991 V24 08/T4 08/RSP 08/T4

As you can see, the 'date' values are unary values and I don't have
any real need to split them.
Another solution would be to write a custom compare function. You will
have to pass the pair of year and month for each of $a and $b as those
are actually the number you want to sort. And then once you get that
sorted list just loop through it and print the corresponding values from
the data set.
I'd be interested in coding it but I'm not good enough to do it without
any testing and since you didn't provide any self-contained program that
could be used a test bed that's not an option.

I just posted a half-assed idea of a solution that would require
processing the file two more times to convert and unconvert this field
to something that would sort naturally. I would like to see a custom
compare function, and if you want, I can send you sample data files
(they contain no confidential or sensitive information) and a script
that I use to product an OUTFILE. I've spend a non-trivial amount of
time thinking about it, and I can't see a solution.
Yet another solution would be to change your data structure. Your HoHoA
has the granularity of the time spans reversed. Had you put year as the
top value, then your algorithm above would have worked naturally.

No, because of this:
Calendar year -
07/01, 07/02, 07/03, ...
Reporting year -
07/03, 07/04, ... 07/01, 07/02
Academic year -
07/01, 07/02 ... 08/04, 08/05

Please note that the Academic year crosses year boundries, i.e., from
2007 to 2008, while the Reporting year crosses month boundries, i.e.,
'03' starts the series and '01','02' ends the series.

Dah, did you even read the man page for sort()? That's what the first
argument of sort() is all about!

Actually, no, I didn't. I know that sort can take a function as an
argument, but I was focused on the algorithm, not the implementation.
But I'm headed that way now.

CC
 
Ad

Advertisements

U

Uri Guttman

c> Actually, this is a general solution to a number of different
c> problems. The keys could represent a number of different types, such
c> as states, cities, course names (e.g., ART 109, BIO 225, MAT 4556),
c> people names, and so on. The common factor is that every type of field
c> is both unique and sortable, so I can use the same data structure
c> regardless of the type of data.

but then you still have issues with key processing (getting the order
you want) and sort key ordering. this is why my module can help.

c> You are right. Actually, isn't it an ASCII sort? When I sort character
c> values, case makes a difference.

yes, sort use asciibetical sorting. you can fold case by extracting
string keys and upper/lower casing them all.

c> There is no code other than that shown above. I have the data in a
c> hash of hash refs. What I meant was -- the INSTITUTIONAL ordering is
c> different than the NUMERIC ordering. What it prints is, "1 2 3 4 5".
c> What I want it to print is "3 4 5 1 2". And, I want it to print this
c> only for this kind of string, not for other strings that might contain
c> numeric characters, such as zip codes, area codes, ID numbers, etc.

oh, i didn't get you wanted an internal wacky ordering. the easiest
thing to do is to extract your keys and map them through a hash which
converts them to an order value that you want. something like this
(highly untested):

# based on the order you show above.

my %sort_ordering = (
1 => 3,
2 => 4,
....

) ;

and a basic sort would be like this:

@sorted = sort { $sort_ordering{$a} cmp $sort_ordering{$b} } @input ;


c> Thanks, I'll look at this.

sort::maker will simplify key extraction and custom ordering. you would
build a ordering hash as above but put that conversion in the key
extract code. something like this:

my $sorter = make_sorter( 'GRT',
string => '$sort_ordering{get_the_key($_)}'
) ;

get_the_key is either direct code or a call to (ahem!) get the key from
$_.

c> The central component in all these strings is the slash: /. If I write
c> a function that takes a key as the parameter such as "2006/03" and
c> returns a string such as "2006/1/03", then I could control exactly how
c> the sort would work. When I was finished with the report, I could run
c> it through another function that would substitute "/1/" with just "/"
c> and that would solve the problem.

that is the conversion hash i spoke of. it can be a sub too if the
conversion needs more work or such. actually with sort::maker you would
sort on multiple keys each having its own sort ordering hash as
needed. then you don't need an extraction sub which is slower.

uri
 
Ad

Advertisements

U

Uri Guttman

GJ> An example of the Schwartzian Transform (look it up in wikipedia):

GJ> my @dates = qw(
GJ> 03/2005 04/2005 01/2006 02/2006 03/2006 04/2006
GJ> 01/2007 02/2007 03/2007 04/2007 01/2008 02/2008);
GJ> # or @dates = keys %somehash

GJ> my @sorted =
GJ> map { $_->[0] }
GJ> sort { $a->[1] <=> $b->[1] }
GJ> map { [$_, substr($_,3,4) . substr($_,0,2)] } @dates;


and sort::maker can generate an ST and spit out the code for you or you
can use the code ref it creates. it also removes redundancy (duplicate
key code and hides all the sort/map/sort syntax. and it can generate GRT
sorts which are much faster than the ST.

another trick is to pregenerate sorters for various key sets and print
them out off line. then paste in those sorters (name the subs) and call
the desired one based on the data set being sorted. then you don't need
to hand code all the sort variants of key sets.

uri
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top