An odd sort requirement - data munging

Discussion in 'Perl Misc' started by cartercc, Nov 4, 2008.

  1. cartercc

    cartercc Guest

    I have a series of date related values, such as 01/2007, 02/2007,
    03/2008, 04/2006, etc. These values represent a column in various
    files that range from several dozen rows deep to almost 1M rows deep.
    My job is to create reports from a collection of these types of files.

    I create a number of refs to hashes that have the general appearance
    of this:
    $h{$k1}{$k2}{$k3} => data (generally but not always a simple count).

    I write to an outfile generally like this:
    foreach $k1 (sort keys %h) {
    foreach $k2 (sort keys %{$h{$k1}}) (
    print OUTFILE " <k1>$k1</k1> <k2>$k2</k2> <val>$h{$k1}{$k2}</val>
    \n";
    }
    }

    When I run this, it works perfectly, sorting the date values into
    perfect numerical order. When something's so perfect, you know it's
    wrong!

    Here's the problem: the ordering of the dates isn't numerical, the
    proper order is- 03/2005, 04/2005 ... 01/2006, 02/2006
    03/2006, 04/2006 ... 01/2007, 02/2007
    03/2007, 04/2007 ... 01/2008, 02/2008

    Here's another view of the problem - when I print data for a year, the
    ordering is:
    03/07, 04/07, 05/07 ... 01/07, 02/07

    What I would like to do is overload the sort operator (call it
    'sort_y') to sort in this non-numerical order. Can this be done? Can
    it be done perhaps in C and compiled to run in Perl? Can it be done
    algorithmetically by passing sort a function of some kind? I've tried
    this, and the logic is very clumsy and full of stupid relational
    operators and elsifs.

    The former solution was to copy the data into Excel and manually cut
    and paste the columns in the correct order. Some of these reports are
    enormous (25 cols by 2500 rows) and I don't want to do this. This took
    a lot of time and was very much error prone.

    Thanks, CC.
     
    cartercc, Nov 4, 2008
    #1
    1. Advertising

  2. >>>>> "cc" == cartercc <> writes:

    cc> What I would like to do is overload the sort operator (call it
    cc> 'sort_y') to sort in this non-numerical order. Can this be done?

    perldoc -f sort, and notice that sort takes an optional code block or
    function name.

    Charlton


    --
    Charlton Wilbur
     
    Charlton Wilbur, Nov 4, 2008
    #2
    1. Advertising

  3. cartercc

    Uri Guttman Guest

    >>>>> "c" == cartercc <> writes:

    c> I have a series of date related values, such as 01/2007, 02/2007,
    c> 03/2008, 04/2006, etc. These values represent a column in various
    c> files that range from several dozen rows deep to almost 1M rows deep.
    c> My job is to create reports from a collection of these types of files.

    c> I create a number of refs to hashes that have the general appearance
    c> of this:
    c> $h{$k1}{$k2}{$k3} => data (generally but not always a simple count).

    you don't need to build up such a hash tree to sort this data. it will
    slow you down. you can sort directly on the fields but we don't know the
    original record format (a single long string?).

    c> I write to an outfile generally like this:
    c> foreach $k1 (sort keys %h) {
    c> foreach $k2 (sort keys %{$h{$k1}}) (

    those are string comparisons (the default for sort).

    c> When I run this, it works perfectly, sorting the date values into
    c> perfect numerical order. When something's so perfect, you know it's
    c> wrong!

    you say numerical order (but use string compares as i said above)

    c> Here's the problem: the ordering of the dates isn't numerical, the
    c> proper order is- 03/2005, 04/2005 ... 01/2006, 02/2006
    c> 03/2006, 04/2006 ... 01/2007, 02/2007
    c> 03/2007, 04/2007 ... 01/2008, 02/2008


    c> Here's another view of the problem - when I print data for a year, the
    c> ordering is:
    c> 03/07, 04/07, 05/07 ... 01/07, 02/07

    can you show the code for this sort?

    c> The former solution was to copy the data into Excel and manually cut
    c> and paste the columns in the correct order. Some of these reports are
    c> enormous (25 cols by 2500 rows) and I don't want to do this. This took
    c> a lot of time and was very much error prone.

    25 cols x 2500 rows is not enormous for perl.

    try using Sort::Maker for this. you just define how you want each key
    extracted (using a regex or substr or any code) from your record and how
    to sort it (numeric/string, up/down). in your case i would say use a
    string sort since you have padded numbers. then make sure you sort the
    keys in the order you want - year first, and then month. you will get a
    clean and fast sort without all of your extra code and not needing to
    use external programs.

    uri

    --
    Uri Guttman ------ -------- http://www.sysarch.com --
    ----- Perl Code Review , Architecture, Development, Training, Support ------
    --------- Free Perl Training --- http://perlhunter.com/college.html ---------
    --------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
     
    Uri Guttman, Nov 4, 2008
    #3
  4. cartercc

    cartercc Guest

    Thanks, Uri,

    On Nov 4, 11:34 am, Uri Guttman <> wrote:
    > >>>>> "c" == cartercc  <> writes:

    >
    >   c> I have a series of date related values, such as 01/2007, 02/2007,
    >   c> 03/2008, 04/2006, etc. These values represent a column in various
    >   c> files that range from several dozen rows deep to almost 1M rows deep.
    >   c> My job is to create reports from a collection of these types of files.
    >
    >   c> I create a number of refs to hashes that have the general appearance
    >   c> of this:
    >   c> $h{$k1}{$k2}{$k3} => data (generally but not always a simple count).
    >
    > you don't need to build up such a hash tree to sort this data. it will
    > slow you down. you can sort directly on the fields but we don't know the
    > original record format (a single long string?).


    Actually, this is a general solution to a number of different
    problems. The keys could represent a number of different types, such
    as states, cities, course names (e.g., ART 109, BIO 225, MAT 4556),
    people names, and so on. The common factor is that every type of field
    is both unique and sortable, so I can use the same data structure
    regardless of the type of data.

    With particular reference to this sort problem, it's the only one type
    that can't be sorted normally, which is the out-of-order date types.

    >   c> I write to an outfile generally like this:
    >   c> foreach $k1 (sort keys %h) {
    >   c>   foreach $k2 (sort keys %{$h{$k1}}) (
    >
    > those are string comparisons (the default for sort).


    You are right. Actually, isn't it an ASCII sort? When I sort character
    values, case makes a difference.

    >   c> When I run this, it works perfectly, sorting the date values into
    >   c> perfect numerical order. When something's so perfect, you know it's
    >   c> wrong!
    >
    > you say numerical order (but use string compares as i said above)


    True -- excuse my casualness.

    >   c> Here's the problem: the ordering of the dates isn't numerical, the
    >   c> proper order is- 03/2005, 04/2005 ... 01/2006, 02/2006
    >   c> 03/2006, 04/2006 ... 01/2007, 02/2007
    >   c> 03/2007, 04/2007 ... 01/2008, 02/2008
    >
    >   c> Here's another view of the problem - when I print data for a year,the
    >   c> ordering is:
    >   c> 03/07, 04/07, 05/07 ... 01/07, 02/07
    >
    > can you show the code for this sort?


    There is no code other than that shown above. I have the data in a
    hash of hash refs. What I meant was -- the INSTITUTIONAL ordering is
    different than the NUMERIC ordering. What it prints is, "1 2 3 4 5".
    What I want it to print is "3 4 5 1 2". And, I want it to print this
    only for this kind of string, not for other strings that might contain
    numeric characters, such as zip codes, area codes, ID numbers, etc.

    >   c> The former solution was to copy the data into Excel and manually cut
    >   c> and paste the columns in the correct order. Some of these reports are
    >   c> enormous (25 cols by 2500 rows) and I don't want to do this. This took
    >   c> a lot of time and was very much error prone.
    >
    > 25 cols x 2500 rows is not enormous for perl.


    Right. However, it is for manual processing. This is why I want a
    scripted solution. My scripted solution is about 95% complete, and if
    I could solve this problem it would be close to 100%.

    > try using Sort::Maker for this. you just define how you want each key
    > extracted (using a regex or substr or any code) from your record and how
    > to sort it (numeric/string, up/down). in your case i would say use a
    > string sort since you have padded numbers. then make sure you sort the
    > keys in the order you want - year first, and then month. you will get a
    > clean and fast sort without all of your extra code and not needing to
    > use external programs.


    Thanks, I'll look at this.

    And in replying to your message, an idea occurred to me.

    The central component in all these strings is the slash: /. If I write
    a function that takes a key as the parameter such as "2006/03" and
    returns a string such as "2006/1/03", then I could control exactly how
    the sort would work. When I was finished with the report, I could run
    it through another function that would substitute "/1/" with just "/"
    and that would solve the problem.

    Thanks, CC.
     
    cartercc, Nov 4, 2008
    #4
  5. cartercc <> wrote:
    >I have a series of date related values, such as 01/2007, 02/2007,
    >03/2008, 04/2006, etc. These values represent a column in various
    >files that range from several dozen rows deep to almost 1M rows deep.
    >My job is to create reports from a collection of these types of files.
    >
    >I create a number of refs to hashes that have the general appearance
    >of this:
    >$h{$k1}{$k2}{$k3} => data (generally but not always a simple count).


    After scratching my head for some time I am guessing that probably $k1
    contains the month and $k2 contains the year.
    If you had provided a minimal, self-contained script as requested in the
    posting guidelines it would have been much easier to identify your data
    structure.

    >I write to an outfile generally like this:
    >foreach $k1 (sort keys %h) {
    > foreach $k2 (sort keys %{$h{$k1}}) (
    > print OUTFILE " <k1>$k1</k1> <k2>$k2</k2> <val>$h{$k1}{$k2}</val>
    >\n";
    > }
    >}
    >Here's the problem: the ordering of the dates isn't numerical, the
    >proper order is- 03/2005, 04/2005 ... 01/2006, 02/2006
    >03/2006, 04/2006 ... 01/2007, 02/2007
    >03/2007, 04/2007 ... 01/2008, 02/2008


    Well, that is what you are asking for. Assuming $k1 and $k2 are month
    and year respectively then you are sorting your data by month and within
    each month by year.
    You could just reverse those two, sorting by year first and then within
    each year by month.

    Another solution would be to write a custom compare function. You will
    have to pass the pair of year and month for each of $a and $b as those
    are actually the number you want to sort. And then once you get that
    sorted list just loop through it and print the corresponding values from
    the data set.
    I'd be interested in coding it but I'm not good enough to do it without
    any testing and since you didn't provide any self-contained program that
    could be used a test bed that's not an option.

    Yet another solution would be to change your data structure. Your HoHoA
    has the granularity of the time spans reversed. Had you put year as the
    top value, then your algorithm above would have worked naturally.

    >What I would like to do is overload the sort operator (call it
    >'sort_y') to sort in this non-numerical order. Can this be done?


    Why would you want to do that? Why don't you simply write your own
    custom compare function and use that instead of the default <=>?

    >Can
    >it be done perhaps in C and compiled to run in Perl? Can it be done
    >algorithmetically by passing sort a function of some kind?


    Dah, did you even read the man page for sort()? That's what the first
    argument of sort() is all about!

    jue
     
    Jürgen Exner, Nov 4, 2008
    #5
  6. cartercc

    cartercc Guest

    On Nov 4, 12:23 pm, Jürgen Exner <> wrote:
    > cartercc <> wrote:
    > >I have a series of date related values, such as 01/2007, 02/2007,
    > >03/2008, 04/2006, etc. These values represent a column in various
    > >files that range from several dozen rows deep to almost 1M rows deep.
    > >My job is to create reports from a collection of these types of files.

    >
    > >I create a number of refs to hashes that have the general appearance
    > >of this:
    > >$h{$k1}{$k2}{$k3} => data (generally but not always a simple count).

    >
    > After scratching my head for some time I am guessing that probably $k1
    > contains the month and $k2 contains the year.
    > If you had provided a minimal, self-contained script as requested in the
    > posting guidelines it would have been much easier to identify your data
    > structure.


    No. $k1, etc., contains ANYTHING that's sortable and unique. It can
    contain names, like "Exner, J", "New York," or "Baltimore" or numbers
    (telephone, area code, ID numbers) or other values. The contents of
    the keys are not relevant to the code or to the question.

    > >Here's the problem: the ordering of the dates isn't numerical, the
    > >proper order is- 03/2005, 04/2005 ... 01/2006, 02/2006
    > >03/2006, 04/2006 ... 01/2007, 02/2007
    > >03/2007, 04/2007 ... 01/2008, 02/2008

    >
    > Well, that is what you are asking for. Assuming $k1 and $k2 are month
    > and year respectively then you are sorting your data by month and within
    > each month by year.
    > You could just reverse those two, sorting by year first and then within
    > each year by month.


    Actually, no. Here is a sample of a data file:
    "07/T1","A27","117"
    "07/T1","D01","3"
    "07/T1","EA27","30"
    "07/T1","EF20","52"
    ....
    "08/T5","V26","17"
    "08/T5","W03","11"
    "08/T5","W04","4"
    "08/T5","W05","1"

    Hee is a sample of another data file:
    1222413 G07 07/T2 07/RFA 07/T2
    1247990 FH1 08/T4 08/RSP 08/T4
    1094529 EARMY 05/T4 05/T4 05/T5 07/T1 07/RFA 07/T2
    1247991 V24 08/T4 08/RSP 08/T4

    As you can see, the 'date' values are unary values and I don't have
    any real need to split them.

    > Another solution would be to write a custom compare function. You will
    > have to pass the pair of year and month for each of $a and $b as those
    > are actually the number you want to sort. And then once you get that
    > sorted list just loop through it and print the corresponding values from
    > the data set.
    > I'd be interested in coding it but I'm not good enough to do it without
    > any testing and since you didn't provide any self-contained program that
    > could be used a test bed that's not an option.


    I just posted a half-assed idea of a solution that would require
    processing the file two more times to convert and unconvert this field
    to something that would sort naturally. I would like to see a custom
    compare function, and if you want, I can send you sample data files
    (they contain no confidential or sensitive information) and a script
    that I use to product an OUTFILE. I've spend a non-trivial amount of
    time thinking about it, and I can't see a solution.

    > Yet another solution would be to change your data structure. Your HoHoA
    > has the granularity of the time spans reversed. Had you put year as the
    > top value, then your algorithm above would have worked naturally.


    No, because of this:
    Calendar year -
    07/01, 07/02, 07/03, ...
    Reporting year -
    07/03, 07/04, ... 07/01, 07/02
    Academic year -
    07/01, 07/02 ... 08/04, 08/05

    Please note that the Academic year crosses year boundries, i.e., from
    2007 to 2008, while the Reporting year crosses month boundries, i.e.,
    '03' starts the series and '01','02' ends the series.


    > Dah, did you even read the man page for sort()? That's what the first
    > argument of sort() is all about!


    Actually, no, I didn't. I know that sort can take a function as an
    argument, but I was focused on the algorithm, not the implementation.
    But I'm headed that way now.

    CC
     
    cartercc, Nov 4, 2008
    #6
  7. cartercc

    Uri Guttman Guest

    >>>>> "c" == cartercc <> writes:

    >> you don't need to build up such a hash tree to sort this data. it will
    >> slow you down. you can sort directly on the fields but we don't know the
    >> original record format (a single long string?).


    c> Actually, this is a general solution to a number of different
    c> problems. The keys could represent a number of different types, such
    c> as states, cities, course names (e.g., ART 109, BIO 225, MAT 4556),
    c> people names, and so on. The common factor is that every type of field
    c> is both unique and sortable, so I can use the same data structure
    c> regardless of the type of data.

    but then you still have issues with key processing (getting the order
    you want) and sort key ordering. this is why my module can help.

    >> those are string comparisons (the default for sort).


    c> You are right. Actually, isn't it an ASCII sort? When I sort character
    c> values, case makes a difference.

    yes, sort use asciibetical sorting. you can fold case by extracting
    string keys and upper/lower casing them all.

    >> can you show the code for this sort?


    c> There is no code other than that shown above. I have the data in a
    c> hash of hash refs. What I meant was -- the INSTITUTIONAL ordering is
    c> different than the NUMERIC ordering. What it prints is, "1 2 3 4 5".
    c> What I want it to print is "3 4 5 1 2". And, I want it to print this
    c> only for this kind of string, not for other strings that might contain
    c> numeric characters, such as zip codes, area codes, ID numbers, etc.

    oh, i didn't get you wanted an internal wacky ordering. the easiest
    thing to do is to extract your keys and map them through a hash which
    converts them to an order value that you want. something like this
    (highly untested):

    # based on the order you show above.

    my %sort_ordering = (
    1 => 3,
    2 => 4,
    ....

    ) ;

    and a basic sort would be like this:

    @sorted = sort { $sort_ordering{$a} cmp $sort_ordering{$b} } @input ;

    switch cmp for <=> as desired.

    >> try using Sort::Maker for this. you just define how you want each key
    >> extracted (using a regex or substr or any code) from your record and how
    >> to sort it (numeric/string, up/down). in your case i would say use a
    >> string sort since you have padded numbers. then make sure you sort the
    >> keys in the order you want - year first, and then month. you will get a
    >> clean and fast sort without all of your extra code and not needing to
    >> use external programs.


    c> Thanks, I'll look at this.

    sort::maker will simplify key extraction and custom ordering. you would
    build a ordering hash as above but put that conversion in the key
    extract code. something like this:

    my $sorter = make_sorter( 'GRT',
    string => '$sort_ordering{get_the_key($_)}'
    ) ;

    get_the_key is either direct code or a call to (ahem!) get the key from
    $_.

    c> The central component in all these strings is the slash: /. If I write
    c> a function that takes a key as the parameter such as "2006/03" and
    c> returns a string such as "2006/1/03", then I could control exactly how
    c> the sort would work. When I was finished with the report, I could run
    c> it through another function that would substitute "/1/" with just "/"
    c> and that would solve the problem.

    that is the conversion hash i spoke of. it can be a sub too if the
    conversion needs more work or such. actually with sort::maker you would
    sort on multiple keys each having its own sort ordering hash as
    needed. then you don't need an extraction sub which is slower.

    uri

    --
    Uri Guttman ------ -------- http://www.sysarch.com --
    ----- Perl Code Review , Architecture, Development, Training, Support ------
    --------- Free Perl Training --- http://perlhunter.com/college.html ---------
    --------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
     
    Uri Guttman, Nov 4, 2008
    #7
  8. cartercc

    Uri Guttman Guest

    >>>>> "GJ" == Glenn Jackman <> writes:

    GJ> At 2008-11-04 09:07AM, "cartercc" wrote:
    >> Here's the problem: the ordering of the dates isn't numerical, the
    >> proper order is- 03/2005, 04/2005 ... 01/2006, 02/2006
    >> 03/2006, 04/2006 ... 01/2007, 02/2007
    >> 03/2007, 04/2007 ... 01/2008, 02/2008


    GJ> An example of the Schwartzian Transform (look it up in wikipedia):

    GJ> my @dates = qw(
    GJ> 03/2005 04/2005 01/2006 02/2006 03/2006 04/2006
    GJ> 01/2007 02/2007 03/2007 04/2007 01/2008 02/2008);
    GJ> # or @dates = keys %somehash

    GJ> my @sorted =
    GJ> map { $_->[0] }
    GJ> sort { $a->[1] <=> $b->[1] }
    GJ> map { [$_, substr($_,3,4) . substr($_,0,2)] } @dates;


    and sort::maker can generate an ST and spit out the code for you or you
    can use the code ref it creates. it also removes redundancy (duplicate
    key code and hides all the sort/map/sort syntax. and it can generate GRT
    sorts which are much faster than the ST.

    another trick is to pregenerate sorters for various key sets and print
    them out off line. then paste in those sorters (name the subs) and call
    the desired one based on the data set being sorted. then you don't need
    to hand code all the sort variants of key sets.

    uri

    --
    Uri Guttman ------ -------- http://www.sysarch.com --
    ----- Perl Code Review , Architecture, Development, Training, Support ------
    --------- Free Perl Training --- http://perlhunter.com/college.html ---------
    --------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
     
    Uri Guttman, Nov 4, 2008
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Guest

    Cookie munging problem

    Guest, Aug 1, 2003, in forum: ASP .Net
    Replies:
    5
    Views:
    1,363
    Seaside
    Aug 2, 2003
  2. Replies:
    2
    Views:
    1,315
  3. Timothy Grant

    py.test munging strings in asserts?

    Timothy Grant, Apr 21, 2006, in forum: Python
    Replies:
    1
    Views:
    315
    Johnny deBris
    Apr 21, 2006
  4. Chris Allen
    Replies:
    3
    Views:
    324
    Arnau Sanchez
    Aug 29, 2007
  5. Replies:
    0
    Views:
    104
Loading...

Share This Page