Grouping like items together

Discussion in 'Perl Misc' started by AcCeSsDeNiEd, Nov 15, 2005.

  1. AcCeSsDeNiEd

    AcCeSsDeNiEd Guest

    I have several 10s of thousands files with no directories.

    I'm trying to group the 'similar' files together and place them in a directory.

    E.g of such files:

    Mike F. 2332445-withdrawal.pdf
    Mike F. 43565654-letter.pdf
    Mike F. 434324.sign.pdf
    Dawn M. Yang letter of acceptance.pdf
    Dawn M. Yang (01).pdf
    Dawn M. Yang 4355434 SOA.pdf


    I'm trying to group these files by their names.
    The names are not in a fixed format. E.g, not all names may have a middle name.
    if these names were in a list, how do I match and group them together? How would I know the group
    name?

    Thx.

    To e-mail, remove the obvious
    AcCeSsDeNiEd, Nov 15, 2005
    #1
    1. Advertising

  2. AcCeSsDeNiEd

    Guest

    AcCeSsDeNiEd wrote:
    > E.g of such files:
    >
    > Mike F. 2332445-withdrawal.pdf
    > Mike F. 43565654-letter.pdf
    > Mike F. 434324.sign.pdf
    > Dawn M. Yang letter of acceptance.pdf
    > Dawn M. Yang (01).pdf
    > Dawn M. Yang 4355434 SOA.pdf
    >
    > I'm trying to group these files by their names.
    > The names are not in a fixed format....


    This is a dreadful question (meaning it is very hard to ascertain your
    intent). The best way to get a good answer is to ask a good question.
    You have asked a very bad question, so you will only get a very bad
    answer (as I believe PG has already provided).

    But, unlike PG, I am here to help you, not berate you. First of all,
    you should read the posting guidelines for this group. They can be
    found on-line at:
    http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html

    These guidelines exist for YOUR benefit (because they show you how to
    compose effective posts which are much more likely to get effective
    responses - without getting flamed).

    Next, (at a VERY MINIMUM) you need to tell us EXACTLY what you want to
    do. So I am gonna ask you a question. The question is not simply
    rhetorical; I want you to sit down at your keyboard and actually type
    out an answer and post it here. Here is the question:

    What would you want the directory structure to look like (ie, what
    would be the names of the subdirectories - give me a complete list of
    exactly you want the subdirectory names to be) if your filenames looked
    like this (and pay attention to the filenames - some are identical for
    quite a number of characters):

    Mike 12345.pdf
    Mike G. 2332445-withdrawal.pdf
    Mike G. 12345.pdf
    Mike G. Johnson 12345.pdf
    Mike F. Smith 12345.pdf
    Mike F. Jones 12345.pdf
    Mike F. Jones 12345 (01).pdf
    Mike F. Jones 12345 (02).pdf
    Mike F. 2332445-withdrawal.pdf
    Mike F. 434324.sign.pdf
    Mike F. 434324.everywhere_a_sign.pdf
    Mike 12345.pdf
    Mike Carlson 12345.pdf
    Mike C. 12345.pdf

    If my question is not clear, let me put it another way: If you were
    manually creating directories to organize these filenames, what
    directories would you create? I would like you to actually post the
    answer to that question so we can better understand your intent.
    , Nov 15, 2005
    #2
    1. Advertising

  3. AcCeSsDeNiEd <> wrote:

    > I'm trying to group the 'similar' files together and place them in a directory.


    One step of the solution would be to get them sorted by "name"...


    > E.g of such files:
    >
    > Mike F. 2332445-withdrawal.pdf
    > Mike F. 43565654-letter.pdf
    > Mike F. 434324.sign.pdf
    > Dawn M. Yang letter of acceptance.pdf
    > Dawn M. Yang (01).pdf
    > Dawn M. Yang 4355434 SOA.pdf



    .... so your test data should not be already sorted.


    > I'm trying to group these files by their names.



    Another part of the solution then would be to identify where
    the "names" end.


    > The names are not in a fixed format.



    Then you will need to identify every case so that you can write
    code that will handle every case.


    > E.g, not all names may have a middle name.



    But you identify only one of the cases, and provide none of that one
    case in your test data.

    Do you also have:

    Mike F. Smith 1234.pdf

    where you need it to be grouped with " Mike F."?


    You make it too hard to help you...


    > if these names were in a list, how do I match and group them together?



    You need to seperate the "name" from "the rest" to start with.

    I will assume that each component of a "name" starts with an
    upper case letter, and that the first part after the name
    does NOT start with an upper case letter.

    If you had lines like the above in a file, then this seems
    to do a credible job of identifying the "name" part:

    ----------------------------------------
    #!/usr/bin/perl
    use warnings;
    use strict;

    while ( <> ) {
    next unless /^(([A-Z]\S+ )+)/;
    chop(my $name = $1);
    print "'$name'\n";
    }
    ----------------------------------------



    > How would I know the group
    > name?



    See above.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Nov 15, 2005
    #3
  4. AcCeSsDeNiEd

    AcCeSsDeNiEd Guest

    On Tue, 15 Nov 2005 07:27:16 GMT, "Purl Gurl" <>
    wrote:

    >Impossible. Your files are in a directory. Only exceptions which come
    >to mind, would be you are running an ENIAC machine, circa 1950, or
    >an old IBM 600 series machine which stores data on punch cards.


    Dude, I didn't mean that in a literal sense.
    LOL. I just left out their tiny details so the post wouldn't be too
    long.

    What I meant was that the files are not grouped properly.
    These are basically client files that are kept in each staff's
    directory. But within the staff's directory, all the client files are
    just dumped there without even being sorted into a folder with regard's
    to the client name. It seems that someone thought they would rather just
    add the client name to the file name. Costly mistake which I have to
    clean up for them now.

    1 staff can even have 6k files. So scrolling down the list is getting
    rather slow.

    It pisses me off that 4 yrs ago I told them recently not to do this and
    now I'm still the one that has to end up doing the cleaning up.
    Sorry......


    >Is it you want to create directories based on names but your mind
    >went up in stinky Chong smoke and you forgot to mention this?


    Yes. But these names can only be extracted from the file names.
    But how would I know which files are *like* and *what* is like about
    them? So that I can create the directory and push the files over to it.

    To e-mail, remove the obvious
    AcCeSsDeNiEd, Nov 15, 2005
    #4
  5. AcCeSsDeNiEd

    AcCeSsDeNiEd Guest

    On Tue, 15 Nov 2005 08:48:14 -0600, Tad McClellan <> wrote:

    Well, I've just given up doing this programmatically.
    I've taken a closer look at the naming conventions.
    One method I thought off was to split the name from the numbers.
    But I've come across files that do not have numbers just after the client's name.
    The client name always comes on the left of the filename, but the rest of the filename is just too
    'gibberish'.

    Not gonna happen. At least not until computers are capable of AI.

    Thx for the help anyways.

    My coy will just have to hire temp staff to clean up this mess.

    Btw, we have 400k files.
    So good luck on the manual process.

    To e-mail, remove the obvious
    AcCeSsDeNiEd, Nov 16, 2005
    #5
  6. AcCeSsDeNiEd <> wrote in
    news::

    > Well, I've just given up doing this programmatically.


    Please quote some context when you reply.

    > I've taken a closer look at the naming conventions.
    > One method I thought off was to split the name from the numbers.
    > But I've come across files that do not have numbers just after the
    > client's name. The client name always comes on the left of the
    > filename, but the rest of the filename is just too 'gibberish'.
    >
    > Not gonna happen. At least not until computers are capable of AI.


    As a first stab, grouping files on the basis of closeness of their
    names maybe reduce the amount of work needed.

    See if

    http://search.cpan.org/~jgoldberg/Text-LevenshteinXS-0.03/LevenshteinXS.pm

    helps. I could see myself using something like this to first
    distribute files into sub-directories. Then the manual work of checking
    for incorrectly identified files ought to be less.

    Sinan

    --
    A. Sinan Unur <>
    (reverse each component and remove .invalid for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
    A. Sinan Unur, Nov 16, 2005
    #6
  7. AcCeSsDeNiEd wrote:
    > I have several 10s of thousands files with no directories.
    >
    > I'm trying to group the 'similar' files together and place them in a directory.
    >
    > E.g of such files:
    >
    > Mike F. 2332445-withdrawal.pdf
    > Mike F. 43565654-letter.pdf
    > Mike F. 434324.sign.pdf
    > Dawn M. Yang letter of acceptance.pdf
    > Dawn M. Yang (01).pdf
    > Dawn M. Yang 4355434 SOA.pdf
    >
    >
    > I'm trying to group these files by their names.
    > The names are not in a fixed format. E.g, not all names may have a middle name.
    > if these names were in a list, how do I match and group them together? How would I know the group
    > name?

    -----------
    AcCeSsDeNiEd wrote:
    > On Tue, 15 Nov 2005 08:48:14 -0600, Tad McClellan <> wrote:
    >
    > Well, I've just given up doing this programmatically.
    > I've taken a closer look at the naming conventions.
    > One method I thought off was to split the name from the numbers.
    > But I've come across files that do not have numbers just after the client's name.
    > The client name always comes on the left of the filename, but the rest of the filename

    is just too
    > 'gibberish'.
    >
    > Not gonna happen. At least not until computers are capable of AI.
    >
    > Thx for the help anyways.
    >
    > My coy will just have to hire temp staff to clean up this mess.
    >
    > Btw, we have 400k files.
    > So good luck on the manual process.


    How about starting with a list of Users (storing Name, TragetDir, and
    a (growing) list of alias names given as RegExps). First move the 'clear
    cases' to the TargetDirs, then view the remaining files to improve the
    alias names.
    ekkehard.horner, Nov 16, 2005
    #7
  8. AcCeSsDeNiEd

    Anno Siegel Guest

    AcCeSsDeNiEd <> wrote in comp.lang.perl.misc:
    > On Tue, 15 Nov 2005 08:48:14 -0600, Tad McClellan <> wrote:
    >
    > Well, I've just given up doing this programmatically.
    > I've taken a closer look at the naming conventions.
    > One method I thought off was to split the name from the numbers.
    > But I've come across files that do not have numbers just after the
    > client's name.
    > The client name always comes on the left of the filename, but the rest
    > of the filename is just too
    > 'gibberish'.
    >
    > Not gonna happen. At least not until computers are capable of AI.
    >
    > Thx for the help anyways.
    >
    > My coy will just have to hire temp staff to clean up this mess.
    >
    > Btw, we have 400k files.
    > So good luck on the manual process.


    A list of valid names (even a modest one) could help sorting out the
    clear cases. If everything in the formats "first middle last",
    "first last" and "first middle" with verified "first" and "last"
    (and middle something like /[[:upper:]]\./ was accepted automatically,
    that could reduce the amount of manual processing considerably. I am
    appending a sketch of how this could work.

    Name lists are available from the US Census Bureau, typical file names
    are dist.all.last, dist.female.first, and dist.male.first.

    Anno

    #!/usr/bin/perl
    use strict; use warnings; $| = 1; # @^~`
    use Vi::QuickFix;

    my ( %first, %last);
    my $namedir = "$ENV{ HOME}/dict/us-census-names";
    my $in;
    open $in, $_ or die "Can't read $_: $!" for "$namedir/dist.female.first";
    @first{ map /(\S+)/, <$in> } = ();
    open $in, $_ or die "Can't read $_: $!" for "$namedir/dist.male.first";
    @first{ map /(\S+)/, <$in> } = ();
    open $in, $_ or die "Can't read $_: $!" for "$namedir/dist.all.last";
    @last{ map /(\S+)/, <$in> } = ();

    my ( @accepted, @rejected);
    while ( <DATA> ) {
    chomp;
    my ( $first, $middle, $last) = split;
    unless ( exists $first{ uc $first} ) {
    push @rejected, "$first $middle $last";
    next;
    }
    if ( $middle =~ /[[:upper:]]\./ ) {
    if ( exists $last{ uc $last} ) {
    push @accepted, "$first $middle $last";
    }
    else {
    push @accepted, "$first, $middle";
    }
    }
    else {
    $last = $middle;
    if ( exists $last{ uc $last} ) {
    push @accepted, "$first $last";
    }
    else {
    push @rejected, "$first $last";
    }
    }
    }
    print "accepted:\n";
    print "$_\n" for @accepted;
    print "\nrejected:\n";
    print "$_\n" for @rejected;

    __DATA__
    Mike 12345.pdf
    Mike G. 2332445-withdrawal.pdf
    Mike G. 12345.pdf
    Mike G. Johnson 12345.pdf
    Mike F. Smith 12345.pdf
    Mike F. Jones 12345.pdf
    Mike F. Jones 12345 (01).pdf
    Mike F. Jones 12345 (02).pdf
    Mike F. 2332445-withdrawal.pdf
    Mike F. 434324.sign.pdf
    Mike F. 434324.everywhere_a_sign.pdf
    Mike 12345.pdf
    --
    If you want to post a followup via groups.google.com, don't use
    the broken "Reply" link at the bottom of the article. Click on
    "show options" at the top of the article, then click on the
    "Reply" at the bottom of the article headers.
    Anno Siegel, Nov 16, 2005
    #8
  9. AcCeSsDeNiEd

    Dr.Ruud Guest

    AcCeSsDeNiEd:

    > I'm trying to group the 'similar' files together and place them in a
    > directory.
    >
    > E.g of such files:
    >
    > Mike F. 2332445-withdrawal.pdf
    > Mike F. 43565654-letter.pdf
    > Mike F. 434324.sign.pdf
    > Dawn M. Yang letter of acceptance.pdf
    > Dawn M. Yang (01).pdf
    > Dawn M. Yang 4355434 SOA.pdf
    >
    >
    > I'm trying to group these files by their names.
    > The names are not in a fixed format. E.g, not all names may have a
    > middle name.
    > if these names were in a list, how do I match and group them
    > together? How would I know the group name?


    #!/usr/bin/perl
    use strict; use warnings;

    { local ($,,$\) = ("\t", "\n");

    for (<>) {

    chomp;

    /^( # start a capturing group
    [[:upper:]] # a Word should start with a capital
    [[:lower:][:punct:]]+
    # followed by 1 or more specific chars
    (?: # start a non-capturing group
    \s+ # 1 or more wsp chars
    [[:upper:]] # followed by another Word
    [[:lower:][:punct:]]+
    )* # 0 or more trailing Words
    )/x; # end of capturing group

    print "[$1]", $_;
    }
    }

    $ names.pl < names.inp
    [Mike F.] Mike F. 2332445-withdrawal.pdf
    [Mike F.] Mike F. 43565654-letter.pdf
    [Mike F.] Mike F. 434324.sign.pdf
    [Dawn M. Yang] Dawn M. Yang letter of acceptance.pdf
    [Dawn M. Yang] Dawn M. Yang (01).pdf
    [Dawn M. Yang] Dawn M. Yang 4355434 SOA.pdf


    You can use a hash to convert from name to group, with entries like:

    "Mike F." => "Mike_Forster"
    "Dawn M. Yang" => "Dawn_Yang"

    --
    Affijn, Ruud

    "Gewoon is een tijger."
    Dr.Ruud, Nov 17, 2005
    #9
  10. AcCeSsDeNiEd

    AcCeSsDeNiEd Guest

    On 16 Nov 2005 14:35:17 GMT, -berlin.de (Anno Siegel) wrote:

    >Name lists are available from the US Census Bureau, typical file names
    >are dist.all.last, dist.female.first, and dist.male.first.


    Thx for the help. But more than half the names are not English.
    And the whole filename is in caps.
    Sigh...


    To e-mail, remove the obvious
    AcCeSsDeNiEd, Nov 17, 2005
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Gilles Kuhn
    Replies:
    0
    Views:
    417
    Gilles Kuhn
    Sep 15, 2003
  2. Robert Smith

    grouping controls together

    Robert Smith, Mar 31, 2008, in forum: ASP .Net
    Replies:
    2
    Views:
    278
    Alex Meleta
    Mar 31, 2008
  3. maury

    Grouping items in checkboxlist

    maury, Apr 17, 2011, in forum: ASP .Net
    Replies:
    0
    Views:
    767
    maury
    Apr 17, 2011
  4. fishfry

    Choosing grouping of split items

    fishfry, Feb 15, 2004, in forum: Perl Misc
    Replies:
    8
    Views:
    144
    John W. Krahn
    Feb 16, 2004
  5. Michael Fogleman

    Grouping items by a key?

    Michael Fogleman, Mar 22, 2013, in forum: Python
    Replies:
    1
    Views:
    113
    Steven D'Aprano
    Mar 23, 2013
Loading...

Share This Page