document ID tracking

Discussion in 'Perl Misc' started by slash, Jul 24, 2003.

  1. slash

    slash Guest

    Hi,
    I am trying to write a script that will allow me to manipulate words
    in a certain way and also keep track of the documents from which those
    words came from. In other words, let's say my corpus consisted of
    htese three documents with the following contents.

    DocID 1.TXT
    Compose your message

    DocID 2.TXT
    Use this form to post your message

    DocID 3.TXT
    Remember that it can be viewed by millions

    Now, when I do my processing for all files, I want to be able to see
    that "message" is a word that appears in both DocID 1.TXT and DocID
    2.TXT

    How can I do this in Perl? Is this what an inverted index is minus the
    term frequencies, etc.? I am under pressure and wanted to know if
    there was any way I could perhaps get this code from somewhere else or
    perhaps the pseudocode.
    I would certainly appreciate any help.

    Thanks,
    Satish
    slash, Jul 24, 2003
    #1
    1. Advertising

  2. slash wrote:
    >
    > I am trying to write a script that will allow me to manipulate words
    > in a certain way and also keep track of the documents from which those
    > words came from. In other words, let's say my corpus consisted of
    > htese three documents with the following contents.
    >
    > DocID 1.TXT
    > Compose your message
    >
    > DocID 2.TXT
    > Use this form to post your message
    >
    > DocID 3.TXT
    > Remember that it can be viewed by millions
    >
    > Now, when I do my processing for all files, I want to be able to see
    > that "message" is a word that appears in both DocID 1.TXT and DocID
    > 2.TXT
    >
    > How can I do this in Perl? Is this what an inverted index is minus the
    > term frequencies, etc.? I am under pressure and wanted to know if
    > there was any way I could perhaps get this code from somewhere else or
    > perhaps the pseudocode.
    > I would certainly appreciate any help.


    Something like this should work:

    @ARGV = ( 'DocID 1.TXT', 'DocID 2.TXT', 'DocID 3.TXT' );

    my %data;

    while ( <> ) {
    # store the file name and line numbers for each occurance of word
    push @{ $data{ $_ }{ $ARGV } }, $. for split;
    # reset $. for next file
    close ARGV if eof;
    }


    for my $word ( sort keys %data ) {
    for my $file ( sort keys %$word ) {
    print "$word $file @{$data{$word}{$file}}\n";
    }
    }



    John
    --
    use Perl;
    program
    fulfillment
    John W. Krahn, Jul 24, 2003
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Steve Carrow
    Replies:
    0
    Views:
    537
    Steve Carrow
    Jul 28, 2003
  2. Wendy S
    Replies:
    1
    Views:
    6,326
    Darren Davison
    Aug 5, 2003
  3. Mike
    Replies:
    0
    Views:
    7,858
  4. goks
    Replies:
    7
    Views:
    275
    Thomas 'PointedEars' Lahn
    May 30, 2004
  5. slash

    document ID tracking

    slash, Jul 24, 2003, in forum: Perl Misc
    Replies:
    6
    Views:
    94
    slash
    Jul 27, 2003
Loading...

Share This Page