document ID tracking

S

slash

Hi,
I am trying to write a script that will allow me to manipulate words
in a certain way and also keep track of the documents from which those
words came from. In other words, let's say my corpus consisted of
htese three documents with the following contents.

DocID 1.TXT
Compose your message

DocID 2.TXT
Use this form to post your message

DocID 3.TXT
Remember that it can be viewed by millions

Now, when I do my processing for all files, I want to be able to see
that "message" is a word that appears in both DocID 1.TXT and DocID
2.TXT

How can I do this in Perl? Is this what an inverted index is minus the
term frequencies, etc.? I am under pressure and wanted to know if
there was any way I could perhaps get this code from somewhere else or
perhaps the pseudocode.
I would certainly appreciate any help.

Thanks,
Satish
 
J

John W. Krahn

slash said:
I am trying to write a script that will allow me to manipulate words
in a certain way and also keep track of the documents from which those
words came from. In other words, let's say my corpus consisted of
htese three documents with the following contents.

DocID 1.TXT
Compose your message

DocID 2.TXT
Use this form to post your message

DocID 3.TXT
Remember that it can be viewed by millions

Now, when I do my processing for all files, I want to be able to see
that "message" is a word that appears in both DocID 1.TXT and DocID
2.TXT

How can I do this in Perl? Is this what an inverted index is minus the
term frequencies, etc.? I am under pressure and wanted to know if
there was any way I could perhaps get this code from somewhere else or
perhaps the pseudocode.
I would certainly appreciate any help.

Something like this should work:

@ARGV = ( 'DocID 1.TXT', 'DocID 2.TXT', 'DocID 3.TXT' );

my %data;

while ( <> ) {
# store the file name and line numbers for each occurance of word
push @{ $data{ $_ }{ $ARGV } }, $. for split;
# reset $. for next file
close ARGV if eof;
}


for my $word ( sort keys %data ) {
for my $file ( sort keys %$word ) {
print "$word $file @{$data{$word}{$file}}\n";
}
}



John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,901
Latest member
Noble71S45

Latest Threads

Top