document ID tracking

S

slash

Hi,
I am trying to write a script that will allow me to manipulate words
in a certain way and also keep track of the documents from which those
words came from. In other words, let's say my corpus consisted of
htese three documents with the following contents.

DocID 1.TXT
Compose your message

DocID 2.TXT
Use this form to post your message

DocID 3.TXT
Remember that it can be viewed by millions

Now, when I do my processing for all files, I want to be able to see
that "message" is a word that appears in both DocID 1.TXT and DocID
2.TXT

How can I do this in Perl? Is this what an inverted index is minus the
term frequencies, etc.? I am under pressure and wanted to know if
there was any way I could perhaps get this code from somewhere else or
perhaps the pseudocode.
I would certainly appreciate any help.

Thanks,
Slash
 
A

A. Sinan Unur

(e-mail address removed) (slash) wrote in @posting.google.com:
Hi,
I am trying to write a script that will allow me to manipulate words
in a certain way and also keep track of the documents from which those
words came from. In other words, let's say my corpus consisted of
htese three documents with the following contents.

DocID 1.TXT
Compose your message

DocID 2.TXT
Use this form to post your message

DocID 3.TXT
Remember that it can be viewed by millions

Now, when I do my processing for all files, I want to be able to see
that "message" is a word that appears in both DocID 1.TXT and DocID
2.TXT

I am sure there is a better way to do this, but you can use a hash keyed
on the words. My quick hack is below. (BTW, I do hope this is not
homework).

# cw: Common Word
# Script to list words that appear in all the files
# passed on the command line

use diagnostics;
use strict;
use warnings;

die "$0: file1 ... fileN\n" unless scalar @ARGV;

my %word_to_files;

while(<ARGV>) {
chomp;
my @words = split /\s+/;
foreach my $word (@words) {
if(exists $word_to_files{$word}) {
unless(grep /$ARGV/, @{$word_to_files{$word}}) {
push @{$word_to_files{$word}}, ($ARGV);
}
} else {
$word_to_files{$word} = [$ARGV];
}
}
}

foreach (sort keys %word_to_files) {
print "$_: @{$word_to_files{$_}}\n";
}

__END__

C:\develop\perl\misc>cat file?.txt
Compose your message

Use this form to post your message

Remember that it can be viewed by millions

C:\develop\perl\misc>cw.pl file1.txt file2.txt file3.txt
Compose: file1.txt
Remember: file3.txt
Use: file2.txt
be: file3.txt
by: file3.txt
can: file3.txt
form: file2.txt
it: file3.txt
message: file1.txt file2.txt
millions: file3.txt
post: file2.txt
that: file3.txt
this: file2.txt
to: file2.txt
viewed: file3.txt
your: file1.txt file2.txt
 
S

Steve in NY

Hi,
I am trying to write a script that will allow me to manipulate words
in a certain way and also keep track of the documents from which those
words came from. In other words, let's say my corpus consisted of
htese three documents with the following contents.

DocID 1.TXT
Compose your message

DocID 2.TXT
Use this form to post your message

DocID 3.TXT
Remember that it can be viewed by millions

Now, when I do my processing for all files, I want to be able to see
that "message" is a word that appears in both DocID 1.TXT and DocID
2.TXT

How can I do this in Perl? Is this what an inverted index is minus the
term frequencies, etc.? I am under pressure and wanted to know if
there was any way I could perhaps get this code from somewhere else or
perhaps the pseudocode.
I would certainly appreciate any help.

Thanks,
Slash

this doesn't check for frequencies, just that the word does exist in each file.
to check for frequencies, I would suggest first breaking up the line on each
word (break each word by whitespace), and then using a hash with the word as key
and vaule would be number of times it appears, etc....

#!/usr/bin/perl -w
use strict;

my $word = "message";

my @files = qw(DocID_1.TXT
DocID_2.TXT
DocID_3.TXT);

for my $file (@files) {
open (FILE, "<$file");
while (<FILE>) {
if ($_ =~ /($word)/) {
print "$file contains the word $word.\n"
}
}
}
 
E

Eric J. Roode

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

if(exists $word_to_files{$word}) {
unless(grep /$ARGV/, @{$word_to_files{$word}}) {
push @{$word_to_files{$word}}, ($ARGV);
}
} else {
$word_to_files{$word} = [$ARGV];
}

Why use an array as the second-level data structure -- why not a hash?

$word_to_files{$word}{$ARGV} = 1;

- --
Eric
$_ = reverse sort qw p ekca lre Js reh ts
p, $/.r, map $_.$", qw e p h tona e; print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBPyEAG2PeouIeTNHoEQL8tQCcDqV7RIXQpkLdixd/fX8I6mS3TKQAnRUK
IYK6PGqSuEmL6krOv6gj+mI0
=7lDc
-----END PGP SIGNATURE-----
 
A

A. Sinan Unur

if(exists $word_to_files{$word}) {
unless(grep /$ARGV/, @{$word_to_files{$word}}) {
push @{$word_to_files{$word}}, ($ARGV);
}
} else {
$word_to_files{$word} = [$ARGV];
}

Why use an array as the second-level data structure -- why not a hash?

$word_to_files{$word}{$ARGV} = 1;

Muddled thinking I guess. And I do remember making a mental note of this
when you pointed out the same thing in another thread, but it looks like
I regarded that as just another deadline reminder :)

Is this better?

# cw: Common Word
# Script to list words that appear in all the files
# passed on the command line

use diagnostics;
use strict;
use warnings;

die "$0: file1 ... fileN\n" unless scalar @ARGV;

my %word_to_files;

while(<ARGV>) {
chomp;
my @words = split /\s+/;
foreach (@words) {
$word_to_files{$_}{$ARGV} = 1;
}
}

foreach (sort keys %word_to_files) {
print "$_: ", join(" ", keys %{$word_to_files{$_}}), "\n";
}

__END__
 
A

A. Sinan Unur

(e-mail address removed) (slash) wrote in
Thanks so much for all the helpful responses. (Sinan, this is not a HW
problem! :) I didn't get any helpful responses in another related
postiing so I am adding this as a followup here hoping that it will
get reviewed.

You'll need to post something that can be run locally (make sure some
sample input data are included).

....
undef $/;

Are you sure you want to this here?
my @words = split /\W+/, <> ;
my $line_number = 2;
my $n;
my $line_num = 2;

What is the difference between $line_number and $line_num and what purpose
do they serve?
my $n_cols = 5;
my $col = { align => 'left'}; # no title, left alignment
my $tb = Text::Table->new( ( $col) x $n_cols);
my @stack = ( '*' ) x $n_cols;
foreach $word ( @words ) {
shift @stack;
push @stack, $word;
$tb->add(@stack);
}

What on earth is going on in here?
my @lines = $tb->add("$stack[-4]", "$stack[-3]", "$stack[-2]",
"$stack[-1]", "*");
my @lines = $tb->add("$stack[-3]", "$stack[-2]", "$stack[-1]","*",
"*");
my @lines = $tb->table($line_number, $n);

Why do you keep redeclaring and redefining @lines before you do anything
with it?
#print @lines;
my $t1 = $tb->select(2, {is_sep => 2, body => " "}, 1,0,
{is_sep => 2, body => "\n"},
2, {is_sep =>2, body => " "}, 3,4);
#foreach $textID (@textID) {
#$t1 = $t1->add($ARGV); }#adds one data line at the end of ngrams not
a col

I do not understand this comment. Is it supposed to do something else? Did
you read the docs for Text::Table?

add()
adds a data line to the table, returns the table.
my $input = $t1->table($line_num, $n);
print $input; ....
To recap, I don't know if I really need an inverted index. Perhaps an
array of arrays might help instead of the table module. Where I can
have @lines and $ARGV. Would that work? In other words, an array
consisting of the following:(first line of ngram, $ARGV)
(Second line of ngram, $ARGV)
.
.
.
(Last line of ngram, $ARGV)
And perhaps I could put this into a table and do the select statemetns
over them to display the desired output. Is this possible or I am just
dreaming?

No, you are just rambling. The way this works is, you post a specific
problem, and people try to help you solve it. We cannot figure out for you
your requirements etc because we do not have the information you have
regarding the overall picture.

So, I do not know why you decided the previous solutions we posted to the
problem of associating each word with the file(s) it came from were
inadequate. Before people can help you, you have to clearly communicate
what problem you are trying to solve.
Any suggestions on how to achieve this would be very much appreciated.

I do not know what you mean by "this". But, would the following help?

# fubar.pl

use strict;
use warnings;

use Text::Table;

my $cols = 5;
my $col = { 'align' => 'left' };
my $table = Text::Table->new(($col) x $cols);

{
local $/;
while(my @words = split /\W+/, <ARGV>) {
while (@words) {
my @row = splice (@words, 0, $cols - 1);
if(@row < $cols - 1) {
push @row, (undef)x($cols - @row - 1);
}
push @row, ($ARGV);
$table->add(@row);
}
}
}

print ($table->body());
__END__

C:\Home> cat file1
file1a file1b
file1c
file1d
file1e file1f
file1g

C:\Home>cat file2
file2a
file2b
file2c
file2d
file2e
file2f

C:\Home>fubar.pl file1 file2
file1a file1b file1c file1d file1
file1e file1f file1g file1
file2a file2b file2c file2d file2
file2e file2f file2
 
S

slash

Thanks for the help again and sorry for the confusion. I really
appreciate the time people are taking out of their busy schedules to
help me out here. I just don't get references well enough to be able
to figure out how I can get what I want.

What I was trying to do (so ineffectively) was to have a 5gram first
for all the words and then the filenames next to them so that I can
select specific columns for display.
I will try to explain this with more detail:

My input is essentially bunch of text files that I am currently
passing in to the script as: perl -n script.pl ./*.TXT

fox.txt
quick brown fox jumped over lazy dog tripped over resting fox

My badly written program was trying to achieve three things:
produce complete 5grams, then
select specific columns from that 5grams,
third part has to do with document tracking that I just can't seem to
get it in.

First part: 5grams
=================
.. . . . quick
.. . . quick brown
.. . quick brown fox
.. quick brown fox jumped
quick brown fox jumped over
brown fox jumped over lazy
fox jumped over lazy dog
jumped over lazy dog tripped
over lazy dog tripped over
lazy dog tripped over resting
dog tripped over resting fox
tripped over resting fox
over resting fox


2nd part: select columns
=====================
quick brown fox
brown quick
brown fox jumped
fox brown quick
fox jumped over
fox resting over
jumped fox brown
jumped over lazy
over jumped fox
over lazy dog
lazy over jumped
lazy dog tripped
dog lazy over
dog tripped over
tripped dog lazy
tripped over resting
over tripped dog
over resting fox
resting tripped over
resting fox

Third part: filenames
===================
quick brown fox fox.txt
brown quick fox.txt
brown fox jumped fox.txt
fox brown quick fox.txt
fox jumped over fox.txt
fox resting over fox.txt
jumped fox brown fox.txt
jumped over lazy fox.txt
over jumped fox fox.txt
over lazy dog fox.txt
lazy over jumped fox.txt
lazy dog tripped fox.txt
dog lazy over fox.txt
dog tripped over fox.txt
tripped dog lazy fox.txt
tripped over resting fox.txt
over tripped dog fox.txt
over resting fox fox.txt
resting tripped over fox.txt
resting fox fox.txt

Any help you could give me to move this forward would be greatly
appreciated.

many thanks,
slash


A. Sinan Unur said:
(e-mail address removed) (slash) wrote in
Thanks so much for all the helpful responses. (Sinan, this is not a HW
problem! :) I didn't get any helpful responses in another related
postiing so I am adding this as a followup here hoping that it will
get reviewed.

You'll need to post something that can be run locally (make sure some
sample input data are included).

...
undef $/;

Are you sure you want to this here?
my @words = split /\W+/, <> ;
my $line_number = 2;
my $n;
my $line_num = 2;

What is the difference between $line_number and $line_num and what purpose
do they serve?
my $n_cols = 5;
my $col = { align => 'left'}; # no title, left alignment
my $tb = Text::Table->new( ( $col) x $n_cols);
my @stack = ( '*' ) x $n_cols;
foreach $word ( @words ) {
shift @stack;
push @stack, $word;
$tb->add(@stack);
}

What on earth is going on in here?
my @lines = $tb->add("$stack[-4]", "$stack[-3]", "$stack[-2]",
"$stack[-1]", "*");
my @lines = $tb->add("$stack[-3]", "$stack[-2]", "$stack[-1]","*",
"*");
my @lines = $tb->table($line_number, $n);

Why do you keep redeclaring and redefining @lines before you do anything
with it?
#print @lines;
my $t1 = $tb->select(2, {is_sep => 2, body => " "}, 1,0,
{is_sep => 2, body => "\n"},
2, {is_sep =>2, body => " "}, 3,4);
#foreach $textID (@textID) {
#$t1 = $t1->add($ARGV); }#adds one data line at the end of ngrams not
a col

I do not understand this comment. Is it supposed to do something else? Did
you read the docs for Text::Table?

add()
adds a data line to the table, returns the table.
my $input = $t1->table($line_num, $n);
print $input; ...
To recap, I don't know if I really need an inverted index. Perhaps an
array of arrays might help instead of the table module. Where I can
have @lines and $ARGV. Would that work? In other words, an array
consisting of the following:(first line of ngram, $ARGV)
(Second line of ngram, $ARGV)
.
.
.
(Last line of ngram, $ARGV)
And perhaps I could put this into a table and do the select statemetns
over them to display the desired output. Is this possible or I am just
dreaming?

No, you are just rambling. The way this works is, you post a specific
problem, and people try to help you solve it. We cannot figure out for you
your requirements etc because we do not have the information you have
regarding the overall picture.

So, I do not know why you decided the previous solutions we posted to the
problem of associating each word with the file(s) it came from were
inadequate. Before people can help you, you have to clearly communicate
what problem you are trying to solve.
Any suggestions on how to achieve this would be very much appreciated.

I do not know what you mean by "this". But, would the following help?

# fubar.pl

use strict;
use warnings;

use Text::Table;

my $cols = 5;
my $col = { 'align' => 'left' };
my $table = Text::Table->new(($col) x $cols);

{
local $/;
while(my @words = split /\W+/, <ARGV>) {
while (@words) {
my @row = splice (@words, 0, $cols - 1);
if(@row < $cols - 1) {
push @row, (undef)x($cols - @row - 1);
}
push @row, ($ARGV);
$table->add(@row);
}
}
}

print ($table->body());
__END__

C:\Home> cat file1
file1a file1b
file1c
file1d
file1e file1f
file1g

C:\Home>cat file2
file2a
file2b
file2c
file2d
file2e
file2f

C:\Home>fubar.pl file1 file2
file1a file1b file1c file1d file1
file1e file1f file1g file1
file2a file2b file2c file2d file2
file2e file2f file2
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,049
Latest member
Allen00Reed

Latest Threads

Top