document ID tracking

Discussion in 'Perl Misc' started by slash, Jul 24, 2003.

  1. slash

    slash Guest

    Hi,
    I am trying to write a script that will allow me to manipulate words
    in a certain way and also keep track of the documents from which those
    words came from. In other words, let's say my corpus consisted of
    htese three documents with the following contents.

    DocID 1.TXT
    Compose your message

    DocID 2.TXT
    Use this form to post your message

    DocID 3.TXT
    Remember that it can be viewed by millions

    Now, when I do my processing for all files, I want to be able to see
    that "message" is a word that appears in both DocID 1.TXT and DocID
    2.TXT

    How can I do this in Perl? Is this what an inverted index is minus the
    term frequencies, etc.? I am under pressure and wanted to know if
    there was any way I could perhaps get this code from somewhere else or
    perhaps the pseudocode.
    I would certainly appreciate any help.

    Thanks,
    Slash
     
    slash, Jul 24, 2003
    #1
    1. Advertising

  2. (slash) wrote in news:30fe9f1e.0307240405.7908ae70
    @posting.google.com:

    > Hi,
    > I am trying to write a script that will allow me to manipulate words
    > in a certain way and also keep track of the documents from which those
    > words came from. In other words, let's say my corpus consisted of
    > htese three documents with the following contents.
    >
    > DocID 1.TXT
    > Compose your message
    >
    > DocID 2.TXT
    > Use this form to post your message
    >
    > DocID 3.TXT
    > Remember that it can be viewed by millions
    >
    > Now, when I do my processing for all files, I want to be able to see
    > that "message" is a word that appears in both DocID 1.TXT and DocID
    > 2.TXT


    I am sure there is a better way to do this, but you can use a hash keyed
    on the words. My quick hack is below. (BTW, I do hope this is not
    homework).

    # cw: Common Word
    # Script to list words that appear in all the files
    # passed on the command line

    use diagnostics;
    use strict;
    use warnings;

    die "$0: file1 ... fileN\n" unless scalar @ARGV;

    my %word_to_files;

    while(<ARGV>) {
    chomp;
    my @words = split /\s+/;
    foreach my $word (@words) {
    if(exists $word_to_files{$word}) {
    unless(grep /$ARGV/, @{$word_to_files{$word}}) {
    push @{$word_to_files{$word}}, ($ARGV);
    }
    } else {
    $word_to_files{$word} = [$ARGV];
    }
    }
    }

    foreach (sort keys %word_to_files) {
    print "$_: @{$word_to_files{$_}}\n";
    }

    __END__

    C:\develop\perl\misc>cat file?.txt
    Compose your message

    Use this form to post your message

    Remember that it can be viewed by millions

    C:\develop\perl\misc>cw.pl file1.txt file2.txt file3.txt
    Compose: file1.txt
    Remember: file3.txt
    Use: file2.txt
    be: file3.txt
    by: file3.txt
    can: file3.txt
    form: file2.txt
    it: file3.txt
    message: file1.txt file2.txt
    millions: file3.txt
    post: file2.txt
    that: file3.txt
    this: file2.txt
    to: file2.txt
    viewed: file3.txt
    your: file1.txt file2.txt

    --
    A. Sinan Unur

    Remove dashes for address
    Spam bait: mailto:
     
    A. Sinan Unur, Jul 24, 2003
    #2
    1. Advertising

  3. slash

    Steve in NY Guest

    On 24 Jul 2003 12:53:47 -0700, (slash) wrote:

    >Hi,
    >I am trying to write a script that will allow me to manipulate words
    >in a certain way and also keep track of the documents from which those
    >words came from. In other words, let's say my corpus consisted of
    >htese three documents with the following contents.
    >
    >DocID 1.TXT
    >Compose your message
    >
    >DocID 2.TXT
    >Use this form to post your message
    >
    >DocID 3.TXT
    >Remember that it can be viewed by millions
    >
    >Now, when I do my processing for all files, I want to be able to see
    >that "message" is a word that appears in both DocID 1.TXT and DocID
    >2.TXT
    >
    >How can I do this in Perl? Is this what an inverted index is minus the
    >term frequencies, etc.? I am under pressure and wanted to know if
    >there was any way I could perhaps get this code from somewhere else or
    >perhaps the pseudocode.
    >I would certainly appreciate any help.
    >
    >Thanks,
    >Slash


    this doesn't check for frequencies, just that the word does exist in each file.
    to check for frequencies, I would suggest first breaking up the line on each
    word (break each word by whitespace), and then using a hash with the word as key
    and vaule would be number of times it appears, etc....

    #!/usr/bin/perl -w
    use strict;

    my $word = "message";

    my @files = qw(DocID_1.TXT
    DocID_2.TXT
    DocID_3.TXT);

    for my $file (@files) {
    open (FILE, "<$file");
    while (<FILE>) {
    if ($_ =~ /($word)/) {
    print "$file contains the word $word.\n"
    }
    }
    }
     
    Steve in NY, Jul 25, 2003
    #3
  4. -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

    "A. Sinan Unur" <> wrote in
    news:Xns93C2A80F838A4asu1cornelledu@132.236.56.8:

    > if(exists $word_to_files{$word}) {
    > unless(grep /$ARGV/, @{$word_to_files{$word}}) {
    > push @{$word_to_files{$word}}, ($ARGV);
    > }
    > } else {
    > $word_to_files{$word} = [$ARGV];
    > }


    Why use an array as the second-level data structure -- why not a hash?

    $word_to_files{$word}{$ARGV} = 1;

    - --
    Eric
    $_ = reverse sort qw p ekca lre Js reh ts
    p, $/.r, map $_.$", qw e p h tona e; print

    -----BEGIN PGP SIGNATURE-----
    Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

    iQA/AwUBPyEAG2PeouIeTNHoEQL8tQCcDqV7RIXQpkLdixd/fX8I6mS3TKQAnRUK
    IYK6PGqSuEmL6krOv6gj+mI0
    =7lDc
    -----END PGP SIGNATURE-----
     
    Eric J. Roode, Jul 25, 2003
    #4
  5. "Eric J. Roode" <> wrote in
    news:Xns93C33D640AD40sdn.comcast@206.127.4.25:

    > "A. Sinan Unur" <> wrote in
    > news:Xns93C2A80F838A4asu1cornelledu@132.236.56.8:
    >
    >> if(exists $word_to_files{$word}) {
    >> unless(grep /$ARGV/, @{$word_to_files{$word}}) {
    >> push @{$word_to_files{$word}}, ($ARGV);
    >> }
    >> } else {
    >> $word_to_files{$word} = [$ARGV];
    >> }

    >
    > Why use an array as the second-level data structure -- why not a hash?
    >
    > $word_to_files{$word}{$ARGV} = 1;


    Muddled thinking I guess. And I do remember making a mental note of this
    when you pointed out the same thing in another thread, but it looks like
    I regarded that as just another deadline reminder :)

    Is this better?

    # cw: Common Word
    # Script to list words that appear in all the files
    # passed on the command line

    use diagnostics;
    use strict;
    use warnings;

    die "$0: file1 ... fileN\n" unless scalar @ARGV;

    my %word_to_files;

    while(<ARGV>) {
    chomp;
    my @words = split /\s+/;
    foreach (@words) {
    $word_to_files{$_}{$ARGV} = 1;
    }
    }

    foreach (sort keys %word_to_files) {
    print "$_: ", join(" ", keys %{$word_to_files{$_}}), "\n";
    }

    __END__



    --
    A. Sinan Unur

    Remove dashes for address
    Spam bait: mailto:
     
    A. Sinan Unur, Jul 25, 2003
    #5
  6. (slash) wrote in
    news::

    > Thanks so much for all the helpful responses. (Sinan, this is not a HW
    > problem! :) I didn't get any helpful responses in another related
    > postiing so I am adding this as a followup here hoping that it will
    > get reviewed.


    You'll need to post something that can be run locally (make sure some
    sample input data are included).

    ....

    > undef $/;


    Are you sure you want to this here?

    > my @words = split /\W+/, <> ;
    > my $line_number = 2;
    > my $n;
    > my $line_num = 2;


    What is the difference between $line_number and $line_num and what purpose
    do they serve?

    > my $n_cols = 5;
    > my $col = { align => 'left'}; # no title, left alignment
    > my $tb = Text::Table->new( ( $col) x $n_cols);
    > my @stack = ( '*' ) x $n_cols;
    > foreach $word ( @words ) {
    > shift @stack;
    > push @stack, $word;
    > $tb->add(@stack);
    > }


    What on earth is going on in here?

    > my @lines = $tb->add("$stack[-4]", "$stack[-3]", "$stack[-2]",
    > "$stack[-1]", "*");
    > my @lines = $tb->add("$stack[-3]", "$stack[-2]", "$stack[-1]","*",
    > "*");
    > my @lines = $tb->table($line_number, $n);


    Why do you keep redeclaring and redefining @lines before you do anything
    with it?

    > #print @lines;
    > my $t1 = $tb->select(2, {is_sep => 2, body => " "}, 1,0,
    > {is_sep => 2, body => "\n"},
    > 2, {is_sep =>2, body => " "}, 3,4);
    > #foreach $textID (@textID) {
    > #$t1 = $t1->add($ARGV); }#adds one data line at the end of ngrams not
    > a col


    I do not understand this comment. Is it supposed to do something else? Did
    you read the docs for Text::Table?

    add()
    adds a data line to the table, returns the table.

    > my $input = $t1->table($line_num, $n);
    > print $input;

    ....
    > To recap, I don't know if I really need an inverted index. Perhaps an
    > array of arrays might help instead of the table module. Where I can
    > have @lines and $ARGV. Would that work? In other words, an array
    > consisting of the following:(first line of ngram, $ARGV)
    > (Second line of ngram, $ARGV)
    > .
    > .
    > .
    > (Last line of ngram, $ARGV)
    > And perhaps I could put this into a table and do the select statemetns
    > over them to display the desired output. Is this possible or I am just
    > dreaming?


    No, you are just rambling. The way this works is, you post a specific
    problem, and people try to help you solve it. We cannot figure out for you
    your requirements etc because we do not have the information you have
    regarding the overall picture.

    So, I do not know why you decided the previous solutions we posted to the
    problem of associating each word with the file(s) it came from were
    inadequate. Before people can help you, you have to clearly communicate
    what problem you are trying to solve.

    > Any suggestions on how to achieve this would be very much appreciated.


    I do not know what you mean by "this". But, would the following help?

    # fubar.pl

    use strict;
    use warnings;

    use Text::Table;

    my $cols = 5;
    my $col = { 'align' => 'left' };
    my $table = Text::Table->new(($col) x $cols);

    {
    local $/;
    while(my @words = split /\W+/, <ARGV>) {
    while (@words) {
    my @row = splice (@words, 0, $cols - 1);
    if(@row < $cols - 1) {
    push @row, (undef)x($cols - @row - 1);
    }
    push @row, ($ARGV);
    $table->add(@row);
    }
    }
    }

    print ($table->body());
    __END__

    C:\Home> cat file1
    file1a file1b
    file1c
    file1d
    file1e file1f
    file1g

    C:\Home>cat file2
    file2a
    file2b
    file2c
    file2d
    file2e
    file2f

    C:\Home>fubar.pl file1 file2
    file1a file1b file1c file1d file1
    file1e file1f file1g file1
    file2a file2b file2c file2d file2
    file2e file2f file2

    --
    A. Sinan Unur

    Remove dashes for address
    Spam bait: mailto:
     
    A. Sinan Unur, Jul 26, 2003
    #6
  7. slash

    slash Guest

    Thanks for the help again and sorry for the confusion. I really
    appreciate the time people are taking out of their busy schedules to
    help me out here. I just don't get references well enough to be able
    to figure out how I can get what I want.

    What I was trying to do (so ineffectively) was to have a 5gram first
    for all the words and then the filenames next to them so that I can
    select specific columns for display.
    I will try to explain this with more detail:

    My input is essentially bunch of text files that I am currently
    passing in to the script as: perl -n script.pl ./*.TXT

    fox.txt
    quick brown fox jumped over lazy dog tripped over resting fox

    My badly written program was trying to achieve three things:
    produce complete 5grams, then
    select specific columns from that 5grams,
    third part has to do with document tracking that I just can't seem to
    get it in.

    First part: 5grams
    =================
    .. . . . quick
    .. . . quick brown
    .. . quick brown fox
    .. quick brown fox jumped
    quick brown fox jumped over
    brown fox jumped over lazy
    fox jumped over lazy dog
    jumped over lazy dog tripped
    over lazy dog tripped over
    lazy dog tripped over resting
    dog tripped over resting fox
    tripped over resting fox
    over resting fox


    2nd part: select columns
    =====================
    quick brown fox
    brown quick
    brown fox jumped
    fox brown quick
    fox jumped over
    fox resting over
    jumped fox brown
    jumped over lazy
    over jumped fox
    over lazy dog
    lazy over jumped
    lazy dog tripped
    dog lazy over
    dog tripped over
    tripped dog lazy
    tripped over resting
    over tripped dog
    over resting fox
    resting tripped over
    resting fox

    Third part: filenames
    ===================
    quick brown fox fox.txt
    brown quick fox.txt
    brown fox jumped fox.txt
    fox brown quick fox.txt
    fox jumped over fox.txt
    fox resting over fox.txt
    jumped fox brown fox.txt
    jumped over lazy fox.txt
    over jumped fox fox.txt
    over lazy dog fox.txt
    lazy over jumped fox.txt
    lazy dog tripped fox.txt
    dog lazy over fox.txt
    dog tripped over fox.txt
    tripped dog lazy fox.txt
    tripped over resting fox.txt
    over tripped dog fox.txt
    over resting fox fox.txt
    resting tripped over fox.txt
    resting fox fox.txt

    Any help you could give me to move this forward would be greatly
    appreciated.

    many thanks,
    slash


    "A. Sinan Unur" <> wrote in message news:<Xns93C4A044A8D34asu1cornelledu@132.236.56.8>...
    > (slash) wrote in
    > news::
    >
    > > Thanks so much for all the helpful responses. (Sinan, this is not a HW
    > > problem! :) I didn't get any helpful responses in another related
    > > postiing so I am adding this as a followup here hoping that it will
    > > get reviewed.

    >
    > You'll need to post something that can be run locally (make sure some
    > sample input data are included).
    >
    > ...
    >
    > > undef $/;

    >
    > Are you sure you want to this here?
    >
    > > my @words = split /\W+/, <> ;
    > > my $line_number = 2;
    > > my $n;
    > > my $line_num = 2;

    >
    > What is the difference between $line_number and $line_num and what purpose
    > do they serve?
    >
    > > my $n_cols = 5;
    > > my $col = { align => 'left'}; # no title, left alignment
    > > my $tb = Text::Table->new( ( $col) x $n_cols);
    > > my @stack = ( '*' ) x $n_cols;
    > > foreach $word ( @words ) {
    > > shift @stack;
    > > push @stack, $word;
    > > $tb->add(@stack);
    > > }

    >
    > What on earth is going on in here?
    >
    > > my @lines = $tb->add("$stack[-4]", "$stack[-3]", "$stack[-2]",
    > > "$stack[-1]", "*");
    > > my @lines = $tb->add("$stack[-3]", "$stack[-2]", "$stack[-1]","*",
    > > "*");
    > > my @lines = $tb->table($line_number, $n);

    >
    > Why do you keep redeclaring and redefining @lines before you do anything
    > with it?
    >
    > > #print @lines;
    > > my $t1 = $tb->select(2, {is_sep => 2, body => " "}, 1,0,
    > > {is_sep => 2, body => "\n"},
    > > 2, {is_sep =>2, body => " "}, 3,4);
    > > #foreach $textID (@textID) {
    > > #$t1 = $t1->add($ARGV); }#adds one data line at the end of ngrams not
    > > a col

    >
    > I do not understand this comment. Is it supposed to do something else? Did
    > you read the docs for Text::Table?
    >
    > add()
    > adds a data line to the table, returns the table.
    >
    > > my $input = $t1->table($line_num, $n);
    > > print $input;

    > ...
    > > To recap, I don't know if I really need an inverted index. Perhaps an
    > > array of arrays might help instead of the table module. Where I can
    > > have @lines and $ARGV. Would that work? In other words, an array
    > > consisting of the following:(first line of ngram, $ARGV)
    > > (Second line of ngram, $ARGV)
    > > .
    > > .
    > > .
    > > (Last line of ngram, $ARGV)
    > > And perhaps I could put this into a table and do the select statemetns
    > > over them to display the desired output. Is this possible or I am just
    > > dreaming?

    >
    > No, you are just rambling. The way this works is, you post a specific
    > problem, and people try to help you solve it. We cannot figure out for you
    > your requirements etc because we do not have the information you have
    > regarding the overall picture.
    >
    > So, I do not know why you decided the previous solutions we posted to the
    > problem of associating each word with the file(s) it came from were
    > inadequate. Before people can help you, you have to clearly communicate
    > what problem you are trying to solve.
    >
    > > Any suggestions on how to achieve this would be very much appreciated.

    >
    > I do not know what you mean by "this". But, would the following help?
    >
    > # fubar.pl
    >
    > use strict;
    > use warnings;
    >
    > use Text::Table;
    >
    > my $cols = 5;
    > my $col = { 'align' => 'left' };
    > my $table = Text::Table->new(($col) x $cols);
    >
    > {
    > local $/;
    > while(my @words = split /\W+/, <ARGV>) {
    > while (@words) {
    > my @row = splice (@words, 0, $cols - 1);
    > if(@row < $cols - 1) {
    > push @row, (undef)x($cols - @row - 1);
    > }
    > push @row, ($ARGV);
    > $table->add(@row);
    > }
    > }
    > }
    >
    > print ($table->body());
    > __END__
    >
    > C:\Home> cat file1
    > file1a file1b
    > file1c
    > file1d
    > file1e file1f
    > file1g
    >
    > C:\Home>cat file2
    > file2a
    > file2b
    > file2c
    > file2d
    > file2e
    > file2f
    >
    > C:\Home>fubar.pl file1 file2
    > file1a file1b file1c file1d file1
    > file1e file1f file1g file1
    > file2a file2b file2c file2d file2
    > file2e file2f file2
     
    slash, Jul 27, 2003
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Steve Carrow
    Replies:
    0
    Views:
    552
    Steve Carrow
    Jul 28, 2003
  2. Wendy S
    Replies:
    1
    Views:
    6,368
    Darren Davison
    Aug 5, 2003
  3. Mike
    Replies:
    0
    Views:
    8,040
  4. goks
    Replies:
    7
    Views:
    287
    Thomas 'PointedEars' Lahn
    May 30, 2004
  5. slash

    document ID tracking

    slash, Jul 24, 2003, in forum: Perl Misc
    Replies:
    1
    Views:
    93
    John W. Krahn
    Jul 24, 2003
Loading...

Share This Page