newbie's question on the text file processing?

Discussion in 'Perl Misc' started by Jim, Dec 7, 2003.

  1. Jim

    Jim Guest

    Hello,

    I am learning Perl and I have come across something. I would like to process
    the text file and calculate the word frequency in it. All analysis is case
    insensitive and all punctuation marks other than hyphens, apostrophe and
    plus and minus signs were substituted by the space.As I am a new bie, I have
    no idea of how to write a complex regular expression to extract the correct
    word one by one from the file. Can anyone help me finish the script?
    Jim, Dec 7, 2003
    #1
    1. Advertising

  2. "Jim" <> wrote in
    news:bqvj4l$d2h$99.com:

    > Hello,
    >
    > I am learning Perl and I have come across something. I would like to
    > process the text file and calculate the word frequency in it. All
    > analysis is case insensitive and all punctuation marks other than
    > hyphens, apostrophe and plus and minus signs were substituted by the
    > space.As I am a new bie, I have no idea of how to write a complex
    > regular expression to extract the correct word one by one from the
    > file.


    This smells of homework or some other blatant attempt to make others do
    your work for you.

    > Can anyone help me finish the script?


    Show us what you have done so far and ask specific questions.

    --
    A. Sinan Unur

    Remove dashes for address
    Spam bait: mailto:
    A. Sinan Unur, Dec 7, 2003
    #2
    1. Advertising

  3. Jim

    ww Guest

    hint: what does open() do?
    hint: what does join(split()) do?
    hint: what does grep() return?
    hint: I don't know how to solve your problem.

    -w w



    On Mon, 8 Dec 2003 00:05:51 +0800, "Jim" <>
    wrote:

    >Hello,
    >
    >I am learning Perl and I have come across something. I would like to process
    >the text file and calculate the word frequency in it. All analysis is case
    >insensitive and all punctuation marks other than hyphens, apostrophe and
    >plus and minus signs were substituted by the space.As I am a new bie, I have
    >no idea of how to write a complex regular expression to extract the correct
    >word one by one from the file. Can anyone help me finish the script?
    >
    >
    ww, Dec 7, 2003
    #3
  4. Jim <> wrote:

    > I would like to process
    > the text file and calculate the word frequency in it.


    my %words;
    while ( <> ) {
    $words{$1}++ while /(\w+)/g;
    }
    printf "%9d %s\n", $_, $words{$_} for sort keys %words;


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Dec 7, 2003
    #4
  5. Jim

    Jim Guest

    while(my $line = <FILE>) {
    $line =~ s/[\+\-\']/_/g;
    $line = lc $line;
    my @array = ($line =~ /\b\w+\b/g);
    foreach(@array) {
    $wordFreq{$_}++;
    }
    }

    Is this correct? But I am not sure if the code fulfill the requirement.

    Jim
    Jim, Dec 7, 2003
    #5
  6. Jim wrote:
    > while(my $line = <FILE>) {
    > $line =~ s/[\+\-\']/_/g;
    > $line = lc $line;
    > my @array = ($line =~ /\b\w+\b/g);
    > foreach(@array) {
    > $wordFreq{$_}++;
    > }
    > }
    >
    > Is this correct? But I am not sure if the code fulfill the
    > requirement.


    How can we say? You don't tell us what the code is supposed to do (i.e. what
    are those ominous requirements you are refering to without actually telling
    us) or what kind of problems you have with that code or why you believe it
    is not correct. Just "question on text file processing" is a bit vague,
    don't you think?

    Posting your code is good, but it is not sufficient.
    Please
    - specify the requirement
    - explain what the code is supposed to do (or what you think the code is
    doing)
    - explain what the code is actully doing and in how this is different from
    what you expect it to do
    - quote literally any warning or error message you are getting
    Then we may be able to help you more

    jue
    Jürgen Exner, Dec 7, 2003
    #6
  7. Jim wrote:
    >
    > I am learning Perl and I have come across something. I would like to process
    > the text file and calculate the word frequency in it. All analysis is case
    > insensitive and all punctuation marks other than hyphens, apostrophe and
    > plus and minus signs were substituted by the space.As I am a new bie, I have
    > no idea of how to write a complex regular expression to extract the correct
    > word one by one from the file. Can anyone help me finish the script?


    my %words;
    while ( <> ) {
    s/[^[:alnum:]'+-]/ /g;
    $words{ lc() }++ for /\S+/g;
    }

    print "$_\t$words{$_}\n" for sort keys %words;



    John
    --
    use Perl;
    program
    fulfillment
    John W. Krahn, Dec 7, 2003
    #7

  8. > Subject: newbie's question on the text file processing?


    Please put the subject of your post in the Subject of your post. If
    in doubt try this simple test. Imagine you could have been bothered
    to have done a search before you posted. Next imagine you found a
    thread with your subject line. Would you have been able to recognise
    it as the same subject?

    Note: the words "newbie" and "question" are red-flag words in subject
    lines.

    "Jim" <> writes:

    [ No context - Please don't overtrim ]

    > while(my $line = <FILE>) {
    > $line =~ s/[\+\-\']/_/g;
    > $line = lc $line;
    > my @array = ($line =~ /\b\w+\b/g);
    > foreach(@array) {
    > $wordFreq{$_}++;
    > }
    > }
    >
    > Is this correct? But I am not sure if the code fulfill the requirement.


    I don't see why you do s/[\+\-\']/_/g

    It I read the requirement correctly you want to treat hyphen, plus and
    apostrophe as distinct word characters not replace then with underscore.

    The leading \b in /\b\w+\b/ is redundant because // always favours the
    ealiest possible match..

    The trailing \b in /\b\w+\b/ is redundant because + is greedy.

    BTW the variable @array is redundant - you could just use the
    expression directly in the argument of foreach().

    while(my $line = <FILE>} {
    $wordFreq{$_}++ for lc($line) =~ /[-+'\w]+/g;
    }
    Brian McCauley, Dec 11, 2003
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Sharon
    Replies:
    0
    Views:
    372
    Sharon
    Jun 16, 2004
  2. Hubert Hung-Hsien Chang
    Replies:
    2
    Views:
    408
    Michael Foord
    Sep 17, 2004
  3. Todd_Calhoun
    Replies:
    4
    Views:
    357
    Bengt Richter
    Apr 2, 2005
  4. len

    newbie file/DB processing

    len, May 18, 2005, in forum: Python
    Replies:
    9
    Views:
    338
  5. Replies:
    4
    Views:
    309
    Dennis Lee Bieber
    Oct 5, 2005
Loading...

Share This Page