newbie's question on the text file processing?

J

Jim

Hello,

I am learning Perl and I have come across something. I would like to process
the text file and calculate the word frequency in it. All analysis is case
insensitive and all punctuation marks other than hyphens, apostrophe and
plus and minus signs were substituted by the space.As I am a new bie, I have
no idea of how to write a complex regular expression to extract the correct
word one by one from the file. Can anyone help me finish the script?
 
A

A. Sinan Unur

Hello,

I am learning Perl and I have come across something. I would like to
process the text file and calculate the word frequency in it. All
analysis is case insensitive and all punctuation marks other than
hyphens, apostrophe and plus and minus signs were substituted by the
space.As I am a new bie, I have no idea of how to write a complex
regular expression to extract the correct word one by one from the
file.

This smells of homework or some other blatant attempt to make others do
your work for you.
Can anyone help me finish the script?

Show us what you have done so far and ask specific questions.
 
W

ww

hint: what does open() do?
hint: what does join(split()) do?
hint: what does grep() return?
hint: I don't know how to solve your problem.

-w w
 
T

Tad McClellan

Jim said:
I would like to process
the text file and calculate the word frequency in it.

my %words;
while ( <> ) {
$words{$1}++ while /(\w+)/g;
}
printf "%9d %s\n", $_, $words{$_} for sort keys %words;
 
J

Jim

while(my $line = <FILE>) {
$line =~ s/[\+\-\']/_/g;
$line = lc $line;
my @array = ($line =~ /\b\w+\b/g);
foreach(@array) {
$wordFreq{$_}++;
}
}

Is this correct? But I am not sure if the code fulfill the requirement.

Jim
 
J

Jürgen Exner

Jim said:
while(my $line = <FILE>) {
$line =~ s/[\+\-\']/_/g;
$line = lc $line;
my @array = ($line =~ /\b\w+\b/g);
foreach(@array) {
$wordFreq{$_}++;
}
}

Is this correct? But I am not sure if the code fulfill the
requirement.

How can we say? You don't tell us what the code is supposed to do (i.e. what
are those ominous requirements you are refering to without actually telling
us) or what kind of problems you have with that code or why you believe it
is not correct. Just "question on text file processing" is a bit vague,
don't you think?

Posting your code is good, but it is not sufficient.
Please
- specify the requirement
- explain what the code is supposed to do (or what you think the code is
doing)
- explain what the code is actully doing and in how this is different from
what you expect it to do
- quote literally any warning or error message you are getting
Then we may be able to help you more

jue
 
J

John W. Krahn

Jim said:
I am learning Perl and I have come across something. I would like to process
the text file and calculate the word frequency in it. All analysis is case
insensitive and all punctuation marks other than hyphens, apostrophe and
plus and minus signs were substituted by the space.As I am a new bie, I have
no idea of how to write a complex regular expression to extract the correct
word one by one from the file. Can anyone help me finish the script?

my %words;
while ( <> ) {
s/[^[:alnum:]'+-]/ /g;
$words{ lc() }++ for /\S+/g;
}

print "$_\t$words{$_}\n" for sort keys %words;



John
 
B

Brian McCauley

Subject: newbie's question on the text file processing?

Please put the subject of your post in the Subject of your post. If
in doubt try this simple test. Imagine you could have been bothered
to have done a search before you posted. Next imagine you found a
thread with your subject line. Would you have been able to recognise
it as the same subject?

Note: the words "newbie" and "question" are red-flag words in subject
lines.


[ No context - Please don't overtrim ]
while(my $line = <FILE>) {
$line =~ s/[\+\-\']/_/g;
$line = lc $line;
my @array = ($line =~ /\b\w+\b/g);
foreach(@array) {
$wordFreq{$_}++;
}
}

Is this correct? But I am not sure if the code fulfill the requirement.

I don't see why you do s/[\+\-\']/_/g

It I read the requirement correctly you want to treat hyphen, plus and
apostrophe as distinct word characters not replace then with underscore.

The leading \b in /\b\w+\b/ is redundant because // always favours the
ealiest possible match..

The trailing \b in /\b\w+\b/ is redundant because + is greedy.

BTW the variable @array is redundant - you could just use the
expression directly in the argument of foreach().

while(my $line = <FILE>} {
$wordFreq{$_}++ for lc($line) =~ /[-+'\w]+/g;
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top