fiel reading and processing

A

amit_h123

Hello all ,
I have large text filw with lots of spaces and newline chararters in
it, which i want to remove.
And after that i need to construct the hash tables for the unique qord
which are present in the file. Its like i need the hash for only
unigrams (one word at a time), a hash for bigrams (2 words at a time)
and same as for 3 words.
I am all lost in removing and accessing the spaces in the text fiel but
am not bale to access the each word at a time.
Just a simple example of what i need to do is:

if my text in file is :

hello how are you all hello how are.

so my unigrams will be like:
hello 2
how 2
are 2
you 1...

bigrams will be
hello how 2
how are 2
are you 1
you all 1

trigrams
hello how are 2
how are you 1
are you all 1
.....so on


Can anyone help me with this code.
-thanks
 
G

Gunnar Hjalmarsson

I have large text filw with lots of spaces and newline chararters in
it, which i want to remove.
And after that i need to construct the hash tables for the unique qord
which are present in the file. Its like i need the hash for only
unigrams (one word at a time), a hash for bigrams (2 words at a time)
and same as for 3 words.
I am all lost in removing and accessing the spaces in the text fiel

Have you thought of learning a programming language?
 
X

Xicheng Jia

Hello all ,
I have large text filw with lots of spaces and newline chararters in
it, which i want to remove.
And after that i need to construct the hash tables for the unique qord
which are present in the file. Its like i need the hash for only
unigrams (one word at a time), a hash for bigrams (2 words at a time)
and same as for 3 words.
I am all lost in removing and accessing the spaces in the text fiel but
am not bale to access the each word at a time.
Just a simple example of what i need to do is:

if my text in file is :

hello how are you all hello how are.

so my unigrams will be like:
hello 2
how 2
are 2
you 1...

bigrams will be
hello how 2
how are 2
are you 1
you all 1

trigrams
hello how are 2
how are you 1
are you all 1
....so on

Learn to use hash and how to make hash key-value pairs. Here is a very
general way to count trigrams as you mentioned. Find your own ways to
get bigrams, ..**grams.....

perl -le '
$_=q(hello how are you all hello how are);
$seen{ "$1$2" }++
while m{ (\S+) (?=( (?: \s+\S+){2} ) ) }xg;
print map{ "$_ => $seen{ $_ }\n" } keys %seen'

Xicheng
 
A

amit_h123

thanks for your help.
But. how do i use this on whole of the file. How should i read the
whoel text and process. Can you help me with that please.
 
X

Xicheng Jia

thanks for your help.
But. how do i use this on whole of the file. How should i read the
whoel text and process. Can you help me with that please.

1) read your file in slurp-mode, then all the contents in your file are
saved in a single string, then you can handle it the samilar way as the
above mentioned..

perl -0777ne '
tr/\n\t / /s;
$seen{ "$1$2" }++
while m{ (\S+) (?=( (?:\s+\S+){2}) ) }xg;
} { # quit from the hidden while-loop
print map{ "$_ => $seen{ $_ }\n" } keys %seen
' myfile.txt
(untested)

2) how about words separated by tab or newlines?? i.e. how do you
handle the following cases: are they treated as the same or
differently??

"how are you" and

"how are
you"

or

"how are you"

If you want to treat them differently, then comment the "tr///"
expression out.

check "perldoc perlrun" to change it into your real code..

3) how do you handle punctuations or all other graphic characters?? You
may figure that out by yourself.. :)

Good luck,
Xicheng
 
T

Tad McClellan

Hello all ,
I have large text filw with lots of spaces and newline chararters in
it, which i want to remove.
And after that i need to construct the hash tables for the unique qord
which are present in the file. Its like i need the hash for only
unigrams (one word at a time), a hash for bigrams (2 words at a time)
and same as for 3 words.
I am all lost in removing and accessing the spaces in the text fiel but
am not bale to access the each word at a time.
Just a simple example of what i need to do is:

if my text in file is :

hello how are you all hello how are.

so my unigrams will be like:
hello 2
how 2
are 2
you 1...

bigrams will be
hello how 2
how are 2
are you 1
you all 1

trigrams
hello how are 2
how are you 1
are you all 1
....so on


Can anyone help me with this code.


What code?

If you post your code we will help you fix it.

We cannot fix it if you do not post it for us to see.
 
R

robic0

thanks for your help.
But. how do i use this on whole of the file. How should i read the
whoel text and process. Can you help me with that please.

feil Hitler !!!!!!!!!!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,014
Latest member
BiancaFix3

Latest Threads

Top