Counting most frequently-occurring n-grams in a file (or over multiple files)

C

C3

I'm looking for, or willing to write, a program that will take a list of
files as command-line arguments, and then build up a frequency table of
n-grams (individual bytes, or strings of 2 or more bytes) for all these
files.

e.g. ngram 4 file1.txt file2.txt

would return the most frequently occurring sequences of 4 bytes over the two
files.

I am willing to go quick'n'dirty for this. I understand I need to build up a
table of all the n-grams that exist in each file. Can someone help me get
started on this?


cheers,
 
J

John W. Krahn

C3 said:
I'm looking for, or willing to write, a program that will take a list of
files as command-line arguments, and then build up a frequency table of
n-grams (individual bytes, or strings of 2 or more bytes) for all these
files.

e.g. ngram 4 file1.txt file2.txt

would return the most frequently occurring sequences of 4 bytes over the two
files.

I am willing to go quick'n'dirty for this. I understand I need to build up a
table of all the n-grams that exist in each file. Can someone help me get
started on this?

Well if it's quick'n'dirty that you want:

perl -lne'BEGIN{$r="."x shift}$h{$1}++while/(?=($r))/g}{print for keys%h' 4
file1.txt file2.txt



John
 
C

C3

Unmatched curly brace :)

John W. Krahn said:
Well if it's quick'n'dirty that you want:

perl -lne'BEGIN{$r="."x shift}$h{$1}++while/(?=($r))/g}{print for keys%h'
4 file1.txt file2.txt



John
 
J

Jeff 'japhy' Pinyan

I'm looking for, or willing to write, a program that will take a list of
files as command-line arguments, and then build up a frequency table of
n-grams (individual bytes, or strings of 2 or more bytes) for all these
files.

e.g. ngram 4 file1.txt file2.txt

would return the most frequently occurring sequences of 4 bytes over the two
files.

Open the file, read it in conveniently sized chunks, and for every group
of four characters, increment $ngram{$g}.

--
Jeff "japhy" Pinyan % How can we ever be the sold short or
RPI Acacia Brother #734 % the cheated, we who for every service
Senior Dean, Fall 2004 % have long ago been overpaid?
RPI Corporation Secretary %
http://japhy.perlmonk.org/ % -- Meister Eckhart
 
B

Bill Smith

C3 said:
I'm looking for, or willing to write, a program that will take a list of
files as command-line arguments, and then build up a frequency table of
n-grams (individual bytes, or strings of 2 or more bytes) for all these
files.
--snip--


Are n-grams restricted to characters on a single line or can they flow
onto the next line? (or even next file?) In the latter case, are the
newline character(s) part of the n-gram?

Bill
 
L

Larry Felton Johnson

C3 said:
Hmm, seems to run on the command-line, but it produces no output for me.

What sort of environment are you running it in? I cut and pasted his
oneliner and ran it against a number of files on my workstation, and it
worked right away. I haven't really checked the output carefully, but
on trivial files of character sequences it seems to work as I'd expect.

Larry
 
C

C3

Are n-grams restricted to characters on a single line or can they flow
onto the next line? (or even next file?) In the latter case, are the
newline character(s) part of the n-gram?

n-grams are sequences of bytes, not ASCII characters, so line feeds and
carriage returns are treated like any other character. n-grams may not flow
onto other files.

cheers,
 
C

C3

I'm running Perl 5.6.1 under Debian 3.0. I don't get any output, and have to
kill the app. Incidentally, what would it take to modify the program so that
it printed the ASCII code in hex (or decimal)? After all, it will be run on
binary files.


cheers,
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,733
Messages
2,569,440
Members
44,831
Latest member
HealthSmartketoReviews

Latest Threads

Top