How to search in a file efficiently (=fast !) for a certain hex value ?

U

Ulf Meinhardt

How can I efficiently (!) and fast search a file for a given hex value (e.g. x'6A')
resp. how can I count the number of occurencies in a given hex value ?

Ulf
 
A

Abhishek

How can I efficiently (!) and fast search a file for a given hex value (e.g. x'6A')
resp. how can I count the number of occurencies in a given hex value ?

Ulf

u want to count the number of occurances of a phrase in a document?
is thats your question, use the REGEX package.
 
K

Knute Johnson

Ulf said:
How can I efficiently (!) and fast search a file for a given hex value (e.g. x'6A')
resp. how can I count the number of occurencies in a given hex value ?

Ulf

Get a really fast computer.
 
J

Joshua Cranmer

Ulf said:
How can I efficiently (!) and fast search a file for a given hex value (e.g. x'6A')
resp. how can I count the number of occurencies in a given hex value ?

Ulf

If the file has a hash table of the occurrence of all hex values, then
it's fast. Chances are you don't have this.

A simple loop:

int countOccurence(InputStream is, byte value) throws IOException {
int character, total = 0;
while ((character = is.read()) != -1) {
if ((byte)character == value)
total++;
}
return total;
}

If you need to get several different counts, create a hash table and
then return from that (i.e.: int[] counts = new int[256]; and
counts[character]++; should be added).

You can't get better than O(n) for only one value.
 
M

Mark Space

Ulf said:
How can I efficiently (!) and fast search a file for a given hex value (e.g. x'6A')
resp. how can I count the number of occurencies in a given hex value ?

Ulf

You could spawn several threads, each searching a specific region of the
file.

This loads more of the file into memory at once, loads up the IO
subsystem with IO requests so it can organize the requests and load bits
of the file faster, and uses more CPUs if available. Also, it uses CPUs
to search one bit of the file while other bits are blocked waiting for
regions of the file to load.
 
M

Mark Thornton

Mark said:
You could spawn several threads, each searching a specific region of the
file.

This loads more of the file into memory at once, loads up the IO
subsystem with IO requests so it can organize the requests and load bits

It might on some operating systems. To get this effect on Windows you
have to open the file multiple times. All the file read methods are
synchronized in Sun's current Windows implementation so that only one
read occurs at a time on a given
FileChannel/RandomAccessFile/FileInputStream. Rather disappointing.

Mark Thornton
 
J

John W. Kennedy

Ulf said:
How can I efficiently (!) and fast search a file for a given hex value (e.g. x'6A')
resp. how can I count the number of occurencies in a given hex value ?

java.nio.MappedByteBuffer

--
John W. Kennedy
"Those in the seat of power oft forget their failings and seek only the
obeisance of others! Thus is bad government born! Hold in your heart
that you and the people are one, human beings all, and good government
shall arise of its own accord! Such is the path of virtue!"
-- Kazuo Koike. "Lone Wolf and Cub: Thirteen Strings" (tr. Dana Lewis)
 
K

Knute Johnson

Mark said:
It might on some operating systems. To get this effect on Windows you
have to open the file multiple times. All the file read methods are
synchronized in Sun's current Windows implementation so that only one
read occurs at a time on a given
FileChannel/RandomAccessFile/FileInputStream. Rather disappointing.

Mark Thornton

Just for my curiosity is this documented somewhere or did you glean this
from the source code?

Thanks,
 
A

Arne Vajhøj

Ulf said:
How can I efficiently (!) and fast search a file for a given hex value (e.g. x'6A')
resp. how can I count the number of occurencies in a given hex value ?

You will have to read the bytes of the file from disk. Focus on
the fastest read from disk. Looping through the bytes searching
for the byte value is trivial.

Arne
 
A

Arne Vajhøj

A simple loop:

int countOccurence(InputStream is, byte value) throws IOException {
int character, total = 0;
while ((character = is.read()) != -1) {
if ((byte)character == value)
total++;
}
return total;
}

InputStream.read() does not seem a good candidate for fast IO.

Arne
 
C

Christian

John said:
java.nio.MappedByteBuffer
probably this will be even slower than just using an input Stream..

MappedByteBuffer may have some possibilitys for use.. but not for sth
like this.. secially if the file is too big... the os might have to
cache the file on another place of the disc..



to the op:

what is wrong with

BufferedInputStream bis = new BufferedInputStream(new
FileInputStream(file));

int read;
int[] timesRead = new int[256];
while (-1 != (read = bis.read()) {
timesRead[read & 0xff]++;
}

to read in the file..
it is simple, you won't get much faster and afterwards you can look up
the occurences easily ...
 
A

Arne Vajhøj

Christian said:
probably this will be even slower than just using an input Stream..

MappedByteBuffer may have some possibilitys for use.. but not for sth
like this.. secially if the file is too big... the os might have to
cache the file on another place of the disc..

It may be difficult to guarantee anything for something that is
so platform specific.

But common platforms should:
- map all the file into virtual memory
- only have some of it in physical memory
- read directly from the file with no use of pagefile or
other temp disk IO
to the op:

what is wrong with

BufferedInputStream bis = new BufferedInputStream(new
FileInputStream(file));

int read;
int[] timesRead = new int[256];
while (-1 != (read = bis.read()) {
timesRead[read & 0xff]++;
}

to read in the file..
it is simple, you won't get much faster and afterwards you can look up
the occurences easily ...

Even with BufferedInputStream I would go for multi byte read instead
of single byte read.

Arne
 
C

Christian

Arne said:
It may be difficult to guarantee anything for something that is
so platform specific.

But common platforms should:
- map all the file into virtual memory
- only have some of it in physical memory
- read directly from the file with no use of pagefile or
other temp disk IO

and that would be to slow.. first mapping the file to memory then
letting the algorithm dig through the whole memory for counting..
just reading each byte and then discarding it is way more memory
friendly and faster.
to the op:

what is wrong with

BufferedInputStream bis = new BufferedInputStream(new
FileInputStream(file));

int read;
int[] timesRead = new int[256];
while (-1 != (read = bis.read()) {
timesRead[read & 0xff]++;
}

to read in the file..
it is simple, you won't get much faster and afterwards you can look up
the occurences easily ...

Even with BufferedInputStream I would go for multi byte read instead
of single byte read.

Arne

why? BufferedInputStream does the multi byte read..
what I have left is the overhead of calling the read() method for each
byte..

Though I do assume that the speed of this algorithm is still solely
dependent of the disc's speed... to few computation is done.. so you
won't get much faster..
 
T

Tim Smith

int countOccurence(InputStream is, byte value) throws IOException {
int character, total = 0;
while ((character = is.read()) != -1) {
if ((byte)character == value)
total++;
}
return total;
}

That's doing two conditionals per iteration. If you read the file in
chunks, and put the target value at the end of the chunks as a sentinel,
you can get it down to one conditional for bytes that aren't the target,
and two for bytes that are the target.
 
R

Renli

How can I efficiently (!) and fast search a file for a given hex value (e.g. x'6A')
resp. how can I count the number of occurencies in a given hex value ?

Ulf

Create a FileInputStream. If you call it "fis" ala "FileInputStream
fis = new FileInputStream("file.txt");" then you can use:

int value = fis.read();

this will read one byte of data. Now, how can you compare this with a
hex value of 6a? that depends in what form the value "6a" is. If it's
a String, then convert it into an integer and compare it directly with
the "value" integer:

// String hex
int hex_as_int = Integer.parseInt(hex_as_string);

Needless to say if you are already storing your hex value as an int
then don't try to convert it with Integer.parseInt.

Next it's a simple matter of comparison. if (value == hexvalue_as_int)
{ whatever; }

It's worth noting you can also convert the integer into a string and
compare it that way:

String value_as_hex_string = Integer.toHexString(value);
if (value_as_hex_string.equals(hex_as_string)) { whatever; }

If you're looking for efficiency you'll want to convert the value
you're looking for and not every byte you read out of the file. So
converting the hex string to an integer is likely your best bet.

Your second question is ambiguous. I'm not sure if you want to count
the number of values you find (just use int count; then count=count+1
when you find one....) or if you want to count them and store the
result as a hex string (Store it as an integer, then use
Integer.toHexString() to display, save or print it).

-
 
J

John W. Kennedy

Christian said:
probably this will be even slower than just using an input Stream..

It isn't. It's faster than BufferedInputStream, using 1-byte read. Using
the combination of BufferedInputStream and a user byte[] buffer might be
faster, but I don't know.
MappedByteBuffer may have some possibilitys for use.. but not for sth
like this.. secially if the file is too big... the os might have to
cache the file on another place of the disc..

"Too big", in this case, means >2GB. I think you do not know how
MappedByteBuffer works.
to the op:

what is wrong with

BufferedInputStream bis = new BufferedInputStream(new
FileInputStream(file));

int read;
int[] timesRead = new int[256];
while (-1 != (read = bis.read()) {
timesRead[read & 0xff]++;
}

to read in the file..
it is simple, you won't get much faster and afterwards you can look up
the occurences easily ...

MappedByteBuffer is faster.
 
J

John W. Kennedy

Christian said:
and that would be to slow.. first mapping the file to memory then
letting the algorithm dig through the whole memory for counting..
just reading each byte and then discarding it is way more memory
friendly and faster.

I was right; you don't know how MappedByteBuffer works.

MappedByteBuffer works through the feature, common on modern operating
systems, that allows a user file to be temporarily appended to the
paging system. No buffers have to be copied, I/O is done in even page
units, and, typically, a shorter path is taken through the kernel.
 
C

Christian

John said:
Christian said:
probably this will be even slower than just using an input Stream..

It isn't. It's faster than BufferedInputStream, using 1-byte read. Using
the combination of BufferedInputStream and a user byte[] buffer might be
faster, but I don't know.
MappedByteBuffer may have some possibilitys for use.. but not for sth
like this.. secially if the file is too big... the os might have to
cache the file on another place of the disc..

"Too big", in this case, means >2GB. I think you do not know how
MappedByteBuffer works.
exactly .. I don't know.. just thought that would map the File to memory..

only tried to use it once for hashing a file ... and found that the
performance decreased a lot compared to reading bytes into a ByteBuffer
directly..

my virtual memory file increased during mapping some file .. and as the
filesize was larger than my 512 MiB memory ...
As the Java api state it performes best with larger files my conclusion
was that MappedByteBuffer is rather sth used on machines with lots of
ram /Servers..

I still don't see the point in mapping a complete file to ram just to
read it sequentially.
I mean hey when I read the file .. my OS will read ahead anyway and
store the next bytes I want in memory.. doing probably more intelligent
caching..
I know my solution might profit from using a ByteBuffer to read in some
bytes .. and spare the method call overhead of calling the single byte
read..
But it won't matter.. the limitation will in all cases be the disc
speed. There is just to few computation going on per byte that the disc
could keep up that pace. Though some benchmarks would be welcome.

to the op:

what is wrong with

BufferedInputStream bis = new BufferedInputStream(new
FileInputStream(file));

int read;
int[] timesRead = new int[256];
while (-1 != (read = bis.read()) {
timesRead[read & 0xff]++;
}

to read in the file..
it is simple, you won't get much faster and afterwards you can look up
the occurences easily ...

MappedByteBuffer is faster.
 
R

Roedy Green

How can I efficiently (!) and fast search a file for a given hex value (e.g. x'6A')
resp. how can I count the number of occurencies in a given hex value ?

indexOf will do it as fast as anything. Or you can just index through
the array like this

for ( int i=0; i<size; i++)
{
if (s.charAt(i) == thevalue ) count++;
}
 
K

Knute Johnson

Arne said:
Christian said:
probably this will be even slower than just using an input Stream..

MappedByteBuffer may have some possibilitys for use.. but not for sth
like this.. secially if the file is too big... the os might have to
cache the file on another place of the disc..

It may be difficult to guarantee anything for something that is
so platform specific.

But common platforms should:
- map all the file into virtual memory
- only have some of it in physical memory
- read directly from the file with no use of pagefile or
other temp disk IO
to the op:

what is wrong with

BufferedInputStream bis = new BufferedInputStream(new
FileInputStream(file));

int read;
int[] timesRead = new int[256];
while (-1 != (read = bis.read()) {
timesRead[read & 0xff]++;
}

to read in the file..
it is simple, you won't get much faster and afterwards you can look up
the occurences easily ...

Even with BufferedInputStream I would go for multi byte read instead
of single byte read.

Arne

Just for curiosity I tried writing some simple programs to test relative
speed. Using a BufferedInputStream with the default buffer size and
reading 64k at a whack is about twice as fast as reading a single byte
at a time. I noticed on my computer that that method was only using one
core of the processor. So I thought I would write another program
that ran two threads reading from a FileChannel. The docs imply that
concurrent I/O will block but I thought that the non-reading thread
could be searching the buffer for the target byte. Only I can't get it
two work correctly. I've never used FileChannel I/O before and I don't
think I have something right. Please take a look at the example below
and tell me where I've gone wrong. Thanks,

import java.io.*;
import java.nio.*;
import java.nio.channels.*;

public class test3 {
public static void main(String[] args) throws Exception {
String fname = "Fedora-8-x86_64-DVD.iso";
int count = 0;
FileInputStream fis = new FileInputStream(fname);
FileChannel fc = fis.getChannel();
ByteBuffer bybuf = ByteBuffer.allocate(65536);

long start = System.currentTimeMillis();
int n;
do {
bybuf.clear();
n = fc.read(bybuf);
bybuf.flip();
if (n > 0) {
try {
while (true)
if (bybuf.get() == (byte)0xfc)
++count;
} catch (BufferUnderflowException bue) { }
}
} while (n != -1) ;

long end = System.currentTimeMillis();
System.out.println(end - start);
System.out.println(count);
}
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,564
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top