Piggypack Encoding/Decoding on RandomAccessFile

J

Joshua Cranmer

Should I mingle with file descriptor, and get
associated input and output streams, and then
move forward?

The "standard way" (at least, all of the use cases I've ever had for
RandomAccessFile) effectively uses the methods that are associated with
java.io.DataInput to read data: read(byte[]), and read*().
 
J

Jan Burse

Joshua said:
The "standard way" (at least, all of the use cases I've ever had for
RandomAccessFile) effectively uses the methods that are associated with
java.io.DataInput to read data: read(byte[]), and read*().

I would like to use an arbirary encoding/decoding on top of the
byte stream to get a character stream. But since RandomAccessFile
does not implement InputStream/OutputStream, I cannot create
a InputStreamReader/OutputStreamWrite on top.

Bye
 
M

markspace

Joshua said:
The "standard way" (at least, all of the use cases I've ever had for
RandomAccessFile) effectively uses the methods that are associated with
java.io.DataInput to read data: read(byte[]), and read*().

I would like to use an arbirary encoding/decoding on top of the
byte stream to get a character stream. But since RandomAccessFile
does not implement InputStream/OutputStream, I cannot create
a InputStreamReader/OutputStreamWrite on top.

Bye

5 minutes, untested:


package quicktest;

import java.io.IOException;
import java.io.InputStream;
import java.io.RandomAccessFile;

/**
*
* @author Brenden
*/
public class RndFileStream extends InputStream {

private final RandomAccessFile raf;

public RndFileStream(RandomAccessFile raf) {
this.raf = raf;
}

@Override
public int read() throws IOException {
return raf.read();
}


public void seek( long pos ) throws IOException {
raf.seek(pos);
}

}
 
L

Lew

Joshua said:
The "standard way" (at least, all of the use cases I've ever had for
RandomAccessFile) effectively uses the methods that are associated with
java.io.DataInput to read data: read(byte[]), and read*().

I would like to use an arbirary encoding/decoding on top of the
byte stream to get a character stream. But since RandomAccessFile
does not implement InputStream/OutputStream, I cannot create
a InputStreamReader/OutputStreamWrite on top.

No, but you can use the 'DataInput' (and 'DataOutput') methods, as Joshua indicated.

However, the notion of encoding / decoding from a random access file seems fraught with peril. You have to ensure that you start from a valid position in the file, not, for example, in the second byte of a three-byte character.
 
J

Jan Burse

markspace said:
public void seek( long pos ) throws IOException {
raf.seek(pos);
}


public void skip(long n) {
raf.seek(raf.getFilePosition()+n);
}

Or some such....
 
J

Jan Burse

Jan said:
Should I mingle with file descriptor, and get
associated input and output streams, and then
move forward?

BTW: The file descriptor route works as follows:


RandomAccessFile raf = new RandomAccessFile(..., "r");
FileInputStream fi = new FileInputStream(raf.getFD());

... change file position ...

... piggypack a InputStreamReader ...

... change file position ...

... piggypack another InputStreamReader ...

fi.close();
raf.close();

Works also for FileOutputStream.

But I am not sure whether it is the prefered route...

Also since javadoc for InputStreamReader says:

"To enable the efficient conversion of bytes to
characters, more bytes may be read ahead
from the underlying stream than are necessary
to satisfy the current read operation."

Reading the file position after some read from InputStreamReader
will probably not give a reliable position.

But advantage over a normal InputStreamReader, which has only
a skip(), would be for example that a rewind() can be implemented
via a seek(0).

Bye
 
A

Arne Vajhøj

How can I go about and encode/decode bytes read
from a random access file.

I am used to FileInputStream and FileOutputStream,
but I don't see right now how I could piggypack
encoding/deconding on a RandomAccessFile:

http://download.oracle.com/javase/7/docs/api/java/io/RandomAccessFile.html

Should I mingle with file descriptor, and get
associated input and output streams, and then
move forward?

I think the most clean code would be to separate the logic
in two layers:
- a lower layer that uses RandomAccessFile and use byte[]
for in and out data
- a higher layer that uses String getBytes and constructor

Arne
 
E

Eric Sosman

Joshua said:
The "standard way" (at least, all of the use cases I've ever had for
RandomAccessFile) effectively uses the methods that are associated with
java.io.DataInput to read data: read(byte[]), and read*().

I would like to use an arbirary encoding/decoding on top of the
byte stream to get a character stream. But since RandomAccessFile
does not implement InputStream/OutputStream, I cannot create
a InputStreamReader/OutputStreamWrite on top.

For a completely "arbitrary" encoding, I think you're out of luck.
Stateful encodings (where the encoding of byte B[n] is a function of
B[n-1],B[n-2],...) make it difficult to begin in medias res: You cannot
know how to decode the first byte you read without already having seen
all its predecessors.

To support random access, where you'd like to jump directly to B[n]
without plowing through all that goes before, one usually addresses the
problem by restricting the valid n to multiples of some "block size,"
and encoding each "block" independently. You seek to the next lower
multiple of 32K or whatever, set your decryptor/compressor/decoder to
its initial state, and roll merrily along.

There's a problem if the encoding does not always map K input bytes
to f(K) output bytes: compressors, for example, output different amounts
of data depending on the values of the bytes compressed. There are two
principal methods for dealing with this difficulty:

1) Encode the original in blocks of 32K (say), and store each
encoded block in a file region that's sure to be large enough -- 40K,
perhaps. Pad with nulls or other junk values as needed, so long as
your decompressor can recognize and ignore the padding. Then original
byte N is in block number N/32K, whose encoding starts at (N/32K)*40K
in the file; seek to that spot and start decoding.

2) As before, encode the original in fixed-size blocks, but write
them cheek by jowl to the file. As you do so, also write an index file
that's essentially Map<OriginalByteNumber,EncodedByteNumber> for each
block boundary. Then original byte N is in the block beginning at
theMap.get(N/32K); seek to that spot and start decoding.

Elsethread you mention that RandomAccessFile provides neither
InputStream nor OutputStream. If you think about this a bit, you'll
see it's a natural consequence of the "Random" part: a Stream provides
the abstraction of a linear sequence of things, and does not admit of
leaping forward or backward to unrelated positions. Yes, there are
skip() and mark() and reset(), but I think you'll agree these are of
a different character than "read bytes 3000-3999, then 10000-10999,
then 936-22728." Streams are sequential; Random isn't.
 
J

Jan Burse

Eric said:
Elsethread you mention that RandomAccessFile provides neither
InputStream nor OutputStream. If you think about this a bit, you'll
see it's a natural consequence of the "Random" part: a Stream provides
the abstraction of a linear sequence of things, and does not admit of
leaping forward or backward to unrelated positions. Yes, there are
skip() and mark() and reset(), but I think you'll agree these are of
a different character than "read bytes 3000-3999, then 10000-10999,
then 936-22728." Streams are sequential; Random isn't.

It seems that the FileInputStream reacts on the what is
done with the underlying RandomAccessFile. Since it is
not buffering and since it shares the same file descriptor.

But I have only done a small testing. Something allong:

Writing "Hello World!" to a file.

Random access opening the file.

Doing a FileInputStream on the random access
file via the file descriptor.

Seeking the random access file to position 6.

Reading from FileInputStream, and you get "World!"

So they are somehow interlocked.

Bye
 
J

Jan Burse

Jan said:
So they are somehow interlocked.

From the file channel java doc we have:

"Where the file channel is obtained from an existing stream or
random access file then the state of the file channel is
intimately connected to that of the object whose getChannel
method returned the channel. Changing the channel's position,
whether explicitly or by reading or writing bytes, will change
the file position of the originating object, and vice versa.
Changing the file's length via the file channel will change the
length seen via the originating object, and vice versa.
Changing the file's content by writing bytes will change the
content seen by the originating object, and vice versa."

In case that the file input/output stream is interchangeable
with the file channel, then they are interwoven.

Bye
 
R

Roedy Green

How can I go about and encode/decode bytes read
from a random access file.

A random access file is not something generally you create with a word
processor. It is an internal file created programmatically. So I
would use its native binary format, e.g. RandomAccessFile.writeUTF,
writeLong...

Another other way to handle it is to use a ByteArrayOutputStream to
create an array of bytes representing serialised objects and write
them as byte[] with RandomAccessFile.writeBytes

See http://mindprod.com/applet/fileio.html for sample code.

If you really want to read and write encoded strings, read them as
byte[] and decode them in RAM to Strings. See
http://mindprod.com/jgloss/encoding.html

Encoded strings are tricky to handle in a RandomAccessFile because it
is hard to pin down the length. You have various flavours of single
byte and multi-byte chars intermixed. You pretty well have to embed
the length in BYTES along with the data. That is pretty much what
RandomAccessFile.readUTF does, except it uses a variant of UTF-8 all
the time.

--
Roedy Green Canadian Mind Products
http://mindprod.com
Capitalism has spurred the competition that makes CPUs faster and
faster each year, but the focus on money makes software manufacturers
do some peculiar things like deliberately leaving bugs and deficiencies
in the software so they can soak the customers for upgrades later.
Whether software is easy to use, or never loses data, when the company
has a near monopoly, is almost irrelevant to profits, and therefore
ignored. The manufacturer focuses on cheap gimicks like dancing paper
clips to dazzle naive first-time buyers. The needs of existing
experienced users are almost irrelevant. I see software rental as the
best remedy.
 
E

Eric Sosman

From the file channel java doc we have:

"Where the file channel is obtained from an existing stream or
random access file then the state of the file channel is
intimately connected to that of the object whose getChannel
method returned the channel. Changing the channel's position,
whether explicitly or by reading or writing bytes, will change
the file position of the originating object, and vice versa.
Changing the file's length via the file channel will change the
length seen via the originating object, and vice versa.
Changing the file's content by writing bytes will change the
content seen by the originating object, and vice versa."

In case that the file input/output stream is interchangeable
with the file channel, then they are interwoven.

But how do you maintain the state of the encoder/decoder,
if that state is a function of anything more than just the
file position itself? If the decoder must process all the
bytes prior to B[n] to get itself into the proper state for
decoding B[n], random access just makes no sense.

Well, I guess you *could* make a preliminary sequential
pass over the data, and build a Map<ByteOffset,DecoderState> for
selected positions. You could even build the map incrementally,
adding new entries whenever the random access pattern takes you
into formerly uncharted territory. Either way, though, the first
exploration of each byte position has to be sequential to allow
the decoder to accumulate its state. And if you ever *write* to
the middle of the file (which changes the encoding of all the
bytes that follow), ... I think the cases in which such a scheme
might be practical would be "unusual," if not downright "contrived."
 
J

Jan Burse

Eric said:
But how do you maintain the state of the encoder/decoder,
if that state is a function of anything more than just the
file position itself? If the decoder must process all the
bytes prior to B[n] to get itself into the proper state for
decoding B[n], random access just makes no sense.

I guess in the optimal case, the decoder only needs state
inside reading a single char. So it needs state until it
has consumed all bytes to produce a single char. This
holds for sure for UTF-8 and UTF-16.

So since I am piggypacking decoder/encoder and then
reading and writing char. Not much evil should happen.
But of course it depends on the decoder implementation,
and whether it does some extra read ahead, like a buffered
read, which is also a form of state.

Enventually the decoder object has some API. I didn't
yet check. The encoder can be flushed via the normal
character stream flush operation. So there should be much
issue. But the decoder, I am not sure how to control.

Just imaginge a lexicon implementation via Random Access,
so file includes some index and file offsets. So when
presented with a word, would first go through the index,
and then via the file offset go to the gloss. And read
it by piggypacking a decoder.

But what Roedy and others were suggesting, is of course
also a solution, to first obtain the bytes and then
do a in-memory conversion.

But if the gloss entries are just subtress in (character
encoded) XML, then I could read/decode and look for the
end-tag, and don't need to store some entry length.
Would only need the ability to navigate to entry
offsets.

For one time access a skip() would be ok. But for random
access, I guess this type of file is made for that!

Bye
 
L

Lew

Roedy Green wrote in his tag line:
Capitalism has spurred the competition that makes CPUs faster and
faster each year, but the focus on money makes software manufacturers
do some peculiar things like deliberately leaving bugs and deficiencies
in the software so they can soak the customers for upgrades later.

Facts not in evidence.

You have not established that bugs are left in deliberately, nor that even if they are it's the "focus on money" that makes them do that.

Pure B.S., like most pseudo-socialist rhetoric.
 
A

Arne Vajhøj

Roedy Green wrote in his tag line:

Facts not in evidence.

You have not established that bugs are left in deliberately, nor that even if they are it's the "focus on money" that makes them do that.

I don't think it is general practice.

The invisible hand should eventually take care of such companies
if there are any competition.

But hey - we do not know how Roedy do software - we can only say that
we don't do it that way!

Arne
 
S

Stanimir Stamenkov

Thu, 03 Nov 2011 19:18:50 +0100, /Jan Burse/:
How can I go about and encode/decode bytes read
from a random access file.

I am used to FileInputStream and FileOutputStream,
but I don't see right now how I could piggypack
encoding/deconding on a RandomAccessFile:

http://download.oracle.com/javase/7/docs/api/java/io/RandomAccessFile.html

Should I mingle with file descriptor, and get
associated input and output streams, and then
move forward?

I think it is o.k. to use the FileDescriptor obtained from a
RandomAccessFile to create a FileInputStream if you need to read the
contents of the file through InputStream interface further.

One could also obtain InputStream to the RandomAccessFile using its
FileChannel:

import java.io.InputStream;
import java.io.RandomAccessFile;
import java.nio.channels.Channels;

RandomAccessFile raf;
...
InputStream in = Channels.newInputStream(raf.getChannel());

java.nio.channels.Channels also provide:

import java.io.Reader;

Reader reader = Channels.newReader(raf.getChannel(), "UTF-8");
 
J

Jan Burse

Hi,

Thank you for pointing to this API. Wasn't aware.
I just had a look at the API, and also found:

public static Reader newReader(ReadableByteChannel ch,
CharsetDecoder dec,
int minBufferCap)

So via the minBufferCap parameter also problems of
overreadinging the underlying raf can be solved to
some extend, if this is a problem.

Thanks

Bye
 
L

Lew

Jan said:
Thank you for pointing to this API. Wasn't aware.
I just had a look at the API, and also found:

public static Reader newReader(ReadableByteChannel ch,
CharsetDecoder dec,
int minBufferCap)

So via the minBufferCap parameter also problems of
overreadinging the underlying raf can be solved to
some extend, if this is a problem.

This has been a very interesting question and ensuing thread. I've seen this sort of multiple interaction (conflict) of resource clients a couple of times but most of those cases were unintentional and considered bugs.

One lesson is that such situations are always fraught with peril.

Which is why they pay us the big bucks. Fraught with peril is the programmer's bread and butter. But you can't be careless, and the questions addressed in this conversation are key.

Inevitably I wonder if the functional need can be served at a different level of the architecture. Here's the syllogism:

The synchronization between resource clients, in this case a random access mechanism and a stream mechanism attached to the same file descriptor, willrequire a unified view that they share, otherwise completely independent views with no interaction between them whatsoever.

The questions here address how the problem would be solved in breadth at the resource-access level. The baseline model is two concurrent clients withshared state.

What about a solution in depth with a serial model?

'DataInput' and 'DataOutput' make good first responders to mixed-format binary files. That's why 'RandomAccessFile' implements them. They're intendedfor the sort of low-level operations with manual bookkeeping for data location as discussed. They are the clear choice for direct access to the resource.

If you need sequential stream access to all or part of that data, have sequential clients work with scraps piped through the lower-lying direct accessinstances rather than fight over the same prey.
 
J

Jan Burse

Lew said:
This has been a very interesting question and
ensuing thread. I've seen this sort of multiple
interaction (conflict) of resource clients a couple
of times but most of those cases were unintentional
and considered bugs.

If you open the raf with mode "r" no such issues
pertain. If you open the raf with mode "rw" then
of course you have an update problem if multiple
threads access the raf, not to speak what happens
accross processes.

But the channel also offers locks. You can even
parameterize them by ranges. And they are even
synchronized accross processes. It happens that I
have used the locks for byte based access, but I
guess they are also useful when a character reader/
write is placed on the byte stream.

Bye
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,061
Latest member
KetonaraKeto

Latest Threads

Top