Dual binary/character streams?

Discussion in 'Java' started by Adam Warner, Nov 6, 2005.

  1. Adam Warner

    Adam Warner Guest

    Hi all,

    Suppose a stream contains text and binary data. The text will describe how
    many bytes to read as binary data before switching back to reading text.
    It appears Java provides no library upon which to reasonably build this
    functionality!

    Let's make up an example:

    "character data" #10 __________"character data continues"
    ^ ^
    |10octets|

    The token #10 means: read 10 bytes of binary data. Thereafter continue
    reading characters in the default character set.

    An InputStream supports reading binary data. But an InputStreamReader is
    permitted to act like a BufferedReader: "To enable the efficient
    conversion of bytes to characters, more bytes may be read ahead from the
    underlying stream than are necessary to satisfy the current read operation."

    Thus an InputStreamReader cannot be relied upon to just read a character.
    It may read ahead, removing the binary data from the InputStream.

    Is there a character reader for Java that only reads the number of bytes
    necessary to satisfy a read() request?

    Regards,
    Adam
     
    Adam Warner, Nov 6, 2005
    #1
    1. Advertising

  2. Adam Warner

    Roedy Green Guest

    On Sun, 06 Nov 2005 22:30:37 +1300, Adam Warner
    <> wrote, quoted or indirectly quoted someone
    who said :

    >Let's make up an example:
    >
    > "character data" #10 __________"character data continues"
    > ^ ^
    > |10octets|


    you use a DataInputStream. You read the binary with readInt
    readDouble etc.

    You read the character data, presumably 8-bit encoded as bytes. then
    convert the byte array to a string using the desired encoding.

    // byte[] -> String
    String t = new String( b , "Cp1252" /* encoding */ );

    If you have control over the stream, you get the person sending it you
    you to encode the strings in counted UTF-8 format. Then you can read
    them easily with readUTF.
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
     
    Roedy Green, Nov 6, 2005
    #2
    1. Advertising

  3. Adam Warner

    Roedy Green Guest

    On Sun, 06 Nov 2005 09:39:05 GMT, Roedy Green
    <> wrote, quoted or
    indirectly quoted someone who said :

    >> "character data" #10 __________"character data continues"
    >> ^ ^
    >> |10octets|

    >
    >you use a DataInputStream. You read the binary with readInt
    >readDouble etc.


    if these are little-endian, use LEDataInputStream. see
    http://mindprod.com/products1.html#LEDATASTREAM

    or talk the other end into generating the binary in network order.

    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
     
    Roedy Green, Nov 6, 2005
    #3
  4. Adam Warner

    Adam Warner Guest

    On Sun, 06 Nov 2005 09:39:05 +0000, Roedy Green wrote:
    > On Sun, 06 Nov 2005 22:30:37 +1300, Adam Warner
    > <> wrote, quoted or indirectly quoted someone
    > who said :
    >
    >>Let's make up an example:
    >>
    >> "character data" #10 __________"character data continues"
    >> ^ ^
    >> |10octets|

    >
    > you use a DataInputStream. You read the binary with readInt
    > readDouble etc.
    >
    > You read the character data, presumably 8-bit encoded as bytes. then
    > convert the byte array to a string using the desired encoding.


    Thanks for the suggestion Roedy. I'm attempting to avoid any presumption
    about the default character set (it could for example be UTF-8 or UTF-16)
    so this isn't a general solution.

    > // byte[] -> String
    > String t = new String( b , "Cp1252" /* encoding */ );


    At this point one doesn't know where the characters terminate and the
    binary data begins.

    One could say InputStreamReader is missing a readByte() method. It is
    permitted to read ahead bytes yet it provides no way to access
    those subsequent bytes.

    By reading ahead and not providing a readByte method the Java standard
    library appears to provide no reasonable way to decode a char (in an
    arbitrary character encoding) within a binary stream while preserving
    the rest of the binary data.

    > If you have control over the stream, you get the person sending it you
    > you to encode the strings in counted UTF-8 format. Then you can read
    > them easily with readUTF.


    The character encoding is not fixed. And readUTF is Java-specific junk.
    It's impressive how Sun managed to come up with a way to waste 50% more
    space than four byte encoded UTF-8 code points.

    Regards,
    Adam
     
    Adam Warner, Nov 6, 2005
    #4
  5. Adam Warner

    Roedy Green Guest

    On Mon, 07 Nov 2005 00:20:29 +1300, Adam Warner
    <> wrote, quoted or indirectly quoted someone
    who said :

    >At this point one doesn't know where the characters terminate and the
    >binary data begins.


    If you can't tell, your protocol is broken. You will have to do
    something to fix it. I suggest using counted UTF strings.

    Maybe you mean you have to KNOW the lengths in your code to read the
    stream., that they are not embedded in the stream and there is no
    format description in the stream.

    There is no way you can process a stream without knowing the encoding.
    The encoding may be 7-bit ASCII, but you still have to know what it
    is.

    You can COPY such a stream, but you can't process it.

    The beauty of UTF-8 is that it works for any platform and you don't
    have to customize it for different locales.

    If this stream is a legacy, and you can't change its format at all,
    and this stream was actually read and processed at one point in
    history, there must be some hidden assumptions you can take advantage
    of. e.g. null terminated strings, the encoding used, fixed lengths
    of fields, a file header ...
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
     
    Roedy Green, Nov 6, 2005
    #5
  6. Adam Warner

    Chris Uppal Guest

    Adam Warner wrote:

    > Suppose a stream contains text and binary data. The text will describe how
    > many bytes to read as binary data before switching back to reading text.
    > It appears Java provides no library upon which to reasonably build this
    > functionality!


    Does your format have a reliable way of spotting the end of a stream of
    character data /without/ decoding it ? E.g. in HTTP the headers can specify
    the length of the (binary) body, but the headers can be separated reliably from
    the body before they are decoded. Or, failing that, is there a hard limit to
    how many bytes of character data are allowed in one "chunk" (so that you can
    make a copy of that data and decode it independently) ?

    If not then the format is rather awkwardly designed, and you will have to mess
    around with more complicated code to unravel it character-by-character I
    suggest using a
    java.nio.charset.CharsetDecoder directly.

    BTW, since you will have to work character-by-character, even if you were able
    to use a stock InputSteamReader (if it didn't read ahead), it wouldn't be
    buying you much at all compared with using your own CharsetDecoder.

    BTW2. don't forget that Unicode characters, unlike Java chars, are not limited
    to 16bits. So one logical character of input may require two actual chars of
    output.

    -- chris
     
    Chris Uppal, Nov 6, 2005
    #6
  7. Adam Warner

    Adam Warner Guest

    On Sun, 06 Nov 2005 11:50:31 +0000, Roedy Green wrote:
    > On Mon, 07 Nov 2005 00:20:29 +1300, Adam Warner
    > <> wrote, quoted or indirectly quoted someone
    > who said :
    >
    >>At this point one doesn't know where the characters terminate and the
    >>binary data begins.

    >
    > If you can't tell, your protocol is broken.


    No. You are simply unable to solve the stated issue: "By reading ahead and
    not providing a readByte method the Java standard library appears to
    provide no reasonable way to decode a char (in an arbitrary character
    encoding) within a binary stream while preserving the rest of the binary
    data."

    > You will have to do something to fix it. I suggest using counted UTF
    > strings.
    >
    > Maybe you mean you have to KNOW the lengths in your code to read the
    > stream., that they are not embedded in the stream and there is no format
    > description in the stream.
    >
    > There is no way you can process a stream without knowing the encoding.
    > The encoding may be 7-bit ASCII, but you still have to know what it is.


    This is not the issue. InputStreamReader has a default encoding and a
    named encoding can also be specified. Unfortunately it may read extra
    bytes from the underlying binary stream without providing a way to access
    them as binary data.

    > You can COPY such a stream, but you can't process it.
    >
    > The beauty of UTF-8 is that it works for any platform and you don't have
    > to customize it for different locales.
    >
    > If this stream is a legacy, and you can't change its format at all, and
    > this stream was actually read and processed at one point in history,
    > there must be some hidden assumptions you can take advantage of. e.g.
    > null terminated strings, the encoding used, fixed lengths of fields, a
    > file header ...


    There is no hidden assumption. The decoding information is contained in
    the character stream. Conceptually it's a type of bivalent stream:
    <http://www.franz.com/support/documentation/7.0/doc/socket.htm#socket-characteristics-1>
    ("Bivalent means that the stream will accept text and binary stream
    functions. That is, you can write-byte or write-char, read-byte or
    read-char.")

    The protocol is: Decode and interpret a string token. The interpretation
    of the token determines whether the next datum in the stream will be read
    as a character or a byte.

    Given the specification of InputStreamReader this protocol appears to be
    difficult to implement. A simple solution is unlikely.

    Regards,
    Adam
     
    Adam Warner, Nov 6, 2005
    #7
  8. Adam Warner

    Chris Uppal Guest

    Adam Warner wrote:

    > One could say InputStreamReader is missing a readByte() method. It is
    > permitted to read ahead bytes yet it provides no way to access
    > those subsequent bytes.


    One other problem -- more than just being unable to retrieve bytes that it
    has read ahead -- is that those bytes might form an invalid or illegal
    sequences for the given encoder. Logically it should not throw an error until
    it was asked for the "character" at the illegal position, but I bet it's not
    implemented that way.

    -- chris
     
    Chris Uppal, Nov 6, 2005
    #8
  9. Adam Warner

    Adam Warner Guest

    On Sun, 06 Nov 2005 12:44:30 +0000, Chris Uppal wrote:
    > Adam Warner wrote:
    >
    >> Suppose a stream contains text and binary data. The text will describe
    >> how many bytes to read as binary data before switching back to reading
    >> text. It appears Java provides no library upon which to reasonably
    >> build this functionality!

    >
    > Does your format have a reliable way of spotting the end of a stream of
    > character data /without/ decoding it ?


    No. While I can come up with a different format (e.g. encoding the binary
    data in base 64) I'd like to solve the problem as specified.

    > E.g. in HTTP the headers can specify the length of the (binary) body,
    > but the headers can be separated reliably from the body before they are
    > decoded. Or, failing that, is there a hard limit to how many bytes of
    > character data are allowed in one "chunk" (so that you can make a copy
    > of that data and decode it independently) ?


    Since I'll be supporting arbitrary precision integers I guess the
    character data is effectively unlimited.

    > If not then the format is rather awkwardly designed, and you will have
    > to mess around with more complicated code to unravel it
    > character-by-character I suggest using a
    > java.nio.charset.CharsetDecoder directly.


    The NIO could be helpful. But I still wouldn't know where to cut off a
    chunk from the stream without potentially splitting a character and
    breaking the decoding.

    > BTW, since you will have to work character-by-character, even if you
    > were able to use a stock InputSteamReader (if it didn't read ahead), it
    > wouldn't be buying you much at all compared with using your own
    > CharsetDecoder.
    >
    > BTW2. don't forget that Unicode characters, unlike Java chars, are not
    > limited to 16bits. So one logical character of input may require two
    > actual chars of output.


    Indeed. Java chars are not only sufficient for building code points but
    also serve as input for decoding graphemes via IBM's ICU4J library:
    <http://icu.sourceforge.net/apiref/icu4j/com/ibm/icu/text/BreakIterator.html>

    Thanks for the ideas Chris.

    Regards,
    Adam
     
    Adam Warner, Nov 6, 2005
    #9
  10. Adam Warner

    Chris Uppal Guest

    Adam Warner wrote:

    > No. While I can come up with a different format (e.g. encoding the binary
    > data in base 64) I'd like to solve the problem as specified.


    Hmm. I'm starting to think that you might want to take that option...


    > > If not then the format is rather awkwardly designed, and you will have
    > > to mess around with more complicated code to unravel it
    > > character-by-character I suggest using a
    > > java.nio.charset.CharsetDecoder directly.

    >
    > The NIO could be helpful. But I still wouldn't know where to cut off a
    > chunk from the stream without potentially splitting a character and
    > breaking the decoding.


    What I had in mind was a simple loop where, at each step, you feed 1 byte to
    the CharsetDecoder and get back 0, 1, or 2 chars.

    Unfortunately I was wrong. Although the documentation doesn't say so, and
    although the design is clearly set up to be used like that, it doesn't work.
    At least the UTF-8 decoder doesn't work if used like that. It doesn't retain
    enough state to remember that it has seen the start of an encoded character,
    and so it cannot be trusted to decode sucessfully across buffer boundaries (I
    don't know whether that's a bug or simply that it isn't expected to be able to
    do so). So I think that the loop has to look more like

    0) clear a small buffer
    1) get the next byte
    2) append it to the small buffer
    3) attempt to decode that into up to 2 chars
    4) if that works[*] then process the chars and goto (0)
    5) goto (1)

    and that -- when expressed using the magic of nio ByteBuffers and
    CharBuffers -- looks as it'd be extremely messy...

    ([*] by "works" I mean produces at least 1 char)

    -- chris
     
    Chris Uppal, Nov 6, 2005
    #10
  11. Adam Warner wrote:
    > Hi all,
    >
    > Suppose a stream contains text and binary data. The text will describe how
    > many bytes to read as binary data before switching back to reading text.
    > It appears Java provides no library upon which to reasonably build this
    > functionality!
    >
    > Let's make up an example:
    >
    > "character data" #10 __________"character data continues"
    > ^ ^
    > |10octets|
    >
    > The token #10 means: read 10 bytes of binary data. Thereafter continue
    > reading characters in the default character set.
    >
    > An InputStream supports reading binary data. But an InputStreamReader is
    > permitted to act like a BufferedReader: "To enable the efficient
    > conversion of bytes to characters, more bytes may be read ahead from the
    > underlying stream than are necessary to satisfy the current read operation."
    >
    > Thus an InputStreamReader cannot be relied upon to just read a character.
    > It may read ahead, removing the binary data from the InputStream.
    >
    > Is there a character reader for Java that only reads the number of bytes
    > necessary to satisfy a read() request?
    >
    > Regards,
    > Adam


    I don't know why anybody would create a data file in this format but you
    are going to have to read it with an InputStream not a Reader. So the
    answer to your question is no! There must be some method of determining
    when you have found a 'binary is coming tag' or nobody could decode this
    data. Use in InputStream and look for the tag, collect your data and
    proceed. What are you going to do with the binary data? Is it images
    or something like that? Or is it going to be converted to characters too?

    --

    Knute Johnson
    email s/nospam/knute/
     
    Knute Johnson, Nov 6, 2005
    #11
  12. Adam Warner

    Roedy Green Guest

    On Mon, 07 Nov 2005 02:42:44 +1300, Adam Warner
    <> wrote, quoted or indirectly quoted someone
    who said :

    >> Does your format have a reliable way of spotting the end of a stream of
    >> character data /without/ decoding it ?

    >
    >No. While I can come up with a different format (e.g. encoding the binary
    >data in base 64) I'd like to solve the problem as specified.


    You say you CAN tell the end in the DECODED stream but not in the byte
    stream. How do you notice the end in the DECODED stream?
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
     
    Roedy Green, Nov 6, 2005
    #12
  13. Adam Warner

    Adam Warner Guest

    On Sun, 06 Nov 2005 17:17:14 +0000, Chris Uppal wrote:
    > Adam Warner wrote:
    >
    >> No. While I can come up with a different format (e.g. encoding the
    >> binary data in base 64) I'd like to solve the problem as specified.

    >
    > Hmm. I'm starting to think that you might want to take that option...
    >
    >
    >> > If not then the format is rather awkwardly designed, and you will
    >> > have to mess around with more complicated code to unravel it
    >> > character-by-character I suggest using a
    >> > java.nio.charset.CharsetDecoder directly.

    >>
    >> The NIO could be helpful. But I still wouldn't know where to cut off a
    >> chunk from the stream without potentially splitting a character and
    >> breaking the decoding.

    >
    > What I had in mind was a simple loop where, at each step, you feed 1
    > byte to the CharsetDecoder and get back 0, 1, or 2 chars.


    Nice idea that's unfortunately necessary because CharsetDecoder omits a
    decodeChar() method.

    > Unfortunately I was wrong. Although the documentation doesn't say so,
    > and although the design is clearly set up to be used like that, it
    > doesn't work. At least the UTF-8 decoder doesn't work if used like that.
    > It doesn't retain enough state to remember that it has seen the start
    > of an encoded character, and so it cannot be trusted to decode
    > sucessfully across buffer boundaries (I don't know whether that's a bug
    > or simply that it isn't expected to be able to do so).


    I've worked around this.

    > So I think that the loop has to look more like
    >
    > 0) clear a small buffer
    > 1) get the next byte
    > 2) append it to the small buffer
    > 3) attempt to decode that into up to 2 chars
    > 4) if that works[*] then process the chars and goto (0)
    > 5) goto (1)
    >
    > and that -- when expressed using the magic of nio ByteBuffers and
    > CharBuffers -- looks as it'd be extremely messy...
    >
    > ([*] by "works" I mean produces at least 1 char)


    My approach is similar. I fill a byte buffer (currently 1024 bytes) using
    a bulk read operation. Without any further copying I supply successive
    windows of the byte buffer to the charset decoder which places the decoded
    result into a char buffer of size 2. If there is no result the byte buffer
    position is reset to its previous value and the buffer limit is increased
    by 1 (leading to a visible byte window of 1, 2, 3, ... bytes). If the
    limit exceeds the size of the byte buffer then the few bytes yet to be
    decoded are copied back to the start of the byte buffer and the rest of
    the byte buffer is filled via another bulk read operation. Eventually a
    character is read.

    There will be bugs in the implementation below. I have successfully run a
    test class that includes a selection of Unicode code points and 256 bytes
    of binary data. To execute:

    javac BivalentInputStream.java BivalentInputStreamTest.java &&
    java BivalentInputStreamTest

    Many thanks for the feedback Chris.

    Regards,
    Adam


    import java.io.InputStream;
    import java.nio.ByteBuffer;
    import java.nio.CharBuffer;
    import java.nio.charset.Charset;
    import java.nio.charset.CharsetDecoder;
    import java.nio.charset.CoderResult;

    public class BivalentInputStream
    {
    public static int bufSize=1024;

    private InputStream in;
    private ByteBuffer bb=ByteBuffer.allocate(bufSize);
    private byte[] ba=bb.array();
    private int maxLimit;

    private CharsetDecoder decoder;
    private CharBuffer cb=CharBuffer.allocate(2); //support surrogate chars
    private char[] ca=cb.array();


    /** @return The number of bytes read into the buffer. */
    private int saneBulkRead(byte[] b, int offset) throws java.io.IOException {
    int brokenNumBytesRead=in.read(b, offset, b.length-offset);
    if (brokenNumBytesRead==-1) return 0;
    return brokenNumBytesRead;
    }


    public BivalentInputStream(InputStream in) throws java.io.IOException {
    this.in=in;
    maxLimit=saneBulkRead(ba, 0);
    bb.limit(1);
    decoder=Charset.defaultCharset().newDecoder();
    }

    public BivalentInputStream(InputStream in, Charset cs) throws java.io.IOException {
    this.in=in;
    maxLimit=saneBulkRead(ba, 0);
    bb.limit(1);
    this.decoder=cs.newDecoder();
    }


    private char cachedSurrogate;
    private boolean storedSurrogate=false;

    /** @return '\uFFFF' if the stream is exhausted or the remaining bytes
    do not comprise a 16-bit char. */
    public char readChar() throws java.io.IOException {
    if (storedSurrogate==true) {
    storedSurrogate=false;
    return cachedSurrogate;
    }
    int codePoint=readCodePoint();
    if (codePoint==-1) return '\uFFFF';
    if (codePoint>0xFFFF) {
    char[] chars=Character.toChars(codePoint);
    storedSurrogate=true;
    cachedSurrogate=chars[1];
    return chars[0];
    }
    return (char) codePoint;
    }


    /** @return -1 if the stream is exhausted or the remaining bytes
    do not comprise a Unicode code point. */
    public int readCodePoint() throws java.io.IOException {
    //Buffer refill logic
    if (bb.position()==maxLimit) {
    if (maxLimit==0) return -1;
    //refill the byte buffer after moving the remaining bytes up to position 0
    int remainingBytes=maxLimit-bb.position();
    System.arraycopy(ba, bb.position(), ba, 0, remainingBytes);
    maxLimit=saneBulkRead(ba, remainingBytes);
    if (maxLimit==0) return -1; //remaining bytes do not comprise a code point
    maxLimit+=remainingBytes;
    bb.position(0);
    bb.limit(remainingBytes+1);
    }

    cb.position(0);
    int bbStartPos=bb.position();
    decoder.reset();
    CoderResult result=decoder.decode(bb, cb, true);
    decoder.flush(cb);
    if (result==CoderResult.UNDERFLOW) {
    if (bb.limit()<maxLimit) bb.limit(bb.limit()+1);
    return Character.codePointAt(ca, 0);
    }
    bb.position(bbStartPos);
    bb.limit(bb.limit()+1);
    return readCodePoint();
    }


    /** @return -1 if the stream is exhausted. */
    public int readByte() throws java.io.IOException {
    if (bb.position()==maxLimit) {
    if (maxLimit==0) return -1;
    //refill the byte buffer
    maxLimit=saneBulkRead(ba, 0);
    bb.position(0);
    bb.limit(1);
    }
    if (bb.limit()<maxLimit) bb.limit(bb.limit()+1);
    return ((int) bb.get()) & 0xFF;
    }
    }

    //////////////////////////////////////////////////////////////////////////////

    import java.io.*;

    public class BivalentInputStreamTest
    {
    static int numCharUnits=0;

    public static byte[] buildTestArray() throws java.io.IOException {
    ByteArrayOutputStream baos=new ByteArrayOutputStream();
    DataOutputStream dos=new DataOutputStream(baos);

    //write code points
    String intro="Hello, World";
    dos.writeChars(intro);
    numCharUnits+=intro.length();

    for (int i=0; i<0x110000; i+=128) {
    //avoid writing lone surrogates
    if (((i>=0xD800 && i<=0xDBFF) || (i>=0xDC00 && i<=0xDFFF))!=true) {
    char[] chars=Character.toChars(i);
    dos.writeChars(new String(chars));
    numCharUnits+=chars.length;
    }
    }
    //write binary data
    for (int i=0; i<256; ++i) {
    dos.writeByte(i);
    }

    dos.flush(); dos.close();
    return baos.toByteArray();
    }


    public static void printByteArrayDifferences(byte[] array1, byte[] array2) {
    System.out.println("array1.length="+array1.length+
    "; array2.length="+array2.length);
    byte[] smaller=array1, larger=array2;
    if (array1.length>array2.length) { smaller=array2; larger=array1; }
    for(int i=0; i<smaller.length; ++i) {
    if (array1!=array2)
    System.out.println("position "+i+": "+(((int) array1) & 0xFF)+
    " "+(((int) array2) & 0xFF));
    }
    for (int i=smaller.length; i<larger.length; ++i) {
    System.out.println("position "+i+": "+(((int) larger) & 0xFF));
    }
    }


    public static void main(String[] args) throws java.io.IOException {
    byte[] ba=buildTestArray();
    ByteArrayInputStream bais=new ByteArrayInputStream(ba);
    BivalentInputStream in=new BivalentInputStream(bais, java.nio.charset.Charset.forName("UTF-16"));
    ByteArrayOutputStream baos=new ByteArrayOutputStream();
    DataOutputStream dos=new DataOutputStream(baos);
    //read char data (for testing purposes using the stored number of char units)
    for (int i=0; i<numCharUnits; ++i) {
    char c=in.readChar();
    dos.writeChar(c);
    }
    //read binary data
    for (int i=0; i<256; ++i) {
    dos.writeByte(in.readByte());
    }
    //Compare the arrays
    dos.flush(); dos.close();
    byte[] newBA=baos.toByteArray();
    if (ba.equals(newBA)!=true) printByteArrayDifferences(ba, newBA);
    }
    }
     
    Adam Warner, Nov 7, 2005
    #13
  14. Adam Warner

    Adam Warner Guest

    On Sun, 06 Nov 2005 23:14:03 +0000, Roedy Green wrote:
    > On Mon, 07 Nov 2005 02:42:44 +1300, Adam Warner
    > <> wrote, quoted or indirectly quoted someone
    > who said :
    >
    >>> Does your format have a reliable way of spotting the end of a stream of
    >>> character data /without/ decoding it ?

    >>
    >>No. While I can come up with a different format (e.g. encoding the binary
    >>data in base 64) I'd like to solve the problem as specified.

    >
    > You say you CAN tell the end in the DECODED stream but not in the byte
    > stream. How do you notice the end in the DECODED stream?


    If I call readCodePoint() upon a BivalentInputStream with valid character
    data then a Unicode code point is returned or -1 to signal the end of the
    stream. This is the first way of noticing the end of the decoded stream.

    Alternatively I could decide that a newline code point terminates the end
    of decoding. Again this is easy to detect.

    More complicated protocols are possible. A programming language could
    provide syntax to switch to binary decoding to reduce overhead when
    transferring code and data over a network. A kind of Binary XML could use
    this approach to switch to binary encoding. A tag such as
    <binary octets="12345"/> could read 12345 octets of binary data
    immediately following the closing > before switching back to reading text.

    A bivalent approach avoids the overhead of encoding binary data in the
    current character set and the high CPU burden of compressing that data for
    transmission and decompressing it again at the other end and finally
    translating the characters back to binary data. One clearly needs control
    over the whole communication process because the transformed data is
    unlikely to be legal text unless the character set is a legacy encoding
    such as ISO-8859-1. And even if the resulting text is legal the binary
    data will be corrupted by different operating system newline conventions.

    Regards,
    Adam
     
    Adam Warner, Nov 7, 2005
    #14
  15. Adam Warner

    Roedy Green Guest

    On Mon, 07 Nov 2005 02:42:44 +1300, Adam Warner
    <> wrote, quoted or indirectly quoted someone
    who said :

    >No. While I can come up with a different format (e.g. encoding the binary
    >data in base 64) I'd like to solve the problem as specified.


    If you use counted UTF, the problem goes away. You don't have a slow
    Mickey Mouse solution. The String is handled with equal ease to any
    binary field. Why goof around with bailing wire?

    see DataOutputStream.writeUTF and DataInputStream. readUTF
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
     
    Roedy Green, Nov 7, 2005
    #15
  16. Adam Warner

    Roedy Green Guest

    On Mon, 07 Nov 2005 21:15:06 +1300, Adam Warner
    <> wrote, quoted or indirectly quoted someone
    who said :

    > BivalentInputStream


    I am not familiar with that class. Further I have never heard the
    term bivalent used outside the chemistry or genetics contexts.

    What do you mean by "bivalent" in terms of datastreams? Do you just
    mean having two different encodings, e.g. encoded char and binary and
    some mechanism to toggle?
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
     
    Roedy Green, Nov 7, 2005
    #16
  17. Adam Warner

    Chris Uppal Guest

    Adam Warner wrote:

    > There will be bugs in the implementation below.


    You might like a couple of test inputs, The following byte array defines a
    sequence of 4 Unicode code points, or 5 Java chars (sorry about the layout
    mangling).

    Charset utf8 = Charset.forName("UTF-8");
    byte[] bytes = new byte[] {
    0x32, // = U+000032
    (byte)0xD0, (byte)0xB0, // = U+000430
    (byte)0xE4, (byte)0xBA, (byte)0x8C, // = U+004E8C
    (byte)0xF0, (byte)0x90, (byte)0x8C, (byte)0x82 // = U+010302
    )

    Also this sequence defines an /invalid/ UTF-8 sequence:
    byte[] bytes = new byte[] {
    (byte)0xB0, (byte)0xD0 // = invalid
    };

    A couple of comments, if you want 'em:


    > private int saneBulkRead(byte[] b, int offset) throws java.io.IOException

    {
    > int brokenNumBytesRead=in.read(b, offset,


    I see you prefer self-documenting code ;-) Nice...


    > public int readCodePoint() throws java.io.IOException {
    > [...]
    > if (result==CoderResult.UNDERFLOW) {
    > if (bb.limit()<maxLimit) bb.limit(bb.limit()+1);
    > return Character.codePointAt(ca, 0);
    > }
    > bb.position(bbStartPos);
    > bb.limit(bb.limit()+1);
    > return readCodePoint();
    > }


    If the input data is mangled, then 'result' will be isMalformed() and no amount
    of extra data added to the end will fix it, so in that case the recursion will
    continue more-or-less indefinitely.

    I /think/ you may also have a problem with the bb.limit(..) line. It assumes
    that there is enough space in bb which I don't think is necessarily the case.

    -- chris
     
    Chris Uppal, Nov 7, 2005
    #17
  18. Adam Warner

    Adam Warner Guest

    On Mon, 07 Nov 2005 11:59:57 +0000, Chris Uppal wrote:
    > Adam Warner wrote:
    >
    >> There will be bugs in the implementation below.

    >
    > You might like a couple of test inputs, The following byte array defines
    > a sequence of 4 Unicode code points, or 5 Java chars (sorry about the
    > layout mangling).


    Many thanks. I do have to improve handling of malformed data.

    > A couple of comments, if you want 'em:
    >
    >
    >> private int saneBulkRead(byte[] b, int offset) throws
    >> java.io.IOException

    > {
    >> int brokenNumBytesRead=in.read(b, offset,

    >
    > I see you prefer self-documenting code ;-) Nice...


    An Enterprise API isn't complete until the documentation for x.plus(y)
    reads: /** @return The sum of x and y, unless the sum is 42 then -1 is
    returned. */

    java.io.InputStream.read(byte[] b, int off, int len) returns the number of
    bytes written to the byte array. Except when it doesn't. A better language
    would support seamless multiple return values and their efficient
    implementation. If Java had multiple return values the first return value
    for this method could simply be the number of bytes written to the byte
    array. The second return value, to be optionally captured, could be a
    boolean denoting the end of stream. Instead of conflating two return
    values there could also be a separate isEndofStream() method.

    As JVMs become capable of stack allocating many new objects via escape
    analysis there's potential for the efficient return of multiple values
    within an explicit new array. If Java the language is changed to support
    seamless multiple return values (like the recent introduction of variable
    arguments on the input side) then more consistent libraries are likely.

    Regards,
    Adam
     
    Adam Warner, Nov 7, 2005
    #18
  19. Adam Warner

    Roedy Green Guest

    On Tue, 08 Nov 2005 11:33:40 +1300, Adam Warner
    <> wrote, quoted or indirectly quoted someone
    who said :

    >As JVMs become capable of stack allocating many new objects via escape
    >analysis there's potential for the efficient return of multiple values
    >within an explicit new array. If Java the language is changed to support
    >seamless multiple return values (like the recent introduction of variable
    >arguments on the input side) then more consistent libraries are likely.


    Java the language is fine. The Jet people automatically allocate some
    objects on the stack. Allocating objects there would likely require an
    overhaul of the JVM.
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
     
    Roedy Green, Nov 8, 2005
    #19
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jan Kastorp

    problems with character streams

    Jan Kastorp, Jun 16, 2005, in forum: Java
    Replies:
    2
    Views:
    403
    Jan Kastorp
    Jun 16, 2005
  2. Tron Thomas
    Replies:
    3
    Views:
    516
    Tron Thomas
    Nov 8, 2004
  3. , India

    text and binary streams

    , India, Aug 23, 2008, in forum: C Programming
    Replies:
    4
    Views:
    438
    Peter Nilsson
    Aug 25, 2008
  4. Leslaw Bieniasz
    Replies:
    2
    Views:
    1,065
    Thomas J. Gritzan
    Jan 15, 2010
  5. Replies:
    2
    Views:
    261
Loading...

Share This Page