Problem with getting correct data out of buffer reading from channel

Discussion in 'Java' started by nooneinparticular314159@yahoo.com, Jul 20, 2009.

  1. Guest

    I'm reading data from a socket channel in a network program. To test
    my code, I'm telnetting to the program and typing in some data, which
    I then try to read from the buffer, view as a charbuffer, and write to
    standard out. Unfortunately, what I type in are english letters and
    numbers, and what I get out seems to be unicode chinese! Here's what
    I'm doing:

    try {
    NumberOfBytesReadFromChannel = Channel.read
    (ReceiveBuffer); //read available data into the buffer
    }

    ReceiveBuffer.flip(); //flip the buffer so it
    can be read

    //Read the new data out of the buffer and add it to
    IncomingDataString, which stores unprocessed incoming data
    IncomingMessageBuffer = ReceiveBuffer.asCharBuffer();
    if (IncomingDataString == null) {
    IncomingDataString = IncomingMessageBuffer.toString();
    } else {
    IncomingDataString = IncomingDataString +
    IncomingMessageBuffer.toString();
    }

    //*************************
    System.out.println("String received was: " +
    IncomingDataString);

    ReceiveBuffer.clear();

    (Not shown: the IOException catch statement)

    What I get are a series of strings that look like:
    String received was: 摧æ æ‘¦à´Šæœæ æ 

    So somehow, I seem to be reading the data incorrectly, even though I
    am receiving it. Any idea what I'm doing wrong here?

    Thanks!
    , Jul 20, 2009
    #1
    1. Advertising

  2. Roedy Green Guest

    On Sun, 19 Jul 2009 17:52:24 -0700 (PDT),
    ""
    <> wrote, quoted or indirectly quoted
    someone who said :

    > ReceiveBuffer.asCharBuffer();


    this implies you have 16 bit Unicode, not UTF-8 or some other 8-bit
    encoding.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    "The industrial civilisation is based on the consumption of energy resources that are inherently limited in quantity, and that are about to become scarce. When they do, competition for what remains will trigger dramatic economic and geopolitical events; in the end, it may be impossible for even a single nation to sustain industrialism as we have know it in the twentieth century."
    ~ Richard Heinberg, The Party’s Over: Oil, War, and the Fate of Industrial Societies
    Roedy Green, Jul 20, 2009
    #2
    1. Advertising

  3. Re: Problem with getting correct data out of buffer reading fromchannel

    Hmm...telnet should just be sending raw ASCII. Is there a way to
    force java not to use unicode?

    Thanks!
    nooneinparticular314159, Jul 20, 2009
    #3
  4. Re: Problem with getting correct data out of buffer reading fromchannel

    Ok, Richard. Looks like you were right! I created a CharsetDecoder
    for ISO-8859-1, and use that to decode my buffer, and what I get out
    is the text I transmitted using telnet!

    So my remaining questions are: Let's say that I write a program in
    Java to transmit some data to the program above. If I don't
    explicitly change the encoding, will the encoding be correct, since it
    will be using whatever Java natively uses?

    Also, let's say that I want to get some data from something other than
    my own Java programs. Is there a way to detect the encoding that they
    are using, so that I can work with any program that transmits to my
    program? Or do I have to just know what they are using?

    Thanks,
    Michael
    nooneinparticular314159, Jul 20, 2009
    #4
  5. Roedy Green Guest

    Re: Problem with getting correct data out of buffer reading from channel

    On Mon, 20 Jul 2009 02:01:37 -0700 (PDT), nooneinparticular314159
    <> wrote, quoted or indirectly quoted
    someone who said :

    >Hmm...telnet should just be sending raw ASCII. Is there a way to
    >force java not to use unicode?


    You could process raw bytes with nio or with an InputStream.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    "The industrial civilisation is based on the consumption of energy resources that are inherently limited in quantity, and that are about to become scarce. When they do, competition for what remains will trigger dramatic economic and geopolitical events; in the end, it may be impossible for even a single nation to sustain industrialism as we have know it in the twentieth century."
    ~ Richard Heinberg, The Party’s Over: Oil, War, and the Fate of Industrial Societies
    Roedy Green, Jul 20, 2009
    #5
  6. Roedy Green Guest

    Re: Problem with getting correct data out of buffer reading from channel

    On Mon, 20 Jul 2009 02:10:54 -0700 (PDT), nooneinparticular314159
    <> wrote, quoted or indirectly quoted
    someone who said :

    >So my remaining questions are: Let's say that I write a program in
    >Java to transmit some data to the program above. If I don't
    >explicitly change the encoding, will the encoding be correct, since it
    >will be using whatever Java natively uses?


    The answer is ugly. See http://mindprod.com/jgloss/encoding.html

    The answer is, in general, data are not tagged with the encoding.
    I assume this is a result of the male propensity to surround himself
    with dirty coffee cups and empty pizza boxes. I can't imagine Martha
    Stewart as computer programmer putting up with such a slovenly state
    of affairs.

    HTTP has some encoding headers, and some ways to request your
    preferred encodings.

    I think the way out is gradually to discard all encodings except
    UTF-8.

    I wrote a little utility to help you guess what encoding was used.
    see http://mindprod.com/applet/officialencoding.html

    Basically the receiver is just supposed to "know" the encoding.
    This might have been reasonably in 1960 when every datacentre had its
    own private encoding and everyone used it, and people rarely exchanged
    data with the outside world. But today, with the international sharing
    on the Internet, it is crazy.

    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    "The industrial civilisation is based on the consumption of energy resources that are inherently limited in quantity, and that are about to become scarce. When they do, competition for what remains will trigger dramatic economic and geopolitical events; in the end, it may be impossible for even a single nation to sustain industrialism as we have know it in the twentieth century."
    ~ Richard Heinberg, The Party’s Over: Oil, War, and the Fate of Industrial Societies
    Roedy Green, Jul 20, 2009
    #6
  7. markspace Guest

    Re: Problem with getting correct data out of buffer reading fromchannel

    nooneinparticular314159 wrote:
    > for ISO-8859-1, and use that to decode my buffer, and what I get out



    As others have mentioned, this probably should be "US-ASCII" for telnet.


    > So my remaining questions are: Let's say that I write a program in
    > Java to transmit some data to the program above. If I don't
    > explicitly change the encoding, will the encoding be correct, since it
    > will be using whatever Java natively uses?



    No, Java IO "natively" uses the platform default. This will be
    different on each platform. Internally, all character encoding are the
    same in Java (as long as it's based on char, not byte) but almost all IO
    will convert Java's internal encoding to the platform default.

    It's possible to write Java's internal raw characters to a stream, but
    you have to be careful to do it correctly or Java will translate
    (encode) that character. It's easier just to specify an encoding, imo.


    > Is there a way to detect the encoding that they
    > are using, so that I can work with any program that transmits to my
    > program? Or do I have to just know what they are using?


    There's no way to figure it out, you have to "just know". That's for
    any language or IO, not just Java. Very few IO operations specify an
    encoding or how to obtain one. Two big exceptions which DO specify an
    encoding are HTTP and XML, which is one reason why they're so popular.
    markspace, Jul 20, 2009
    #7
  8. Lew Guest

    Re: Problem with getting correct data out of buffer reading fromchannel

    Roedy Green wrote:
    > The answer is, in general, data are not tagged with the encoding.
    > I assume this is a result of the male propensity to surround himself
    > with dirty coffee cups and empty pizza boxes. I can't imagine Martha
    > Stewart as computer programmer putting up with such a slovenly state
    > of affairs.


    What a profoundly bigoted and sexist thing to say.

    --
    Lew
    Lew, Jul 21, 2009
    #8
  9. Re: Problem with getting correct data out of buffer reading fromchannel

    Lew wrote:
    > Roedy Green wrote:
    >> The answer is, in general, data are not tagged with the encoding. I
    >> assume this is a result of the male propensity to surround himself
    >> with dirty coffee cups and empty pizza boxes. I can't imagine Martha
    >> Stewart as computer programmer putting up with such a slovenly state
    >> of affairs.

    >
    > What a profoundly bigoted and sexist thing to say.


    The question is whether it is derogatory against women
    or computer programmers.

    :)

    From the more serious perspective the stereotype computer
    programmer described is not very common - software engineers
    are professionals like other engineers - follow office
    dress codes, try and work normal hours when possible, eat healthy
    after doctors orders, have a wife and house etc.etc..

    Arne
    Arne Vajhøj, Jul 21, 2009
    #9
  10. OT: US-ASCII (was Re: Problem with getting correct data out of bufferreading from channel)

    markspace wrote:
    > As others have mentioned, this probably should be "US-ASCII" for telnet.


    My first reaction was that US-ASCII is something of an oxymoron, but I
    do recall that there were some early 7-bit character-sets that were
    national variations of ASCII (ANSI X-3.4 1968, ISO 646). Was there a UK
    version with a £ in place of the $? I can't find any references to this.

    I find telnet is mostly character-set agnostic. I can switch character
    sets without the telnet protocol needing to know what I am doing.

    --
    RGB
    RedGrittyBrick, Jul 21, 2009
    #10
  11. Re: OT: US-ASCII (was Re: Problem with getting correct data out ofbuffer reading from channel)

    RedGrittyBrick wrote:
    >
    > markspace wrote:
    >> As others have mentioned, this probably should be "US-ASCII" for telnet.

    >
    > My first reaction was that US-ASCII is something of an oxymoron, but I
    > do recall that there were some early 7-bit character-sets that were
    > national variations of ASCII (ANSI X-3.4 1968, ISO 646). Was there a UK
    > version with a £ in place of the $? I can't find any references to this.


    Not quite right, http://rabbit.eng.miami.edu/info/ascii.html#BS4730 says
    £ in place of #.

    Since these non-American variants of ASCII have their own names I still
    think the US- prefix is, at least, somewhat redundant for ASCII.

    There were significant differences between the 1963, 1965 and 1967 ASCII
    standards which might be more important to highlight than the US-ness of
    the A in ASCII.

    I'll shut up now :)

    --
    RGB
    Picking nits since 19xx.
    RedGrittyBrick, Jul 21, 2009
    #11
  12. Arne Vajhøj Guest

    Re: OT: US-ASCII (was Re: Problem with getting correct data out ofbuffer reading from channel)

    RedGrittyBrick wrote:
    > RedGrittyBrick wrote:
    >> markspace wrote:
    >>> As others have mentioned, this probably should be "US-ASCII" for telnet.

    >>
    >> My first reaction was that US-ASCII is something of an oxymoron, but I
    >> do recall that there were some early 7-bit character-sets that were
    >> national variations of ASCII (ANSI X-3.4 1968, ISO 646). Was there a
    >> UK version with a £ in place of the $? I can't find any references to
    >> this.

    >
    > Not quite right, http://rabbit.eng.miami.edu/info/ascii.html#BS4730 says
    > £ in place of #.
    >
    > Since these non-American variants of ASCII have their own names I still
    > think the US- prefix is, at least, somewhat redundant for ASCII.
    >
    > There were significant differences between the 1963, 1965 and 1967 ASCII
    > standards which might be more important to highlight than the US-ness of
    > the A in ASCII.


    http://www.iana.org/assignments/character-sets

    says:

    Name: ANSI_X3.4-1968 [RFC1345,KXS2]
    MIBenum: 3
    Source: ECMA registry
    Alias: iso-ir-6
    Alias: ANSI_X3.4-1986
    Alias: ISO_646.irv:1991
    Alias: ASCII
    Alias: ISO646-US
    Alias: US-ASCII (preferred MIME name)
    Alias: us
    Alias: IBM367
    Alias: cp367
    Alias: csASCII

    US-ASCII is listed even claiming to be "preferred MIME name" !

    Arne
    Arne Vajhøj, Jul 22, 2009
    #12
  13. Re: OT: US-ASCII (was Re: Problem with getting correct data out ofbuffer reading from channel)

    Arne Vajhøj wrote:
    > RedGrittyBrick wrote:
    >> RedGrittyBrick wrote:
    >>> markspace wrote:
    >>>> As others have mentioned, this probably should be "US-ASCII" for
    >>>> telnet.
    >>>
    >>> My first reaction was that US-ASCII is something of an oxymoron, but
    >>> I do recall that there were some early 7-bit character-sets that were
    >>> national variations of ASCII (ANSI X-3.4 1968, ISO 646). Was there a
    >>> UK version with a £ in place of the $? I can't find any references to
    >>> this.

    >>
    >> Not quite right, http://rabbit.eng.miami.edu/info/ascii.html#BS4730
    >> says £ in place of #.
    >>
    >> Since these non-American variants of ASCII have their own names I
    >> still think the US- prefix is, at least, somewhat redundant for ASCII.
    >>
    >> There were significant differences between the 1963, 1965 and 1967
    >> ASCII standards which might be more important to highlight than the
    >> US-ness of the A in ASCII.

    >
    > http://www.iana.org/assignments/character-sets
    >
    > says:
    >
    > Name: ANSI_X3.4-1968 [RFC1345,KXS2]
    > MIBenum: 3
    > Source: ECMA registry
    > Alias: iso-ir-6
    > Alias: ANSI_X3.4-1986
    > Alias: ISO_646.irv:1991
    > Alias: ASCII
    > Alias: ISO646-US
    > Alias: US-ASCII (preferred MIME name)
    > Alias: us
    > Alias: IBM367
    > Alias: cp367
    > Alias: csASCII
    >
    > US-ASCII is listed even claiming to be "preferred MIME name" !
    >


    Oh well, If IANA say so, though I think the US prefix is about as
    redundant as one of the numbers in "PIN number".

    --
    RGB
    RedGrittyBrick, Jul 22, 2009
    #13
  14. Lew Guest

    Re: OT: US-ASCII (was Re: Problem with getting correct data out ofbuffer reading from channel)

    RedGrittyBrick wrote:
    > Oh well, If IANA say so, though I think the US prefix is about as
    > redundant as one of the numbers in "PIN number".


    Not at all. That's no more redundant than "D-Day".

    --
    Lew
    Enter your PIN number into the ATM machine.
    Lew, Jul 22, 2009
    #14
  15. Re: OT: US-ASCII (was Re: Problem with getting correct data out ofbuffer reading from channel)

    Lew wrote:
    > RedGrittyBrick wrote:
    >> Oh well, If IANA say so, though I think the US prefix is about as
    >> redundant as one of the numbers in "PIN number".

    >
    > Not at all. That's no more redundant than "D-Day".
    >


    :) I'm beginning to regret starting this.

    AIUI "D-Day" is a military term often used to *name* a specific day in a
    military operation. The most famous D-Day (in my locale anyway) is June
    6th 1944. Saying "D-Day" is like saying "Day X" and not like saying
    "Day". So "D-Day" does not mean the exact same thing as "Day". D-Day is
    part of a family of similarly named but distinct days such as VE-Day.
    The prefixes are needed to distinguish amongst those days.

    By contrast my ATM card has a Personal Identification Number. As this is
    usually abbreviated to PIN, I could say my ATM card has a PIN. A PIN
    number would be a Personal Identification Number number. Perhaps we
    could abbreviate that to PINN and start talking about PINN numbers?

    Unlike with "D-Day" and "Day", when people say "PIN number", they would
    lose no information by saying "PIN" instead.

    Therefore, it seems to me, the "number" in "PIN number" is *much* more
    redundant than the "D-" in "D-Day". One is, the other isn't.


    --
    RGB
    RedGrittyBrick, Jul 22, 2009
    #15
  16. Arne Vajhøj Guest

    Re: OT: US-ASCII (was Re: Problem with getting correct data out ofbuffer reading from channel)

    RedGrittyBrick wrote:
    > Arne Vajhøj wrote:
    >> RedGrittyBrick wrote:
    >>> RedGrittyBrick wrote:
    >>>> markspace wrote:
    >>>>> As others have mentioned, this probably should be "US-ASCII" for
    >>>>> telnet.
    >>>>
    >>>> My first reaction was that US-ASCII is something of an oxymoron, but
    >>>> I do recall that there were some early 7-bit character-sets that
    >>>> were national variations of ASCII (ANSI X-3.4 1968, ISO 646). Was
    >>>> there a UK version with a £ in place of the $? I can't find any
    >>>> references to this.
    >>>
    >>> Not quite right, http://rabbit.eng.miami.edu/info/ascii.html#BS4730
    >>> says £ in place of #.
    >>>
    >>> Since these non-American variants of ASCII have their own names I
    >>> still think the US- prefix is, at least, somewhat redundant for ASCII.
    >>>
    >>> There were significant differences between the 1963, 1965 and 1967
    >>> ASCII standards which might be more important to highlight than the
    >>> US-ness of the A in ASCII.

    >>
    >> http://www.iana.org/assignments/character-sets
    >>
    >> says:
    >>
    >> Name: ANSI_X3.4-1968 [RFC1345,KXS2]
    >> MIBenum: 3
    >> Source: ECMA registry
    >> Alias: iso-ir-6
    >> Alias: ANSI_X3.4-1986
    >> Alias: ISO_646.irv:1991
    >> Alias: ASCII
    >> Alias: ISO646-US
    >> Alias: US-ASCII (preferred MIME name)
    >> Alias: us
    >> Alias: IBM367
    >> Alias: cp367
    >> Alias: csASCII
    >>
    >> US-ASCII is listed even claiming to be "preferred MIME name" !
    >>

    >
    > Oh well, If IANA say so, though I think the US prefix is about as
    > redundant as one of the numbers in "PIN number".


    I would agree, but if the choice is between following the standard
    or do what the standard should have been, then one should follow
    the standard.

    Arne
    Arne Vajhøj, Jul 23, 2009
    #16
  17. Re: OT: US-ASCII (was Re: Problem with getting correct data out ofbuffer reading from channel)

    RedGrittyBrick wrote:
    > :) I'm beginning to regret starting this.


    Too late !

    Arne
    Arne Vajhøj, Jul 23, 2009
    #17
  18. Lew Guest

    Re: OT: US-ASCII (was Re: Problem with getting correct data out ofbuffer reading from channel)

    RedGrittyBrick wrote:
    >
    > Lew wrote:
    >> RedGrittyBrick wrote:
    >>> Oh well, If IANA say so, though I think the US prefix is about as
    >>> redundant as one of the numbers in "PIN number".

    >>
    >> Not at all. That's no more redundant than "D-Day".
    >>

    >
    > :) I'm beginning to regret starting this.
    >
    > AIUI "D-Day" is a military term often used to *name* a specific day in a
    > military operation. The most famous D-Day (in my locale anyway) is June
    > 6th 1944. Saying "D-Day" is like saying "Day X" and not like saying
    > "Day". So "D-Day" does not mean the exact same thing as "Day". D-Day is
    > part of a family of similarly named but distinct days such as VE-Day.
    > The prefixes are needed to distinguish amongst those days.


    Actually, in (at least U.S.) military parlance, "D-Day" is part of a family of
    terms like "H-Hour" - the first "D" is the specific day of a particular
    operation (e.g., the landing at Normandy Beach). It stands for "Day", as in
    "the Day of the operation", just as "H" in "H-Hour" stands for "Hour".

    There are as many "D-Day"s as there are military operations that have a
    scheduled day.

    > By contrast my ATM card has a Personal Identification Number. As this is
    > usually abbreviated to PIN, I could say my ATM card has a PIN. A PIN
    > number would be a Personal Identification Number number. Perhaps we
    > could abbreviate that to PINN and start talking about PINN numbers?
    >
    > Unlike with "D-Day" and "Day", when people say "PIN number", they would
    > lose no information by saying "PIN" instead.


    Actually, they would. The "N" in "PIN" is generic like the first "D" in
    "D-Day", and it means generally the personal identification number, and "PIN
    number" is the particular person's identifying personal identification number.

    > Therefore, it seems to me, the "number" in "PIN number" is *much* more
    > redundant than the "D-" in "D-Day". One is, the other isn't.


    No more so than saying "machine" in "ATM machine". You have to say "ATM
    machine" so you know you aren't speaking of the "ATM card" or the "ATM machine
    PIN number".

    --
    Lew
    Lew, Jul 23, 2009
    #18
  19. Re: OT: US-ASCII (was Re: Problem with getting correct data out ofbuffer reading from channel)

    Lew wrote:
    > No more so than saying "machine" in "ATM machine". You have to say "ATM
    > machine" so you know you aren't speaking of the "ATM card" or the "ATM
    > machine PIN number".


    You kitten mass-murderer! ;-)

    --
    Beware of bugs in the above code; I have only proved it correct, not
    tried it. -- Donald E. Knuth
    Joshua Cranmer, Jul 23, 2009
    #19
  20. markspace Guest

    Re: Problem with getting correct data out of buffer reading fromchannel

    Patricia Shanahan wrote:
    > Lew wrote:
    >> Roedy Green wrote:
    >>> The answer is, in general, data are not tagged with the encoding. I
    >>> assume this is a result of the male propensity to surround himself
    >>> with dirty coffee cups and empty pizza boxes. I can't imagine Martha
    >>> Stewart as computer programmer putting up with such a slovenly state
    >>> of affairs.

    >>
    >> What a profoundly bigoted and sexist thing to say.
    >>

    >
    > Obviously, Roedy has never seen my kitchen. :)



    Dang it, now I'm hungry.
    markspace, Jul 23, 2009
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Kiran Kumar

    reading from socket channel

    Kiran Kumar, May 22, 2004, in forum: Java
    Replies:
    1
    Views:
    1,513
    Steve Horsley
    May 22, 2004
  2. Chris Online

    Reading out data from octal buffer

    Chris Online, Mar 2, 2004, in forum: C++
    Replies:
    1
    Views:
    381
    Peter Jansson
    Mar 2, 2004
  3. Raja
    Replies:
    12
    Views:
    24,343
    John Harrison
    Jun 21, 2004
  4. Replies:
    4
    Views:
    345
    Default User
    Sep 6, 2006
  5. Replies:
    2
    Views:
    587
    sergejusz
    Mar 26, 2007
Loading...

Share This Page