Zero Byte Terminated Strings

Discussion in 'Java' started by PurpleServerMonkey, Mar 28, 2007.

  1. Hi,

    I'm writting a simple UDP server in Java, it's designed to take an
    initial request packet from a C based client and perform further
    actions. The networking side of things is fine however I'm having
    problems dealing with a zero byte terminated string being sent from
    the client.

    Code Snippet:
    byte[] data = new byte[1000];
    DatagramSocket serverSocket = new DatagramSocket(1025);
    DatagramPacket packet = new DatagramPacket(data, data.length);
    serverSocket.receive(packet);

    The recieved packet then gets put onto a queue for pickup by a thread
    pool. It's in the threadpool that I look at processing the packet and
    extracting the string information (represents a filename, mode, etc).
    Note that the strings in this packet are zero byte terminated.

    Code Snippet:
    byte[] payload = new byte[1000];
    payload = packet.getData();

    What I'd like to know is, what is the best way to retrive zero byte
    terminated strings from the byte array?

    Thanks in advance for your assistance.
    PurpleServerMonkey, Mar 28, 2007
    #1
    1. Advertising

  2. PurpleServerMonkey wrote:
    > Hi,
    >
    > I'm writting a simple UDP server in Java, it's designed to take an
    > initial request packet from a C based client and perform further
    > actions. The networking side of things is fine however I'm having
    > problems dealing with a zero byte terminated string being sent from
    > the client.
    >
    > Code Snippet:
    > byte[] data = new byte[1000];
    > DatagramSocket serverSocket = new DatagramSocket(1025);
    > DatagramPacket packet = new DatagramPacket(data, data.length);
    > serverSocket.receive(packet);
    >
    > The recieved packet then gets put onto a queue for pickup by a thread
    > pool. It's in the threadpool that I look at processing the packet and
    > extracting the string information (represents a filename, mode, etc).
    > Note that the strings in this packet are zero byte terminated.
    >
    > Code Snippet:
    > byte[] payload = new byte[1000];
    > payload = packet.getData();
    >
    > What I'd like to know is, what is the best way to retrive zero byte
    > terminated strings from the byte array?
    >
    > Thanks in advance for your assistance.
    >


    Actually very easy to do. Just create a String from your byte[] buffer
    and split it on the 0s.

    public class test {
    public static void main (String[] args) throws Exception {
    byte[] buf = { 0x54, 0x48, 0x49, 0x53, 0x00, 0x49, 0x53, 0x00,
    0x41, 0x00, 0x54, 0x45, 0x53, 0x54, 0x00 };

    String str = new String(buf);
    String[] arr = str.split("\u0000");

    for (int i=0; i<arr.length; i++)
    System.out.println(arr);
    }
    }

    --

    Knute Johnson
    email s/nospam/knute/
    Knute Johnson, Mar 28, 2007
    #2
    1. Advertising

  3. PurpleServerMonkey

    Adam Maass Guest

    "Knute Johnson" <> wrote:
    >>

    >
    > Actually very easy to do. Just create a String from your byte[] buffer
    > and split it on the 0s.
    >
    > public class test {
    > public static void main (String[] args) throws Exception {
    > byte[] buf = { 0x54, 0x48, 0x49, 0x53, 0x00, 0x49, 0x53, 0x00,
    > 0x41, 0x00, 0x54, 0x45, 0x53, 0x54, 0x00 };
    >
    > String str = new String(buf);


    Ahem, it will be critically important to specify the encoding to the String
    constructor!

    String str = new String(buf, "ASCII");

    > String[] arr = str.split("\u0000");
    >
    > for (int i=0; i<arr.length; i++)
    > System.out.println(arr);
    > }
    > }
    >
    > --
    >
    > Knute Johnson
    > email s/nospam/knute/
    Adam Maass, Mar 28, 2007
    #3
  4. PurpleServerMonkey

    Chris Uppal Guest

    PurpleServerMonkey wrote:

    > What I'd like to know is, what is the best way to retrive zero byte
    > terminated strings from the byte array?


    There is no easy way to do it. That's to say, the /code/ will be trivially
    simple once you know what you have to do, but finding out what you have to do
    will be tricky unless the C programmers who generate the input are unusually
    knowledgeable.

    There is no equivalence between character data and binary data, so one is
    always turned into the other by using some character encoding or other (often
    called a "charset" or a "code page"). In Java, when you convert bytes to text
    (or vice versa) you /always/ have to tell the system what character encoding to
    use. (There are some "convenience" methods which use a system-default code
    page, but you should avoid those in most circumstances, and you should
    /definitely/ avoid them in this case).

    So how do you find out what character set has been used by the C programmers ?
    The first thing to do is to ask them. The chances are fairly good that they'll
    have no idea what you are talking about. If not, then presumably they haven't
    taken any steps at all to /control/ what code page is being used, and it will
    be either:
    some system default, if they are generating the text themselves
    or
    whatever character set the /real/ source of the data used.

    If they are generating the data themselves, then you can probably get a decent
    guess as to what character set they are using by running the following little
    Java programs on the machine where they compile their stuff.

    public class Main
    {
    public static void
    main(String[] args)
    {
    System.out.println(
    "file.encoding: "
    + System.getProperty("file.encoding"));
    }
    }

    That will tell you what character set Java thinks is most likely to be a
    sensible default for that machine, and it /may/ be correct. On my system
    today, that name is "Cp1252" (which cognoscenti will recognise as meaning I
    have a Windows box set up to use an English/Western European character set by
    default).

    If you can't find any sensible information, then it's probably a good idea to
    assume that the data is pure ASCII -- which is a 7-bit encoding which
    (therefore) only defines 127 characters, but those 127 characters are common to
    all (as far as I know) encodings that your UDP packets are likely to be using.
    To use that character encoding use an encoding name of "US-ASCII".

    Once you have decided what character set is in use, actually decoding it is
    trivial. Just find the start of the text data in your byte[] buffer (which you
    must already know how to do), loop down the buffer looking for the terminating
    byte which has value 0 (but see below), and then pass the resulting data into
    the String constructor:
    String(byte[] bytes, int offset, int length, String charsetName)
    or, if you prefer:
    String(byte[] bytes, int offset, int length, Charset charset)
    which will do the conversion for you.

    (The potential gotcha about looking for the value 0 is that it assumes that the
    data is encoded using an 8-bit (or 7-bit) encoding like "ISO-8859-1", "UTF-8",
    or "Cp1252", rather than a 16-bit encoding like "UTF-16" -- but that seems a
    safe bet or even C programmers would know that there was a potential problem
    and warn you about it.)

    If you can, I'd advise getting the C people to send a packet containing /all/
    the potential 254 non-zero characters, and then compare what you decode it as
    with what they expect it to look like. Needless to say, you'll have to be
    careful about character encoding issues when you do the comparison...

    -- chris
    Chris Uppal, Mar 28, 2007
    #4
  5. Adam Maass wrote:
    >
    > "Knute Johnson" <> wrote:
    >>>

    >>
    >> Actually very easy to do. Just create a String from your byte[]
    >> buffer and split it on the 0s.
    >>
    >> public class test {
    >> public static void main (String[] args) throws Exception {
    >> byte[] buf = { 0x54, 0x48, 0x49, 0x53, 0x00, 0x49, 0x53, 0x00,
    >> 0x41, 0x00, 0x54, 0x45, 0x53, 0x54, 0x00 };
    >>
    >> String str = new String(buf);

    >
    > Ahem, it will be critically important to specify the encoding to the
    > String constructor!
    >
    > String str = new String(buf, "ASCII");
    >
    >> String[] arr = str.split("\u0000");
    >>
    >> for (int i=0; i<arr.length; i++)
    >> System.out.println(arr);
    >> }
    >> }
    >>


    Only if he doesn't want his system default character set. Mine
    certainly doesn't default to ASCII, or as it is more correctly known
    ANSI_X3.4-1968. What character set does your C compiler default to?

    --

    Knute Johnson
    email s/nospam/knute/
    Knute Johnson, Mar 28, 2007
    #5
  6. On Mar 28, 3:44 pm, Knute Johnson <>
    wrote:
    > Adam Maass wrote:
    >
    > > "Knute Johnson" <> wrote:

    >
    > >> Actually very easy to do. Just create a String from your byte[]
    > >> buffer and split it on the 0s.

    >
    > >> public class test {
    > >> public static void main (String[] args) throws Exception {
    > >> byte[] buf = { 0x54, 0x48, 0x49, 0x53, 0x00, 0x49, 0x53, 0x00,
    > >> 0x41, 0x00, 0x54, 0x45, 0x53, 0x54, 0x00 };

    >
    > >> String str = new String(buf);

    >
    > > Ahem, it will be critically important to specify the encoding to the
    > > String constructor!

    >
    > > String str = new String(buf, "ASCII");

    >
    > >> String[] arr = str.split("\u0000");

    >
    > >> for (int i=0; i<arr.length; i++)
    > >> System.out.println(arr);
    > >> }
    > >> }

    >
    > Only if he doesn't want his system default character set. Mine
    > certainly doesn't default to ASCII, or as it is more correctly known
    > ANSI_X3.4-1968. What character set does your C compiler default to?
    >
    > --
    >
    > Knute Johnson
    > email s/nospam/knute/


    Thanks guys, that has worked a treat.

    The client is an old C based application and is using ASCII encoding,
    the above info has solved the problem and is working well.
    PurpleServerMonkey, Mar 28, 2007
    #6
  7. PurpleServerMonkey

    Chris Uppal Guest

    Knute Johnson wrote:

    > > Ahem, it will be critically important to specify the encoding to the
    > > String constructor!

    [..]
    > Only if he doesn't want his system default character set. Mine
    > certainly doesn't default to ASCII, or as it is more correctly known
    > ANSI_X3.4-1968. What character set does your C compiler default to?


    But using the Java system default charset is almost always going to be a bad
    mistake in this situation. Or do you have a good reason to believe that the
    default charset of the C compiler installation where the code which generates
    the UDP packets was complied will be the same[*] as the default Java charset
    set on the system where the UDP packets are received ?

    ([*] Note: that is "will be the same", not "is likely to be the same").

    Using the default system charset for real data, in production code, is nothing
    better than lazy and incompetent.

    -- chris
    Chris Uppal, Mar 28, 2007
    #7
  8. Chris Uppal wrote:
    > Knute Johnson wrote:
    >
    >>> Ahem, it will be critically important to specify the encoding to the
    >>> String constructor!

    > [..]
    >> Only if he doesn't want his system default character set. Mine
    >> certainly doesn't default to ASCII, or as it is more correctly known
    >> ANSI_X3.4-1968. What character set does your C compiler default to?

    >
    > But using the Java system default charset is almost always going to be a bad
    > mistake in this situation. Or do you have a good reason to believe that the
    > default charset of the C compiler installation where the code which generates
    > the UDP packets was complied will be the same[*] as the default Java charset
    > set on the system where the UDP packets are received ?
    >
    > ([*] Note: that is "will be the same", not "is likely to be the same").
    >
    > Using the default system charset for real data, in production code, is nothing
    > better than lazy and incompetent.
    >
    > -- chris


    You know I don't like being called lazy and incompetent this late in the
    evening. The other fellow mentioned nothing about the character set he
    was using. Picking one out of a hat is no better than using the system
    default. Odds are pretty good that system defaults will be the same if
    used on the same computer, albeit different compilers. Specifying the
    wrong character set may very well cause it to not work at all. If he
    said gee this doesn't work for my Chinese clients, they get a bunch of
    ?????? then you can deal with his character set problems. Or you can
    force his Chinese clients to use ANSI_X3.4-1968 and they will get ??????
    right off the bat.

    It's late and this lazy incompetent is going to bed now.

    --

    Knute Johnson
    email s/nospam/knute/
    Knute Johnson, Mar 28, 2007
    #8
  9. PurpleServerMonkey

    Chris Uppal Guest

    Knute Johnson wrote:

    [me:]
    > > Using the default system charset for real data, in production code, is
    > > nothing better than lazy and incompetent.
    > >
    > > -- chris

    >
    > You know I don't like being called lazy and incompetent this late in the
    > evening.


    You won't see this until tomorrow, and I suppose you'll like it even less then.
    But I'm afraid that I'm going to stick by my comment, and if -- by
    implication -- it applies to you, then that's unfortunate because I had meant
    nothing personal, but I will also stand by the implications.

    -- chris
    Chris Uppal, Mar 28, 2007
    #9
  10. PurpleServerMonkey

    Adam Maass Guest

    "Chris Uppal" <-THIS.org> wrote:
    > PurpleServerMonkey wrote:
    >
    >> What I'd like to know is, what is the best way to retrive zero byte
    >> terminated strings from the byte array?

    >
    > There is no easy way to do it. That's to say, the /code/ will be
    > trivially
    > simple once you know what you have to do, but finding out what you have to
    > do
    > will be tricky unless the C programmers who generate the input are
    > unusually
    > knowledgeable.
    >
    > There is no equivalence between character data and binary data, so one is
    > always turned into the other by using some character encoding or other
    > (often
    > called a "charset" or a "code page"). In Java, when you convert bytes to
    > text
    > (or vice versa) you /always/ have to tell the system what character
    > encoding to
    > use. (There are some "convenience" methods which use a system-default
    > code
    > page, but you should avoid those in most circumstances, and you should
    > /definitely/ avoid them in this case).
    >
    > So how do you find out what character set has been used by the C
    > programmers ?
    > The first thing to do is to ask them. The chances are fairly good that
    > they'll
    > have no idea what you are talking about. If not, then presumably they
    > haven't
    > taken any steps at all to /control/ what code page is being used, and it
    > will
    > be either:
    > some system default, if they are generating the text themselves
    > or
    > whatever character set the /real/ source of the data used.
    >
    > If they are generating the data themselves, then you can probably get a
    > decent
    > guess as to what character set they are using by running the following
    > little
    > Java programs on the machine where they compile their stuff.
    >
    > public class Main
    > {
    > public static void
    > main(String[] args)
    > {
    > System.out.println(
    > "file.encoding: "
    > + System.getProperty("file.encoding"));
    > }
    > }
    >
    > That will tell you what character set Java thinks is most likely to be a
    > sensible default for that machine, and it /may/ be correct. On my system
    > today, that name is "Cp1252" (which cognoscenti will recognise as meaning
    > I
    > have a Windows box set up to use an English/Western European character set
    > by
    > default).
    >
    > If you can't find any sensible information, then it's probably a good idea
    > to
    > assume that the data is pure ASCII -- which is a 7-bit encoding which
    > (therefore) only defines 127 characters, but those 127 characters are
    > common to
    > all (as far as I know) encodings that your UDP packets are likely to be
    > using.
    > To use that character encoding use an encoding name of "US-ASCII".
    >
    > Once you have decided what character set is in use, actually decoding it
    > is
    > trivial. Just find the start of the text data in your byte[] buffer
    > (which you
    > must already know how to do), loop down the buffer looking for the
    > terminating
    > byte which has value 0 (but see below), and then pass the resulting data
    > into
    > the String constructor:
    > String(byte[] bytes, int offset, int length, String charsetName)
    > or, if you prefer:
    > String(byte[] bytes, int offset, int length, Charset charset)
    > which will do the conversion for you.
    >
    > (The potential gotcha about looking for the value 0 is that it assumes
    > that the
    > data is encoded using an 8-bit (or 7-bit) encoding like "ISO-8859-1",
    > "UTF-8",
    > or "Cp1252", rather than a 16-bit encoding like "UTF-16" -- but that seems
    > a
    > safe bet or even C programmers would know that there was a potential
    > problem
    > and warn you about it.)
    >
    > If you can, I'd advise getting the C people to send a packet containing
    > /all/
    > the potential 254 non-zero characters, and then compare what you decode it
    > as
    > with what they expect it to look like. Needless to say, you'll have to be
    > careful about character encoding issues when you do the comparison...
    >
    > -- chris
    >
    >


    Thank you Chris for a thorough, thoughtful, and detailed response.

    If you expect 0-byte terminated strings, you absolutely need to know the
    character encoding in use; some of the more exotic encodings (to those of us
    using Latin charsets) will contain 0-bytes that do not indicate the end of a
    string. If you don't specify the charset and operate on a system that
    defaults to one of these "exotic" encodings, then the String(byte[])
    constructor will not do what you expect.

    In short, when dealing with raw bytes that represent character data, you
    need to know what encoding was used to generate the bytes.
    Adam Maass, Mar 28, 2007
    #10
  11. Chris Uppal wrote:
    > Knute Johnson wrote:
    >
    > [me:]
    >>> Using the default system charset for real data, in production code, is
    >>> nothing better than lazy and incompetent.
    >>>
    >>> -- chris

    >> You know I don't like being called lazy and incompetent this late in the
    >> evening.

    >
    > You won't see this until tomorrow, and I suppose you'll like it even less then.
    > But I'm afraid that I'm going to stick by my comment, and if -- by
    > implication -- it applies to you, then that's unfortunate because I had meant
    > nothing personal, but I will also stand by the implications.
    >
    > -- chris


    Computer programs are tools, just like any other tool. They have a cost
    and a benefit. You can buy a rusty box-end wrench or you can buy a gold
    plated spanner. They do the same job most of the time. To say that you
    absolutely have to use the gold plated spanner and that you are lazy and
    incompetent if you don't is just plain rude.

    If the default character set wasn't adequate for his purposes he could
    easily change it. As it turns out he was happy with the solution
    provided and it worked just fine.

    And now I'm going to take my lazy butt to town.

    --

    Knute Johnson
    email s/nospam/knute/
    Knute Johnson, Mar 28, 2007
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Roedy Green
    Replies:
    0
    Views:
    449
    Roedy Green
    Jul 9, 2003
  2. Barry

    strncpy() and null terminated strings

    Barry, Apr 8, 2004, in forum: C Programming
    Replies:
    4
    Views:
    1,115
    Malcolm
    Apr 8, 2004
  3. Roy Smith
    Replies:
    2
    Views:
    1,870
    Peter Otten
    Mar 6, 2004
  4. tmp123
    Replies:
    5
    Views:
    868
    Tim Roberts
    May 4, 2007
  5. jacob navia

    Zero terminated strings

    jacob navia, Jul 31, 2009, in forum: C Programming
    Replies:
    149
    Views:
    2,677
    Antoninus Twink
    Aug 12, 2009
Loading...

Share This Page