Special Character Token

Sameer · Mar 1, 2005

Hello,
In the process of designing a chatting system, I have to send some text
from one machine to another. This text usually contains 3 to 4 parts
separated by a token like ~ or ^ or $. At the other end I use
StringTokenizer to decode the text.
It is expected that the texts separated by these tokens must not
contain such tokens. We do not expect such things from users and a user
may type a message which contain these tokens and it will lead to
malfunctioning of the chatting system.

Can I insert some special character tokens which can not be generated
by keyboard easily or in general typing.
How to generate such token characters?
Please give answer in Java and Unicode context.
Give methods for coding and decoding of characters and to embed them in
text.
-Sameer

Eric Sosman · Mar 1, 2005

Sameer said:
Hello,
In the process of designing a chatting system, I have to send some text
from one machine to another. This text usually contains 3 to 4 parts
separated by a token like ~ or ^ or $. At the other end I use
StringTokenizer to decode the text.
It is expected that the texts separated by these tokens must not
contain such tokens. We do not expect such things from users and a user
may type a message which contain these tokens and it will lead to
malfunctioning of the chatting system.

Can I insert some special character tokens which can not be generated
by keyboard easily or in general typing.
How to generate such token characters?
Please give answer in Java and Unicode context.
Give methods for coding and decoding of characters and to embed them in
text.

"Security by obscurity" is not very robust. As soon as
somebody figures out the right ALT sequence or similar trick,
the vandals will have a field day with your chat system.

A better way is to develop an encoding that can handle
all characters, even those that would ordinarily have special
meaning. One simple approach is to double a special character
whenever it appears in a non-special context (e.g., in the
message body). For example, if you use # to delimit the
parts of the message and the three parts are

Knick-knack paddy-whack

Give # dog # bone

This old ### came rolling home

.... you could transmit the message as

#
Knick-knack paddy-whack
#
Give ## dog ## bone
#
This old ###### came rolling home
#

When the receiver gets this stream of characters it looks
for each #. If a # is followed by another #, the two become
one # considered as an ordinary data character. But if a #
is followed by something other than a second #, it is a part
separator, not a data character.

Oscar kind · Mar 1, 2005

Sameer said:
In the process of designing a chatting system, I have to send some text
from one machine to another. This text usually contains 3 to 4 parts
separated by a token like ~ or ^ or $. At the other end I use
StringTokenizer to decode the text.
It is expected that the texts separated by these tokens must not
contain such tokens. We do not expect such things from users and a user
may type a message which contain these tokens and it will lead to
malfunctioning of the chatting system.

Can I insert some special character tokens which can not be generated
by keyboard easily or in general typing.
How to generate such token characters?
Please give answer in Java and Unicode context.
Give methods for coding and decoding of characters and to embed them in
text.

As stated earlier by Eric, such a thing will not work because the text of
the user can include anything. His idea of doubling special characters is
therefore a good one.

<plug mode="shameless">

Another solution is to use CSV records, although implementing this from
scratch would be more work. See my playground project on
http://oscar.stachanov.com/java/
(look for the classes CSVParser & CSVFormatter)

</plug mode="shameless">

shriop · Mar 3, 2005

I went and looked at this project of yours. Do you really think
wrapping up ReadLine.Split(',') inside a class is going to fool anyone?
And your description for the project says that you're cleanly and
correctly handling the csv format. This is totally wrong. I'm sorry.

Oscar kind · Mar 3, 2005

shriop said:
I went and looked at this project of yours. Do you really think
wrapping up ReadLine.Split(',') inside a class is going to fool anyone?
And your description for the project says that you're cleanly and
correctly handling the csv format. This is totally wrong. I'm sorry.

The implementation is correct: it handles the CSV format exactly as
specified here:
http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm

This implementation exhibits a more stable behaviour than for example the
unpredictable one from Microsoft: That one uses the list separator from
the regional settings, but sometimes silently ignores it. Microsoft didn't
document that, let alone when, their implementation does this, nor what
record separator is used instead.

Also, IMHO, using String.split(String, int) doesn't make an implementation
unclean (and there is no ReadLine class btw). I'm therefore not trying to
fool anyone.

Admittedly, there are improvements possible, and I welcome any
constructive criticism. This requires arguments though. Did you have any?

shriop · Mar 4, 2005

You're absolutely right. I was too quick to judgement and now I see how
you're handling all the situations. The only rule I can find now taking
a second look that as far as I can see you're still violating is

Fields with leading or trailing spaces must be delimited with
double-quote characters.

You appear to always be trimming leading and trailing whitespace
whether in quotes or not. Other than that, and other than that fact
that your class is very string heavy, it does appear correct.

Oscar kind · Mar 4, 2005

shriop said:
You appear to always be trimming leading and trailing whitespace
whether in quotes or not. Other than that, and other than that fact
that your class is very string heavy, it does appear correct.

It is rather heavy: String.split(String, int) uses regular expressions,
which for a simple case as this isn't efficient. It's just easy to
understand and maintain.

Also, note that I trim leading and trailing whitespace first, and then
remove surrounding quotes (if present): the record separator (',') may
be surrounded by whitespace. This isn't considered part of the fields
(hence it's trimmed). This is also the reason that fields with leading
and/or trailing whitespace should be quoted.

If I were to optimize it, I would need to do the following:
- Read the stream character by character (probsbly buffered, but still)
- Add field values character by character instead of token by token

This works approximately the same, but the algorithm is (IMHO) less easy
to understand, as it is more low-level. I'm not used to that.

shriop · Mar 5, 2005

You got me again, you're right about the trimming.

Can't solve problems! please Help	0	Sep 26, 2022
I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	1	Jun 4, 2023
Issue: special characters	0	Jul 15, 2011
non SGML character escape	12	Mar 13, 2009
How to scan Java source texts?	14	Jun 11, 2013
TF-IDF	1	Aug 19, 2021
java sax parser special characters	3	Jun 12, 2008
The cost of the cheapest routes between cities	3	Jan 7, 2023

Special Character Token

Sameer

Eric Sosman

Oscar kind

shriop

Oscar kind

shriop

Oscar kind

shriop

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads