Special Character Token

S

Sameer

Hello,
In the process of designing a chatting system, I have to send some text
from one machine to another. This text usually contains 3 to 4 parts
separated by a token like ~ or ^ or $. At the other end I use
StringTokenizer to decode the text.
It is expected that the texts separated by these tokens must not
contain such tokens. We do not expect such things from users and a user
may type a message which contain these tokens and it will lead to
malfunctioning of the chatting system.

Can I insert some special character tokens which can not be generated
by keyboard easily or in general typing.
How to generate such token characters?
Please give answer in Java and Unicode context.
Give methods for coding and decoding of characters and to embed them in
text.
-Sameer
 
E

Eric Sosman

Sameer said:
Hello,
In the process of designing a chatting system, I have to send some text
from one machine to another. This text usually contains 3 to 4 parts
separated by a token like ~ or ^ or $. At the other end I use
StringTokenizer to decode the text.
It is expected that the texts separated by these tokens must not
contain such tokens. We do not expect such things from users and a user
may type a message which contain these tokens and it will lead to
malfunctioning of the chatting system.

Can I insert some special character tokens which can not be generated
by keyboard easily or in general typing.
How to generate such token characters?
Please give answer in Java and Unicode context.
Give methods for coding and decoding of characters and to embed them in
text.

"Security by obscurity" is not very robust. As soon as
somebody figures out the right ALT sequence or similar trick,
the vandals will have a field day with your chat system.

A better way is to develop an encoding that can handle
all characters, even those that would ordinarily have special
meaning. One simple approach is to double a special character
whenever it appears in a non-special context (e.g., in the
message body). For example, if you use # to delimit the
parts of the message and the three parts are

Knick-knack paddy-whack

Give # dog # bone

This old ### came rolling home

.... you could transmit the message as

#
Knick-knack paddy-whack
#
Give ## dog ## bone
#
This old ###### came rolling home
#

When the receiver gets this stream of characters it looks
for each #. If a # is followed by another #, the two become
one # considered as an ordinary data character. But if a #
is followed by something other than a second #, it is a part
separator, not a data character.
 
O

Oscar kind

Sameer said:
In the process of designing a chatting system, I have to send some text
from one machine to another. This text usually contains 3 to 4 parts
separated by a token like ~ or ^ or $. At the other end I use
StringTokenizer to decode the text.
It is expected that the texts separated by these tokens must not
contain such tokens. We do not expect such things from users and a user
may type a message which contain these tokens and it will lead to
malfunctioning of the chatting system.

Can I insert some special character tokens which can not be generated
by keyboard easily or in general typing.
How to generate such token characters?
Please give answer in Java and Unicode context.
Give methods for coding and decoding of characters and to embed them in
text.

As stated earlier by Eric, such a thing will not work because the text of
the user can include anything. His idea of doubling special characters is
therefore a good one.

<plug mode="shameless">

Another solution is to use CSV records, although implementing this from
scratch would be more work. See my playground project on
http://oscar.stachanov.com/java/
(look for the classes CSVParser & CSVFormatter)

</plug mode="shameless">
 
S

shriop

I went and looked at this project of yours. Do you really think
wrapping up ReadLine.Split(',') inside a class is going to fool anyone?
And your description for the project says that you're cleanly and
correctly handling the csv format. This is totally wrong. I'm sorry.
 
O

Oscar kind

shriop said:
I went and looked at this project of yours. Do you really think
wrapping up ReadLine.Split(',') inside a class is going to fool anyone?
And your description for the project says that you're cleanly and
correctly handling the csv format. This is totally wrong. I'm sorry.

The implementation is correct: it handles the CSV format exactly as
specified here:
http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm

This implementation exhibits a more stable behaviour than for example the
unpredictable one from Microsoft: That one uses the list separator from
the regional settings, but sometimes silently ignores it. Microsoft didn't
document that, let alone when, their implementation does this, nor what
record separator is used instead.

Also, IMHO, using String.split(String, int) doesn't make an implementation
unclean (and there is no ReadLine class btw). I'm therefore not trying to
fool anyone.

Admittedly, there are improvements possible, and I welcome any
constructive criticism. This requires arguments though. Did you have any?
 
S

shriop

You're absolutely right. I was too quick to judgement and now I see how
you're handling all the situations. The only rule I can find now taking
a second look that as far as I can see you're still violating is

Fields with leading or trailing spaces must be delimited with
double-quote characters.

You appear to always be trimming leading and trailing whitespace
whether in quotes or not. Other than that, and other than that fact
that your class is very string heavy, it does appear correct.
 
O

Oscar kind

shriop said:
You appear to always be trimming leading and trailing whitespace
whether in quotes or not. Other than that, and other than that fact
that your class is very string heavy, it does appear correct.

It is rather heavy: String.split(String, int) uses regular expressions,
which for a simple case as this isn't efficient. It's just easy to
understand and maintain.

Also, note that I trim leading and trailing whitespace first, and then
remove surrounding quotes (if present): the record separator (',') may
be surrounded by whitespace. This isn't considered part of the fields
(hence it's trimmed). This is also the reason that fields with leading
and/or trailing whitespace should be quoted.

If I were to optimize it, I would need to do the following:
- Read the stream character by character (probsbly buffered, but still)
- Add field values character by character instead of token by token

This works approximately the same, but the algorithm is (IMHO) less easy
to understand, as it is more low-level. I'm not used to that.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top