marcus wrote:
Please don't top post.
This is a very confusing post, but I think I see why no-one has
responded. You are using java.io.Str*eam*Tokenizer and I for one did
not know that existed.
Which apparently confused you, but probably not any of the many
programmers around here who know all about StreamTokenizer, which has
been in the Java platform since 1.0 (just like StringTokenizer). I, for
one, had not responded because I generally don't read newsgroups on
Saturday or Sunday.
In fact, I misread your class the first 6 times
hehe. I use java.util.Str*ing*Tokenizer, which doesn't appear to have
the same limitations. I just read a line and tokenize it. If you don't
need the other BS about C++ comments and stuff, try the easy way.
StreamTokenizer has a significant number of attractive features, all
tempered by the fact that it is fundamentally broken. Those programmers
around here who are familiar with StreamTokenizer would have immediately
recognized the problems the OP described.
Not surprising.
More on this later, but what did you expect?
I'm not surprised.
You would think that, wouldn't you. But (1) StreamTokenizer is
fundamentally broken, and (2) you probably don't really want to do that
in the first place.
If it is sufficient for your needs then marcus' suggestion of
StringTokenizer is indeed a reasonable way to go. If you need more
powerful / flexible tokenization then you'll need to write your own
tokenizer (it's not that hard) or find a satisfactory third-party solution.
Either way, you need to read the data via a Reader configured to use the
correct charset. This may in fact mean providing the ability to change
charset on an e-mail by e-mail basis, which is doable but a bit tricky.
StreamTokenizer was designed before Java appreciated the importance of
the distinction between character streams and byte streams, and as a
result it is deeply flawed. In particular, its character attribute
tables only support the first 256 characters, 0x00 -> 0xff. It was
retrofitted with support for character input via a Reader, but that only
solves a few of its problems.
As an aside, it's not clear to me from your comments that you have a
good grasp of Unicode. I recommend this article "The Absolute Minimum
Every Software Developer Absolutely, Positively Must Know About Unicode
and Character Sets (No Excuses!)":
http://www.joelonsoftware.com/articles/Unicode.html
It reads well and contains loads of good information for those
insufficiently clued in to the world of character sets.
John Bollinger
(e-mail address removed)