questions about StreamTokenizer

Christian Bongiorno · May 8, 2004

I am trying to use StreamTokenizer to parse email (a spam corpus) and I
am running into some problems.

First, string tokenizer evaluates every character between 00->ff. Ascii
and extended ASCII.

However, I am noticing that on occasion, when it tells me it has a
TT_WORD value, that the sval contains characters with a value > ff.

I assumed the tokenizer would just treat these are whitespace since I
set .isWhiteSpace(128,ffff).

So, I guess what I need to know, is there something I am missing, or is
there a better class that can actually deal with unicode characters?

Christian

marcus · May 10, 2004

This is a very confusing post, but I think I see why no-one has
responded. You are using java.io.Str*eam*Tokenizer and I for one did
not know that existed. In fact, I misread your class the first 6 times
hehe. I use java.util.Str*ing*Tokenizer, which doesn't appear to have
the same limitations. I just read a line and tokenize it. If you don't
need the other BS about C++ comments and stuff, try the easy way.
Cheers!

John C. Bollinger · May 10, 2004

marcus wrote:

Please don't top post.

This is a very confusing post, but I think I see why no-one has
responded. You are using java.io.Str*eam*Tokenizer and I for one did
not know that existed.

Which apparently confused you, but probably not any of the many
programmers around here who know all about StreamTokenizer, which has
been in the Java platform since 1.0 (just like StringTokenizer). I, for
one, had not responded because I generally don't read newsgroups on
Saturday or Sunday.

In fact, I misread your class the first 6 times
hehe. I use java.util.Str*ing*Tokenizer, which doesn't appear to have
the same limitations. I just read a line and tokenize it. If you don't
need the other BS about C++ comments and stuff, try the easy way.

StreamTokenizer has a significant number of attractive features, all
tempered by the fact that it is fundamentally broken. Those programmers
around here who are familiar with StreamTokenizer would have immediately
recognized the problems the OP described.

Not surprising.

More on this later, but what did you expect?

I'm not surprised.

You would think that, wouldn't you. But (1) StreamTokenizer is
fundamentally broken, and (2) you probably don't really want to do that
in the first place.

If it is sufficient for your needs then marcus' suggestion of
StringTokenizer is indeed a reasonable way to go. If you need more
powerful / flexible tokenization then you'll need to write your own
tokenizer (it's not that hard) or find a satisfactory third-party solution.

Either way, you need to read the data via a Reader configured to use the
correct charset. This may in fact mean providing the ability to change
charset on an e-mail by e-mail basis, which is doable but a bit tricky.

StreamTokenizer was designed before Java appreciated the importance of
the distinction between character streams and byte streams, and as a
result it is deeply flawed. In particular, its character attribute
tables only support the first 256 characters, 0x00 -> 0xff. It was
retrofitted with support for character input via a Reader, but that only
solves a few of its problems.

As an aside, it's not clear to me from your comments that you have a
good grasp of Unicode. I recommend this article "The Absolute Minimum
Every Software Developer Absolutely, Positively Must Know About Unicode
and Character Sets (No Excuses!)":

http://www.joelonsoftware.com/articles/Unicode.html

It reads well and contains loads of good information for those
insufficiently clued in to the world of character sets.

John Bollinger
(e-mail address removed)

Christian Bongiorno · May 10, 2004

I had thought of the StringTokenizer class but thought the
StreamTokenizer class might be more usefull and ready-to-go -- and yes,
definately it contains attractive features.

It does appear flawed though. I will investigate creating my own
tokenizer if time permits. StreamTokenizer just leaves me in a
situation where I must throw out any string with unrecognized characters
in it (chinese words and such).

Christian

marcus · May 10, 2004

Actually, a close reading will show the author used the two terms
interchangeably, albiet with a space between the words, which is what
threw me off. fergosh sakes, thanks for telling me I am not a real
programmer because I don't know all the tokenizer classes available --
here I thought I was being helpful

sheesh!

marcus · May 10, 2004

I think modern anti-spam theory is leaning toward excluding anything not
demonstrably acceptable anyhow.

Roedy Green · May 10, 2004

I think modern anti-spam theory is leaning toward excluding anything not
demonstrably acceptable anyhow.

The technique I think that has the best potential is SpamAssassin
which works by people sharing info about spam they have received. The
problem is only Perl geeks can make it work so far.

See http://mindprod.com/jgloss/spam.html

marcus · May 10, 2004

Hehe Roedy -- I am far from a PERL geek. My only book is Perl, a
beginners guide. I have gotten spamassassin and spamassmilter running
on my freebsd box, though, but now I am affraid to update sendmail!

FYI, I fed spamassassin about 10K spams to "teach it" and it gets about
70% of my spams. I have another 17K to send it of the new stuff, but I
am too lazy.

Nigel Wade · May 12, 2004

Roedy said:
The technique I think that has the best potential is SpamAssassin
which works by people sharing info about spam they have received. The
problem is only Perl geeks can make it work so far.

I'm sure many people who don't read a word of Perl have installed and run
SA. For example, on RedHat it's simply a matter of installing the
spamassasin package. It's not really that difficult (at least on UNIX/Linux,
I can't speak for Windows).

It's also very, very effective when trained properly. The Bayesian filter is
the best part about it, this trains SA on the spam and legitimate mail which
you receive, so it can differentiate between the two.

Of the 6000 spams I've have received in the past month (those are the ones
which get past the first tier filter, which rejects all spam with scores
over 15) I don't think SA has let more than 10 through into my inbox. That's
better than 99% on the spams scoring less than 15. From the logs I can see
that 20% of the spam over the last month has scored above 15, so an overall
estimate would be around 99.8% success.

Also, last time I checked I'd had no false positives either.

Roedy Green · May 12, 2004

I'm sure many people who don't read a word of Perl have installed and run
SA. For example, on RedHat it's simply a matter of installing the
spamassasin package. It's not really that difficult (at least on UNIX/Linux,
I can't speak for Windows).

The guy who wrote James, the Java mailserver, is working on
integrating SpamAssassin into it as we speak. That way people might
be able to use it without even having to think about it and and on any
platform.

read and parse file	3	Feb 8, 2005
Questions about working with character encodings	1	Dec 14, 2005
Character set woes with binary data	0	Apr 1, 2007
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Jan 12, 2008
UnicodeDecodeError	0	Jul 21, 2007
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	May 1, 2007
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Mar 1, 2008

questions about StreamTokenizer

Christian Bongiorno

marcus

John C. Bollinger

Christian Bongiorno

marcus

marcus

Roedy Green

marcus

Nigel Wade

Roedy Green

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads