questions about StreamTokenizer

  • Thread starter Christian Bongiorno
  • Start date
C

Christian Bongiorno

I am trying to use StreamTokenizer to parse email (a spam corpus) and I
am running into some problems.

First, string tokenizer evaluates every character between 00->ff. Ascii
and extended ASCII.

However, I am noticing that on occasion, when it tells me it has a
TT_WORD value, that the sval contains characters with a value > ff.

I assumed the tokenizer would just treat these are whitespace since I
set .isWhiteSpace(128,ffff).

So, I guess what I need to know, is there something I am missing, or is
there a better class that can actually deal with unicode characters?

Christian
 
M

marcus

This is a very confusing post, but I think I see why no-one has
responded. You are using java.io.Str*eam*Tokenizer and I for one did
not know that existed. In fact, I misread your class the first 6 times
hehe. I use java.util.Str*ing*Tokenizer, which doesn't appear to have
the same limitations. I just read a line and tokenize it. If you don't
need the other BS about C++ comments and stuff, try the easy way.
Cheers!
 
J

John C. Bollinger

marcus wrote:

Please don't top post.
This is a very confusing post, but I think I see why no-one has
responded. You are using java.io.Str*eam*Tokenizer and I for one did
not know that existed.

Which apparently confused you, but probably not any of the many
programmers around here who know all about StreamTokenizer, which has
been in the Java platform since 1.0 (just like StringTokenizer). I, for
one, had not responded because I generally don't read newsgroups on
Saturday or Sunday.
In fact, I misread your class the first 6 times
hehe. I use java.util.Str*ing*Tokenizer, which doesn't appear to have
the same limitations. I just read a line and tokenize it. If you don't
need the other BS about C++ comments and stuff, try the easy way.

StreamTokenizer has a significant number of attractive features, all
tempered by the fact that it is fundamentally broken. Those programmers
around here who are familiar with StreamTokenizer would have immediately
recognized the problems the OP described.

Not surprising.

More on this later, but what did you expect?

I'm not surprised.

You would think that, wouldn't you. But (1) StreamTokenizer is
fundamentally broken, and (2) you probably don't really want to do that
in the first place.

If it is sufficient for your needs then marcus' suggestion of
StringTokenizer is indeed a reasonable way to go. If you need more
powerful / flexible tokenization then you'll need to write your own
tokenizer (it's not that hard) or find a satisfactory third-party solution.

Either way, you need to read the data via a Reader configured to use the
correct charset. This may in fact mean providing the ability to change
charset on an e-mail by e-mail basis, which is doable but a bit tricky.


StreamTokenizer was designed before Java appreciated the importance of
the distinction between character streams and byte streams, and as a
result it is deeply flawed. In particular, its character attribute
tables only support the first 256 characters, 0x00 -> 0xff. It was
retrofitted with support for character input via a Reader, but that only
solves a few of its problems.

As an aside, it's not clear to me from your comments that you have a
good grasp of Unicode. I recommend this article "The Absolute Minimum
Every Software Developer Absolutely, Positively Must Know About Unicode
and Character Sets (No Excuses!)":

http://www.joelonsoftware.com/articles/Unicode.html

It reads well and contains loads of good information for those
insufficiently clued in to the world of character sets.


John Bollinger
(e-mail address removed)
 
C

Christian Bongiorno

I had thought of the StringTokenizer class but thought the
StreamTokenizer class might be more usefull and ready-to-go -- and yes,
definately it contains attractive features.

It does appear flawed though. I will investigate creating my own
tokenizer if time permits. StreamTokenizer just leaves me in a
situation where I must throw out any string with unrecognized characters
in it (chinese words and such).


Christian
 
M

marcus

Actually, a close reading will show the author used the two terms
interchangeably, albiet with a space between the words, which is what
threw me off. fergosh sakes, thanks for telling me I am not a real
programmer because I don't know all the tokenizer classes available --
here I thought I was being helpful

sheesh!
 
M

marcus

I think modern anti-spam theory is leaning toward excluding anything not
demonstrably acceptable anyhow.
 
R

Roedy Green

I think modern anti-spam theory is leaning toward excluding anything not
demonstrably acceptable anyhow.

The technique I think that has the best potential is SpamAssassin
which works by people sharing info about spam they have received. The
problem is only Perl geeks can make it work so far.

See http://mindprod.com/jgloss/spam.html
 
M

marcus

Hehe Roedy -- I am far from a PERL geek. My only book is Perl, a
beginners guide. I have gotten spamassassin and spamassmilter running
on my freebsd box, though, but now I am affraid to update sendmail!

FYI, I fed spamassassin about 10K spams to "teach it" and it gets about
70% of my spams. I have another 17K to send it of the new stuff, but I
am too lazy.
 
N

Nigel Wade

Roedy said:
The technique I think that has the best potential is SpamAssassin
which works by people sharing info about spam they have received. The
problem is only Perl geeks can make it work so far.

I'm sure many people who don't read a word of Perl have installed and run
SA. For example, on RedHat it's simply a matter of installing the
spamassasin package. It's not really that difficult (at least on UNIX/Linux,
I can't speak for Windows).

It's also very, very effective when trained properly. The Bayesian filter is
the best part about it, this trains SA on the spam and legitimate mail which
you receive, so it can differentiate between the two.

Of the 6000 spams I've have received in the past month (those are the ones
which get past the first tier filter, which rejects all spam with scores
over 15) I don't think SA has let more than 10 through into my inbox. That's
better than 99% on the spams scoring less than 15. From the logs I can see
that 20% of the spam over the last month has scored above 15, so an overall
estimate would be around 99.8% success.

Also, last time I checked I'd had no false positives either.
 
R

Roedy Green

I'm sure many people who don't read a word of Perl have installed and run
SA. For example, on RedHat it's simply a matter of installing the
spamassasin package. It's not really that difficult (at least on UNIX/Linux,
I can't speak for Windows).

The guy who wrote James, the Java mailserver, is working on
integrating SpamAssassin into it as we speak. That way people might
be able to use it without even having to think about it and and on any
platform.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,754
Messages
2,569,528
Members
45,000
Latest member
MurrayKeync

Latest Threads

Top