SimpleDateFormat Slow, Looking to Build or Find Faster One

N

Niko

Hi,

We are finding that SimpleDateFormat is pretty slow, if your trying to
use it to pass millions of records. We improved upon it by added some
caches in the code, if things like the Month was the same and so on
but in all we find it to be a hog.

For example we can pass 20,000 records a second, if they don't contain
dates in them but when you add dates this can drop to 4,000.

So does anyone know of a good class out there or before we go and
build a faster one.

TIA
 
R

Roedy Green

So does anyone know of a good class out there or before we go and
build a faster one.

You might want to look into BigDate if you are dealing only with Dates
not date/timestamps. It has a couple of toString methods. You could
roll your own on those models which should be much faster than
SimpleDateFormat.
 
M

Matt Humphrey

Niko said:
Hi,

We are finding that SimpleDateFormat is pretty slow, if your trying to
use it to pass millions of records. We improved upon it by added some
caches in the code, if things like the Month was the same and so on
but in all we find it to be a hog.

For example we can pass 20,000 records a second, if they don't contain
dates in them but when you add dates this can drop to 4,000.

So does anyone know of a good class out there or before we go and
build a faster one.

If speed is the issue, you might want to consider turning the problem
around. There are only 365 days in a year, so over a 100 year period there
are only 36500 distinct dates. Pre-format those that are most likely to be
in your range of dates and put them in a hash table, or use a simple
indexing method. This completely sidesteps expensive string formatting
problems and is especially good if there are many redundant dates.

Cheers,
Matt Humphrey (e-mail address removed) http://www.iviz.com/
 
N

Niko

Thanks for the bigdate and the index lookup ideas, unfortunately I'm
working with DateTimes, i.e 3rd Jun 1993 05:01:43. However I was
thinking I could produce two hash tables, one for the time and one for
the date, ignoring the year, split the string and lookup in both
tables and adjust for year.
 
R

Roedy Green

However I was
thinking I could produce two hash tables, one for the time and one for
the date, ignoring the year, split the string and lookup in both
tables and adjust for year.

"adjust for year" means recreating the logic in BigDate.

you could precompute the strings for the days for a period of five
years, and index to get the YYYYYMMDD part and then plop in the time
part, but that is a rather big chunk of RAM.

You could get BigDate to get you the date part. You have to do the
time part yourself. You need a timezone adjust.
 
M

Matt Humphrey

Niko said:
Thanks for the bigdate and the index lookup ideas, unfortunately I'm
working with DateTimes, i.e 3rd Jun 1993 05:01:43. However I was
thinking I could produce two hash tables, one for the time and one for
the date, ignoring the year, split the string and lookup in both
tables and adjust for year.

That's workable and equivalent to forming the string via indexed lookup, but
with more lookup elements. Your tables for lookup would the day/month, the
year, the hour, minute and second (all the same table from 0..59) Assemble
them via a StringBuffer. There are over 84000 second time-stamps in a day,
so that's a bit much for direct lookup. Part of what you're trying to avoid
is the number-to-string conversion and the string assembly. This technique
does not avoid the string assembly problem, but the number-to-string lookup
is reduced to table index.

Another way to avoid string assembly is to arrange the string to always have
the same layout: e.g. 2dights, th|rd, sp, 3-letters, sp, dd:dd:dd. This way
you only allocate the string once and copy the elements to fixed places. But
I really only suggest this after a run with a serious profiler.

Cheers,
 
J

Jim Sculley

Niko said:
Hi,

We are finding that SimpleDateFormat is pretty slow, if your trying to
use it to pass millions of records. We improved upon it by added some
caches in the code, if things like the Month was the same and so on
but in all we find it to be a hog.

For example we can pass 20,000 records a second, if they don't contain
dates in them but when you add dates this can drop to 4,000.

On what hardware?
So does anyone know of a good class out there or before we go and
build a faster one.

Are you using SimpleDateFormat correctly? You should not create a new
instance for each record. I get a throughput of about 150,000 calls to
format() per second using an array of one million random dates.
 
C

Chris Uppal

Niko said:
Thanks for the bigdate and the index lookup ideas, unfortunately I'm
working with DateTimes, i.e 3rd Jun 1993 05:01:43.

I realise this probably won't help, but do you actually *have* to format all
the dates ? If you can arrange to keep them in their initial (not String) form
thoughout, and only change them into strings when/if displayed to a user then
you can avoid the overhead that way. That might well be difficult, but not
necessarily worse than messing around with faster parsing or complex cacheing
schemes.

-- chris
 
N

Niko

I create one single instance but when we look at the profiler we see a
chunk of time spent in SimpleDateFormat, it may only be a few percent
but when you are loading a file with 50 fields and maybe 8 dates then
you really start to see the chunk grow. We spent a long time
optimizing other parts of the code and even NIO showed no improvement
over our enhanced buffered IO (though we prefer NIO as it reduces the
amount of custom code) so it seams an awful shame to let
SimpleDateFormat get away without being optimized.

As for the source supplying pure dates, it sometimes can come like
that but the code is part of data loading tool which is configurable
for any data source that can come via Streams or Channels, and we only
format the date for display at the very end. It's the passing that
takes the time and creating a table with all known value sections
doesn't scare us to much as memory is cheap and this type of software
is running on big boxes overnight.
 
W

Wojtek

Thanks for the bigdate and the index lookup ideas, unfortunately I'm
working with DateTimes, i.e 3rd Jun 1993 05:01:43. However I was
thinking I could produce two hash tables, one for the time and one for
the date, ignoring the year, split the string and lookup in both
tables and adjust for year.

Why not break the date string into parts using StringTokenizer, then
evaluate each part and build the input for a Calendar object, then
evaluate on the Calendar object.

3rd -> value of the number, ignore the text
Jun -> lookup on 12 possibilities, less if you use progressive
lookup (ie: check the first letter, if no match check the second, if
no match check the third)
1993 -> value
substring on the time
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,575
Members
45,053
Latest member
billing-software

Latest Threads

Top