Buffering data on disk

carmelo · Oct 15, 2008

Hi everybody,
I'm developing an application which reads GPS data from COM port every
1 second. After reading these data I need to send them to a remote
host through a network. For avoiding data loss I thought to use a
buffer, which must store them on disk (to avoid data loss if the pc
turns off).
I'm thinking about using HSQLDB or H2 for storing these data, because
it's a light and fast way. What do you think about?

Cheers
Carmelo

carmelo · Oct 15, 2008

I forgot a detail:
the application could work in this way:
- a Writer thread writes data into the buffer (into the H2DB)
- a Reader thread reads data from the buffer (from the H2DB) and send
them to a remote host through the network

John B. Matthews · Oct 15, 2008

carmelo said:
Hi everybody,
I'm developing an application which reads GPS data from COM port
every 1 second. After reading these data I need to send them to a
remote host through a network. For avoiding data loss I thought to
use a buffer, which must store them on disk (to avoid data loss if
the pc turns off). I'm thinking about using HSQLDB or H2 for storing
these data, because it's a light and fast way. What do you think
about?

I've had positive experiences with H2. It sounds like you could have one
task feeding the database and another uploading. You'll need proper
primary key constraints and a little care to where you left off.
Housekeeping may require scheduled downtime.

On a related note, the DBSDL project is a comprehensive yet accessible
example using H2:

<http://dbsdl.sourceforge.net/>

It uses ant to create, populate and test an H2 database. An included
xslt transforms the schema definition into sql and another generates the
schema documentation. It's particularly convenient during development
and when getting ready for delivery.

Tom Anderson · Oct 15, 2008

I'm developing an application which reads GPS data from COM port every
1 second. After reading these data I need to send them to a remote
host through a network. For avoiding data loss I thought to use a
buffer, which must store them on disk (to avoid data loss if the pc
turns off).
I'm thinking about using HSQLDB or H2 for storing these data, because
it's a light and fast way. What do you think about?

I think using a normal file would be faster and lighter. Not to mention
simpler, and more open to working with other tools.

tom

Lothar Kimmeringer · Oct 15, 2008

carmelo said:
I'm developing an application which reads GPS data from COM port every
1 second. After reading these data I need to send them to a remote
host through a network. For avoiding data loss I thought to use a
buffer, which must store them on disk (to avoid data loss if the pc
turns off).
I'm thinking about using HSQLDB or H2 for storing these data, because
it's a light and fast way. What do you think about?

I think a database is overkill, especially in this case where the
easier way is to create a file and simply write the values you
read from the serial port.

If the reading and writing thread work independent from each
other, you might consider creating a file for each entry. That
way, the reading thread already knows what still needs to be
transfered (the files that are still in the filesystem).

The writing thread creates a file, writes the data into it,
closes it and renames it to a pattern, the reading thread is
looking for when listing the files in the directory. That way
the reading thread is not trying to read data that is not
fully written.

Regards, Lothar
--
Lothar Kimmeringer E-Mail: (e-mail address removed)
PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

Always remember: The answer is forty-two, there can only be wrong
questions!

Martin Gregorie · Oct 15, 2008

I think using a normal file would be faster and lighter. Not to mention
simpler, and more open to working with other tools.

If you said one file per GPS sentence and using the timestamp as the file
name so the reader can retrieve the sets in arrival sequence, I'd agree.
Without that separation you can't guarantee no data loss: bear in mind
that we don't know what the OS is and hence whether it implements flush
on open files. The disadvantage is that you need to avoid race conditions
by writing, closing and renaming each file. Even so, one file per second
leaves plenty of time for the file to be written and renamed before the
next sample arrives.

However, I think that a database offers benefits. The set of NMEA
sentences received each second is either fixed or configurable, so table
design is trivial (one column per sentence, timestamp as prime key), you
get good performance and transaction isolation provided the commitment
unit is the GPS sentence set. Best yet, by letting the reading process
access the database remotely, you get the transfer protocol for free and
it can easily retrieve sentence sets in arrival sequence. There are free
databases available too: Derby, HSQLDB, PostgresQL and MySQL spring to
mind.

Failing that, use MQ if it is available for the OS and the project budget
can cover it.

carmelo · Oct 16, 2008

I think using a normal file would be faster and lighter. Not to mention
simpler, and more open to working with other tools.

tom

How would you do that with a simple file?
If the Reader thread reads data from file beginning, and if the Writer
thread appends data to the file, its dimension will ever increase...

May Java Data Objects help me?

carmelo · Oct 16, 2008

Thank you guys for your interesting suggestions.

But I'm not yet sure that the DB solution is better than writing into
file(s).

In other words, we have 1 producer (the Writer thread) and 1 consumer
(the Reader thread). Usually I would create a Buffer<ClassData> shared
by the producer and the consumer, but in this case I need to guarantee
data persistence.

I hope you can help me
Cheers
Carmelo

Martin Gregorie · Oct 16, 2008

In other words, we have 1 producer (the Writer thread) and 1 consumer
(the Reader thread). Usually I would create a Buffer<ClassData> shared
by the producer and the consumer, but in this case I need to guarantee
data persistence.

I'm not sure why you mention threads here. Your original post said that
the data would be consumed by a remote host, which to me means that you
need tow processes, rather tha n a single multi-threaded one. The first
process grabs sentences from the GPS and writes them to a cache. The
cache uses some sort of atomic write or flush to safe store each set of
NMEA sentences immediately after they have been written. At some point a
second process on a separate host reads and processes the stored
sentences, presumably in arrival sequence.

IOW what you want is a FIFO queue that safe stores each queued item and
preferably allows read and delete access from a remote host.

Is this the requirement or did I misunderstand something?

carmelo · Oct 16, 2008

I'm not sure why you mention threads here. Your original post said that
the data would be consumed by a remote host, which to me means that you
need tow processes, rather tha n a single multi-threaded one. The first
process grabs sentences from the GPS and writes them to a cache. The
cache uses some sort of atomic write or flush to safe store each set of
NMEA sentences immediately after they have been written. At some point a
second process on a separate host reads and processes the stored
sentences, presumably in arrival sequence.

IOW what you want is a FIFO queue that safe stores each queued item and
preferably allows read and delete access from a remote host.

Is this the requirement or did I misunderstand something?

Maybe I was not clear enough.
I need to buffer data, ensuring persistence, allowing a producer to
share its produced data with a consumer.
- The producer is a thread which listens for GPS data and store them
into a buffer
- The consumer is a thread which reads data stored into the buffer for
sending them to a remote host

carmelo · Oct 16, 2008

Or maybe you can reconsider your objection to a DBMS, where you suggested
using a file(s) might be preferable. A DBMS would solve the problem you
mention here.

So, do you suggest a DBMS? I'm implementing the H2 Embedde DB for this
purpose... What do you think about it?

John B. Matthews · Oct 16, 2008

Lew: I think the OP is still on the fence. Tom Anderson helpfully
suggested using the file system above, and that's how I've done remote
telemetry in the past. As you suggest, a local database like H2, Derby,
or PostgreSQL offers too many advantages, especially in the face of
uncertain connectivity and bandwidth.

So, do you suggest a DBMS? I'm implementing the H2 Embedde DB for this
purpose... What do you think about it?

OP: You should observe that H2 offers several connection modes:

<http://www.h2database.com/html/features.html?highlight=embedded,mode%2
Cusers&search=embedded%20mode%20users#connection_modes>

Mixed mode would allow the fastest local operation while still allowing
other processes (in another JVM, say) to access the data. As
connectivity may be erratic, I'd prefer an outgoing connection to upload
the collected data. Incoming connections are a greater security risk.

Tom Anderson · Oct 16, 2008

If you said one file per GPS sentence and using the timestamp as the
file name so the reader can retrieve the sets in arrival sequence, I'd
agree. Without that separation you can't guarantee no data loss: bear in
mind that we don't know what the OS is and hence whether it implements
flush on open files.

If it doesn't, then the database doesn't make that guarantee either.

Unless it's using a raw partition, which in this app really, really would
be overkill.

The disadvantage is that you need to avoid race conditions by writing,
closing and renaming each file. Even so, one file per second leaves
plenty of time for the file to be written and renamed before the next
sample arrives.

I think this would be a pretty good architecture, actually. You could also
use a counter instead of a timestamp, right?

I was thinking of appending to a log file and using sync(). You'd roll the
logs to avoid them growing to infinite size.

On a unix machine, another option would be to store the data in symlinks.
This is a cunning hack i came across a while ago: the destination of a
symlink doesn't actually have to be a path, it can be an arbitrary string,
whose length can be up to whatever the maximum path length is on your
system - 1024 on mine. Creation and deletion of symlinks is atomic, and
visible to all processes, so this effectively gives you a simple data
store, similar to Berkeley DB or something.

Since the contents of symlinks live in directory blocks, this means the
system doesn't have to allocate a cluster for each entry, as it would if
they were in their own files, saving some space, so it should even appeal
to efficiency weenies.

However, java doesn't provide convenient access to syminks, the technique
isn't portable, and it's a grotesque hack, so i'm not seriously suggesting
it.

However, I think that a database offers benefits. The set of NMEA
sentences received each second is either fixed or configurable, so table
design is trivial (one column per sentence, timestamp as prime key), you
get good performance and transaction isolation provided the commitment
unit is the GPS sentence set. Best yet, by letting the reading process
access the database remotely, you get the transfer protocol for free and
it can easily retrieve sentence sets in arrival sequence. There are free
databases available too: Derby, HSQLDB, PostgresQL and MySQL spring to
mind.

NFS or similar also gives you a free and convenient transfer protocol.

Failing that, use MQ if it is available for the OS and the project
budget can cover it.

!

tom

Martin Gregorie · Oct 16, 2008

Maybe I was not clear enough.
I need to buffer data, ensuring persistence, allowing a producer to
share its produced data with a consumer. - The producer is a thread
which listens for GPS data and store them into a buffer
Understood.

- The consumer is a thread which reads data stored into the buffer for
sending them to a remote host

Is there a particular reason for using threads? This would, IMO, be
easier to code and test as two independent processes seeing that there is
apparently no interaction between your GPSListener and Consumer threads.

Is the Consumer doing anything apart from:

while (data-item in buffer)
send data-item to remote host
if (ack received from host)
delete data-item from buffer

If that's all its doing then, again IMO, you don't need the Consumer
process if you put the data in a database table that's treated as a FIFO
queue and let the remote process fetch data directly from it. Doing this
doesn't affect the logic in the remote process. It still needs a read
loop regardless of whether its firing off SQL SELECTs or accepting data
from your Consumer process.

By eliminating the Consumer you also remove the need to design, implement
and test the message handling protocol you'll need to move data from
Consumer to the remote process. Bear in mind that using a simple
unidirectional data stream largely cancels the data security provided by
the cache: unless the remote process acknowledges each data item as its
received and processed how can you know when its safe to delete an item
from the cache or, after a major failure, where to restart without data
loss or duplication?

If you use transactional SQL to fetch the data, you get everything I
described above for free. The more I think about what I understand to be
your requirements the more likely it is that this is how I'd do it.

Of course you could also use IBM's Websphere MQ. This is message handling
middleware which provides caching, transport and restart/recovery
facilities. However, its not free and may well be OTT for your
requirement.

Martin Gregorie · Oct 16, 2008

So, do you suggest a DBMS? I'm implementing the H2 Embedde DB for this
purpose... What do you think about it?

That would work, but use mixed mode rather than embedded mode so your GPS
Listener accesses it directly with JDBC in embedded mode and your remote
process can access it in server mode (again by JDBC, but over the
network).

Martin Gregorie · Oct 16, 2008

If you said one file per GPS sentence and using the timestamp as the
file name so the reader can retrieve the sets in arrival sequence, I'd
agree. Without that separation you can't guarantee no data loss: bear in
mind that we don't know what the OS is and hence whether it implements
flush on open files.

FWIW I've used a variation of this at a much higher volume than the OP's
application can generate. The files were a MB or two each and we got tens
to hundreds of them a day delivered by FTP, which renamed each file when
the transfer completed. This was on a Unix box, so the rename was atomic
and fast. The transfer technique worked well and didn't cause any
problems, even at the very high file arrival rates we could get after a
restart or a backup interruption.

I think this would be a pretty good architecture, actually. You could
also use a counter instead of a timestamp, right?

True, though the timestamp has the advantage that there's nothing to
store between sessions and you'll never get a filename clash after a
restart.

I was thinking of appending to a log file and using sync(). You'd roll
the logs to avoid them growing to infinite size.

....and of course this also may not be implemented on the target OS.

On a unix machine, another option would be to store the data in
symlinks. This is a cunning hack i came across a while ago: the
destination of a symlink doesn't actually have to be a path, it can be
an arbitrary string, whose length can be up to whatever the maximum path
length is on your system - 1024 on mine. Creation and deletion of
symlinks is atomic, and visible to all processes, so this effectively
gives you a simple data store, similar to Berkeley DB or something.

I agree its a hack, but worth remembering all the same.

!

Quite! I have used it briefly, in the past, though from C. I found that
it does pretty much what it says on the tin. The APIs were well designed
and straight forward to use.

Tom Anderson · Oct 17, 2008

FWIW I've used a variation of this at a much higher volume than the OP's
application can generate. The files were a MB or two each and we got
tens to hundreds of them a day delivered by FTP, which renamed each file
when the transfer completed. This was on a Unix box, so the rename was
atomic and fast. The transfer technique worked well and didn't cause any
problems, even at the very high file arrival rates we could get after a
restart or a backup interruption.

Good to hear. It's the kind of thing that sounds like it should work well.

One thing to remember is that rename is not necessarily atomic on all
platforms; according to:

http://java.sun.com/javase/6/docs/api/java/io/File.html#renameTo(java.io.File)

"Many aspects of the behavior of this method are inherently
platform-dependent: The rename operation might not be able to move a file
from one filesystem to another, it might not be atomic, and it might not
succeed if a file with the destination abstract pathname already exists.
The return value should always be checked to make sure that the rename
operation was successful."

True, though the timestamp has the advantage that there's nothing to
store between sessions and you'll never get a filename clash after a
restart.

On the other hand, a counter can deal with messages arriving faster than
the clock ticks. In this application, that's not an issue, however. You
could pretty easily use a hybrid timestamp + counter naming scheme to get
around this problem without losing the benefits of timestamps.

...and of course this also may not be implemented on the target OS.

You mean sync()? No, perhaps not.

I agree its a hack, but worth remembering all the same.

I've implemented the three strategies we discussed:

http://urchin.earth.li/~twic/MessageQueue/

Including a little JNI library to do symlinks on unix. There's a
Mac-specific build script; users of lesser unices will need to work out
the relevant compiler/libtool/etc incantations for their platforms. Or
just not run that version!

tom

Roedy Green · Oct 17, 2008

I'm thinking about using HSQLDB or H2 for storing these data, because
it's a light and fast way. What do you think about?

It is rather large hammer for the job. Any job that runs all the time
I can't help but try to trim down as much as possible even in the
early design stages.

I would handle it with without a database. I would simply write to a
sequential file with a small buffer, and flush after each
record/group/transaction of records, or x seconds of inactivity. That
way you have a log of everything that happened even if you crash.

The other thread just checks the file length, reads to EOF sending
records off to the server, then goes back to sleep for a while,
possibly to be awakened prematurely by the other thread.

The trickiest part of my approach is writing an after crash program
that throws away any trailing partial records/transactions If you put
a length on the font of each, this is pretty easy.

carmelo · Oct 17, 2008

It is rather large hammer for the job. Any job that runs all the time
I can't help but try to trim down as much as possible even in the
early design stages.

I would handle it with without a database. I would simply write to a
sequential file with a small buffer, and flush after each
record/group/transaction of records, or x seconds of inactivity. That
way you have a log of everything that happened even if you crash.

The other thread just checks the file length, reads to EOF sending
records off to the server, then goes back to sleep for a while,
possibly to be awakened prematurely by the other thread.

The trickiest part of my approach is writing an after crash program
that throws away any trailing partial records/transactions If you put
a length on the font of each, this is pretty easy.

Roedy, how would you manage this log file? If the Producer appends
data to a sequential file, its size will ever increase...

Martin Gregorie · Oct 17, 2008

One thing to remember is that rename is not necessarily atomic on all
platforms; according to:

True, which is why I said it was a Unix box. However, I should have
qualified it by mentioning that the rename didn't involve a move between
filing systems (which isn't atomic) though it did move the file between
related directories (mv a/b/file a/c/file).

Info for non-*nix folks: the mv utility renames files within a filing
system and is equivalent to "copy a to b, delete a" if the destination is
another filing system. The unix term 'filing system' is equivalent to
'partition' or 'disk' for other OSes.

File renaming within a FS is always atomic if the OS is POSIX compliant.
Does anybody know what the situation is under Windows?

Really random thought: like some others on this ng, I started out using
mainframe computers (ICL 1900, IBM S/360) which didn't have filing
systems. Disk volumes had names. but 'files' were just named partitions.
Some, but not all, programs could subdivide a 'file' into named subfiles.
I just now wondered if Java has been implemented in this environment.
That would mean the big IBM iron running legacy operating systems.

On the other hand, a counter can deal with messages arriving faster than
the clock ticks. In this application, that's not an issue, however. You
could pretty easily use a hybrid timestamp + counter naming scheme to
get around this problem without losing the benefits of timestamps.

Unlikely in most unices, but I take your point. Reading the time once
when the program starts and generating the filename by concatenating the
time with the file number in this run should work on all systems - even
those with short file names if you converted timestamp+fileno to a base
64 representation.

I've implemented the three strategies we discussed:

http://urchin.earth.li/~twic/MessageQueue/

Including a little JNI library to do symlinks on unix. There's a
Mac-specific build script; users of lesser unices will need to work out
the relevant compiler/libtool/etc incantations for their platforms. Or
just not run that version!

Cool. Have you got any timing data on the different methods? I'd guess
the ranking (fast to slow) was symlink, dbms table, files but then what
do I know?.

Question about how to get line buffering from paramiko	0	Jul 5, 2011
Storing data periodically on remote server	24	Oct 7, 2008
How to rescue data on a hard disk?	3	Dec 17, 2004
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
On sandboxes, and why you should care	0	Mar 31, 2006
Asp.net Important Topics.	0	Jan 18, 2007
Stuff the purple heart programmers cook up	10	Dec 30, 2004
One Small step one infinite leap	1	Feb 6, 2005

Buffering data on disk

carmelo

carmelo

John B. Matthews

Tom Anderson

Lothar Kimmeringer

Martin Gregorie

carmelo

carmelo

Martin Gregorie

carmelo

carmelo

John B. Matthews

Tom Anderson

Martin Gregorie

Martin Gregorie

Martin Gregorie

Tom Anderson

Roedy Green

carmelo

Martin Gregorie

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads