detecting corrupt/stuffed files (how to?)

J

Jole

Hi
I'm writing a program that needs to read from a file. In order for the
program to be robust, it should somehow check that the file isn't corrupt,
or stuffed in any way. For example, that file may have been created but a
crash occurred at that point in time (while it was being created), damaging
the file. Now, my program which needs to read from this file, should first
check that it's in good condition, and that it hasn't been stuffed up in
any way. What is the normal way of doing this?

thanks in advance
Jole

I'm using Java and am aware of some of those File.XX methods. perhaps the
File.isReadable() methods will fail if the files have been damaged or
corrupted? or another way ?
 
T

Thomas Weidenfeller

Followup ignored. comp.lang.java is not a valid newsgroup.
I'm writing a program that needs to read from a file. In order for the
program to be robust, it should somehow check that the file isn't corrupt,
or stuffed in any way. For example, that file may have been created but a
crash occurred

What type of crash? The OS crashing? Your application crashing?
at that point in time (while it was being created), damaging
the file.

If the file system got damaged, the OS has to care about this (once it
gets aware of it). Java has no specific API to find about this. Most
general-purpose operating systems don't provide such information to an
application. Well, a read attempt might return an error at some
unexpected point.

If you need some extra protection on the file system level, consider
using a journaling file system.

If just the data in the file is corrupt , and not the file system, it
appears as a normal file to your application. So read on:
Now, my program which needs to read from this file, should first
check that it's in good condition, and that it hasn't been stuffed up in
any way. What is the normal way of doing this?

A common way to ensure integrity is to run a checksum over the file, and
append it to the end of the file. This doesn't protect from malicious
attempts, but is a good protection for accidents.

There are a few tricks, e.g. appending the checksum data in a way that
when running the checksum algorithm again over the combined data, the
resulting second checksum is 0. This simplifies the verification of the
data a little bit: Blindly running the checksum algorithm over the
combined data must return 0, or the file is corrupt.

If you need to have protection from malicious tampering, look into
digital signatures.
I'm using Java and am aware of some of those File.XX methods. perhaps the
File.isReadable() methods will fail if the files have been damaged or
corrupted?

No, not at all. Read the API documentation.

/Thomas
 
J

Jole

Hi
I'm a little unsure of what happens to files being copied when the OS
crashes. What does happen? does the filesystem not report the file's
correct size? It's still readable right?

My application consisits of 2 parts. It has one simple program running on
one pc which creates a file with data in it. This first program creates
this file on a floppy. Then that disk is transferred to another pc, which
reads that newly copied file. I need to make sure that the file copy on
the floppy isn't damaged in any way.

What if the first program copied it to the disk, and then renamed it. That
way, the other application which reads from the disk, looks for a file with
the different name. If it finds one, then we know the OS didn't previously
crash when copying the file.

here is my algorithm:

first program:
create file schedule.dat on floppy
rename file to schedule_new.dat (if OS crashes during copying, the file
wont be renamed)

client program:
look for file on floppy (schedule_new.dat) with the renamed name. (if it
finds this file it means that the OS didn't crash while creating the file)

i realize now that, as you were saying, this doesn't mean the file is the
same file. ie, it's possible to copy the file but it wont be the same copy
(have the same checksum). I'll be looking into this. Perhaps i can get
away without worrying about file integrity, so long as it doesn't happen
too often.


thanks
 
J

Jole

if you don't mind me asking, what are some of the things that can happen
(generally speaking) if i don't worry about file integrity. I know that
this is largely dependant on my program, and you can't answer this fully.
Still, what has experience taught you. And how important do you think it
is to use, for files that are very small (less than 100K) in a commercial
application. These files have data in them which are serialized objects
(using the java serialization api). Maybe the readObject() method will
throw an exception if the data in a file isn't totally correct, allowing me
to catch it and recover. (looking into this anyhow)

Jole
 
L

Liz

Jole said:
if you don't mind me asking, what are some of the things that can happen
(generally speaking) if i don't worry about file integrity. I know that
this is largely dependant on my program, and you can't answer this fully.
Still, what has experience taught you. And how important do you think it
is to use, for files that are very small (less than 100K) in a commercial
application. These files have data in them which are serialized objects
(using the java serialization api). Maybe the readObject() method will
throw an exception if the data in a file isn't totally correct, allowing me
to catch it and recover. (looking into this anyhow)

Jole

"It shouldn't really matter what happens when there is a crash."
Is that not what you want? Consider if you write your file in gzip format
using the standard Java classes. Now, when you go to read it, either
everything is fine, or you get an exception. If the file can't be read,
you get one kind of exception and if the file is corrupted you will get
another kind. The corruption is found by the gzip class since you have
to be able to read all of it and it has to make sense otherwise the
decompression algorithm won't work. As a side benefit the file will be
smaller and use less space.
 
C

Chris Smith

Jole said:
I'm a little unsure of what happens to files being copied when the OS
crashes. What does happen? does the filesystem not report the file's
correct size? It's still readable right?

Well, a lot of things can happen, depending on the OS, the filesystem,
and the details of when the crash occurred. But pretty much any modern
OS and filesystem will do metadata journaling, which means you'll likely
get a file that's readable, but doesn't contain all the data that would
have been written. So it will be shorter than it would otherwise be.

How you detect these issues is more difficult. One step is to use a
very strict parser for the file. Designing the format to require some
specific indication of the end of the file would also help. It turns
out XML conveniently provides both of those safeguards, since a well-
formed XML document always ends with the end tag of the document
element, and compliant XML parsers are required to reject any XML stream
that's not well-formed. But you can build those features into your
format and parsing code without using XML, if you like.
My application consisits of 2 parts. It has one simple program running on
one pc which creates a file with data in it. This first program creates
this file on a floppy. Then that disk is transferred to another pc, which
reads that newly copied file. I need to make sure that the file copy on
the floppy isn't damaged in any way.

The "damaged in any way" bit is actually impossible. Assuming your file
can contain two distinct data sets (that is, that it's useful at all),
then it will always be theoretically possible for the file to be damaged
in a way that causes it to appear as valid. The trick is to reduce that
probability, since you can never eliminate it.

If you need more validation than is available from what I mentioned
above, then CRC could be used to reduce that to a vanishingly small
probability of corruption.

--
www.designacourse.com
The Easiest Way to Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
J

Jole

thanks for that. good to get a deeper sense of what goes on behind the
scenes. I might not worry about xml or checksums. If the data in the file
is inaccurate, shouldn't be too bad a thing, as users will eventually try
the appliactions 'new schedule' scenario which will put the app in a proper
working state. Hopefully there wont be any 'holes' in my program that will
make it unusable :).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,014
Latest member
BiancaFix3

Latest Threads

Top