writing output to textfile

P

(-Peter-)

Hi..

I've been using java to print out 10 millions lines like "int
double" (where int and double is numbers for example 1 and
1,932131293)

I printed it to a .txt file and, the result was a 300 mb file - isn't
this "too much".

I'm using the output in octave(="free matlab"), where I load the file
as a matrix - this takes very long time, and uses all my memory when I
manipulate the data (I have 2 gb, which I think is "enough")

Is it normal that 10 millions of lines would use that much space?

Is there another way to save the output which does not require that
amount of space (and memory usage in octave).

/Peter
 
R

Roedy Green

I've been using java to print out 10 millions lines like "int
double" (where int and double is numbers for example 1 and
1,932131293)

Have a look at the file with a hex editor and see roughly how many
characters there are per line and multiply to see if there is
something amiss.

For a more compact file, you could write out a DataOutputStream or an
ObjectStream. However, it would not be human-readable.

See http://mindprod.com/applet/fileio.html
for how
 
P

(-Peter-)

Have a look at the file with a hex editor and see roughly  how many
characters there are per line and multiply to see if there is
something amiss.

Can you explain how I do that?
For a more compact file, you could write out a DataOutputStream or an
ObjectStream. However, it would not be human-readable.

Seehttp://mindprod.com/applet/fileio.html
for how

Thank you for the answer.

And for other user that might be able to help me - it is not necessary
that it is human-readable - octave(~matlab) should only be able to
read it.

/Peter
 
A

Alan Morgan

Hi..

I've been using java to print out 10 millions lines like "int
double" (where int and double is numbers for example 1 and
1,932131293)

I printed it to a .txt file and, the result was a 300 mb file - isn't
this "too much".

Too much for what? If each line is about 30 bytes (which seems reasonable)
then 10 million lines would be 300MB. How big do you expect it to be?
I'm using the output in octave(="free matlab"), where I load the file
as a matrix - this takes very long time, and uses all my memory when I
manipulate the data (I have 2 gb, which I think is "enough")

Obviously it isn't.
Is it normal that 10 millions of lines would use that much space?

10 million lines will use exactly as much space as the average size
of a line x 10 million. If you have short lines it will be a smaller
number than if you have long lines.
Is there another way to save the output which does not require that
amount of space (and memory usage in octave).

Well, you can save the data in binary form. That will reduce the size
of the file by... oh, 50% or more. But Octave might not be able to
read it and even if it can I would be surprised if it ends up taking up
less room in memory as a result.

Alan
 
P

(-Peter-)

     300 megabytes divided by 10 million lines equals 30 bytes
per line.  That sounds a little high, but not out of reason.
How big are these numbers, and how much precision do they
express?  Only you can say whether it's excessive.  Why don't
you look at some of the output and see what it's like?




     Maybe, and maybe.  For the file size, try to find out
whether you're writing more precision than you need, for
example "1,234.5678900000000000000000113" instead of
"1,234.6".  Or, find out if this octave thing can read a
more compact format and use it instead: an undifferentiated
binary dump would use four bytes per int and eight per
double, for 12/30 = 40% your current size.

     As for octave's memory consumption, that's between you
and octave: How much memory does it require for ten million
of these pairs?  Most programs that read external inputs will
translate them into an internal form that has little to do
with their external format, so encoding the file more densely
may not make much difference to octave's memory use.  But as
I say, this is something you'll have to figure out on your
own; it's nothing to do with Java.


Thank you for your comments - I will try to think about it.. :)

And look into octave's memory usage..

/Peter
 
C

Chase Preuninger

Yeah, that's a lot of data. Just one double takes up 8 bytes. If you
were to write them as floats however it would take up half that
space. Then if you want an even smaller file use GZip to compress,
however that would seriously decrease the performance of your app, and
since you are compressing a bunch of numbers it would not reduce the
size by that much.
 
C

Chase Preuninger

didn't realise that you wanted to save the data as ASCII text in that
case all you could do is compress the data with GZip (which works much
better with text than a list of arbitrary numbers) I do seriously
recommend that you save the data as 4 byte floats, like always the
GZip would decrease performance and is only compatible with apps that
read data in through the GZip format. sadly there is nothing you can
do to make it faster if you leave it as text. For example the string
"135335234" is made of 9 chars each taking up 2 bytes where if u saved
it as a float it would only take up 4 bytes instead of 18. but like I
said if u mean to open this in word u need to just use text, I once
had a program that prints out every prime to a billion and since word
could barely handle it when I opened every prime to a million I spread
every 100 thousand over 10,000 text files in a easy to navigate
directory system.
 
A

Arved Sandstrom

(-Peter-) said:
Hi..

I've been using java to print out 10 millions lines like "int
double" (where int and double is numbers for example 1 and
1,932131293)

I printed it to a .txt file and, the result was a 300 mb file - isn't
this "too much".

I'm using the output in octave(="free matlab"), where I load the file
as a matrix - this takes very long time, and uses all my memory when I
manipulate the data (I have 2 gb, which I think is "enough")

Is it normal that 10 millions of lines would use that much space?

Is there another way to save the output which does not require that
amount of space (and memory usage in octave).

You can obviously save space by writing binary - investigate
DataOutputStream on the Java end. When reading in you'll be using fread();
the most efficient way to use this, if you have different data types being
written out, is perhaps to write out column by column...that is, N ints, N
doubles and so forth. Then you can use fread to read N ints or N doubles
into column or row vectors, at one fell swoop, and concatenate as necessary.
To be honest I don't exactly see how ints and doubles are mingling in a
matrix anyway, so they are probably logically separate.

Bear in mind that 10 million records composed of one int (4 bytes) and one
double (8 bytes) are going to be 120 million bytes anyhow, which is pretty
big for processing even with 2 GB of RAM on your machine. My gut feeling is
that you may want to look carefully at exactly what it is you're doing with
the data, and see if the processing actually requires that all of the data
be available at once.

AHS
 
R

Roedy Green

Can you explain how I do that?

download a hex editor. Look at the file. add up the lengths of a
dozen lines and compute the average. Multiply by the number of lines
expected. Compare that with the length of the file as seen in a DIR
command. If they are not roughly in the same ballbark, examine the
file randomly looking for anomalies, eg. the same data being written
over and over.

see http://mindprod.com/jgloss/hex.html
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,186
Latest member
vinaykumar_nevatia

Latest Threads

Top