moving data in a file without using system memory

U

ulyses

Let's assume I have following file:

2938929384902491233.....
923949919199191919112....

File contains INTs only. What is more they are huge. For example first
row in file may contain integer which size is 50MB and the second 30MB.
Now we come to my problem. Is there possibility to swap this rows
without using system memory (preferably in Unix/Linux)? Is there any
function in C to do this?

Thanks for help,
John
 
N

Nelu

Let's assume I have following file:

2938929384902491233.....
923949919199191919112....

File contains INTs only. What is more they are huge. For example first
row in file may contain integer which size is 50MB and the second 30MB.
Now we come to my problem. Is there possibility to swap this rows
without using system memory (preferably in Unix/Linux)? Is there any
function in C to do this?
I assume you want to make the swap without using char *, because you
need to use some memory.
One way you could do this is to use two files: f1 and f2.
You can read the first line, character by character and put it in the
f1 file using fgetc and fputc.
Read the second line in the same way and put it in f2.
Create f3 or rewrite the initial file, write the content of f2 (also
with fgetc and fputc) to f3, add the EOL and then add the content
of f1 to f3. Your lines should be swapped. Now, you have to pay
attention at what EOL means both when reading and writing. Also,
given the fact that you have integers, you could use isdigit() (pay
attention to negative numbers).
 
A

Alex Colvin

2938929384902491233.....
923949919199191919112....
File contains INTs only. What is more they are huge. For example first
row in file may contain integer which size is 50MB and the second 30MB.
Now we come to my problem. Is there possibility to swap this rows
without using system memory (preferably in Unix/Linux)? Is there any
function in C to do this?

First you have to scan the first 50MB to locate the end-of-line, unless
you happen to know it. That doesn't take a whole lot of memory.

Swapping the lines using a second file is pretty trivial. Just start
copying the second line to the destination. When you're done, rewind the
source and copy the first.

If you can't actually afford a second file, the problem gets more
interesting. You want to slide the second 30MB forward in the file, and
the first 50MB back. Although it's tricky, this too can be done
incrementally, using very little memory. Look up peristaltic in-place
out-of-core permutation.
 
P

Pascal Bourguignon

Nelu said:
I assume you want to make the swap without using char *, because you
need to use some memory.
One way you could do this is to use two files: f1 and f2.
You can read the first line, character by character and put it in the
f1 file using fgetc and fputc.
Read the second line in the same way and put it in f2.
Create f3 or rewrite the initial file, write the content of f2 (also
with fgetc and fputc) to f3, add the EOL and then add the content
of f1 to f3. Your lines should be swapped. Now, you have to pay
attention at what EOL means both when reading and writing. Also,
given the fact that you have integers, you could use isdigit() (pay
attention to negative numbers).

This wouldn't prevent the use of system memory, since any I/O will be
cached by default.

To avoid using system memory you'd have to use the raw device drive on
linux. Perhaps there is something similar on other OSes. Anyways, it's
out of topic on comp.lang.c...

Perhaps the OP could explain what he wants exactly, since it's rather
silly to buy GigaBytes of RAM not to use it later...


--
__Pascal Bourguignon__ http://www.informatimago.com/

NOTE: The most fundamental particles in this product are held
together by a "gluing" force about which little is currently known
and whose adhesive power can therefore not be permanently
guaranteed.
 
W

Walter Roberson

This wouldn't prevent the use of system memory, since any I/O will be
cached by default.
Perhaps the OP could explain what he wants exactly, since it's rather
silly to buy GigaBytes of RAM not to use it later...

My speculation would be that the OP does not want to read the entire
file into memory, as that might be a strain on the system resources.

The OP gave, by the way, no indication that gigabytes of RAM are available
nor that the file contents would be able to fit within the address space
available.


If the OP has numerous rows to exchange, then a "create a new file
each time" algorithm could get rather slow. What might be practical,
though, is to create a couple of I/O wrapper routines that kept
track of the files and which "logically current" portions of the files
are physically somewhere else (probably easiest in this circumstance
if the I/O wrappers operated at the "record" level.) Then, at the
end, create as many temporary files as there were files logically
written to, and for each logically-written file do a linear run
pulling the data out of the original data files in the appropriate order;
when the temporary files are fully generated, then either rename
them into existance where the original data files were, or else
[e.g., for permissions or NFS reasons] then copy the data out of the
temporary files into the original files and trunc() the original files
and throw away the temporary files.
 
K

Keith Thompson

ulyses said:
Let's assume I have following file:

2938929384902491233.....
923949919199191919112....

File contains INTs only. What is more they are huge. For example first
row in file may contain integer which size is 50MB and the second 30MB.
Now we come to my problem. Is there possibility to swap this rows
without using system memory (preferably in Unix/Linux)? Is there any
function in C to do this?

Do you mean that your file is a text file where each line consists of
a sequence decimal digits representing a large integer?

If so, for purposes of you problem, you can just treat them as
sequences of characters (which happen to be decimal digits); the fact
that they represent large integers is irrelevant.

Can the file contain arbitrarily many lines?

Here's one possible approach. Read the input file, keeping track of
where each new line starts. Use ftell() or fgetpos() and build a list
or array of indices.

Then traverse your index in reverse order. For each index, use
fseek() or fsetpos() to jump to that location in the file; read up to
the end of the current line and write to your output file.

(Note that some systems may provide a command to do this -- and 50MB
isn't all that much memory these days. Since you mentioned
Unix/Linux, <OT>"man tac" or "info tac"</OT>.)
 
D

David Schwartz

Let's assume I have following file:

2938929384902491233.....
923949919199191919112....

File contains INTs only. What is more they are huge. For example first
row in file may contain integer which size is 50MB and the second 30MB.
Now we come to my problem. Is there possibility to swap this rows
without using system memory (preferably in Unix/Linux)? Is there any
function in C to do this?

It's not clear what you mean by "without using system memory". Why
wouldn't you want to use memory if this results in faster operation?

For example, you could move 16Kb at a time, and rig it to use very
little system memory. But then the head of the disk would have to jump back
and forth 5,000 times. It would almost certainly make more sense to use more
memory.

Are you concerned about consumption of system cache?

It's a bit tricky, but there are exchange algorithms. You can tweak them
to minimize user memory consumption and then let the system figure out how
much cache is appropriate. You can hint to the cache to help with that.

DS
 
U

ulyses

David Schwartz napisal(a):
It's not clear what you mean by "without using system memory". Why
wouldn't you want to use memory if this results in faster operation?

I thouhg about something that would be some kind of low level function
that would modify pointers not real data, the same way as we do with
pointer to data in memory, e.g.:
int a*, b*, temp*;
....
//the swap
temp = a;
a = b;
b = temp;

And the data was swapped without moving it in memory. I asked if there
is such functionality that would enable me to do something like that
but with rows in file. Swap them without moving them on disk.

Thanks again for help,
John
 
B

Barry Margolin

"ulyses said:
David Schwartz napisal(a):

I thouhg about something that would be some kind of low level function
that would modify pointers not real data, the same way as we do with
pointer to data in memory, e.g.:
int a*, b*, temp*;
...
//the swap
temp = a;
a = b;
b = temp;

And the data was swapped without moving it in memory. I asked if there
is such functionality that would enable me to do something like that
but with rows in file. Swap them without moving them on disk.

No, because the information in files is not organized by lines, it's
just sequences of bytes organized by disk blocks. If you wanted to
rearrange the blocks it would theoretically be possible to update the
block pointers in the inode. However, there's no API for this, so you
would have to do it by writing a new ioctl or device driver, or by
accessing the disk device directly (and you'd need to unmount the
filesystem first, to avoid conflicts with the kernel's in-memory copies
of inodes).

But to rearrange lines you have to read the file into memory, search for
the newline characters, then write the lines back out to the file.
 
K

Keith Thompson

Barry Margolin said:
No, because the information in files is not organized by lines, it's
just sequences of bytes organized by disk blocks. If you wanted to
rearrange the blocks it would theoretically be possible to update the
block pointers in the inode. However, there's no API for this, so you
would have to do it by writing a new ioctl or device driver, or by
accessing the disk device directly (and you'd need to unmount the
filesystem first, to avoid conflicts with the kernel's in-memory copies
of inodes).

But to rearrange lines you have to read the file into memory, search for
the newline characters, then write the lines back out to the file.

If you want to rearrange the lines so you can read the rearranged file
with ordinary stdio calls, you'll need to physically re-write the file
(unless you can manage to do some nasty low-level file system stuff).
But if you want to be able to access the lines in some specified order
other than their physical order in the file, you can just create a
separate index. Do one pass over the file, creating an index of the
position of the start of each line (using ftell() or fgetpos()). Once
you have the index, you can use fseek() or fsetpos() to jump directly
to the beginning of any line you want.

Reading the file in reverse order is likely to be less efficient than
if you had physically reversed the file; the performance tradeoff
depends on how often you read it.
 
P

Pascal Bourguignon

ulyses said:
David Schwartz napisal(a):

I thouhg about something that would be some kind of low level function
that would modify pointers not real data, the same way as we do with
pointer to data in memory, e.g.:
int a*, b*, temp*;
...
//the swap
temp = a;
a = b;
b = temp;

And the data was swapped without moving it in memory. I asked if there
is such functionality that would enable me to do something like that
but with rows in file. Swap them without moving them on disk.

You can build an index yourself.

Read the data file byte by byte, and note the offset of all newline
characters. Save the list of offsets to an index file.

Later you can read the index file, and when you want to read the nth
number, you seek in the data file to the nth offset.

If you want to swap two numbers, you just swap the two offsets in the
index, which involves reading and writing at most two blocks. You can
also easily "delete" or "insert" numbers. To "delete" a number you
just remove its offset in the index (or put it in a "free list" if you
want to be able to reuse the space). To "insert" a number, you just
note the file size as the offset to the new number which you merely
append to the end of the file, and insert the offset in the index.
 
B

Barry Margolin

Pascal Bourguignon said:
You can build an index yourself.

Read the data file byte by byte, and note the offset of all newline
characters. Save the list of offsets to an index file.

Even better would be if you could get the program that creates the file
in the first place to create an index at the same time.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top