A challenging file to parse

D

david.deram

I have a group of files in a format that is that is tab delimited with
about a million columns and a thousand rows.

Reading this file left-to-right top-to-bottom is not a problem but my
requirements are to read it top-to-bottom left-to-right (to read each
column in order as follows).

1,4,7
2,5,8
3,6,9

It's an O(n^2) problem if I read each line for each column (it could
take a week for a big file). The file is too big to hold the lines in
memory and I see no strategy where I can hold a subset of lines in
memory.

This seems like a simple problem but I have struggled with lots of
solutions that have come up short.

Any suggestions would be appreciated.

Thank you.
 
S

santosh

I have a group of files in a format that is that is tab delimited with
about a million columns and a thousand rows.

Reading this file left-to-right top-to-bottom is not a problem but my
requirements are to read it top-to-bottom left-to-right (to read each
column in order as follows).

1,4,7
2,5,8
3,6,9

It's an O(n^2) problem if I read each line for each column (it could
take a week for a big file). The file is too big to hold the lines in
memory and I see no strategy where I can hold a subset of lines in
memory.

This seems like a simple problem but I have struggled with lots of
solutions that have come up short.

Any suggestions would be appreciated.

If the lines are of fixed length, you can fseek to the beginning of each
line.
 
W

Walter Roberson

I have a group of files in a format that is that is tab delimited with
about a million columns and a thousand rows.
Reading this file left-to-right top-to-bottom is not a problem but my
requirements are to read it top-to-bottom left-to-right (to read each
column in order as follows).

It's an O(n^2) problem if I read each line for each column (it could
take a week for a big file). The file is too big to hold the lines in
memory and I see no strategy where I can hold a subset of lines in
memory.

For each row, ftell() to record the file position after the column
has been read. Then when you are processing columns after the first,
fseek() to the remembered file position for a row, parse out the
next column, record the new file position; advance to the column
in the next row by fseek() to the remembered file position for that row,
etc..

You will still need to read the file through once in order to
determine the positions for each line and to determine the number
of rows. You may end up with a little bit of realloc() logic
as you discover you have more rows than your original allocation.

You might find the code cleaner if on your first pass through,
you just look for end of lines and record the positions of each
line, without attempting to parse out the first field on each line.
On the other hand, if someone is watching it go interactively, the
latency before it starts producing any result might not be
considered acceptable.
 
D

David Resnick

I have a group of files in a format that is that is tab delimited with
about a million columns and a thousand rows.

Reading this file left-to-right top-to-bottom is not a problem but my
requirements are to read it top-to-bottom left-to-right (to read each
column in order as follows).

1,4,7
2,5,8
3,6,9

It's an O(n^2) problem if I read each line for each column (it could
take a week for a big file). The file is too big to hold the lines in
memory and I see no strategy where I can hold a subset of lines in
memory.

This seems like a simple problem but I have struggled with lots of
solutions that have come up short.

Any suggestions would be appreciated.

Thank you.

How about opening a temp file for each group of n columns, reading a
line at a time, and appending the section of temp files to each
column. Picking n is an exercise for you, but I'd suggest picking a
number such that:
1) You can have that many open files at once without exceeding any O/S
limits
2) You can slurp each file into memory comfortably for processing
after the
first pass generates those files.

Say 1000 temp files each having 1000000 items....

-David
 
K

Keith Thompson

santosh said:
If the lines are of fixed length, you can fseek to the beginning of each
line.

And if they're not of fixed length, you can do a single traversal of
the file to build an index. The simplest approach is to record the
ftell() value for the beginning of each line. Store the indices in
memory if you can. If not, write them to a file, using a fixed amount
of storage for each index. (For a text file, you can only portably
fseek to a location that came from a previous call to ftell, or to the
beginning or end of the file; you might need to make the index file
binary.)

To randomly access a line, fseek to the appropriate location in the
index file, read the index, and fseek to the specified location in the
main file. Keeping the index synchronized with the data file is left
as an exercise.

You might consider storing indices for columns, or perhaps for every
Nth column, rather than just for the beginning of each line. This is
going to be a tradeoff between space and speed.

I'm assuming, of course, that the files are stored on some
random-access medium such as a disk, not on a sequential-access medium
such as tape. fseek() is likely to be O(1), or nearly so, but the
standard says nothing about performance.

<OT>If you're willing to leave the realm of portable C file access, a
database would probably be a better way to store and access this
information.</OT>
 
M

Malcolm McLean

I have a group of files in a format that is that is tab delimited with
about a million columns and a thousand rows.

Reading this file left-to-right top-to-bottom is not a problem but my
requirements are to read it top-to-bottom left-to-right (to read each
column in order as follows).

1,4,7
2,5,8
3,6,9

It's an O(n^2) problem if I read each line for each column (it could
take a week for a big file). The file is too big to hold the lines in
memory and I see no strategy where I can hold a subset of lines in
memory.

This seems like a simple problem but I have struggled with lots of
solutions that have come up short.

Any suggestions would be appreciated.
Pass throught the file, taking note of the thousand newlines.
Then pass though again. seek the newlines, and read the first column. Then
take a note of where you stopped reading.
Continue, fseeking the thousand positions whrre you left off.

It will still be rather slow, but not all that bad.
 
R

Robert Gamble

For each row, ftell() to record the file position after the column
has been read. Then when you are processing columns after the first,
fseek() to the remembered file position for a row, parse out the
next column, record the new file position; advance to the column
in the next row by fseek() to the remembered file position for that row,
etc..

Given the size of the file and the fact that ftell()/fseek() use long
int for offsets (which creates an artificial limit of only being able
to access the first 2GB on many 32-bit systems), fgetpos()/fsetpos()
or the non-standard lseek() and company may be more appropriate.

Robert Gamble
 
E

Eric Sosman

Walter Roberson wrote On 08/21/07 16:26,:
Let's suppose you can store about 500MB of file data
in RAM at once. With about a thousand lines, that means
you can read the whole file, storing the leftmost 0.5MB
from each line and discarding the rest. You can then
write this portion of the data to the output file in
transposed order. Rewind the original file and make
another pass, this time ignoring the first 0.5MB from each
line, storing the next 0.5MB, and ignoring the tails.
Write that second batch out, rewind, rinse, and repeat.

Let's see: If the data items are ~10 bytes long plus
the tabs between them, each line is about 11MB and you'll
complete the job in about two dozen passes. If you've
got 1.5GB available, you can do it in eight or nine.
 
D

David Mathog

I have a group of files in a format that is that is tab delimited with
about a million columns and a thousand rows.

Reading this file left-to-right top-to-bottom is not a problem but my
requirements are to read it top-to-bottom left-to-right (to read each
column in order as follows).

1,4,7
2,5,8
3,6,9

Hmm, interesting. Too bad it's column/tab delimited. If it could be
done with fixed storage areas in a binary file it would make things a
little easier.
It's an O(n^2) problem if I read each line for each column (it could
take a week for a big file).

You don't have to do it one column per pass. Let

N be the number of rows in the file
C the number of columns in the file
M a number of rows <= C
N*M be an array that will fit nicely in memory

You only need to do C/M passes (um, and maybe one more). That
is, you store columns 1->M of rows 1->N, in the first pass, then emit
col 1 (row 1->N), col 2 (row 1->N) etc. That could cut the number of
passes through the file by a factor of a thousand.

As others have noted, if you keep track of the position of the last
column read in each row on subsequent passes instead of reading N*C
values you'll only have to read N*M.
The file is too big to hold the lines in
memory and I see no strategy where I can hold a subset of lines in
memory.

Hold a subset of columns and rows, not whole lines. If all you're doing
is emitting the values in the other order then this is just the problem
of transposing a matrix which is too big to fit into memory, albeit with
variable width data.

Regards,

David Mathog
 
U

user923005

I have a group of files in a format that is that is tab delimited with
about a million columns and a thousand rows.

Reading this file left-to-right top-to-bottom is not a problem but my
requirements are to read it top-to-bottom left-to-right (to read each
column in order as follows).

1,4,7
2,5,8
3,6,9

It's an O(n^2) problem if I read each line for each column (it could
take a week for a big file). The file is too big to hold the lines in
memory and I see no strategy where I can hold a subset of lines in
memory.

This seems like a simple problem but I have struggled with lots of
solutions that have come up short.

Any suggestions would be appreciated.

1. Write a program that forms a transpose of the original file.
2. Read the file a row at a time.
 
C

CBFalconer

I have a group of files in a format that is that is tab delimited
with about a million columns and a thousand rows.

I assume this means the original input has a million lines with a
maximum line length of about one thousand.
Reading this file left-to-right top-to-bottom is not a problem
but my requirements are to read it top-to-bottom left-to-right
(to read each column in order as follows).

1,4,7
2,5,8
3,6,9

It's an O(n^2) problem if I read each line for each column (it
could take a week for a big file). The file is too big to hold
the lines in memory and I see no strategy where I can hold a
subset of lines in memory.

This seems like a simple problem but I have struggled with lots
of solutions that have come up short.

Any suggestions would be appreciated.

I suggest you start by adding line numbers, with leading zeroes, to
each original line. Thus the first 6 (or better, 8) characters of
the original lines are line numbers. Then sort that file on line
length, shortest line first. Now I believe you can solve the
problem with successive pieces. This is just an instinctive first
attempt on my part.
 
U

user923005

I have a group of files in a format that is that is tab delimited with
about a million columns and a thousand rows.

Reading this file left-to-right top-to-bottom is not a problem but my
requirements are to read it top-to-bottom left-to-right (to read each
column in order as follows).

1,4,7
2,5,8
3,6,9

It's an O(n^2) problem if I read each line for each column (it could
take a week for a big file). The file is too big to hold the lines in
memory and I see no strategy where I can hold a subset of lines in
memory.

This seems like a simple problem but I have struggled with lots of
solutions that have come up short.

/*
No need to read the table more than once. I would do this:
*/

#include <string.h>
#include <limits.h>
#include <stdlib.h>
#include <ctype.h>

/* The default delimiters are chosen as some ordinary white space
characters: */
static const char default_delimiters[] = ",";

/*
* The tokenize() function is similar to a reentrant version of
strtok().
* It parses tokens from 'string', where tokens are substrings
separated by characters from 'delimiter_list'.
* To get the first token from 'string', tokenize() is called with
'string' as its first parameter.
* Remaining tokens from 'string' are obtained by calling tokenize()
with NULL for the first parameter.
* The string of delimiters, identified by 'delimiter_list', can
change from call to call.
* If the string of delimiters is NULL, then the standard list
'default_delimiters' (see above) is used.
* tokenize() modifies the memory pointed to by 'string', because it
writes null characters into the buffer.
*/
char *tokenize(char *string, const char *delimiter_list,
char **placeholder)
{
if (delimiter_list == NULL)
delimiter_list = default_delimiters;

if (delimiter_list[0] == 0)
delimiter_list = default_delimiters;

if (string == NULL)
string = *placeholder;

if (string == NULL)
return NULL;
/*
* The strspn() function computes the length of the initial segment of
the first string
* that consists entirely of characters contained in the second
string.
*/
string += strspn(string, delimiter_list);
if (!string[0]) {
*placeholder = string;
return NULL;
} else {
char *token;
token = string;
/*
* The strpbrk() function finds the first occurrence of any character
contained in the second string
* found in the first string.
*/
string = strpbrk(token, delimiter_list);
if (string == NULL)
*placeholder = token + strlen(token);
else {
*string++ = 0;
*placeholder = string;
}
return token;
}
}

#ifdef UNIT_TEST
char test_string0[] =
"1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17";
char test_string1[] =
"21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37";
char test_string2[] =
"31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47";

#include <stdio.h>

void make_row_col(char *test_string, char *white, int row)
{
char *p = NULL;
char *token;
int col = 0;
token = tokenize(test_string, white, &p);
while (token) {
printf("%d,%d,%s\n",row, col++, token);
token = tokenize(NULL, white, &p);
}
}

int main(void)
{
int row = 0;
make_row_col(test_string0, NULL, row++);
make_row_col(test_string1, NULL, row++);
make_row_col(test_string2, NULL, row++);
return 0;
}
#endif
/*

-- Now load the file of transposed information into this database
table.
-- After that, you can do a SQL query to invert row and col or select
by row or select by col.

CREATE TABLE Transposer (
row INTEGER NOT NULL,
col INTEGER NOT NULL,
data VARCHAR(20)
);

CREATE UNIQUE INDEX XPKTransposer ON Transposer
(
row ASC,
col ASC
);

CREATE INDEX XieTransposerRow ON Transposer
(
row ASC
);

CREATE INDEX XieTransposerCol ON Transposer
(
col ASC
);


ALTER TABLE Transposer
ADD PRIMARY KEY (row, col);

*/
 
C

Chris Dollin

CBFalconer said:
I assume this means the original input has a million lines with a
maximum line length of about one thousand.

I think you've transposed his arrangement: I'd read that as
a thousand rows with a million elements each.

I'd be interested in why the data is in this somewhat strange
format. A /million/ columns? Even a thousand I find ... odd.
(But then I've little acquaintance with gigadata.) Maybe
knowing how this data came to be would allow a more effective
program to be written.
 
C

CBFalconer

Chris said:
I think you've transposed his arrangement: I'd read that as
a thousand rows with a million elements each.

That's why I stated the assumption. A thousand lines would have to
be up to a million chars long, and that seems highly inconvenient,
especially since C i/o has possible limitations on line lengths.
C99 can barely handle thousand character lines. C90 can't
guarantee it.
 
D

david.deram

It *is* a million columns. In some cases it can actually be 8 million
columns. The file is built by an instrument that produces a lot of
data for each element (each row).

The transpose is a good idea but it's not just this one file and there
is a performance consideration with rebuilding.

What I really want to do is just parse the file in column order.

I like the idea of storing whatever will fit in memory...dividing the
file into blocks.

But I think the lightest and fastest solution is to record the
position for each row in some kind of collection that will hold 64 bit
integers. I am going to give this a shot.
For each row, ftell() to record the file position after the column
has been read. Then when you are processing columns after the first,
fseek() to the remembered file position for a row, parse out the
next column, record the new file position; advance to the column
in the next row by fseek() to the remembered file position for that
row,
etc..
<

It will be interesting because I have no control over how many rows
there might be.

Thank you for your input on this. It has opened up some ideas for me.
 
W

Walter Roberson

CBFalconer said:
That's why I stated the assumption. A thousand lines would have to
be up to a million chars long, and that seems highly inconvenient,
especially since C i/o has possible limitations on line lengths.
C99 can barely handle thousand character lines. C90 can't
guarantee it.

I can't think at the moment of which constraint you might be
referring to? The only one that comes to mind right now is the
constraint on source line length, but that's compilation phase,
not execution time.

I do see a C90 environment limit of 509 on the number of
characters produced by fprintf(), and there is an implicit
size limit of INT_MAX for fgets(). I don't see any C90 limit
documented for fscanf() other than the implicit one that
the number of -items- (not number of characters) has to fit
into an int.

Could you expand on the limit you were thinking of?
 
R

Robert Gamble

It *is* a million columns. In some cases it can actually be 8 million
columns. The file is built by an instrument that produces a lot of
data for each element (each row).

The transpose is a good idea but it's not just this one file and there
is a performance consideration with rebuilding.

What I really want to do is just parse the file in column order.

I like the idea of storing whatever will fit in memory...dividing the
file into blocks.

But I think the lightest and fastest solution is to record the
position for each row in some kind of collection that will hold 64 bit
integers. I am going to give this a shot.



For each row, ftell() to record the file position after the column
has been read. Then when you are processing columns after the first,
fseek() to the remembered file position for a row, parse out the
next column, record the new file position; advance to the column
in the next row by fseek() to the remembered file position for that
row,
etc..
<

It will be interesting because I have no control over how many rows
there might be.

Thank you for your input on this. It has opened up some ideas for me.

Below is a transpose program I wrote such as that suggested by others
in this thread. If you decide that transposing the files is not the
solution you'd prefer you should be able to easily modify it for you
needs. The program includes full error checking but is only minimally
tested. It currently assumes the last line ends in a newline. The
resulting file will have the same number of columns for every row, if
the input file is missing columns on certain rows these rows will be
empty in the resulting file. Hope this helps.

transpose.c:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>

#define DELIMITER '\t'
#define TERMINATOR '\n'

/* Wrapper functions */
void * xmalloc(size_t size) {
void *t = malloc(size);
if (t == NULL) {
puts("Out of memory!");
exit(EXIT_FAILURE);
}
return t;
}

void * xrealloc(void *t, size_t size) {
t = realloc(t, size);
if (t == NULL) {
puts("Out of memory!");
exit(EXIT_FAILURE);
}
return t;
}

int xfsetpos(FILE *stream, fpos_t *pos) {
if (fsetpos(stream, pos) != 0) {
perror("fsetpos failed");
}
return 0;
}

int xfgetpos(FILE *stream, fpos_t *pos) {
if (fgetpos(stream, pos) != 0) {
perror("fgetpos failed");
}
return 0;
}

int main (int argc, char *argv[]) {
fpos_t **fpa = NULL;
unsigned long rows = 0;
unsigned long cur_row = 0;
unsigned long finished_rows = 0;
FILE *infile;
FILE *outfile;
int c;

if (argc != 3) {
fprintf(stderr, "Usage: transpose input-file output-file\n");
exit(EXIT_FAILURE);
}

if ((infile = fopen(argv[1], "r")) == NULL) {
fprintf(stderr, "Failed to open file %s: %s\n", argv[1],
strerror(errno));
exit(EXIT_FAILURE);
}

if ((outfile = fopen(argv[2], "w")) == NULL) {
fprintf(stderr, "Failed to open file %s: %s\n", argv[2],
strerror(errno));
exit(EXIT_FAILURE);
}

fpa = xrealloc(fpa, sizeof (*fpa));
fpa[cur_row] = xmalloc(sizeof(**fpa));
xfgetpos(infile, fpa[cur_row++]);

/* Read through file recording the starting position of each record
*/
while ((c = fgetc(infile)) != EOF) {
if (c == TERMINATOR) {
fpa = xrealloc(fpa, (cur_row + 1) * sizeof (*fpa));
fpa[cur_row] = xmalloc(sizeof(**fpa));
xfgetpos(infile, fpa[cur_row++]);
}
}

/* FIXME: Need to test if last record ends with terminator character
*/
rows = cur_row - 1;

/* Process the next column of data for every row one row at a time
until all rows are finished */
while (finished_rows < rows) {
for (cur_row = 0; cur_row < rows; cur_row++) {
if (!fpa[cur_row]) {
/* if pointer is NULL the row has ended but there is more
data,
output a delimiter character to represent an empty cell */
if (cur_row + 1 < rows) fputc(DELIMITER, outfile);
continue;
}
xfsetpos(infile, fpa[cur_row]);
while ((c = fgetc(infile)) != TERMINATOR && c != EOF) {
if (c == DELIMITER) {
if (cur_row + 1 < rows) fputc(DELIMITER, outfile);
if (fpa[cur_row]) xfgetpos(infile, fpa[cur_row]);
goto end;
} else {
fputc(c, outfile);
}
}
/* Row ended, set the position indicator for this row to NULL */
if (cur_row + 1 < rows) fputc(DELIMITER, outfile);
free(fpa[cur_row]);
fpa[cur_row] = NULL;
finished_rows++;
end: ;
}
/* Finished transposing current column, output a terminator
character */
fputc(TERMINATOR, outfile);
}

free(fpa);
return 0;
}


Robert Gamble
 
K

Keith Thompson

It *is* a million columns. In some cases it can actually be 8 million
columns. The file is built by an instrument that produces a lot of
data for each element (each row).

The transpose is a good idea but it's not just this one file and there
is a performance consideration with rebuilding.

What I really want to do is just parse the file in column order.

I like the idea of storing whatever will fit in memory...dividing the
file into blocks.

But I think the lightest and fastest solution is to record the
position for each row in some kind of collection that will hold 64 bit
integers. I am going to give this a shot.

For each row, ftell() to record the file position after the column
has been read. Then when you are processing columns after the first,
fseek() to the remembered file position for a row, parse out the
next column, record the new file position; advance to the column
in the next row by fseek() to the remembered file position for that
row,
etc..
[...]

Once you fseek() to the beginning of a row, you still have to scan
linearly to get to the column. Assuming the columns are of variable
width, you might also have your initial scan save an index for every,
say, 1000 columns, and *perhaps* save the index table for each row in
a separate file.

So, if you want to fetch column 12345 from row 512, you could retrieve
the location of column 12000 from the 512th index file, then seek to
that position, then skip 344 columns.

Try several different approaches, measuring the performance of each.

Or, as I suggested earlier, slurp the data into some sort of database
(Berkeley DB would probably suit your purposes) -- but that would most
likely expand your disk usage considerably.
 
U

umbs.sairam

I understand that this is a C group and solutions tend towards making
use of C language features/libraries. However, I viewed it as an
algorithm problem and found a O(N) run-time and O(N) memory solution,
where n is the total numbers in the file.

Assumptions: the numbers are in a matrix like structure with R rows
and C columns. I realized later that this is too severe a restriction
for your case.

Summary: When coding in C, we parse from left-to-right and top-to-
bottom. Do the same thing, but when storing in an array, store it at
locations that appear as if we have read top-to-bottom and left-to-
right [array storing is an issue in your case, but anyway, here is my
solution]

To explain better, lets assume
R = Rows in the matrix
C = Columns in the matrix
N = R * C = total numbers
SCAN[N] = Array that is storing all these numbers
i = row index
j = column index
p = j*R + i; index in to array SCAN[]

So, when reading the numbers from left-to-right [as in C], store them
at index 'p'. In O(N) you will read complete file, store it and print
it on screen.

Anyway, this solution is too restrictive.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,266
Messages
2,571,075
Members
48,772
Latest member
Backspace Studios

Latest Threads

Top