A challenging file to parse

M

Mike

This whole problem gets a lot simpler if you have a relational database.

You can read the whole file once in sequential order. For each field, write a database record with 3 columns one containing the row
number, one containing the column number and finally one with the field value. The resulting table can then be easily processed by
column or by row or even by field value and by any combination of the above.

Mike Sicilian



|I have a group of files in a format that is that is tab delimited with
| about a million columns and a thousand rows.
|
| Reading this file left-to-right top-to-bottom is not a problem but my
| requirements are to read it top-to-bottom left-to-right (to read each
| column in order as follows).
|
| 1,4,7
| 2,5,8
| 3,6,9
|
| It's an O(n^2) problem if I read each line for each column (it could
| take a week for a big file). The file is too big to hold the lines in
| memory and I see no strategy where I can hold a subset of lines in
| memory.
|
| This seems like a simple problem but I have struggled with lots of
| solutions that have come up short.
|
| Any suggestions would be appreciated.
|
| Thank you.
|
 
C

CBFalconer

Walter said:
I can't think at the moment of which constraint you might be
referring to? The only one that comes to mind right now is the
constraint on source line length, but that's compilation phase,
not execution time.

I do see a C90 environment limit of 509 on the number of
characters produced by fprintf(), and there is an implicit
size limit of INT_MAX for fgets(). I don't see any C90 limit
documented for fscanf() other than the implicit one that
the number of -items- (not number of characters) has to fit
into an int.

I may well be thinking of that fprintf limit. On considering
things slightly further, your attitude seems to make sense. I
haven't researched the std.
 
C

CBFalconer

I understand that this is a C group and solutions tend towards making
use of C language features/libraries. However, I viewed it as an
algorithm problem and found a O(N) run-time and O(N) memory solution,
where n is the total numbers in the file.

Assumptions: the numbers are in a matrix like structure with R rows
and C columns. I realized later that this is too severe a restriction
for your case.

Summary: When coding in C, we parse from left-to-right and top-to-
bottom. Do the same thing, but when storing in an array, store it at
locations that appear as if we have read top-to-bottom and left-to-
right [array storing is an issue in your case, but anyway, here is my
solution]

To explain better, lets assume
R = Rows in the matrix
C = Columns in the matrix
N = R * C = total numbers
SCAN[N] = Array that is storing all these numbers
i = row index
j = column index
p = j*R + i; index in to array SCAN[]

So, when reading the numbers from left-to-right [as in C], store them
at index 'p'. In O(N) you will read complete file, store it and print
it on screen.

Anyway, this solution is too restrictive.

You are omitting the real bind of the storage medium, i.e. the file
system. Unfortunately this message has lost the original
specifications, due to the foolish use of top-posting. Please
don't do that.

To the OP: please describe the actual format in terms of lines and
columns within the lines. Also the data format. So far the whole
thing is up in the air.
 
P

Peter J. Holzer

I understand that this is a C group and solutions tend towards making
use of C language features/libraries. However, I viewed it as an
algorithm problem and found a O(N) run-time and O(N) memory solution,
where n is the total numbers in the file.

Assumptions: the numbers are in a matrix like structure with R rows
and C columns. I realized later that this is too severe a restriction
for your case.

Summary: When coding in C, we parse from left-to-right and top-to-
bottom. Do the same thing, but when storing in an array, store it at
locations that appear as if we have read top-to-bottom and left-to-
right [array storing is an issue in your case, but anyway, here is my
solution]

To explain better, lets assume
R = Rows in the matrix
C = Columns in the matrix
N = R * C = total numbers
I have a group of files in a format that is that is tab delimited with
about a million columns and a thousand rows.

so N would be about 1 billion.
SCAN[N] = Array that is storing all these numbers

The OP didn't specify what is stored in the files, but since each
element of an array takes at least one byte of storage the whole array

so your algorithm (which the OP probably thought of) doesn't help here.

hp
 
P

Peter J. Holzer

I can't think at the moment of which constraint you might be
referring to? The only one that comes to mind right now is the
constraint on source line length, but that's compilation phase,
not execution time.

N1124, section 7.9.12 Streams:

| An implementation shall support text files with lines containing at
| least 254 characters, including the terminating new-line character.
| The value of the macro BUFSIZ shall be at least 256.

I think C90 had a similar limit, but I don't have that at hand.

However, since most systems support text files with arbitrarily long
lines, this is no argument against the existence of files with
ludicrously long lines. All you need is one implementation which
supports long lines and this doesn't even have to be a C implementation.

hp
 
W

Walter Roberson

N1124, section 7.9.12 Streams:
| An implementation shall support text files with lines containing at
| least 254 characters, including the terminating new-line character.
| The value of the macro BUFSIZ shall be at least 256.
I think C90 had a similar limit, but I don't have that at hand.

Thanks for pointing that out. The wording is the same in C90 4.9.2
"Environmental Limits". A footnote on the same page supports the same:

111. An implementation need not distinguish between text streams
and binary streams. In such an implementation, there need be no
new-line characters in a text stream nor any limit to the
length of a line.

But "need be" is just a license to do that or not do that, so for
portability, one cannot assume that lines can be indefinitely long
even on systems that do not distinguish between text and binary streams.

C90 defines the macro BUFSIZ on the previous page, 4.9.1 Introduction.

BUFSIZ
which expands to an integral constant expression which is the
size of the buffer used by the setbuf function.

This raises a question: if the developer codes setbuf() to a larger
buffer, then does the size of that new buffer then become the line
length supported (on systems that have limits), or would a system that
has line length limits be permitted to continue to limit to 254? If the
limit would be required to rise, then the implication would be that
there would be a portable way to increase the line limit to any amount
of memory that could be portably malloc()'d. And what effect
would setting the buffer size to smaller have? Or to setting the
stream to unbuffered?

(Say... is there a clause in the standard regarding the minimum maximum
size that can be portably malloc()'d? Could an implementation be
conforming if it allowed the malloc of only 1 * sizeof(largest builtin
type) ? Could an implementation be conforming if it always returned
NULL instead of malloc()'ing any memory? [Though I seem to recall that
a couple of standard functions such as printf() are described as using
dynamically allocated memory... though that clause about 509 as the
minimum length that fprintf() needs to support could imply a fixed size
buffer.])
 
P

Peter J. Holzer

Thanks for pointing that out. The wording is the same in C90 4.9.2
"Environmental Limits". A footnote on the same page supports the same:

111. An implementation need not distinguish between text streams
and binary streams. In such an implementation, there need be no
new-line characters in a text stream nor any limit to the
length of a line.

But "need be" is just a license to do that or not do that, so for
portability, one cannot assume that lines can be indefinitely long
even on systems that do not distinguish between text and binary streams.

C90 defines the macro BUFSIZ on the previous page, 4.9.1 Introduction.

BUFSIZ
which expands to an integral constant expression which is the
size of the buffer used by the setbuf function.

This raises a question: if the developer codes setbuf() to a larger
buffer, then does the size of that new buffer then become the line
length supported (on systems that have limits), or would a system that
has line length limits be permitted to continue to limit to 254?

The latter I think. A limit on the line length is probably imposed by
the operating system, either because it uses fixed length records for
text files (in which case you probably don't want to waste 64k for each
line just because the programmer thought 64k would be a nice buffer
size to write many short lines) or because it uses variable length
records with a small (e.g. 1 byte) length field (in which case the line
length limit can't be raised).

hp
 
D

david.deram

The format is a comma or tab delimited ascii text file with rows in
the thousands and potentially up to 5 million columns.

The data in the columns tends to be a single letter.

The goal is simply to parse it O(n) (to never read the same data
twice).
 
W

Walter Roberson

The format is a comma or tab delimited ascii text file with rows in
the thousands and potentially up to 5 million columns.
The data in the columns tends to be a single letter.
The goal is simply to parse it O(n) (to never read the same data
twice).

O(n) means "the work required is dominated by the linear size of
the data, for sufficiently big data". So if you had to read
the data (say) 973 times each [no matter how much data there was]
then it would still be O(n). Reading data twice is O(n).

The strategy I outlined of keeping track of seek offsets is O(n)
(provided that the cost of seeking is more or less constant),
but involves reading the data twice, one of the passes being
required to establish the line boundaries.

If the data does not fit within memory, I don't think you will be
able to find any strategy that does not involve reading
the portion that does not fit at least twice -- not if the lines
may be different lengths.
 
E

Eric Sosman

The format is a comma or tab delimited ascii text file with rows in
the thousands and potentially up to 5 million columns.

The data in the columns tends to be a single letter.

The goal is simply to parse it O(n) (to never read the same data
twice).

Has O-abuse become some kind of plague, borne from
who knows where by infected birds? There seems to be an
awful lot of it these days ... (Two points about this
particular "O(n):" first, you have not said whether "n"
is the number of rows, the number of columns, their product,
or the number of programmers who cast the result of malloc().
Second, doing something twice or thrice or a million times
has the same "O" as doing it once.)

The information you've now provided about the file
is a little different from that given originally. "Rows
in the thousands" is not "a thousand rows," one rather
important difference being the implication that you may
not know the row count when you start. Also, "up to five
million columns" is not the same as "about a million
columns" -- it's the same O, but a factor of five is not
to be ignored ...

I cannot think of a method that does not read the
data (most of it, anyhow) at least twice. If you want to
seek to different places in the file, you first need to
locate the line boundaries -- and to do that, you need to
have read the '\n' characters. I don't see any way around
that issue, unless there's still more about the file you
haven't yet told us.

But I don't think you need to make more than about
two full passes over the file -- one full pass plus N
partial passes that each read 1/N'th of the data. On the
first pass, you read straight through the whole thing,
just counting lines and doing an fgetpos() at the start
of each one. From the row count and the available memory
you can calculate how much data you can afford to store
from each row; the "band width" of a vertical stripe of
the matrix.

Then you start the partial passes. Use fsetpos() to
go to the start of each row, read its leftmost "band,"
and use fgetpos() again to remember where the unread part
of each row starts. Output or process the stored data in
column-wise fashion. Then make another partial pass, now
seeking to the updated file positions from the first pass.
Lather, rinse, repeat.

The underlying file system will probably "read" a little
more data than your program does, because it most likely
does I/O in fixed-size blocks: When you seek to the middle
of a block and start reading, the first part of the block
will be read even though you don't want it. Similarly, when
you stop reading in the middle of a block, the end of that
block has probably already been read even though you don't
use it. Edge effects of this kind can add maybe 8KB or so
to each row's band during the partial passes, so if the line
count is large and the band width is narrow you might find
that N full passes (taking advantage of read-ahead) outperform
N partial passes (each penalized by edge-effect slop). You'll
need to decide for yourself.
 
W

Willem

(e-mail address removed) wrote:
) I have a group of files in a format that is that is tab delimited with
) about a million columns and a thousand rows.

Are you allowed to write a temporary file ?

If so, parse the file once, in sequential order, and write out a temp file
with the same values, but with fixed-length records.
Then you can use random-access on that file to parse the records in any
order you desire.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
W

Walter Roberson

(e-mail address removed) wrote:
) I have a group of files in a format that is that is tab delimited with
) about a million columns and a thousand rows.
Are you allowed to write a temporary file ?
If so, parse the file once, in sequential order, and write out a temp file
with the same values, but with fixed-length records.
Then you can use random-access on that file to parse the records in any
order you desire.

That's an interesting idea, but if some fields are strings with
no fixed maximum length, then at the time one reads the first line,
one doesn't know what the appropriate fixed length record size
would be. But if the fields are each readily covertable to fixed-size
binary then the temporary file you propose could end up faster to
work with then the original file.
 
U

user923005

I cannot think of a method that does not read the
data (most of it, anyhow) at least twice. If you want to
seek to different places in the file, you first need to
locate the line boundaries -- and to do that, you need to
have read the '\n' characters. I don't see any way around
that issue, unless there's still more about the file you
haven't yet told us.

The C I posted to parse as row+col+value and then stick it into a
database would only read the raw data once.

Then you have the ability to read back by rows or columns or whatever
as well.

In a sense, you really would be reading and writing again, whenever
you wrote to and read from the database, but it would be efficient
retrieval [you could create an index on (row), (col), (row:col),
(col:row)].

[snip]
 
E

Eric Sosman

user923005 wrote On 09/05/07 19:43,:
The C I posted to parse as row+col+value and then stick it into a
database would only read the raw data once.

... and the database then indexes it, cross-indexes
it, shuffles it, slices it, and dices it how many times?

Obviously I can do *any* processing in "one pass over
the raw data" by using my one pass to make a copy and
thereafter making as many passes as I like over the copy.
Then you have the ability to read back by rows or columns or whatever
as well.

In a sense, you really would be reading and writing again, whenever
you wrote to and read from the database, but it would be efficient
retrieval [you could create an index on (row), (col), (row:col),
(col:row)].

If the O.P. needs repeated access to the same data sets
along multiple dimensions, this might make sense. If he
only needs one shot at each data set, or if he doesn't need
all that flexibility, or if he doesn't feel like tripling
the size of his data[*] before loading the database ...

[*] "Thousands of rows" means at least ten bits for a
row number. "Up to five million columns" means at
least 23 bits for the column number. Call it two
and three bytes, respectively. Average item size
is one byte (multi-byte items are rare), so average
"decorated" item is six bytes. Original file used
tabs or commas as item separators; a more efficient
encoding for the rare multi-byte items thus cuts
the file size in half. Net result: size triples.
 
U

user923005

user923005 wrote On 09/05/07 19:43,:
[snip]
I cannot think of a method that does not read the
data (most of it, anyhow) at least twice. If you want to
seek to different places in the file, you first need to
locate the line boundaries -- and to do that, you need to
have read the '\n' characters. I don't see any way around
that issue, unless there's still more about the file you
haven't yet told us.
The C I posted to parse as row+col+value and then stick it into a
database would only read the raw data once.

... and the database then indexes it, cross-indexes
it, shuffles it, slices it, and dices it how many times?

Obviously I can do *any* processing in "one pass over
the raw data" by using my one pass to make a copy and
thereafter making as many passes as I like over the copy.
Then you have the ability to read back by rows or columns or whatever
as well.
In a sense, you really would be reading and writing again, whenever
you wrote to and read from the database, but it would be efficient
retrieval [you could create an index on (row), (col), (row:col),
(col:row)].

If the O.P. needs repeated access to the same data sets
along multiple dimensions, this might make sense. If he
only needs one shot at each data set, or if he doesn't need
all that flexibility, or if he doesn't feel like tripling
the size of his data[*] before loading the database ...

[*] "Thousands of rows" means at least ten bits for a
row number. "Up to five million columns" means at
least 23 bits for the column number. Call it two
and three bytes, respectively. Average item size
is one byte (multi-byte items are rare), so average
"decorated" item is six bytes. Original file used
tabs or commas as item separators; a more efficient
encoding for the rare multi-byte items thus cuts
the file size in half. Net result: size triples.

It seems unlikely to me that only one pass over the data will be
needed when the operations are done.
If that were the case, then the best solution is to add one more
processing step to the program that writes the file and have all the
processing done without any additional reads.
 
D

David Thompson

I can't think at the moment of which constraint you might be
referring to? The only one that comes to mind right now is the
constraint on source line length, but that's compilation phase,
not execution time.
The possible text-file line length limit is already cited elsethread.
I do see a C90 environment limit of 509 on the number of
characters produced by fprintf(), and there is an implicit

A _minimum_ limit aka guarantee of 509 (C99 4095) chars produced by
_one conversion_ in fprintf, and hence the other *printf's.
size limit of INT_MAX for fgets(). I don't see any C90 limit

Huh? The size passed to fgets() is size_t. And of course nothing says
you have to use fgets, or for that matter *printf/*scanf; all C I/O
can be done, and is defined as-if done, by [f]getc/putc. Plus
ftell/fseek (and rewind) and fgetpos/fsetpos, as already discussed.
documented for fscanf() other than the implicit one that
the number of -items- (not number of characters) has to fit
into an int.
Number of characters (per call) must fit in appropriate integer type
if you use %<mod>n, but of course you don't have to do so.

And there may be (sometimes is) a limit on the number of _arguments_
when calling _any_ function including *printf/scanf; this must be at
least 31 in C90 and 127 in C99 for the "one program"; the standard
doesn't guarantee it for other programs, but any implementation that
doesn't do so will quickly get a well-deserved death.

Also *printf/scanf format strings are _often_ string literal values,
and the minimum limit on those is similarly C90 509 or C99 4095. For
format strings in variables, the minimum limit (guarantee) on object
size hence string lenth is C90 32767 or C99 65535 -- again formally
only for the "one program" but it had better be for others.

But there is no need for *printf/scanf calls to correspond to lines;
they can easily be more than a line, or less, or both. In fact (FAQt?)
one of the more common newbie issues posted here is getting scanf
_unintentionally_ out of sync with line input, and the suggested fix
is fgets + sscanf, or fgets + {strtok,strto{l,ll,d,ld,etc}}.
Could you expand on the limit you were thinking of?

- formerly david.thompson1 || achar(64) || worldnet.att.net
 
B

Ben Bacarisse

David Thompson said:
A _minimum_ limit aka guarantee of 509 (C99 4095) chars produced by
_one conversion_ in fprintf, and hence the other *printf's.


Huh? The size passed to fgets() is size_t.

The non-authoritative documents I have to hand (for C90 and C99) all
have the second parameter of fgets as an int.

I should maybe wait for someone with a minted copy of an actual
standard to hand to reply to this but if int survived at least into a
draft of C99 it is a reasonable mistake to make even if it was been
changed to size_t at the last minute. If it *has* changed, gcc's
library is not conforming.
 
R

Richard Heathfield

Ben Bacarisse said:
David Thompson <[email protected]> writes:

The non-authoritative documents I have to hand (for C90 and C99) all
have the second parameter of fgets as an int.

The official release of C99 also has int:

7.19.7.2 The fgets function
Synopsis
1 #include <stdio.h>
char *fgets(char * restrict s, int n, FILE * restrict stream);

You've caught David Thompson in a rare error. As Darth Vader once said:
"hmmm - impressive".
 
D

David Thompson

Ben Bacarisse said:

Sorry, I don't know what I was thinking there.
You've caught David Thompson in a rare error. As Darth Vader once said:
"hmmm - impressive".

Please don't send Vader after me for errors, however. That's getting
rather closer to a real (??^??^) DeathStation than I want.

Note ??hat is not a trigraph. I checked. <G>

- formerly david.thompson1 || achar(64) || worldnet.att.net
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,266
Messages
2,571,075
Members
48,772
Latest member
Backspace Studios

Latest Threads

Top