Analogue of ReadLine() in C

  • Thread starter МакÑим Фомин
  • Start date
M

Morris Keesan

The reason that you code fails is that you can't generally realloc what
you haven't [m|c|0]alloc'd.

Wrong. From the C99 standard, ISO/IEC 9899:1999 (E)

7.20.3.4 The realloc function

Synopsis
#include <stdlib.h>
void *realloc(void *ptr, size_t size);
....
If ptr is a null pointer, the realloc function behaves like the
malloc function for the specified size.
 
B

BartC

Malcolm McLean said:
The problem is that it is impossible to define a reasonable line
length.

A good guide might be how long you are prepared to spend scrolling from one
end to another.

My rough calculations show that a line 30,000 characters long would take
about a minute to scroll from one end to the other (paging a full width,
twice a second). 30,000 characters seems perfectly reasonable (I use 2048
myself and rarely have problems).

(30,000 characters also requires a virtual screen 200 feet wide. A line
buffer 1 billion characters long requires a virtual screen some 1300 miles
across and would take 3 weeks to scroll. No-one can possibly pretend that
that is still one line of text!)

If a limit of a few thousand characters is not enough, then a different
approach is needed; something in between ReadLine and ReadFile; ReadBlock
perhaps, as the file cannot be considered line-oriented, not in the
conventional, human-readable sense.
 
K

Keith Thompson

BartC said:
Malcolm McLean said:
The problem is that it is impossible to define a reasonable line
length.

A good guide might be how long you are prepared to spend scrolling from one
end to another. [snip]
If a limit of a few thousand characters is not enough, then a different
approach is needed; something in between ReadLine and ReadFile; ReadBlock
perhaps, as the file cannot be considered line-oriented, not in the
conventional, human-readable sense.

Text isn't necessarily meant to be human-readable.
 
I

Ian Collins

BartC said:
Malcolm McLean said:
Someone doing 'ReadLine' is probably interested in line-oriented files.

A file containing with a single multi-GB line is likely not
line-oriented!
Or is not using the expected newline sequence.

The problem is that it is impossible to define a reasonable line
length.

A good guide might be how long you are prepared to spend scrolling from one
end to another. [snip]
If a limit of a few thousand characters is not enough, then a different
approach is needed; something in between ReadLine and ReadFile; ReadBlock
perhaps, as the file cannot be considered line-oriented, not in the
conventional, human-readable sense.

Text isn't necessarily meant to be human-readable.

Indeed. The longest single "line" I have in a file is 1.3GB (a JSON
representation of a rather large filesystem).
 
J

Jorgen Grahn

I am not sure how you decided that this suits the OP's use case. I
think the original posted code was just a cut down example, so it's not
obvious that silently splitting lines that are longer than 255
characters suits the OP.


I think you'd be annoyed if, say, awk behaved like this. Given a big
enough buffer, your shell might get away with it, but I expect it
doesn't try.

I'm too lazy to look for the context, but I assume you are talking
about varying input line lengths here.

I think it's a bit more complicated than that. Assuming some
line-oriented input format:

- It's usually wrong to set a limit (say, 8192 bytes) and pretend
anything longer is two or more lines.

- It's usually wrong to accept *any* length by using malloc()/realloc(),
and then start swapping and crashing when someone feeds you a
10GB line.

I assume a decent awk sets some limit (far higher than any reasonable
input) and exits gracefully with an error message when it gets
anything larger.

- Many file/data formats put a limit to the line length anyway.
NNTP, SMTP. IIRC also the C language says a compiler is allowed
to bail out on lines longer than N characters.

/Jorgen
 
B

BartC

Keith Thompson said:
Text isn't necessarily meant to be human-readable.

In that case there should be no pretence that a ReadLine function actually
reads a 'line' at all, but any sequence of characters of any length.

That's why I suggested such a function should be called Readblock, with all
the problems that that would introduce (such as possibly bringing down the
system in the case of a rogue file with 'lines' too long to fit into either
physical or virtual memory).

Then 'ReadLine' can be used for the straightforward cases such as reading
configuration files or command lines. Lines which are too long are reported
as errors (which in all probability they are).
 
S

Stefan Ram

Jorgen Grahn said:
- It's usually wrong to set a limit (say, 8192 bytes) and pretend
anything longer is two or more lines.

Untested:

int readline
( FILE * const source,
int( * const target )( int ch, void * obj ),
void * const object )
{ int c; int result = 1; while( result == 1 )
{ c = fgetc( source ); result =
c == EOF ? 2 :
ferror( source )? 3 :
c == '\n' ? 0 :
target( c, object )? 4 : result; }
return result; }
 
M

Malcolm McLean

In that case there should be no pretence that a ReadLine function actually
reads a 'line' at all, but any sequence of characters of any length.
There's a difference between human-inspectable and human-readable.

In the case of my program, no-one can really make sense of a matrix
with 1000 factors and 6000 cases. However you can certainly look at it
to see that the factors have a reasonable balance of set and missing,
the labels are right, there's the right number of cases (one per
gene), and so on.
 
K

Keith Thompson

BartC said:
In that case there should be no pretence that a ReadLine function
actually reads a 'line' at all, but any sequence of characters of any
length.

That's why I suggested such a function should be called Readblock,
with all the problems that that would introduce (such as possibly
bringing down the system in the case of a rogue file with 'lines' too
long to fit into either physical or virtual memory).

Then 'ReadLine' can be used for the straightforward cases such as
reading configuration files or command lines. Lines which are too long
are reported as errors (which in all probability they are).

A block of characters terminated by a newline character is what we call
a "line".
 
B

Ben Bacarisse

Untested:

int readline
( FILE * const source,
int( * const target )( int ch, void * obj ),
void * const object )
{ int c; int result = 1; while( result == 1 )
{ c = fgetc( source ); result =
c == EOF ? 2 :
ferror( source )? 3 :
c == '\n' ? 0 :
target( c, object )? 4 : result; }
return result; }

Unread. :)
 
B

Ben Bacarisse

Jorgen Grahn said:
I'm too lazy to look for the context, but I assume you are talking
about varying input line lengths here.

I was talking about the remark I quoted -- that dynamic allocation is
not a good idea because no buffer is big enough.
I think it's a bit more complicated than that. Assuming some
line-oriented input format:

- It's usually wrong to set a limit (say, 8192 bytes) and pretend
anything longer is two or more lines.

- It's usually wrong to accept *any* length by using malloc()/realloc(),
and then start swapping and crashing when someone feeds you a
10GB line.

I assume a decent awk sets some limit (far higher than any reasonable
input) and exits gracefully with an error message when it gets
anything larger.

With a general purpose program like awk, I think there are only two good
strategies:

(a) Just keep allocating because the system does something helpful when
the process gets too large.

(b) Limit the total memory allocation, without limiting any one kind
(like line length).

The trouble is that there are systems that don't do (a) and
implementations that don't (or can't) do (b). I don't think limiting
the line length is really the right solution.
- Many file/data formats put a limit to the line length anyway.
NNTP, SMTP. IIRC also the C language says a compiler is allowed
to bail out on lines longer than N characters.

If a sane upper limit can be picked then there is no problem.
 
B

BartC

Keith Thompson said:
A block of characters terminated by a newline character is what we call
a "line".

Yes, but is a line a few hundred characters long, as it might have been once
(eg. 132), or could it conceivably be terrabytes long? I think that in most
cases where you want to read a line, assuming that the line can be,
effectively, of infinite length, adds unnecessary complications.
 
I

Ian Collins

Yes, but is a line a few hundred characters long, as it might have been once
(eg. 132), or could it conceivably be terrabytes long? I think that in most
cases where you want to read a line, assuming that the line can be,
effectively, of infinite length, adds unnecessary complications.

"where you want to read a line" is the key there, data can be stored in
an infinitely long "line", but if that data isn't intended to be read as
one line (my 1.3GB JSON line for example) the argument is moot.
 
S

Stefan Ram

Ian Collins said:
an infinitely long "line"

ISO/IEC 9899:1999 (E), 7.19.2p2 requires lines to be
terminated by a new-line character. An infinitely long
ordered sequence of characters is not terminated by a
new-line character.
 
M

Malcolm McLean

Yes, but is a line a few hundred characters long, as it might have been once
(eg. 132), or could it conceivably be terrabytes long? I think that in most
cases where you want to read a line, assuming that the line can be,
effectively, of infinite length, adds unnecessary complications.
Not really. Library functions make it slightly easier, in C, to read a
line into a fixed buffer. But if you go for an allocated buffer, the
natural thing is to keep expanding until you run out of memory. On
decent systems, behaviour when fed an infinite sequence of characters
is perfectly acceptable.
 
M

Markus Wichmann

The total time and space is O(n). Changing the multiplier leaves you with O(n).
Starting with size>0 reduces the time and space by O(k), however
O(n)-O(k) = O(n). You're worrying about optimising at the wrong level which is a
waste of time unless you've done measurement showing this is a hot loop. Also
realloc(0,n) is the same as malloc(n) so one initial malloc is no different than
a realloc on the first loop; but it requires more code.
[...]

If the resize doesn't increase geometrically, the run time switches to O(n^2).
Otherwise it is O(n) + whatever code complexity you want to pretend you can beat
O(n).

One of the problems with the O-notation is that it doesn't tell you one
bit if it stays the same. If I had an algorithm that would _always_ beat
your algorithm by 1/2 the runtime, the complexity would still stay the same.

O-notation is only meaningfull if it shows differences! Otherwise, more
insight is needed.

CYA,
Markus
 
R

Richard Damon

The total time and space is O(n). Changing the multiplier leaves you with O(n).
Starting with size>0 reduces the time and space by O(k), however
O(n)-O(k) = O(n). You're worrying about optimising at the wrong level which is a
waste of time unless you've done measurement showing this is a hot loop. Also
realloc(0,n) is the same as malloc(n) so one initial malloc is no different than
a realloc on the first loop; but it requires more code.
[...]

If the resize doesn't increase geometrically, the run time switches to O(n^2).
Otherwise it is O(n) + whatever code complexity you want to pretend you can beat
O(n).

One of the problems with the O-notation is that it doesn't tell you one
bit if it stays the same. If I had an algorithm that would _always_ beat
your algorithm by 1/2 the runtime, the complexity would still stay the same.

O-notation is only meaningfull if it shows differences! Otherwise, more
insight is needed.

CYA,
Markus

And O-notation is also only meaningful if n is going to get "large". If
your problem domain inherently has just "small" n then using a simpler
O(n^2) algorithm may be significantly better than O(n) one. It may have
a smaller constant, or may just be enough simpler, that the extra
resources used aren't important (if what it is measuring isn't a
significant bottleneck for this program).
 
D

David Thompson

On 21 Oct 2011 08:58:55 GMT, Jorgen Grahn <[email protected]>
wrote:
I think it's a bit more complicated than that. Assuming some
line-oriented input format:

- It's usually wrong to set a limit (say, 8192 bytes) and pretend
anything longer is two or more lines.
I'd say always wrong, or very nearly always.
- It's usually wrong to accept *any* length by using malloc()/realloc(),
and then start swapping and crashing when someone feeds you a
10GB line.

I assume a decent awk sets some limit (far higher than any reasonable
input) and exits gracefully with an error message when it gets
anything larger.
I recently had occasion to do some testing on this.

gawk on Linux died around 1G with an error message clearly indicating
that reallocation failed (as expected for a 2G-ish address space).
gawk on mingw failed somewhere in the millions, which on experiment
was confirmed to be very nearly the same size as a program that just
realloc'ed one buffer upward; presumably this has to do with Windows
address space as seen by mingw and I didn't investigate in detail.

Other awks on several Solarises*, HPUX, and an elderly AIX, all died
in the thousands, except that Solaris most-ancient 'oawk' did the very
bad thing of *breaking* lines over some length IIRC about 500.

(* I think the plural should actually be Solares IIRC my high-school
Latin, but I doubt anybody cares.)
- Many file/data formats put a limit to the line length anyway.
NNTP, SMTP. IIRC also the C language says a compiler is allowed
to bail out on lines longer than N characters.
Although NNTP and SMTP, at least, limit line lengths 'on the wire'
after encoding. From early on there were lots of attempts at encodings
allowing 'real' lines to be longer than the transmitted ones, although
none really successful until MIME's QP and B64.

A C implementation is not required to support more than 4095 chars
(509 in C89) in a 'logical' source line (after backslash splicing) nor
more than 254 per line in *text* files at runtime. The basic source
charset doesn't include newline, but some end-of-line 'indicator' is
"treat[ed] ... as if ... newline", so it's not clear if the 4095
includes that. The basic execution charset includes newline, which
terminates lines in text files, and is included in the 254.

In the 70s and into the 80s when C was developed (and most of the
major Internet protocols also) there were lots of important systems
and especially filesystems that had definite limits on the sizes of
records, which conventionally mapped to lines of text (although a C
implementor could break that mapping if they had to).

The Tanpaqard NonStop defined a rather odd format for text files
(officially EDIT format, commonly called code-101 because that's how
it is identified in the directory) which compresses runs of spaces and
can handle actual line lengths from about 239 with no spaces to over
3000 with many spaces. This was created before C existed, and when
Tandem implemented C (belatedly, right around '88) they added
'C-format' files (bag-o-bytes with NL) as code-180, and explicitly
noted in their manual that reading EDIT files, although supported as
an extension (and extremely useful to sites with lots of those files)
was not quite exactly conforming.
 
S

Shao Miller

I need function similar to ReadLine() in any other language.
The program below fails[1] when data more then LAG_SIZE is passed

[...code...]

I guess the biggest line you can handle is limited by the biggest
contiguous range of memory that you can allocate. Maybe you could
allocate as large a buffer as possible when your program starts, then
each time you read a line, try to fill the buffer from the beginning and
worry about what happens when you don't find a newline before that
buffer has been filled.

Here's a silly little bit of code to allocate a large buffer, which is
also available at http://codepad.org/nLoYZzej:

/* For use with C89 */

#include <stddef.h>
#include <stdlib.h>

static void * biggest_buf(void);

int main(void) {
char * foo;

foo = biggest_buf();
free(foo);
return EXIT_SUCCESS;
}

static void * biggest_buf(void) {
char * big_buf;
size_t max_size, big_size, test_size, highest_test;

/* Find the maximum size_t */
max_size = 0;
--max_size;

switch (1) while (1) {
/* big_size < test_size < highest_test */
big_buf = malloc(test_size);
if (!big_buf) {
if (test_size == big_size) {
/* Maximum allocation has changed. Start over */
case 1:
test_size = highest_test = max_size;
big_size = 0;
continue;
}
/*
* We couldn't allocate a bigger buffer than last time.
* Split the difference and try a smaller allocation
*/
highest_test = test_size;
test_size = (test_size - big_size) / 2 + big_size;
continue;
}

/* Check if we've found the biggest allocation */
if (test_size == big_size) {
/* All done */
break;
}

/* Otherwise, we might be able to allocate more */
free(big_buf);
big_size = test_size;
test_size = (highest_test - big_size) / 2 + big_size;
continue;
}
return big_buf;
}

- Shao Miller
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,564
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top