read huge text file from end

quickcur · Oct 31, 2006

Hi,

I have very large text files and I am only interested in the last 200
lines in each file. How can I read a huge text file line by line from
the end, something line the "tail" command in Unix?

Thanks,

qq

Eric Sosman · Oct 31, 2006

Hi,

I have very large text files and I am only interested in the last 200
lines in each file. How can I read a huge text file line by line from
the end, something line the "tail" command in Unix?

Do as "tail" does: Get the size of the file, seek to
a position (200 * average_line_length + safety_margin) bytes
before the end, and start reading. Be prepared for some
glitches if you land in the middle of a multi-byte sequence;
you may need to be tolerant of a malformed line and/or
character decoding errors when you start reading.

Of course, this simply isn't going to work for files
that contain statefully-encoded regions, or that have been
progressively compressed or encrypted. For "very large"
files, compression is distinctly likely -- even if you're
not using it now, you might want to ponder before committing
to a strategy that would prevent using it in the future.

Oliver Wong · Oct 31, 2006

Eric Sosman said:
Do as "tail" does: Get the size of the file, seek to
a position (200 * average_line_length + safety_margin) bytes
before the end, and start reading. Be prepared for some
glitches if you land in the middle of a multi-byte sequence;
you may need to be tolerant of a malformed line and/or
character decoding errors when you start reading.

Of course, this simply isn't going to work for files
that contain statefully-encoded regions, or that have been
progressively compressed or encrypted. For "very large"
files, compression is distinctly likely -- even if you're
not using it now, you might want to ponder before committing
to a strategy that would prevent using it in the future.

Hopefully, the compression would be handled by the underlying OS, and it
would all work "transparently" to your application.

Otherwise, you're no longer dealing with text files (in the traditional
sense), and if you've got custom file formats, you could do tricks like
actually encode the offset of the 200th line from the end into the header.

- Oliver

Eric Sosman · Oct 31, 2006

Oliver Wong wrote On 10/31/06 17:23,:

(e-mail address removed) wrote On 10/31/06 15:45,:

Hi,

I have very large text files and I am only interested in the last 200
lines in each file. How can I read a huge text file line by line from
the end, something line the "tail" command in Unix?

Click to expand...

Do as "tail" does: Get the size of the file, seek to
a position (200 * average_line_length + safety_margin) bytes
before the end, [...]

Of course, this simply isn't going to work for files
that contain statefully-encoded regions, or that have been
progressively compressed or encrypted. For "very large"
files, compression is distinctly likely -- even if you're
not using it now, you might want to ponder before committing
to a strategy that would prevent using it in the future.

Click to expand...

Hopefully, the compression would be handled by the underlying OS, and it
would all work "transparently" to your application.

It might "work" in the sense of "get to the data as
desired," but only by reading and decompressing everything
before that point -- which sort of vitiates the performance
advantage of the seek, don't you think?

Mike Schilling · Nov 1, 2006

Eric Sosman said:
It might "work" in the sense of "get to the data as
desired," but only by reading and decompressing everything
before that point -- which sort of vitiates the performance
advantage of the seek, don't you think?

But that's not how OS file compression works. Generally, there's a page
size (8K or thereabouts), and each page is compressed seperately, with the
OS keeping track of where each compressed page actually starts. A
random-access read requires figuring out where the pages containing the byte
range live and decompressing only those pages.

Eric Sosman · Nov 1, 2006

Mike said:
But that's not how OS file compression works. Generally, there's a page
size (8K or thereabouts), and each page is compressed seperately, with the
OS keeping track of where each compressed page actually starts. A
random-access read requires figuring out where the pages containing the byte
range live and decompressing only those pages.

Look among the bits and pieces of snippage lying about on the
cutting-room floor, and you'll notice I wrote about files that
were "progressively compressed" or "progressively encrypted."
My terminology is probably inexact, but I meant "progressivly"
to describe the sort of compressor/encryptor whose state at a
given point in the data stream is a function of the entire history
of the stream up to that point. gzip, for example.

How does a HEAD pointer end up pointing to the first node in a linked list?	3	Jan 24, 2023
Php combine identical lines in text file	4	Oct 11, 2023
How to read from a .csv file in Java?	1	Nov 6, 2023
Read xml column inside csv file with Python	0	Jul 23, 2022
Problem Splitting Text String	2	Dec 29, 2022
Help with importing from multiple files and printing lines in designated spot to spit out one file.	1	Jan 16, 2023
Find and count strings of text from multiple files	17	Dec 16, 2021
Cyrillic text from file - set utf8 in cmd, unknown characters output anyway	0	Nov 11, 2022

read huge text file from end

quickcur

Eric Sosman

Oliver Wong

Eric Sosman

Mike Schilling

Eric Sosman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads