Reading LAST line from text file without iterating through the file?

A

Arne Vajhøj

The thing about CR and LF is that lineprinters, and things which are
pretending to be lineprinters, like terminal emulators and text editors,
know how to deal with them; they write the next character lower down
and/or at the start of the line. They aren't record separators, they're
format effectors (ASCII does have record separators - an impressive
range of them, in fact - but i don't known of anybody using them).

What happens if you send one of these alleged text files from a
mainframe to a printer or a shell? Do the printers and shells in
mainframe land handle those formats, or does there have to be a program
that reads the format and then talks to the printer? Or does that all
happen down in the OS? How does the lineprinter know to move the golf
ball across the paper when it gets to the end of a record?

It is mostly transparent.

The program that needs to read the file reads the file
just as it would on any other platform.

Java readLine / C fgets / Fortran READ or whatever returns a line.

The underlying system calls handle the record format used for
physical storage of the line.

Arne
 
A

Arved Sandstrom

On 11-02-25 06:46 PM, Tom Anderson wrote:
[ SNIP ]
That a file has text somewhere in it does not make it a text file.

tom
Without getting completely anal over it all, I'm reasonably content with
what Wikipedia has to say about text files:

1. Structured as sequence of lines, so either fixed length lines or EOL
characters, and often an EOF marker;

2. No additional metadata required to interpret a text file; they stand
by themselves. This I think is more important than point 1;

As far as I am concerned, point 2 is central. To me "rich text" formats
are an oxymoron.

AHS
 
W

Wojtek

Ken Wesson wrote :
Record formats are not relevant here, since text files do not have record
formats; they are raw sequences in some character set more or less by
definition. Anything with additional structure over and above that is
something other than a text file. Generically we call such things "binary
files" though commonly binary files do *contain* text. But all contain
additional structure that cannot be represented in, say, a
java.lang.String without resort to some form of escaping or encoding. And
that makes them not pure text, but text-and-some-other-stuff or some-
other-stuff-that-happens-to-contain-text.

Text files which are pure text yet are structured in that they either
have:
- comma delimited
- fixed length field columns

java.lang.String quite happily holds this information in that a single
line is a "record" and can be manipulated as pure text.

Breaking out "fields" is fairly trivial using standard character
routines. Building them up and writing them out is also a purely text
operation.

I will grant you that comma delimited files have "coding" in that the
comma has special meaning, but it still falls within the ASCII set.

I have also used HL7 (a medical information file format) which is also
pure text yet has many information delimiters and needs to be
extensively parsed to extract the information. If I remember right, a
line feed is one of the delimiters and a carriage return is a record
delimiter.

Let's see, CSS, *ML, log files, property files, all are pure ASCII text
which nevertheless hold "field" information.

Emails (newsgroup postings) are also 7-bit ASCII and are pure text yet
the headers hold record information. To get binary information you need
to have an encoding standard, which is itself 7-bit.

I used to say that you could not get a virus from an email just by
reading it because 7-bits could not hold a program. Then some innocent
in Microsoft decided to implement auto-running scripts...
 
D

Daniele Futtorovic

Without getting completely anal over it all, I'm reasonably content with
what Wikipedia has to say about text files:

1. Structured as sequence of lines, so either fixed length lines or EOL
characters, and often an EOF marker;

2. No additional metadata required to interpret a text file; they stand
by themselves. This I think is more important than point 1;

As far as I am concerned, point 2 is central. To me "rich text" formats
are an oxymoron.

Good one. So how about we defined "text file" as a sequence of binary
data that:
1. Contains no metadata;
2. The whole content of which can be transformed into character data
using a single character encoding scheme;
3. Adheres to some convention as to what constitutes a or separates two
lines.

This raises, as far as I can see, two questions:
a) How do we define "character"?
b) What about BOMs? Aren't they metadata? Wouldn't they violate
constraint number two? Does this mean we must exclude from being
candidates for text file encodings such character encodings as provision
for variable endianness?
 
A

Arne Vajhøj

Good one. So how about we defined "text file" as a sequence of binary
data that:
1. Contains no metadata;
2. The whole content of which can be transformed into character data
using a single character encoding scheme;
3. Adheres to some convention as to what constitutes a or separates two
lines.

This raises, as far as I can see, two questions:
a) How do we define "character"?
b) What about BOMs? Aren't they metadata? Wouldn't they violate
constraint number two? Does this mean we must exclude from being
candidates for text file encodings such character encodings as provision
for variable endianness?

re a)

Unicode codepoints or Java char.

re b)

The BOM are not meta data about the way to read the bytes.

It is metadata for the app on how to convert the bytes to
chars (codepoints).

The file content does not say ISO-8859-1 or UTF-8.

The app logic/programming language/IO libraries provide
that.

The BOM is part of line content. Seen from the file/record
format perspective it is data not meta data.

It is first at the higher level that it becomes metadata.

Arne
 
L

Lew

Wojtek said:
Emails (newsgroup postings) are also 7-bit ASCII and are pure text yet the

That is quite the outré claim, and completely false. This newsgroup posting
is certainly not in 7-bit ASCII. It has a soupçon of 8-bit characters. I
don't think I've ever seen a 7-bit ASCII post from Arne Vajhøj, but he sure
has a lot of posts in this newsgroup.

As for emails, I embed JPGs in email all the time. Is that 7-bit ASCII? Or
even pure text?
 
A

Arved Sandstrom

On 11-02-25 11:15 PM, Peter Duniho wrote:
[ SNIP ]
To me, the real question is: if we do define "text file" in a rigorous
way, how in the world does that help any of us write better Java
programs? What's the point?

Pete

The usefulness of the term "text file" for me is that it describes a
file that can be opened, viewed and used by every application, tool and
utility, on every OS and platform, that purports to be a "text editor".
The absolute baseline of "text editor" programs can deal with a text
file. Tools like cat, grep, awk, more etc and their equivalents on
non-UNIX/Linux systems can handle them with ease. vi and Notepad and
gedit and pico and every other self-proclaimed "text editor" faithfully
show all the content.

To me the term "text file" indicates that I have the widest possible
options on any system to view it, and to process it as _generic_ text.
Clearly I may not be able to *process* any given text file in the manner
that it's _intended to be processed_, but I can certainly *view* it
without a specific application.

It's with this meaning that I think it's a useful term. But I don't
think any of this helps us write better Java programs, no. :)

AHS
 
M

Martin Gregorie

Not at all.


By that definition the concept of "record-based" vs. "not-record-based"
becomes completely meaningless.
It is pretty much meaningless unless you're referring to the way a
programs handles data. Consider a file containing nothing but printable
characters:

- if a C or Java program reads the file byte by byte or parses it
by reading words separated by whitespace then line delimiters are
utterly meaningless and the program doesn't care whether the file
contains records or not.

- OTOH if a different program reads the same file a line at a time, e.g
C using fgets(), Java using BufferedReader.readLine(), then this is
pure record-level access.
But most of us use "records" to mean a structure that involves out-of-
band boundaries of some sort.
Not necessarily. A CSV file is generally treated as containing a fixed
number of variable length fields with the last field terminated by a
newline. In this case, both commas and newlines are out-of-band (and so
are some quote marks if the implementation allows fields to contain
commas).

However, fixed length records made up of fixed length fields contain no
out-of-band structure. You want an example? How about the two magnetic
stripe tracks on a credit card - 40 bytes and containing fields whose
content and meaning are defined by their position.
That's text plus file metadata.
Indeed it is. Technically it is made up of fixed length fields with no
delimiters. Apart from the record description that forms part of every
file and the member separators the only metadata is similar to a UNIX
directory entry plus the i-node. OS/400 and Z/OS text files are closer to
a tar or zip file than what a Unix or Windows user considers to be a text
file because you can store many separate chunks of text in a single text
file.
What makes it not *quite* a legitimate text file is that the file's
actual content contains a line break that is distinct from 0x0A, 0x0D,
No it doesn't. The editor won't let you put newlines into an OS/400 text
file - it automatically starts another text line record and assigns a
line number to it.
Database rows need an ID field so there's something you can uniquely key
on, and you said the system stores text in database rows, so there's
your explanation. The thing that makes no sense is it storing text in
database rows instead of as native text.
Nice guess, but that's not how it works. That role is taken by the line
number (which can be a decimal value - when you add lines between lines
0002 and 0003 they'll be numbered 0002.01, 0002.02 etc until yo ask the
editor to renumber the member - unlike Unix and Windows systems the line
numbers in compilation errors aren't screwed up by editing the source.

The ID is a complete mystery - most people and programs don't use it and
IIRC its not accessible via the editor so you can't change it, though it
may be possible to ask the editor to maintain it.
Actually C is already broken here even on "normal" systems, because C
strings can't properly represent text containing NUL characters.
By definition they can't be included in 'text files' - they can be
handled perfectly well in files via the read() and write() functions.
Nope; see above. If everything you've told me is accurate then it is
possible to write an OS/400 "text file" that encodes some information
that will be destroyed in a copy made by simply reading it character by
character through a java.io.Reader and outputting it character by
character, unaltered, through a java.io.Writer.
Incorrect assumption because you can't put non-printable characters in an
OS/400 source file member - the editor and other programs won't let you.

The OS/400 is a database machine. There are no files that aren't
databases. Every file has defining metadata which is automatically
generated for standard file types, e.g. source files and compiled
binaries. The field types control what byte values can appear in every
field, so you might limit a text field to upper case. Violating these
rules generally causes an exception which, of course, can be caught and
acted on.
 
M

Martin Gregorie

What happens if you send one of these alleged text files from a
mainframe to a printer or a shell?
You'd need to convert it before it could be handled just like you use
unix2dos and dos2unix to convert newlines when you move fkiles between
*nixen and DOS/Windows, though the translation is more thorough since
you'd be converting the file to or from EBCDIC character encoding. As you
might expect, EBCDIC, like ASCII and Unicode, has a similar collection of
format effectors as well as field and record separators, though they
occupy 0x00 to 0x3f rather than the ASCII/Unicode 0x00 to 0x1f.

Conversion to/from EBCDIC can only be done with a lookup table because it
is a bit of a mess - A-Z and a-z are not contiguous (their encoding is
related to the way that letters were encoded on punched cards with gaps
between A-I, J-R, S-Z) and the numbers are 0xf0 - 0xf9.
 
A

Arved Sandstrom

Then I think you need to define "text file" more narrowly than what is
actually out there. In this thread alone, there have been mentioned a
number of true text file formats that are simply not readable in your
average or even above-average text editor found on mainstream OSs.

I'm a bit hard-nosed about this one I guess. A text file for me is a
stream of variable-length lines with a reasonably common line separator.

To take VMS for example (and I've used VMS off and on since about 1980)
my view of the 4 record formats in VMS RMS - fixed-length records,
variable length records with count byte per record, variable length
recorsd with fixed length control block, and stream (variable length
records with line delimiters) - my personal view is that only the stream
format with a common delimiter choice (LF or CR) qualifies as a "text file".

You may also have gathered that I am referring to plain text, let's say
Unicode these days, as opposed to any type of formatted text.
[...]
It's with this meaning that I think it's a useful term. But I don't
think any of this helps us write better Java programs, no. :)

Okay. Just checking. :)

Pete

AHS
 
T

Tom Anderson

Then I think you need to define "text file" more narrowly than what is
actually out there. In this thread alone, there have been mentioned a
number of true text file formats that are simply not readable in your
average or even above-average text editor found on mainstream OSs.

Either (a) according to Arved's definition, which is highly appealing to
me, they are not true text file formats, or (b) they *are* readable with
the standard text editors *on the OSs on which they are found*, in which
case, perhaps they are.

Perhaps talking about plain text is a bit like talking about plain
speaking. I might deny that saying "alea iacta est" is plain speaking, and
i would be correct where i'm standing, because i'm standing in an
English-speaking country, but in the Vatican or ancient Rome, it would
have been perfectly plain. The term 'plain' need not mean the same thing
in all times and places.

Nonetheless, in general discourse on a newsgroups like this, there is a
presumption that we're standing in the lands of the tribe of Ken Thompson,
which has come to occupy the greater part of the world, and than plain
text means ASCII or one of its successors, with lines terminated by CR
and/or LF, and no funny business. This is not a universal truth, but it is
a truth where we are right now.

tom
 
A

Arved Sandstrom

Not necessarily. A CSV file is generally treated as containing a fixed
number of variable length fields with the last field terminated by a
newline. In this case, both commas and newlines are out-of-band (and so
are some quote marks if the implementation allows fields to contain
commas).

However, fixed length records made up of fixed length fields contain no
out-of-band structure. You want an example? How about the two magnetic
stripe tracks on a credit card - 40 bytes and containing fields whose
content and meaning are defined by their position.
[ SNIP ]

I think the point is that every file scheme has out-of-band structure,
explicit or implied. There's the structure information and there's the
data. At one end of the spectrum you've got files that are almost fully
self-describing (which category itself subsumes everything from
fixed-length record/fixed-length field files, to stream-oriented plain
text files that are variable length records delimited by EOL markers) to
files that are totally non-self-describing. In every case, including
that of the magnetic stripes example, there is out-of-band structure;
it's just that it may be implied.

So if the records don't contain that structure, something else (either
the file or the processor) does.

AHS
 
M

Martin Gregorie

Not necessarily. A CSV file is generally treated as containing a fixed
number of variable length fields with the last field terminated by a
newline. In this case, both commas and newlines are out-of-band (and so
are some quote marks if the implementation allows fields to contain
commas).

However, fixed length records made up of fixed length fields contain no
out-of-band structure. You want an example? How about the two magnetic
stripe tracks on a credit card - 40 bytes and containing fields whose
content and meaning are defined by their position.
[ SNIP ]

I think the point is that every file scheme has out-of-band structure,
explicit or implied. There's the structure information and there's the
data. At one end of the spectrum you've got files that are almost fully
self-describing (which category itself subsumes everything from
fixed-length record/fixed-length field files, to stream-oriented plain
text files that are variable length records delimited by EOL markers) to
files that are totally non-self-describing. In every case, including
that of the magnetic stripes example, there is out-of-band structure;
it's just that it may be implied.

So if the records don't contain that structure, something else (either
the file or the processor) does.
Well put.

KW seemed to be saying that a record had to have structure, which is
true, and that this had to be in the form of metadata included within the
record, which isn't true. I was attempting to say that, though not as
succinctly as yourself.
 
D

Daniel Pitts

Yes, but it's tricky. You need a random-access file and seek backwards
to a newline.
You can do something a little better than seeking backwards. You can
make some guesses about line length. If it is a typical text file, you
can guess that the length f that line is < 1024 (for instance). Seek to
that location before the end of the file and then perform the typical
"tail" operation.

If you don't find the EOL as expected, you would then do the same thing,
but start further back.

Also, be ware of the special case that the final line may or may not end
with EOL. Many text files have an end of line before end of file, but
not always. So really you want to match "(start-of-file or EOL)(final
line)(Optional EOL)(EOF)
 
A

Arne Vajhøj

Except, of course, that you don't read "a counted approach" as lines of
text; you read it as binary integers mixed with text strings. It is not,
physically, a text file.

It is a text file and can be used as a text file.
Yes. A line break in the middle of a line is utter nonsense, the logical
equivalent of an odd even number or an endpoint of a circle or a corner
of a disc.

It is actually very logical that something that is not considered
a line break in that record format can be in the middle of a line.
[ASCII 10] is perfectly valid as content in the middle of a line on
older MacOS systems

Sophistry. Those just use ASCII 13 to mean the same thing.

Yes.

And in a counted prefix format both LF and CR are valid in lines.
No, a text file is a single string.

No. That is why it is called a file of lines.
I said "the formats commonly used to store, e.g., C source files". No
"count prefix line format" is *commonly* used to store C source files --
99.99% or more of C source files residing on hard disks in this world are
undoubtedly in fact LF-delimited, and most of the rest CRLF delimited
(Windows wackiness strikes again).

You assumption that amount of C cod eon non-*nix platforms to
be less than 0.01% is rather amusing.

How can anyone be so much out of touch with the real world.

Arne
 
A

Arne Vajhøj

Then I think you need to define "text file" more narrowly than what is
actually out there. In this thread alone, there have been mentioned a
number of true text file formats that are simply not readable in your
average or even above-average text editor found on mainstream OSs.

They are edited fine by any text editor on those systems.

This includes cross platform editors that are also available
on *nix and Windows.

If the files are FTP'ed to a Unix box in text mode they can be
edited with any Unix text editor.

If their location is mounted as a Samba drive, then they can be
edited from Windows with any Windows text editor.

For obvious reasons notepad.exe can not be run on the systems.

Arne
 
A

Arne Vajhøj

Either (a) according to Arved's definition, which is highly appealing to
me, they are not true text file formats, or (b) they *are* readable with
the standard text editors *on the OSs on which they are found*, in which
case, perhaps they are.

If Java, C, Fortran etc. reads them as text files, then it seems weird
not to consider them text files.
Nonetheless, in general discourse on a newsgroups like this, there is a
presumption that we're standing in the lands of the tribe of Ken
Thompson, which has come to occupy the greater part of the world, and
than plain text means ASCII or one of its successors, with lines
terminated by CR and/or LF, and no funny business. This is not a
universal truth, but it is a truth where we are right now.

I really don't see any reason to redefine the concept of
lines due to many people having very limited experience
with other OS'es than Windows or Unix.

Java has not been standardized and made as well defined as it has
just to have "if it happens to work on *nix and Windows then it is
portable".

Arne
 
A

Arne Vajhøj

# Since those days, the world has standardized on ASCII flat files for text files.

# Windows text files are flat ASCII files (with CRLF line ends). Mac text
# files are flat ASCII files (with CR line ends). Unix text files are flat
# ASCII files (with LF line ends).

I don't see an argument of any kind in your post. Forget to include one?

The quotation answers your question to Sosman.

You said that.

Arne
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top