File-Reading Best Practices?

Andreas Wenzke · Apr 3, 2010

I want to parse an XML file manually (but my question would be the same
for any other file format):
What are best-practice guidelines for doing that?

I currently use a char buffer in conjunction with istream::read and then
walk through the buffer step by step.
However, problems will arise when tags span across the buffer, i.e. when
the buffer contains "<h" at the end and the next characters to be read
from the stream are "tml>".
I'm considering using memmove, but I just think there has to be a better
option.

As this is for a university project, I'm not allowed to use the STL
(std::string and so on).

Stefan Ram · Apr 3, 2010

Andreas Wenzke said:
I want to parse an XML file manually (but my question would be the same
for any other file format):
What are best-practice guidelines for doing that?
I currently use a char buffer in conjunction with istream::read and then
walk through the buffer step by step.

You seem to think about implementations ("char buffer") early.
I prefer to think about interfaces (.getNextSymbol()) early.

A char is a byte, while XML files are composed of Unicode
characters (code points). If you read them as chars, you
will first have to decode them, so you should at least
implement an UTF-8-reader.

However, problems will arise when tags span across the buffer, i.e. when
the buffer contains "<h" at the end and the next characters to be read
from the stream are "tml>".
I'm considering using memmove, but I just think there has to be a better
option.

Again, it seems strange to me, to mention parsing and then
mention memmove, too low-level thinking. You are thinking
about low-level implementation details too early. They should
be hidden behind interfaces, so that they can be changed
later.

As this is for a university project, I'm not allowed to use the STL
(std::string and so on).

This newsgroup is about using C++, and when you are not
allowed to use ::std::string and so on, you are not allowed
to use C++, so you are in the wrong newsgroup. In C++, also,
there is nothing that is being called »STL« by
ISO/IEC 14882:2003(E), so you possibly are being taught
out-dated terms. Maybe that university also is too low-level.

Carlo Milanesi · Apr 3, 2010

Andreas said:
I want to parse an XML file manually (but my question would be the same
for any other file format):
What are best-practice guidelines for doing that?

I currently use a char buffer in conjunction with istream::read and then
walk through the buffer step by step.
However, problems will arise when tags span across the buffer, i.e. when
the buffer contains "<h" at the end and the next characters to be read
from the stream are "tml>".
I'm considering using memmove, but I just think there has to be a better
option.

As this is for a university project, I'm not allowed to use the STL
(std::string and so on).

Why universities prohibit STL?

I think the simplest way to read a file is by using a memory-mapped
files. They are not standard though. Does your university allow them?
Here you can find a useful library:
http://en.wikibooks.org/wiki/Optimi...on_techniques/Input/Output#Memory-mapped_file
You may use its class InputMemoryFile to read a file that can fit into
your address space.

Andreas Wenzke · Apr 3, 2010

Christian said:
What are you allowed to use at all, then?

"STL" is not a synonym for "standard library". In particular,
std::string is considered a different part of the library than the
container/algorithm part. If your lecturer does not allow you to use the
entire standard library except of the C part, then of course streams
cannot be used, either.

Sorry said:
Anyway, I think that with such course requirements best-practice
guidelines for file reading in C++ simply cannot be met. (I originally
learned C++ that way, too, and later had to unlearn much of what had
been taught to us. It bothers me that C++ is still treated this way at
universities.)

STL will be taught in detail, though not in this class where the
lecturer wants us to understand the implementation first.

Andreas Wenzke · Apr 3, 2010

Stefan said:
You seem to think about implementations ("char buffer") early.
I prefer to think about interfaces (.getNextSymbol()) early.

Care to elaborate a little on this?

A char is a byte, while XML files are composed of Unicode
characters (code points). If you read them as chars, you
will first have to decode them, so you should at least
implement an UTF-8-reader.

The file-reading part is only a very small part of the whole project.
Implementing UTF-8 parsing isn't likely to have any benefits for my
program (strings will be stored "as is" anyway) and probably isn't going
to earn me many bonus points. However, it would probably make things
more complicated as I'd have to distinguish between ANSI and Unicode chars.

Again, it seems strange to me, to mention parsing and then
mention memmove, too low-level thinking. You are thinking
about low-level implementation details too early. They should
be hidden behind interfaces, so that they can be changed
later.

I understand your objection, and I don't really know how to implement
that for my current task.

This newsgroup is about using C++, and when you are not
allowed to use ::std::string and so on, you are not allowed
to use C++, so you are in the wrong newsgroup. In C++, also,
there is nothing that is being called »STL« by
ISO/IEC 14882:2003(E), so you possibly are being taught
out-dated terms. Maybe that university also is too low-level.

<iostream> and C libraries like <string.h> are allowed.
Other "STL" classes like std::string, std::vector will be allowed in
follow-up classes.

Also, I am of course allowed to implement my own string class etc.

Andreas Wenzke · Apr 3, 2010

Carlo said:
Why universities prohibit STL?

Because they want the students to understand the implementation details
first.
The STL will be allowed in follow-up classes.

I think the simplest way to read a file is by using a memory-mapped
files. They are not standard though. Does your university allow them?

If they're not standard, probably not.

Here you can find a useful library:

Third-party libraries aren't allowed...

1jam · Apr 3, 2010

Stefan said:
This newsgroup is about using C++, and when you are not
allowed to use ::std::string and so on, you are not allowed
to use C++, so you are in the wrong newsgroup.

Not true, in embedded C++ development STL is still usually shunned. Plus C++
was used for decades before STL implementations finally matured and became
used.

Stefan Ram · Apr 3, 2010

Andreas Wenzke said:
Care to elaborate a little on this?

I separate the code into sub-units.

To parse an XML file, the obvious sub-units would be: a
characters source (a source for the Unicode code points),
then, a scanner (lexical analyzer) then, a parser (syntactical
analyzer). But you also need to know whether you want to
create a DOM (document object model) parser or calls to
client functions (like a SAX parser) or something else.

Anyway, between those units, there are interfaces.
Interfaces are also known as APIs and similar to abstract
datatypes, they are sets of documented calls. So I start by
writing them.

Only then, I will start to write implementations of these
calls.

Some German language notes about software design by me:

http://www.purl.org/stefan_ram/pub/aufbau_grosser_programme

The file-reading part is only a very small part of the whole project.
Implementing UTF-8 parsing isn't likely to have any benefits for my
program (strings will be stored "as is" anyway) and probably isn't going
to earn me many bonus points. However, it would probably make things
more complicated as I'd have to distinguish between ANSI and Unicode chars.

The XML specification says:

»All XML processors MUST accept the UTF-8 and UTF-16
encodings of Unicode [Unicode]« (uppercase emphasis
was done by the W3C, not by me [Stefan Ram])

http://www.w3.org/TR/REC-xml/

(ISO-8859-1 processing, on the other hand is not required.)

Reading the XML specification and then writing a correct
implementation is a huge project. Now, you tell me this is
only a very small part of the whole project. You are to use C++,
but then are not allowed to use C++, you are to read XML,
but then are not required to read XML as it's specified.

Such an attitude of doing a huge project in such a messy way
(calling »C++« what is not C++, calling »XML« what is not XML)
seems to be highly inappropriate for a scientific university.
It even would be inappropriate for any other teaching situation,
like, say, a »university of applied science« (»Fachhochschule«).

Let me end this post by a quote from Rob Walling:

»I've known smart developers who don't pay attention to detail.
The result is misspelled database columns, uncommented code,
projects that aren't checked into source control,
software that's not unit tested, unimplemented features,
and so on. All of these can be easily dealt with if
you're building a Google mash-up or a five page website.
But in corporate development each of these screw-ups is
a death knell.

So I'll say it very loud, but I promise I'll only say it once:

I have /never, ever, ever/ seen a great software
developer who does not have amazing attention to detail.«

James Kanze · Apr 3, 2010

I want to parse an XML file manually (but my question would be
the same for any other file format):
What are best-practice guidelines for doing that?

I currently use a char buffer in conjunction with
istream::read and then walk through the buffer step by step.
However, problems will arise when tags span across the buffer,
i.e. when the buffer contains "<h" at the end and the next
characters to be read from the stream are "tml>". I'm
considering using memmove, but I just think there has to be a
better option.

As this is for a university project, I'm not allowed to use
the STL (std::string and so on).

The most obvious solution is to ensure that the buffer never
does end in the middle of a token. Say by using getline to read
it. This has the additional advantage of making it trivial to
output the line number in error messages. In the case of real
XML, it's probably not a good idea, since WWW requires
recognizing several different line ending conventions (although
it wouldn't be that difficult to write a custom getline which
recognized them all), but I doubt that that's relevant for a
school project (at least at a level where you aren't allowed to
use the STL).

Another solution is to read character by character, using a
state machine to determine where the token ends, and put each
character into your final buffer. In this way, you never have
more than one token in the buffer, and the buffering in filebuf
least at a level where you aren't allowed to use the STL).

Another solution is to read character by character, using a
state machine to determine where the token ends, and put each
character into your final buffer. In this way, you never have
more than one token in the buffer, and the buffering in filebuf
takes care of the least at a level where you aren't allowed to
use the STL).

Another solution is to read character by character, using a
state machine to determine where the token ends, and put each
character into your final buffer. In this way, you never have
more than one token in the buffer, and filebuf takes care of the
actual IO buffering.

Andreas Wenzke · Apr 4, 2010

Stefan said:
To parse an XML file, the obvious sub-units would be: a
characters source (a source for the Unicode code points),
then, a scanner (lexical analyzer) then, a parser (syntactical
analyzer). But you also need to know whether you want to
create a DOM (document object model) parser or calls to
client functions (like a SAX parser) or something else.

As I only want to parse one certain format, I think this isn't necessary.
Usually, a specific expected token has to be read, otherwise a parsing
error would occur.

Anyway, between those units, there are interfaces.
Interfaces are also known as APIs and similar to abstract
datatypes, they are sets of documented calls. So I start by
writing them.

I have several years of programming experience in C#, so I'm generally
used to developing against interfaces.

But one thing is that I lack experience in C++ and the other is that I
want to get this XML parser done as quickly as possible, so I can
concentrate on the actual project task.

Some German language notes about software design by me:

http://www.purl.org/stefan_ram/pub/aufbau_grosser_programme

"You ain't gonna need it"

I generally understand your objection, and in this case I just want to
get this (pseudo) parser done.

The XML specification says:

»All XML processors MUST accept the UTF-8 and UTF-16
encodings of Unicode [Unicode]« (uppercase emphasis
was done by the W3C, not by me [Stefan Ram])

Actually, I don't think this is an emphasis, but rather the normal RFC
way of pointing out that "MUST", "CAN" etc. are to be interpreted as
keywords (see also RFC 2119).

But that aside, I do accept those encodings, I just don't decode them.

Such an attitude of doing a huge project in such a messy way
(calling »C++« what is not C++, calling »XML« what is not XML)
seems to be highly inappropriate for a scientific university.
It even would be inappropriate for any other teaching situation,
like, say, a »university of applied science« (»Fachhochschule«).

You have to start /somewhere/. You can't just put everything into a
three-hours-per-week class.

The lecturer is very good (and believe me, I have seen bad classes like
someone teaching a C# "beginner's class" where she would teach "design
patterns" without even explaining what polymorphism or interfaces are),
and whilst I don't think using XML as the input format was quite
necessary, he does a good job.

Alf P. Steinbach · Apr 4, 2010

* Andreas Wenzke:

I want to parse an XML file manually (but my question would be the same
for any other file format):
What are best-practice guidelines for doing that?

I currently use a char buffer in conjunction with istream::read and then
walk through the buffer step by step.
However, problems will arise when tags span across the buffer, i.e. when
the buffer contains "<h" at the end and the next characters to be read
from the stream are "tml>".
I'm considering using memmove, but I just think there has to be a better
option.

As this is for a university project, I'm not allowed to use the STL
(std::string and so on).

The abstraction you're looking for seems to be "get next character".

This is provided by the C standard library. <g>

Build your lexer on top of that and your parser on top of the lexer.

Cheers & hth.,

- Alf

Andreas Wenzke · Apr 4, 2010

James said:
The most obvious solution is to ensure that the buffer never
does end in the middle of a token. Say by using getline to read
it.

<foo
attr="value"
/>

is valid XML, as far as I know.

Another solution is to read character by character, using a
state machine to determine where the token ends, and put each
character into your final buffer. In this way, you never have
more than one token in the buffer, and the buffering in filebuf
least at a level where you aren't allowed to use the STL).

Another solution is to read character by character, using a
state machine to determine where the token ends, and put each
character into your final buffer. In this way, you never have
more than one token in the buffer, and the buffering in filebuf
takes care of the least at a level where you aren't allowed to
use the STL).

Another solution is to read character by character, using a
state machine to determine where the token ends, and put each
character into your final buffer. In this way, you never have
more than one token in the buffer, and filebuf takes care of the
actual IO buffering.

Am I mistaken or is this three times the same suggestion?

I initially wanted to implement a finite-state machine (using an enum
for the states), but soon realized there essentially always is a fixed
order:

1. Try to read a BOM
2. Try to read an XML declaration
3. Ignore any whitespace
4. Read the root element
5. Read the first child element
....

So what I have so far are several SkipXXX methodes (SkipBOM,
SkipWhitespace) and so on, each of which advances the char pointer in
the buffer.
As soon as it's tried to move to/past the end of the buffer, the
buffer's contents are memmove'd to the beginning of the buffer and the
remainder is refilled with data from the stream.

What do you think of that approach?

Stefan Ram · Apr 4, 2010

Andreas Wenzke said:
You have to start /somewhere/.

Being allowed to use all of C++ and to use third-party libraries
it is more easy to get something done than when this is forbidden.
It is more difficult to have to implement many things on a low level
oneself than to use given and tested implementations.

Therefore, it would seem more natural to me, to be allowed to use
all of C++ and to use third-party libraries /in the first semester/,
and then to do the more difficult part of implementing »everything«
oneself /in the second semester/, because it seems natural to me to
start with more easy tasks and then proceed toward more difficult
tasks.

Andreas Wenzke · Apr 4, 2010

Stefan said:
Being allowed to use all of C++ and to use third-party libraries
it is more easy to get something done than when this is forbidden.

In fact, I think we are allowed to use an XML-parsing library, since the
lecturer thought that integrating it well would be at least as hard as
writing the parser by oneself.
I nevertheless decided against that as I think I will get more credits
for writing my own (albeit imperfect) parser.

Jonathan Lee · Apr 4, 2010

I currently use a char buffer in conjunction with istream::read and then
walk through the buffer step by step.
However, problems will arise when tags span across the buffer, i.e. when
the buffer contains "<h" at the end and the next characters to be read
from the stream are "tml>".
I'm considering using memmove, but I just think there has to be a better
option.

You could write an LL(k) parser using the EBNF grammar provided by the
XML specification. I think I read somewhere that XML is LL(1) so you
could
get by reading a char at a time.

--Jonathan

Alf P. Steinbach · Apr 4, 2010

* Pete Becker:

or maybe eight, I haven't checked the grammar:

< foo attr = " value " />

How about seven?

< foo attr = "value" / >

Well I haven't checked the grammar either. ;-)

Cheers,

- Alf

Andrew Poelstra · Apr 4, 2010

STL will be taught in detail, though not in this class where the
lecturer wants us to understand the implementation first.

So he's teaching C++ as a "shitty C"? I hope for others'
sake that this lecturer is alone in being such an idiot.

You need to write a lexer whose job is to read entire tokens.
You can do that by reading individual characters into a token
class. The lexer's job is to make sure that an "<" and an ">"
and a "html" are all distinct, complete entities.

The actual XML-parsing work will then be done on those tokens.

Jonathan Lee · Apr 4, 2010

So he's teaching C++ as a "shitty C"? I hope for others'
sake that this lecturer is alone in being such an idiot.

I think he means the teacher is, say, asking students to implement
a sort algorithm before coming to rely on std::sort().

Personally, I don't think that's idiotic at all.

--Jonathan

Stefan Ram · Apr 4, 2010

Jonathan Lee said:
I think he means the teacher is, say, asking students to implement
a sort algorithm before coming to rely on std::sort().

Once one has used ::std::sort(), one's mind will be
deformed and be less able to implement a sort algorithm?

You do not have to make up examples (»... is, say, ...«)
for what the teacher asks, because it was already given in
a more specific way in this tread. It is a project an XML
parser only is a small part of. This is quite different
from studying a single algorithm in isolation.

Jonathan Lee · Apr 4, 2010

Once one has used ::std::sort(), one's mind will be
deformed and be less able to implement a sort algorithm?

Er... no.

You do not have to make up examples (»... is, say, ...«)
for what the teacher asks,

Is there some mandate that obliges me to stay within
the exact context?

--Jonathan

Reading a file using istearm	4	Dec 13, 2007
reading in and parsing through a binary file	9	Feb 2, 2009
How to unget a line when reading from a file/streamiterator/generator?	5	Apr 28, 2008
Reading little-endian data from a file in a portable manner	46	Jul 15, 2010
reading and parsing fixed length text file	1	Nov 7, 2005
ASP.NET 2.0/C# Response to client is masterpage instead of file.	2	Jul 31, 2007
Reading file error	2	Feb 19, 2004
Redirect COUT to file	6	Jul 20, 2004

File-Reading Best Practices?

Andreas Wenzke

Stefan Ram

Carlo Milanesi

Andreas Wenzke

Andreas Wenzke

Andreas Wenzke

1jam

Stefan Ram

James Kanze

Andreas Wenzke

Alf P. Steinbach

Andreas Wenzke

Stefan Ram

Andreas Wenzke

Jonathan Lee

Alf P. Steinbach

Andrew Poelstra

Jonathan Lee

Stefan Ram

Jonathan Lee

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads