File-Reading Best Practices?

Discussion in 'C++' started by Andreas Wenzke, Apr 3, 2010.

  1. I want to parse an XML file manually (but my question would be the same
    for any other file format):
    What are best-practice guidelines for doing that?

    I currently use a char buffer in conjunction with istream::read and then
    walk through the buffer step by step.
    However, problems will arise when tags span across the buffer, i.e. when
    the buffer contains "<h" at the end and the next characters to be read
    from the stream are "tml>".
    I'm considering using memmove, but I just think there has to be a better
    option.

    As this is for a university project, I'm not allowed to use the STL
    (std::string and so on).
    Andreas Wenzke, Apr 3, 2010
    #1
    1. Advertising

  2. Andreas Wenzke

    Stefan Ram Guest

    Andreas Wenzke <> writes:
    >I want to parse an XML file manually (but my question would be the same
    >for any other file format):
    >What are best-practice guidelines for doing that?
    >I currently use a char buffer in conjunction with istream::read and then
    >walk through the buffer step by step.


    You seem to think about implementations ("char buffer") early.
    I prefer to think about interfaces (.getNextSymbol()) early.

    A char is a byte, while XML files are composed of Unicode
    characters (code points). If you read them as chars, you
    will first have to decode them, so you should at least
    implement an UTF-8-reader.

    >However, problems will arise when tags span across the buffer, i.e. when
    >the buffer contains "<h" at the end and the next characters to be read
    >from the stream are "tml>".
    >I'm considering using memmove, but I just think there has to be a better
    >option.


    Again, it seems strange to me, to mention parsing and then
    mention memmove, too low-level thinking. You are thinking
    about low-level implementation details too early. They should
    be hidden behind interfaces, so that they can be changed
    later.

    >As this is for a university project, I'm not allowed to use the STL
    >(std::string and so on).


    This newsgroup is about using C++, and when you are not
    allowed to use ::std::string and so on, you are not allowed
    to use C++, so you are in the wrong newsgroup. In C++, also,
    there is nothing that is being called »STL« by
    ISO/IEC 14882:2003(E), so you possibly are being taught
    out-dated terms. Maybe that university also is too low-level.
    Stefan Ram, Apr 3, 2010
    #2
    1. Advertising

  3. Andreas Wenzke wrote:
    > I want to parse an XML file manually (but my question would be the same
    > for any other file format):
    > What are best-practice guidelines for doing that?
    >
    > I currently use a char buffer in conjunction with istream::read and then
    > walk through the buffer step by step.
    > However, problems will arise when tags span across the buffer, i.e. when
    > the buffer contains "<h" at the end and the next characters to be read
    > from the stream are "tml>".
    > I'm considering using memmove, but I just think there has to be a better
    > option.
    >
    > As this is for a university project, I'm not allowed to use the STL
    > (std::string and so on).


    Why universities prohibit STL?

    I think the simplest way to read a file is by using a memory-mapped
    files. They are not standard though. Does your university allow them?
    Here you can find a useful library:
    http://en.wikibooks.org/wiki/Optimi...on_techniques/Input/Output#Memory-mapped_file
    You may use its class InputMemoryFile to read a file that can fit into
    your address space.

    --

    Carlo Milanesi
    http://digilander.libero.it/carlmila
    Carlo Milanesi, Apr 3, 2010
    #3
  4. Christian Hackl schrieb:
    > What are you allowed to use at all, then?


    <iostream> and C libraries like <string.h>.

    > "STL" is not a synonym for "standard library". In particular,
    > std::string is considered a different part of the library than the
    > container/algorithm part. If your lecturer does not allow you to use the
    > entire standard library except of the C part, then of course streams
    > cannot be used, either.


    Sorry, <iostream> can be used, of course.

    > Anyway, I think that with such course requirements best-practice
    > guidelines for file reading in C++ simply cannot be met. (I originally
    > learned C++ that way, too, and later had to unlearn much of what had
    > been taught to us. It bothers me that C++ is still treated this way at
    > universities.)


    STL will be taught in detail, though not in this class where the
    lecturer wants us to understand the implementation first.
    Andreas Wenzke, Apr 3, 2010
    #4
  5. Stefan Ram schrieb:
    > You seem to think about implementations ("char buffer") early.
    > I prefer to think about interfaces (.getNextSymbol()) early.


    Care to elaborate a little on this?

    > A char is a byte, while XML files are composed of Unicode
    > characters (code points). If you read them as chars, you
    > will first have to decode them, so you should at least
    > implement an UTF-8-reader.


    The file-reading part is only a very small part of the whole project.
    Implementing UTF-8 parsing isn't likely to have any benefits for my
    program (strings will be stored "as is" anyway) and probably isn't going
    to earn me many bonus points. However, it would probably make things
    more complicated as I'd have to distinguish between ANSI and Unicode chars.

    > Again, it seems strange to me, to mention parsing and then
    > mention memmove, too low-level thinking. You are thinking
    > about low-level implementation details too early. They should
    > be hidden behind interfaces, so that they can be changed
    > later.


    I understand your objection, and I don't really know how to implement
    that for my current task.

    >> As this is for a university project, I'm not allowed to use the STL
    >> (std::string and so on).

    >
    > This newsgroup is about using C++, and when you are not
    > allowed to use ::std::string and so on, you are not allowed
    > to use C++, so you are in the wrong newsgroup. In C++, also,
    > there is nothing that is being called »STL« by
    > ISO/IEC 14882:2003(E), so you possibly are being taught
    > out-dated terms. Maybe that university also is too low-level.


    <iostream> and C libraries like <string.h> are allowed.
    Other "STL" classes like std::string, std::vector will be allowed in
    follow-up classes.

    Also, I am of course allowed to implement my own string class etc.
    Andreas Wenzke, Apr 3, 2010
    #5
  6. Carlo Milanesi schrieb:
    > Why universities prohibit STL?


    Because they want the students to understand the implementation details
    first.
    The STL will be allowed in follow-up classes.

    > I think the simplest way to read a file is by using a memory-mapped
    > files. They are not standard though. Does your university allow them?


    If they're not standard, probably not.

    > Here you can find a useful library:


    Third-party libraries aren't allowed...
    Andreas Wenzke, Apr 3, 2010
    #6
  7. Andreas Wenzke

    1jam Guest

    Stefan Ram wrote:

    >
    >>As this is for a university project, I'm not allowed to use the STL
    >>(std::string and so on).

    >
    > This newsgroup is about using C++, and when you are not
    > allowed to use ::std::string and so on, you are not allowed
    > to use C++, so you are in the wrong newsgroup.


    Not true, in embedded C++ development STL is still usually shunned. Plus C++
    was used for decades before STL implementations finally matured and became
    used.
    1jam, Apr 3, 2010
    #7
  8. Andreas Wenzke

    Stefan Ram Guest

    Andreas Wenzke <> writes:
    >>You seem to think about implementations ("char buffer") early.
    >>I prefer to think about interfaces (.getNextSymbol()) early.

    >Care to elaborate a little on this?


    I separate the code into sub-units.

    To parse an XML file, the obvious sub-units would be: a
    characters source (a source for the Unicode code points),
    then, a scanner (lexical analyzer) then, a parser (syntactical
    analyzer). But you also need to know whether you want to
    create a DOM (document object model) parser or calls to
    client functions (like a SAX parser) or something else.

    Anyway, between those units, there are interfaces.
    Interfaces are also known as APIs and similar to abstract
    datatypes, they are sets of documented calls. So I start by
    writing them.

    Only then, I will start to write implementations of these
    calls.

    Some German language notes about software design by me:

    http://www.purl.org/stefan_ram/pub/aufbau_grosser_programme

    >The file-reading part is only a very small part of the whole project.
    >Implementing UTF-8 parsing isn't likely to have any benefits for my
    >program (strings will be stored "as is" anyway) and probably isn't going
    >to earn me many bonus points. However, it would probably make things
    >more complicated as I'd have to distinguish between ANSI and Unicode chars.


    The XML specification says:

    »All XML processors MUST accept the UTF-8 and UTF-16
    encodings of Unicode [Unicode]« (uppercase emphasis
    was done by the W3C, not by me [Stefan Ram])

    http://www.w3.org/TR/REC-xml/

    (ISO-8859-1 processing, on the other hand is not required.)

    Reading the XML specification and then writing a correct
    implementation is a huge project. Now, you tell me this is
    only a very small part of the whole project. You are to use C++,
    but then are not allowed to use C++, you are to read XML,
    but then are not required to read XML as it's specified.

    Such an attitude of doing a huge project in such a messy way
    (calling »C++« what is not C++, calling »XML« what is not XML)
    seems to be highly inappropriate for a scientific university.
    It even would be inappropriate for any other teaching situation,
    like, say, a »university of applied science« (»Fachhochschule«).

    Let me end this post by a quote from Rob Walling:

    »I've known smart developers who don't pay attention to detail.
    The result is misspelled database columns, uncommented code,
    projects that aren't checked into source control,
    software that's not unit tested, unimplemented features,
    and so on. All of these can be easily dealt with if
    you're building a Google mash-up or a five page website.
    But in corporate development each of these screw-ups is
    a death knell.

    So I'll say it very loud, but I promise I'll only say it once:

    I have /never, ever, ever/ seen a great software
    developer who does not have amazing attention to detail.«
    Stefan Ram, Apr 3, 2010
    #8
  9. Andreas Wenzke

    James Kanze Guest

    On Apr 3, 10:32 am, Andreas Wenzke <> wrote:
    > I want to parse an XML file manually (but my question would be
    > the same for any other file format):
    > What are best-practice guidelines for doing that?


    > I currently use a char buffer in conjunction with
    > istream::read and then walk through the buffer step by step.
    > However, problems will arise when tags span across the buffer,
    > i.e. when the buffer contains "<h" at the end and the next
    > characters to be read from the stream are "tml>". I'm
    > considering using memmove, but I just think there has to be a
    > better option.


    > As this is for a university project, I'm not allowed to use
    > the STL (std::string and so on).


    The most obvious solution is to ensure that the buffer never
    does end in the middle of a token. Say by using getline to read
    it. This has the additional advantage of making it trivial to
    output the line number in error messages. In the case of real
    XML, it's probably not a good idea, since WWW requires
    recognizing several different line ending conventions (although
    it wouldn't be that difficult to write a custom getline which
    recognized them all), but I doubt that that's relevant for a
    school project (at least at a level where you aren't allowed to
    use the STL).

    Another solution is to read character by character, using a
    state machine to determine where the token ends, and put each
    character into your final buffer. In this way, you never have
    more than one token in the buffer, and the buffering in filebuf
    least at a level where you aren't allowed to use the STL).

    Another solution is to read character by character, using a
    state machine to determine where the token ends, and put each
    character into your final buffer. In this way, you never have
    more than one token in the buffer, and the buffering in filebuf
    takes care of the least at a level where you aren't allowed to
    use the STL).

    Another solution is to read character by character, using a
    state machine to determine where the token ends, and put each
    character into your final buffer. In this way, you never have
    more than one token in the buffer, and filebuf takes care of the
    actual IO buffering.

    --
    James Kanze
    James Kanze, Apr 3, 2010
    #9
  10. Stefan Ram schrieb:
    > To parse an XML file, the obvious sub-units would be: a
    > characters source (a source for the Unicode code points),
    > then, a scanner (lexical analyzer) then, a parser (syntactical
    > analyzer). But you also need to know whether you want to
    > create a DOM (document object model) parser or calls to
    > client functions (like a SAX parser) or something else.


    As I only want to parse one certain format, I think this isn't necessary.
    Usually, a specific expected token has to be read, otherwise a parsing
    error would occur.

    > Anyway, between those units, there are interfaces.
    > Interfaces are also known as APIs and similar to abstract
    > datatypes, they are sets of documented calls. So I start by
    > writing them.


    I have several years of programming experience in C#, so I'm generally
    used to developing against interfaces.

    But one thing is that I lack experience in C++ and the other is that I
    want to get this XML parser done as quickly as possible, so I can
    concentrate on the actual project task.

    > Some German language notes about software design by me:
    >
    > http://www.purl.org/stefan_ram/pub/aufbau_grosser_programme


    "You ain't gonna need it"

    I generally understand your objection, and in this case I just want to
    get this (pseudo) parser done.

    > The XML specification says:
    >
    > »All XML processors MUST accept the UTF-8 and UTF-16
    > encodings of Unicode [Unicode]« (uppercase emphasis
    > was done by the W3C, not by me [Stefan Ram])


    Actually, I don't think this is an emphasis, but rather the normal RFC
    way of pointing out that "MUST", "CAN" etc. are to be interpreted as
    keywords (see also RFC 2119).

    But that aside, I do accept those encodings, I just don't decode them.

    > Such an attitude of doing a huge project in such a messy way
    > (calling »C++« what is not C++, calling »XML« what is not XML)
    > seems to be highly inappropriate for a scientific university.
    > It even would be inappropriate for any other teaching situation,
    > like, say, a »university of applied science« (»Fachhochschule«).


    You have to start /somewhere/. You can't just put everything into a
    three-hours-per-week class.

    The lecturer is very good (and believe me, I have seen bad classes like
    someone teaching a C# "beginner's class" where she would teach "design
    patterns" without even explaining what polymorphism or interfaces are),
    and whilst I don't think using XML as the input format was quite
    necessary, he does a good job.
    Andreas Wenzke, Apr 4, 2010
    #10
  11. * Andreas Wenzke:
    > I want to parse an XML file manually (but my question would be the same
    > for any other file format):
    > What are best-practice guidelines for doing that?
    >
    > I currently use a char buffer in conjunction with istream::read and then
    > walk through the buffer step by step.
    > However, problems will arise when tags span across the buffer, i.e. when
    > the buffer contains "<h" at the end and the next characters to be read
    > from the stream are "tml>".
    > I'm considering using memmove, but I just think there has to be a better
    > option.
    >
    > As this is for a university project, I'm not allowed to use the STL
    > (std::string and so on).


    The abstraction you're looking for seems to be "get next character".

    This is provided by the C standard library. <g>

    Build your lexer on top of that and your parser on top of the lexer.


    Cheers & hth.,

    - Alf
    Alf P. Steinbach, Apr 4, 2010
    #11
  12. James Kanze schrieb:
    > The most obvious solution is to ensure that the buffer never
    > does end in the middle of a token. Say by using getline to read
    > it.


    <foo
    attr="value"
    />

    is valid XML, as far as I know.

    > Another solution is to read character by character, using a
    > state machine to determine where the token ends, and put each
    > character into your final buffer. In this way, you never have
    > more than one token in the buffer, and the buffering in filebuf
    > least at a level where you aren't allowed to use the STL).
    >
    > Another solution is to read character by character, using a
    > state machine to determine where the token ends, and put each
    > character into your final buffer. In this way, you never have
    > more than one token in the buffer, and the buffering in filebuf
    > takes care of the least at a level where you aren't allowed to
    > use the STL).
    >
    > Another solution is to read character by character, using a
    > state machine to determine where the token ends, and put each
    > character into your final buffer. In this way, you never have
    > more than one token in the buffer, and filebuf takes care of the
    > actual IO buffering.


    Am I mistaken or is this three times the same suggestion?

    I initially wanted to implement a finite-state machine (using an enum
    for the states), but soon realized there essentially always is a fixed
    order:

    1. Try to read a BOM
    2. Try to read an XML declaration
    3. Ignore any whitespace
    4. Read the root element
    5. Read the first child element
    ....

    So what I have so far are several SkipXXX methodes (SkipBOM,
    SkipWhitespace) and so on, each of which advances the char pointer in
    the buffer.
    As soon as it's tried to move to/past the end of the buffer, the
    buffer's contents are memmove'd to the beginning of the buffer and the
    remainder is refilled with data from the stream.

    What do you think of that approach?
    Andreas Wenzke, Apr 4, 2010
    #12
  13. Andreas Wenzke

    Stefan Ram Guest

    Andreas Wenzke <> writes:
    >You have to start /somewhere/.


    Being allowed to use all of C++ and to use third-party libraries
    it is more easy to get something done than when this is forbidden.
    It is more difficult to have to implement many things on a low level
    oneself than to use given and tested implementations.

    Therefore, it would seem more natural to me, to be allowed to use
    all of C++ and to use third-party libraries /in the first semester/,
    and then to do the more difficult part of implementing »everything«
    oneself /in the second semester/, because it seems natural to me to
    start with more easy tasks and then proceed toward more difficult
    tasks.
    Stefan Ram, Apr 4, 2010
    #13
  14. Stefan Ram schrieb:
    > Being allowed to use all of C++ and to use third-party libraries
    > it is more easy to get something done than when this is forbidden.


    In fact, I think we are allowed to use an XML-parsing library, since the
    lecturer thought that integrating it well would be at least as hard as
    writing the parser by oneself.
    I nevertheless decided against that as I think I will get more credits
    for writing my own (albeit imperfect) parser.
    Andreas Wenzke, Apr 4, 2010
    #14
  15. Andreas Wenzke

    Jonathan Lee Guest

    On Apr 3, 5:32 am, Andreas Wenzke <> wrote:
    > I currently use a char buffer in conjunction with istream::read and then
    > walk through the buffer step by step.
    > However, problems will arise when tags span across the buffer, i.e. when
    > the buffer contains "<h" at the end and the next characters to be read
    > from the stream are "tml>".
    > I'm considering using memmove, but I just think there has to be a better
    > option.


    You could write an LL(k) parser using the EBNF grammar provided by the
    XML specification. I think I read somewhere that XML is LL(1) so you
    could
    get by reading a char at a time.

    --Jonathan
    Jonathan Lee, Apr 4, 2010
    #15
  16. * Pete Becker:
    > Pete Becker wrote:
    >> Andreas Wenzke wrote:
    >>> James Kanze schrieb:
    >>>> The most obvious solution is to ensure that the buffer never
    >>>> does end in the middle of a token. Say by using getline to read
    >>>> it.
    >>>
    >>> <foo
    >>> attr="value"
    >>> />
    >>>
    >>> is valid XML, as far as I know.
    >>>

    >>
    >> Yes, and it's made up of six tokens:
    >>
    >> < foo attr = "value" />
    >>

    >
    > or maybe eight, I haven't checked the grammar:
    >
    > < foo attr = " value " />


    How about seven?

    < foo attr = "value" / >

    Well I haven't checked the grammar either. ;-)


    Cheers,

    - Alf
    Alf P. Steinbach, Apr 4, 2010
    #16
  17. On 2010-04-03, Andreas Wenzke <> wrote:
    > Christian Hackl schrieb:
    >> What are you allowed to use at all, then?

    >
    ><iostream> and C libraries like <string.h>.
    >
    >> "STL" is not a synonym for "standard library". In particular,
    >> std::string is considered a different part of the library than the
    >> container/algorithm part. If your lecturer does not allow you to use the
    >> entire standard library except of the C part, then of course streams
    >> cannot be used, either.

    >
    > Sorry, <iostream> can be used, of course.
    >
    >> Anyway, I think that with such course requirements best-practice
    >> guidelines for file reading in C++ simply cannot be met. (I originally
    >> learned C++ that way, too, and later had to unlearn much of what had
    >> been taught to us. It bothers me that C++ is still treated this way at
    >> universities.)

    >
    > STL will be taught in detail, though not in this class where the
    > lecturer wants us to understand the implementation first.


    So he's teaching C++ as a "shitty C"? I hope for others'
    sake that this lecturer is alone in being such an idiot.

    You need to write a lexer whose job is to read entire tokens.
    You can do that by reading individual characters into a token
    class. The lexer's job is to make sure that an "<" and an ">"
    and a "html" are all distinct, complete entities.

    The actual XML-parsing work will then be done on those tokens.

    --
    Andrew Poelstra
    http://www.wpsoftware.net/andrew
    Andrew Poelstra, Apr 4, 2010
    #17
  18. Andreas Wenzke

    Jonathan Lee Guest

    On Apr 4, 12:10 pm, Andrew Poelstra <>
    wrote:
    > On 2010-04-03, Andreas Wenzke <> wrote:
    > > STL will be taught in detail, though not in this class where the
    > > lecturer wants us to understand the implementation first.

    > So he's teaching C++ as a "shitty C"? I hope for others'
    > sake that this lecturer is alone in being such an idiot.


    I think he means the teacher is, say, asking students to implement
    a sort algorithm before coming to rely on std::sort().

    Personally, I don't think that's idiotic at all.

    --Jonathan
    Jonathan Lee, Apr 4, 2010
    #18
  19. Andreas Wenzke

    Stefan Ram Guest

    Jonathan Lee <> writes:
    >I think he means the teacher is, say, asking students to implement
    >a sort algorithm before coming to rely on std::sort().


    Once one has used ::std::sort(), one's mind will be
    deformed and be less able to implement a sort algorithm?

    You do not have to make up examples (»... is, say, ...«)
    for what the teacher asks, because it was already given in
    a more specific way in this tread. It is a project an XML
    parser only is a small part of. This is quite different
    from studying a single algorithm in isolation.
    Stefan Ram, Apr 4, 2010
    #19
  20. Andreas Wenzke

    Jonathan Lee Guest

    On Apr 4, 12:37 pm, -berlin.de (Stefan Ram) wrote:
    >   Once one has used ::std::sort(), one's mind will be
    >   deformed and be less able to implement a sort algorithm?


    Er... no.

    >   You do not have to make up examples (»... is, say, ...«)
    >   for what the teacher asks,


    Is there some mandate that obliges me to stay within
    the exact context?

    --Jonathan
    Jonathan Lee, Apr 4, 2010
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. karim
    Replies:
    0
    Views:
    455
    karim
    Jul 13, 2003
  2. John Dalberg
    Replies:
    3
    Views:
    568
    samuelhon
    Nov 16, 2006
  3. AJ
    Replies:
    1
    Views:
    107
    Rob Meade
    Jun 26, 2006
  4. Robert Au
    Replies:
    2
    Views:
    98
    Eric Hodel
    Dec 20, 2006
  5. Chicken McNuggets

    Best book on C gotchas and best practices?

    Chicken McNuggets, Jul 31, 2013, in forum: C Programming
    Replies:
    9
    Views:
    264
    Fred J. Tydeman
    Aug 5, 2013
Loading...

Share This Page