Parsing file to extraction records

Discussion in 'C Programming' started by M, Mar 9, 2006.

  1. M

    M Guest

    Hi,

    I need to parse text files to extract data records. The files will
    consist of a header,
    zero or more data records, and a trailer. I can discard the header and
    trailer but I must split the data records up and return them to an
    application.

    The complexity here is that I won't know the exact format of the files
    until run time. The files may or may not contain headers and trailers
    and the format is not known yet. The records may have clearly defined
    start and end markers but they may not. There may be a fixed separator
    between the records or there may not. (Separators will be used if
    there are no record start and end markers).

    The current idea is to use UNIX regular expressions to define the
    format of the parts of the file and match them up at run time. However
    it is not clear whether it would be possible to develop single
    expressions for the whole file or whether I would have to use separate
    regular expressions for each part of the file (header, trailer,
    separator, begin/end record etc.). If a single expression is used I
    would imagine the expression would match all the data records rather
    than being able to recognise individual records.

    This code is to extend an application already written in C running on
    UNIX (&OpenVMS) platforms.

    I would be grateful for some thoughts on how this could be achieved.
     
    M, Mar 9, 2006
    #1
    1. Advertising

  2. M wrote:
    > Hi,
    >
    > I need to parse text files to extract data records. The files will
    > consist of a header,
    > zero or more data records, and a trailer. I can discard the header and
    > trailer but I must split the data records up and return them to an
    > application.


    I believe this question is better suited for comp.programming or
    similar...

    > The complexity here is that I won't know the exact format of the files
    > until run time. The files may or may not contain headers and trailers
    > and the format is not known yet. The records may have clearly defined
    > start and end markers but they may not. There may be a fixed separator
    > between the records or there may not. (Separators will be used if
    > there are no record start and end markers).


    I don't really understand how you're going to cater for this level of
    indeterminacy.

    > The current idea is to use UNIX regular expressions to define the
    > format of the parts of the file and match them up at run time. However
    > it is not clear whether it would be possible to develop single
    > expressions for the whole file or whether I would have to use separate
    > regular expressions for each part of the file (header, trailer,
    > separator, begin/end record etc.). If a single expression is used I
    > would imagine the expression would match all the data records rather
    > than being able to recognise individual records.


    If you at least know the limits of what can be expected, why don't you
    come up with a simple(ish) file description language, and pre-pend it
    (or use it as a header).

    Still, nothing C-specific here. Try some other groups.

    --
    BR, Vladimir
     
    Vladimir S. Oka, Mar 9, 2006
    #2
    1. Advertising

  3. M

    M Guest

    Thank for your response.

    > I believe this question is better suited for comp.programming or
    > similar...


    It is posted to comp.programming (and crossposted to comp.lang.c)

    > If you at least know the limits of what can be expected, why don't you
    > come up with a simple(ish) file description language, and pre-pend it
    > (or use it as a header).


    This seems even more difficult than the ideas I discussed. Maybe I did
    not
    explain the requirements well. The program has to cope with a variety
    of
    different file formats. Hence the need to make the program flexible.
    The
    file format would be specified in a database or configuation file and
    would be
    fixed for any particular instance of the program. However there will
    be many
    such programs running on different installations all reading different
    file formats.

    > Still, nothing C-specific here. Try some other groups.


    It's got to be written in C. I think that is specific :)

    M
     
    M, Mar 9, 2006
    #3
  4. NB: Posted just to comp.lang.c

    M wrote:
    > Thank for your response.
    >
    > > I believe this question is better suited for comp.programming or
    > > similar...

    >
    > It is posted to comp.programming (and crossposted to comp.lang.c)


    Sorry, I did not see this.

    > > If you at least know the limits of what can be expected, why don't you
    > > come up with a simple(ish) file description language, and pre-pend it
    > > (or use it as a header).

    >
    > This seems even more difficult than the ideas I discussed. Maybe I did
    > not explain the requirements well. The program has to cope with a variety
    > of different file formats. Hence the need to make the program flexible.
    > The file format would be specified in a database or configuation file and
    > would be fixed for any particular instance of the program. However there will
    > be many such programs running on different installations all reading different
    > file formats.


    You suggested regular expressions. I suggested a simplified form (in
    different words), specific to your implementation. Where the
    description is stored is really immaterial.

    > > Still, nothing C-specific here. Try some other groups.

    >
    > It's got to be written in C. I think that is specific :)


    You're really after the method, which can be implemented in any
    language.

    This group (c.l.c) discusses the C language only. Once you implement
    this in C (or start implementing it), and have a question about
    /implementation/ using standard C, this is the place to ask about it.
    (Although, as you will have noticed, we do tend to give it a stab,
    while pointing to the better place to ask. ;-) )

    --
    BR, Vladimir
     
    Vladimir S. Oka, Mar 9, 2006
    #4
  5. M said:

    > Hi,
    >
    > I need to parse text files to extract data records. The files will
    > consist of a header,
    > zero or more data records, and a trailer. I can discard the header and
    > trailer but I must split the data records up and return them to an
    > application.
    >
    > The complexity here is that I won't know the exact format of the files
    > until run time.


    Been there, done that, got the tee-shirt in several different shapes and
    sizes. We ended up writing a data language. (Well, I say we, but I had very
    little to do with it actually.) I'm fairly sure I've described it here
    before. A descriptor file (text, of course) was used to identify which
    fields were present in which locations and how wide they were, that sort of
    thing.

    > The files may or may not contain headers and trailers
    > and the format is not known yet.


    You just said they would have a header and a trailer. The exact format may
    be a moveable feast, but you need to establish a consistent meta-format
    early on.

    > I would be grateful for some thoughts on how this could be achieved.


    Let's say you wanted to write a C interpreter. (Analogy alert!) To process a
    struct definition, you'd have to read it in from the text file, identify
    the type of each member, and its name, and (if it's an array) its size. And
    you'd have to have some way of finding or updating a particular member's
    value, given its name.

    You have much the same deal here. Your record is like a C struct, in a way.
    (But not in another way. For reading and processing, you will almost
    certainly want to be able to access the various fields of a record in a
    loop - at least sometimes.) So that gives you a clue about your
    configuration file structure. Say, for example, that you are dealing with
    orders for nuts and bolts from fifteen different large customers, all of
    whom send their orders to you electronically. You might want to have a
    config file structure something like this:

    FILETYPE Orders
    CUSTOMER NutsNBoltsRUs
    DEF RECORD Header
    CHAR Type
    DATE Created
    INTEGER RecordCount
    ENDDEF
    DEF RECORD Bolts
    CHAR Type
    DATE OrderDate
    CHAR 16 ProductCode
    STRING Description *
    INTEGER Height
    INTEGER TopDiameter
    CHAR 3 DontCareA
    INTEGER TipDiameter
    CHAR 3 DontCareB
    INTEGER PitchCode
    CHAR 6 DontCareC
    INTEGER PriceCode
    ENDDEF
    DEF RECORD Nuts
    CHAR Type
    DATE OrderDate
    CHAR 14 ProductCode
    STRING Description *
    INTEGER MatCode
    INTEGER Depth
    INTEGER ExternalDiameter
    INTEGER InternalDiameter
    INTEGER PitchCode
    INTEGER PriceCode
    CHAR 12 DontCareD
    INTEGER ColourCode
    ENDDEF

    As you can see, this is easily extensible, and its purpose is to describe
    the file format supplied by a particular customer. Thus, its layout will
    vary depending on that format. The above example contains some fields that
    we simply aren't interested in, but we have to know enough about them to be
    able to ignore them - hence the "DontCare" entries. And at runtime, you
    simply read the config file to find out where in a record the relevant
    field information was. You'll end up with functions to read a record, work
    out what record type it is, find a field within a given record either by
    name or by index, etc etc. Nothing terribly hard, but needs careful
    planning.


    --
    Richard Heathfield
    "Usenet is a strange place" - dmr 29/7/1999
    http://www.cpax.org.uk
    email: rjh at above domain (but drop the www, obviously)
     
    Richard Heathfield, Mar 9, 2006
    #5
  6. It is impossible to use regex w/o knowing the file formats.

    If you can provide further information on what you want to do with your
    program, and I will try to provide some further assistance.
     
    Programming Master, Mar 10, 2006
    #6
  7. M

    Oliver Wong Guest

    "Programming Master" <> wrote in message
    news:...
    > It is impossible to use regex w/o knowing the file formats.
    >
    > If you can provide further information on what you want to do with your
    > program, and I will try to provide some further assistance.


    I think the OP is saying the program WILL know the file formats...
    except only at runtime, instead of at compile time.
     
    Oliver Wong, Mar 10, 2006
    #7
  8. M

    M Guest

    > I think the OP is saying the program WILL know the file formats...
    > except only at runtime, instead of at compile time.


    Correct. The program will have to cope with many different file
    formats (conforming to the specification from my original post). The
    exact format will be known at run time and may be specified in terms of
    regular expressions.

    The purpose of this application is to interpret data files from many
    different clients. Each
    client uses a slightly different file format. My program has to be
    able to read all the files.

    I have now completed a prototype, based on the provision of five
    different regular expressions to define a file format. It would be
    nice to reduce the number of
    expressions necessary - but I can't see a way of doing this. This is
    really what the
    original post was about - using a single RE.

    Mark
     
    M, Mar 13, 2006
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Luke Airig
    Replies:
    0
    Views:
    816
    Luke Airig
    Dec 31, 2003
  2. Replies:
    0
    Views:
    541
  3. Replies:
    2
    Views:
    179
  4. Dan

    Delete records or update records

    Dan, May 10, 2004, in forum: ASP General
    Replies:
    1
    Views:
    478
    Ray at
    May 10, 2004
  5. Replies:
    3
    Views:
    689
    Anthony Jones
    Nov 2, 2006
Loading...

Share This Page