Advice on dealing with a legacy file format

Discussion in 'Java' started by The Frog, Sep 30, 2010.

  1. The Frog

    The Frog Guest

    Hi Everyone,

    Before I begin trudging away at this I thought it prudent to ask for
    some advice. I have a legacy file format from an application that is
    common to both the company I work for and the business partners. The
    file is a 'sort of' structured, in that it is text based, and there
    are paragraphs of information, each following its own structural bent.

    I am going to write a parser for this file type so that we can do more
    than we are currently limited to with the existing application. My
    question is therefore: What would be an 'elegant' approach to reading
    this file and its various parapgraphs? An example is below....

    <This is a 'blank slate' file>
    PROSPACE SCHEMATIC FILE
    ; Version 2006.3.2
    Project,Space Planning-Projekt,,
    0,,7,1.5,1.5,1,1,0,,1,0,0,0,0,0,3,1,1,0,1,0,0,0,0,2,3,1,0,1,0,0,0,0,3,3,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,2,2,2,2,,0,0,0,0,0,0,0,0,,,0,

    <This is with some basic data in it>
    PROSPACE SCHEMATIC FILE
    ; Version 2006.3.2
    Project,Space Planning-Projekt,,
    0,,7,1.5,1.5,1,1,0,,1,1,0,0,0,0,3,1,1,0,1,0,0,0,0,2,3,1,0,1,0,0,0,0,3,3,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,2,2,2,2,,0,0,0,0,0,0,0,0,,,0,
    Planogram,Test,,
    100,200,60,16777215,2,1,100,10,60,1,16777215,0,10,2,0,32896,1,0,0,1,0,,,-1,-1,-1,-1,0,0,0,-1,-1,-1,-1,-1,-1,0,0,0,-1,-1,-1,-1,-1,-1,0,0,0,-1,-1,0,1,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,1,,,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,2,,,,0,0,0,0,0,0,0,0,,,,0.5,30,30,0,0,0,0,0,0,1.5,1.5,1,1,0,0,0,0,0,0,0,0,,,,,,F766F39C-5610-4ced-
    AB0A-FDFD4D6DC166,,,,,0
    Segment,,,
    0,100,0,0,0,0,0,0,0,0,0,,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1,0,,

    What I am hoping to do is to eliminate the need to do a brute force
    approach. I was hoping that perhaps someone might have a suggestion
    using regular expressions. The goal at this stage is to read the file
    and place the data into suitable class objects with some sort of
    heirarchy. I understand the elements themselves (or mostly,
    unfortunately there is no official documentation for the file type),
    so interpreting the values into suitable objects isnt the issue, its
    reading the file in the first place and isolating those entities from
    the text in a 'clean' way.

    Any advice would be greatly appreciated.

    The Frog
     
    The Frog, Sep 30, 2010
    #1
    1. Advertising

  2. The Frog

    Tom Anderson Guest

    On Thu, 30 Sep 2010, The Frog wrote:

    > The file is a 'sort of' structured, in that it is text based, and there
    > are paragraphs of information, each following its own structural bent.


    Could you expand on what you mean by that? The lexical structure looks
    rather regular to me.

    > An example is below....
    >
    > <This is a 'blank slate' file>
    > PROSPACE SCHEMATIC FILE
    > ; Version 2006.3.2
    > Project,Space Planning-Projekt,,
    > 0,,7,1.5,1.5,1,1,0,,1,0,0,0,0,0,3,1,1,0,1,0,0,0,0,2,3,1,0,1,0,0,0,0,3,3,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,2,2,2,2,,0,0,0,0,0,0,0,0,,,0,
    >
    > <This is with some basic data in it>
    > PROSPACE SCHEMATIC FILE
    > ; Version 2006.3.2
    > Project,Space Planning-Projekt,,
    > 0,,7,1.5,1.5,1,1,0,,1,1,0,0,0,0,3,1,1,0,1,0,0,0,0,2,3,1,0,1,0,0,0,0,3,3,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,2,2,2,2,,0,0,0,0,0,0,0,0,,,0,
    > Planogram,Test,,
    > 100,200,60,16777215,2,1,100,10,60,1,16777215,0,10,2,0,32896,1,0,0,1,0,,,-1,-1,-1,-1,0,0,0,-1,-1,-1,-1,-1,-1,0,0,0,-1,-1,-1,-1,-1,-1,0,0,0,-1,-1,0,1,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,1,,,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,2,,,,0,0,0,0,0,0,0,0,,,,0.5,30,30,0,0,0,0,0,0,1.5,1.5,1,1,0,0,0,0,0,0,0,0,,,,,,F766F39C-5610-4ced-
    > AB0A-FDFD4D6DC166,,,,,0
    > Segment,,,
    > 0,100,0,0,0,0,0,0,0,0,0,,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1,0,,


    Ye gods.

    > What I am hoping to do is to eliminate the need to do a brute force
    > approach. I was hoping that perhaps someone might have a suggestion
    > using regular expressions.


    "Now, they have two problems."

    Okay, so AFAICT, the format of a file is that it's a file header followed
    by a sequence of paragraphs. The file header is the string "PROSPACE
    SCHEMATIC FILE" followed by a line break, followed by a version info line,
    followed by a line break (perhaps more than one info line, but each
    starting with a semicolon and terminated by a line break). A paragraph is
    a metadata line followed by a data line. Lines are sequences of values,
    separated by commas, terminated by a line break. The first value in each
    metadata line is the type of the paragraph, which governs the semantics of
    the values on the data line.

    Is that about right?

    If so, i wouldn't bother with anything except a manual parse, because this
    is a very straightforward format. I can't see any helpful way to use
    regular expressions.

    I have the day off, and i really should be packing to go on holiday, so
    naturally, as a means of positive procrastination, i have written a little
    parser to show how i would do it:

    http://urchin.earth.li/~twic/Code/Prospace/

    So far, it only deals with the Project paragraph, but you can see how to
    extend it. The interesting entry point is BuildProspace.main.

    tom

    --
    We must perform a quirkafleeg
     
    Tom Anderson, Sep 30, 2010
    #2
    1. Advertising

  3. The Frog

    The Frog Guest

    Thankyou Tom for the guidance. I understand the approach you are
    taking in your illustration and thankyou for taking the time to
    provide a coded example. I really appreciate the effort you have made.

    Let me see if I can clarify my desire to use Regular Expressions,
    though now perhaps that is not necessary. Simply put the file has
    various revisions to the structure made over time, and it was my
    understanding that I could provide a type of version based reader,
    with different 'rules' for each paragraph type for each version. I was
    hoping to allow a user to construct such a rule definition with an XML
    file acting as the descriptor for the data files construction. Then,
    parse the data file, and the information comes back to the calling
    application.

    To clarify the file itself, the paragraphs are basically someones idea
    of an xml like approach to storing data. Each line in a paragraph
    starts with a descriptor of sorts that identifies the type of data
    that follows. The data that follows is in CSV format (minus any
    headers). Various types of paragraphs are then used 'in order' to
    construct the final data structure. If anyone is interested this is
    for a planogram (shelf design used for retail stores) software tool.
    The order of the paragraphs and lines follows the shelf logic build
    order for the software tool. In the end you have a file that tells you
    what products (and all their descriptive characteristics) belong on
    what shelf, in what position, and how they are placed or stacked, per
    segment of each gondola. Thats a mouthful. When you walk down a
    supermarket aisle you are most likely seeing the result of one of
    these tools (and its processes) and all that data is stored in a file
    like this.

    What I am ultimately hoping to achieve is to take a data file and
    parse it into a heirarchy or objects that represent the same shelf
    design principals and can hand those objects back to the calling
    application, while keeping the parsing logic encapsulated and also
    extensible. I thought that XML definition files for the object
    definitions and regular expressions might be a good way to go for
    this.

    I thankyou once again for your most instructive feedback. I will have
    a play with this and see how I do.

    Cheers

    The Frog
     
    The Frog, Sep 30, 2010
    #3
  4. The Frog

    Roedy Green Guest

    On Thu, 30 Sep 2010 02:16:17 -0700 (PDT), The Frog
    <> wrote, quoted or indirectly quoted
    someone who said :

    >An example is below....


    That looks like an ordinary CSV file. See
    http://mindprod.com/jgloss/csv.html
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    You encapsulate not just to save typing, but more importantly, to make it easy and safe to change the code later, since you then need change the logic in only one place. Without it, you might fail to change the logic in all the places it occurs.
     
    Roedy Green, Oct 1, 2010
    #4
  5. The Frog

    The Frog Guest

    Hi Roedy,

    Thanks for the input. I at first thought the same, but turns out that
    the structure of the file is not so regular as a csv. Each of these
    'sentences' has a different number of fields, and the fields per
    sentence have their own unique order. Unfortunately its not as nicely
    regular as a CSV - that would have been wonderful!

    I have managed to discern from the files descriptor document that
    there are 14 different types of sentence:
    - Project (247 Fields)
    - Planogram (259 Fields)
    - Fixture (177 Fields)
    - Product (320 Fields)
    - Position (180 Fields)
    - Divider (22 Fields)
    - Supplier (31 Fields)
    - Segment (54 Fields)
    - Performance (189 Fields)
    - Drawing (50 Fields)
    - EmbeddedObject (15 Fields)
    - Configuration (494 Fields)
    - Peg (16 Fields)
    - Point (4 Fields)

    These represent objects in a heirarchical order. Still working out
    which ones belong to others, but everything belongs to a project (top
    level). My hope is to be able to develop a package that can parse
    these files and return / deliver to an application a set of objects
    that represents the 'real world' structure this data file represents.
    In researching this file structure it turns out that the structure has
    undergone some revisions over the years. Part of what I am trying to
    figure out is how I can provide an XML document as part of a Java
    Package that would contain the necesssary descriptive information to
    allow for the class objects to be built on the fly. The parser would
    determine the data files version number, 'load' the appropriate XML
    schema document so it could generate the appropriate objects, then
    parse the data file and return to the calling code a set of objects
    that represent the contents of the data file. If a new version of the
    data file is released it is much easier to add a new XML schema
    document to the package than to rewrite the code.

    Tom has given me a great head start in doing the 'physical' side of
    the parsing, but I am still at a loss for the XML to Object side. Is
    there anything that you might point me to?

    Cheers

    The Frog
     
    The Frog, Oct 4, 2010
    #5
  6. The Frog

    Roedy Green Guest

    On Mon, 4 Oct 2010 00:01:55 -0700 (PDT), The Frog
    <> wrote, quoted or indirectly quoted
    someone who said :

    >Each of these
    >'sentences' has a different number of fields, and the fields per
    >sentence have their own unique order. Unfortunately its not as nicely
    >regular as a CSV - that would have been wonderful!


    You can read that with my CSV package. Just read the first field on
    the line, then branch to the code to read that format of line.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    You encapsulate not just to save typing, but more importantly, to make it easy and safe to change the code later, since you then need change the logic in only one place. Without it, you might fail to change the logic in all the places it occurs.
     
    Roedy Green, Oct 4, 2010
    #6
  7. The Frog

    The Frog Guest

    Hi Roedy,

    Nice tool, thankyou for sharing this with the world. Looks like it is
    up to the task of reading the file itself, but I still have no idea
    how to feed some form of structural information to an app so that it
    can build objects from that structural data and populate the members /
    fields with the values in the data file. I have sniffed around
    serialization but am not sure if this is the way to go, or perhaps
    there is a 'better' approach.

    As I see it, the package will evolve with the evolution of the data
    files themselves. As a new format becomes available / known, some form
    of descriptive document is added to the package that allows the
    package to correctly interpret the data file and build appropriate
    objects. I am just not sure how to approach this last part. Is there
    anything you can point me to that might help me solve this?

    Cheers (and many thanks)

    The Frog
     
    The Frog, Oct 5, 2010
    #7
  8. The Frog

    Lew Guest

    The Frog wrote:
    > Nice tool, thankyou for sharing this with the world. Looks like it is
    > up to the task of reading the file itself, but I still have no idea
    > how to feed some form of structural information to an app so that it
    > can build objects from that structural data and populate the members /
    > fields with the values in the data file. I have sniffed around
    > serialization but am not sure if this is the way to go, or perhaps
    > there is a 'better' approach.
    >
    > As I see it, the package will evolve with the evolution of the data
    > files themselves. As a new format becomes available / known, some form
    > of descriptive document is added to the package that allows the
    > package to correctly interpret the data file and build appropriate
    > objects. I am just not sure how to approach this last part. Is there
    > anything you can point me to that might help me solve this?


    Your questions are the logical guideposts for your next iteration. You are
    showing good software design sense.

    Forgive my taking a side topic here, but often it pays big benefits to
    consider a problem in holistic terms and let analysis control your thinking.

    Instead of considering implementations - serialization, CSV, a 'better'
    approach (how can you even tell?) - without understanding the behavior these
    specific implementations must support, document the behavior itself.

    For your application I would draw a state diagram, among other things.

    Based on that diagram, I'd write a code framework - I wouldn't pick libraries
    to support an unwritten code framework for an undocumented algorithm.

    To support the framework you likely will benefit most from a lower-level
    library that does part or most of what you want, but not all, encapsulated in
    custom code that handles the particulars of your situation, matching your
    state diagram and other documentation.

    Your approach to understand candidate libraries, and to iteratively refine
    your solution, is a good one. But there is no one-size-fits-all 'better'
    approach independent of the purpose at hand. Your situation is one for which
    the 'better' approach is to analyze and document the problem in detail prior
    to coding, and to consider custom code in the mix.

    That's a pattern for you to solve your own problem rather than a solution to
    your problem. Others in this thread have already excelled at suggesting
    particulars; my aim is to describe literally what you asked for, an approach.

    --
    Lew
     
    Lew, Oct 5, 2010
    #8
  9. The Frog

    The Frog Guest

    Lew,

    Thankyou for the guidance. It is indeed approaching the scenario from
    a top down rather than a bottom up approach, and your words make a lot
    of sense. I will do as you have suggested and come back another time
    (thread) with more specific issues.

    To all who have helped guide me, a very big thankyou.

    Cheers

    The Frog
     
    The Frog, Oct 6, 2010
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jeff Kish
    Replies:
    5
    Views:
    402
    Oliver Wong
    Apr 28, 2006
  2. The Frog
    Replies:
    4
    Views:
    348
    New Java 456
    Jan 29, 2010
  3. Randy Kramer
    Replies:
    2
    Views:
    429
    Randy Kramer
    Jan 12, 2007
  4. Greg Ennis
    Replies:
    1
    Views:
    105
    Greg Ennis
    Nov 11, 2004
  5. nun
    Replies:
    3
    Views:
    114
    John W. Krahn
    Mar 20, 2007
Loading...

Share This Page