Legacy data parsing

Discussion in 'Python' started by gov, Jul 8, 2005.

  1. gov

    gov Guest

    Hi,

    I've just started to learn programming and was told this was a good
    place to ask questions :)

    Where I work, we receive large quantities of data which is currently
    all printed on large, obsolete, dot matrix printers. This is a problem
    because the replacement parts will not be available for much longer.

    So I'm trying to create a program which will capture the fixed width
    text file data and convert as well as sort the data (there are several
    different report types) into a different format which would allow it to
    be printed normally, or viewed on a computer.

    I've been reading up on the Regular Expression module and ways in which
    to manipulate strings however it has been difficult to think of a way
    in which to extract an address.

    Here's an example of the raw text that I have to work with:


    ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:
    ****************************

    FOR/POUR AL/LA: 20
    CORR TYP: A1B 2C3 P:3 CHNGD/CHANG
    LANG: E CONS/REGR: #######
    MRS XXX X XXXXXXX
    ### XXXXXXXXX ST DD TYP: P:6
    CHNGD/CHANG
    MONCTON NB LANG: E CONS/REGR:
    #######
    MRS XXX X XXXXXXX
    #####
    ####
    ###-###-#

    ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:
    ****************************

    FOR/POUR AL/LA: 30
    BOTH TYP: A1B 2D3 P:3 CHNGD/CHANG
    LANG: E CONS/REGR: #######
    MISS XXXX XXXXX
    ### XXXXXXXX ST
    MONCTON NB

    EARNINGS VITAL INFORMATION/RENSEIGNEMENTS ESSENTIELS SUR LES GAINS:
    ***********

    (the # = any number, and the X's are just regular text)
    I would like to extract the address information, but the two different
    text objects on the right hand side are difficult to remove. I think
    it would be easier if I could just extract a fixed square of
    information, but I don't have a clue as to how to go about it.

    If anyone could give me suggestions as to methods in sorting this type
    of data, it would be appreciated.
     
    gov, Jul 8, 2005
    #1
    1. Advertising

  2. gov

    Jeremy Jones Guest

    gov wrote:

    >Hi,
    >
    >
    >

    <snip>

    >If anyone could give me suggestions as to methods in sorting this type
    >of data, it would be appreciated.
    >
    >
    >

    Maybe it's overkill, but I'd *highly* recommend David Mertz's excellent
    book "Text Processing in Python": http://gnosis.cx/TPiP/ Don't know
    what all you're needing to do, but that small snip smells like it needs
    a state machine which this book has an excellent, simple one in (I
    think) chapter 4.

    Jeremy Jones
     
    Jeremy Jones, Jul 8, 2005
    #2
    1. Advertising

  3. gov

    Miki Tebeka Guest

    Hello gov,

    > Here's an example of the raw text that I have to work with:
    >
    >
    > ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:
    > ****************************
    >
    > FOR/POUR AL/LA: 20
    > CORR TYP: A1B 2C3 P:3 CHNGD/CHANG
    > LANG: E CONS/REGR: #######
    > MRS XXX X XXXXXXX
    > ### XXXXXXXXX ST DD TYP: P:6
    > CHNGD/CHANG
    > MONCTON NB LANG: E CONS/REGR:
    > #######
    > MRS XXX X XXXXXXX
    > #####
    > ####
    > ###-###-#
    >
    > ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:
    > ****************************
    >
    > FOR/POUR AL/LA: 30
    > BOTH TYP: A1B 2D3 P:3 CHNGD/CHANG
    > LANG: E CONS/REGR: #######
    > MISS XXXX XXXXX
    > ### XXXXXXXX ST
    > MONCTON NB
    >
    > EARNINGS VITAL INFORMATION/RENSEIGNEMENTS ESSENTIELS SUR LES GAINS:
    > ***********
    >
    > (the # = any number, and the X's are just regular text)
    > I would like to extract the address information, but the two different
    > text objects on the right hand side are difficult to remove. I think
    > it would be easier if I could just extract a fixed square of
    > information, but I don't have a clue as to how to go about it.
    >
    > If anyone could give me suggestions as to methods in sorting this type
    > of data, it would be appreciated.

    Maybe regular expression are too difficult for this. I'd try one of the
    parsing toolkits (such as PLY, PyParsing ...), it might be more suitable
    for the job.

    HTH.
    --
    ------------------------------------------------------------------------
    Miki Tebeka <>
    http://tebeka.bizhat.com
    The only difference between children and adults is the price of the toys

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.4.0 (Cygwin)

    iD8DBQFCzs5Y8jAdENsUuJsRAi3+AJ0SLBJvK2MmmLzQDTx0XbgY9d7ArQCgl02L
    4U2vJdRK7zyiJpajE02KkoA=
    =h7R+
    -----END PGP SIGNATURE-----
     
    Miki Tebeka, Jul 8, 2005
    #3
  4. gov wrote:
    > Hi,
    >
    > I've just started to learn programming and was told this was a good
    > place to ask questions :)
    >
    > Where I work, we receive large quantities of data which is currently
    > all printed on large, obsolete, dot matrix printers. This is a problem
    > because the replacement parts will not be available for much longer.
    >
    > So I'm trying to create a program which will capture the fixed width
    > text file data and convert as well as sort the data (there are several
    > different report types) into a different format which would allow it to
    > be printed normally, or viewed on a computer.


    Are these reports all of the same page-wise format, with fixed-width
    columns? If so, then the suggestion about a state machine sounds good
    -- just run a state machine to figure out which linetype you're on, then
    extract the fixed width fields via slices.

    name = line[x:y]

    If that doesn't work, then pyparsing or DParser might work for you as a
    more general-purpose parser.
     
    Christopher Subich, Jul 8, 2005
    #4
  5. "gov" <> wrote in message
    news:...
    > Hi,
    >
    > I've just started to learn programming and was told this was a good
    > place to ask questions :)
    >
    > Where I work, we receive large quantities of data which is currently
    > all printed on large, obsolete, dot matrix printers. This is a problem
    > because the replacement parts will not be available for much longer.
    >
    > So I'm trying to create a program which will capture the fixed width
    > text file data and convert as well as sort the data (there are several
    > different report types) into a different format which would allow it to
    > be printed normally, or viewed on a computer.


    Text file data has no concept of "fixed width". Somewhere in your system,
    text file data is being thrown at your dot matrix printer. It would seem a
    trivial exercise to simply plug in a newer and probably inexpensive
    replacement printer.

    What am I missing here?

    > I've been reading up on the Regular Expression module and ways in which
    > to manipulate strings however it has been difficult to think of a way
    > in which to extract an address.
    >
    > Here's an example of the raw text that I have to work with:
    >

    <snip>

    How are you intercepting this text data?
    Are you replacing your old printer with a Python speaking computer?
    How will you deliver this data to your Python program?

    > (the # = any number, and the X's are just regular text)
    > I would like to extract the address information, but the two different
    > text objects on the right hand side are difficult to remove. I think
    > it would be easier if I could just extract a fixed square of
    > information, but I don't have a clue as to how to go about it.


    Assuming you know how your Python code will "see" this data -

    You would need no more than standard Python string handling to perform these
    tasks.

    There is no concept of a "fixed square" here. This is a continuous stream
    of (probably ascii) characters. If you could pick the data up from a file,
    you would use readline() to build a list of individual lines. If you were
    picking the data from a serial port, you might assemble the whole thing into
    one big string and use split(/n) to build your list of lines.

    Once you had a full record (print page?) as a list of individual lines you
    could identify each line by it's position in the list *if*, as is likely,
    each item arrives at the same line position. If not, your code can read
    each line and test. For example:
    The line
    "#######"
    Seems to immediately precede several address lines
    " MRS XXX X XXXXXXX"
    " #####"
    " ####:
    " ###-###-#"

    If you can rely on this you would know that the line "#######" is
    immediately followed by several lines of an address - up until the empty
    line. And you can look at each of those address lines and use trim() to
    remove leading and trailing blanks.

    Similarly, the line that begins " LANG:" would seem to immediately precede
    another address.

    None of this is particularly difficult with standard Python.
    But then - if we are merely replacing an old printer -

    We are already working way too hard!
    Thomas Bartkus
     
    Thomas Bartkus, Jul 8, 2005
    #5
  6. gov

    Guest


    >
    > Where I work, we receive large quantities of data which is currently
    > all printed on large, obsolete, dot matrix printers. This is a problem
    > because the replacement parts will not be available for much longer.
    >
    > So I'm trying to create a program which will capture the fixed width
    > text file data and convert as well as sort the data (there are several
    > different report types) into a different format which would allow it to
    > be printed normally, or viewed on a computer.
    >


    Do you have access to the programs that generate these reports? If so,
    its probably a simple fixed format, and you can pull the fields out
    with the slice operator (eg name = line[30:40]) -- no regular
    expressions necessary. I've done this in a couple of cases, and its
    easy *if* you know exactly what the report format is.

    Or, consider using another tool. I've also used Monarch (a purchased
    program) for parsing reports, and its works well on most formats.

    Brian.
     
    , Jul 8, 2005
    #6
  7. On 8 Jul 2005 11:31:14 -0700, "gov" <> wrote:

    >Hi,
    >
    >I've just started to learn programming and was told this was a good
    >place to ask questions :)
    >
    >Where I work, we receive large quantities of data which is currently
    >all printed on large, obsolete, dot matrix printers. This is a problem
    >because the replacement parts will not be available for much longer.
    >
    >So I'm trying to create a program which will capture the fixed width
    >text file data and convert as well as sort the data (there are several
    >different report types) into a different format which would allow it to
    >be printed normally, or viewed on a computer.
    >
    >I've been reading up on the Regular Expression module and ways in which
    >to manipulate strings however it has been difficult to think of a way
    >in which to extract an address.
    >
    >Here's an example of the raw text that I have to work with:
    >
    >
    >ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:
    >****************************
    >
    >FOR/POUR AL/LA: 20
    > CORR TYP: A1B 2C3 P:3 CHNGD/CHANG
    > LANG: E CONS/REGR: #######
    > MRS XXX X XXXXXXX
    > ### XXXXXXXXX ST DD TYP: P:6
    >CHNGD/CHANG
    > MONCTON NB LANG: E CONS/REGR:
    >#######
    > MRS XXX X XXXXXXX
    > #####
    > ####
    > ###-###-#
    >
    >ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:
    >****************************
    >
    >FOR/POUR AL/LA: 30
    > BOTH TYP: A1B 2D3 P:3 CHNGD/CHANG
    > LANG: E CONS/REGR: #######
    > MISS XXXX XXXXX
    > ### XXXXXXXX ST
    > MONCTON NB
    >
    >EARNINGS VITAL INFORMATION/RENSEIGNEMENTS ESSENTIELS SUR LES GAINS:
    >***********
    >
    >(the # = any number, and the X's are just regular text)
    >I would like to extract the address information, but the two different
    >text objects on the right hand side are difficult to remove. I think
    >it would be easier if I could just extract a fixed square of
    >information, but I don't have a clue as to how to go about it.
    >
    >If anyone could give me suggestions as to methods in sorting this type
    >of data, it would be appreciated.
    >

    If this is all fixed-width font characters and fixed record formats, you
    might get some ideas about extracting a "fixed square". I re-joined the
    strings of the fixed square with '\n'.join(<lines_of_the_square>) to print it,
    but you could extract data from the lines in various ways with regexes and such.

    I used your data example and added some under the alternate header.
    (Not tested beyond what you see ;-)

    ----< legacy_data_parsing.py >---------------------------------------------------
    data = """\
    ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:
    ****************************

    FOR/POUR AL/LA: 20
    CORR TYP: A1B 2C3 P:3 CHNGD/CHANG
    LANG: E CONS/REGR: #######
    MRS XXX X XXXXXXX
    ### XXXXXXXXX ST DD TYP: P:6
    CHNGD/CHANG
    MONCTON NB LANG: E CONS/REGR:
    #######
    MRS XXX X XXXXXXX
    #####
    ####
    ###-###-#

    ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:
    ****************************

    FOR/POUR AL/LA: 30
    BOTH TYP: A1B 2D3 P:3 CHNGD/CHANG
    LANG: E CONS/REGR: #######
    MISS XXXX XXXXX
    ### XXXXXXXX ST
    MONCTON NB

    EARNINGS VITAL INFORMATION/RENSEIGNEMENTS ESSENTIELS SUR LES GAINS:
    ***********
    1 [Don't know what [<- 1,34 This is a box of
    2 goes in this kind text with top/left
    3 of record, but this character row/col 1,34
    4 is some text to show and bottom/right at 4,62 ->]
    5 how it might get
    6 extracted]

    """

    record_headers = [
    """\
    ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:
    """,
    """\
    EARNINGS VITAL INFORMATION/RENSEIGNEMENTS ESSENTIELS SUR LES GAINS:
    """
    ]

    import re
    recsplitter = re.compile('('+ '|'.join(map(re.escape,record_headers))+')')
    def extract_block(tl, br, data):
    lines = [s.ljust(br[1]+1) for s in data.splitlines()]
    return '\n'.join([line[tl[1]:br[1]+1] for line in lines[tl[0]:br[0]+1]])

    for i, hdr_or_body in enumerate(recsplitter.split(data)):
    if i==0:
    print '='*10, 'file prefix', '='*30
    data_type = ''
    elif i%2:
    print '='*10, 'record hdr', '='*30
    data_type = hdr_or_body
    else:
    print '='*10, 'record data', '='*30
    print hdr_or_body
    print '='*10
    if not i%2 and data_type == record_headers[1]: # EARNINGS etc
    print '---- earnings record right block ----'
    print extract_block((1,34),(4,62), hdr_or_body)
    print '----'
    ---------------------------------------------------------------------------------

    Produces:

    [15:33] C:\pywk\clp>py24 legacy_data_parsing.py
    ========== file prefix ==============================

    ==========
    ========== record hdr ==============================
    ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:

    ==========
    ========== record data ==============================
    ****************************

    FOR/POUR AL/LA: 20
    CORR TYP: A1B 2C3 P:3 CHNGD/CHANG
    LANG: E CONS/REGR: #######
    MRS XXX X XXXXXXX
    ### XXXXXXXXX ST DD TYP: P:6
    CHNGD/CHANG
    MONCTON NB LANG: E CONS/REGR:
    #######
    MRS XXX X XXXXXXX
    #####
    ####
    ###-###-#


    ==========
    ========== record hdr ==============================
    ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:

    ==========
    ========== record data ==============================
    ****************************

    FOR/POUR AL/LA: 30
    BOTH TYP: A1B 2D3 P:3 CHNGD/CHANG
    LANG: E CONS/REGR: #######
    MISS XXXX XXXXX
    ### XXXXXXXX ST
    MONCTON NB


    ==========
    ========== record hdr ==============================
    EARNINGS VITAL INFORMATION/RENSEIGNEMENTS ESSENTIELS SUR LES GAINS:

    ==========
    ========== record data ==============================
    ***********
    1 [Don't know what [<- 1,34 This is a box of
    2 goes in this kind text with top/left
    3 of record, but this character row/col 1,34
    4 is some text to show and bottom/right at 4,62 ->]
    5 how it might get
    6 extracted]


    ==========
    ---- earnings record right block ----
    [<- 1,34 This is a box of
    text with top/left
    character row/col 1,34
    and bottom/right at 4,62 ->]
    ----

    HTH

    Regards,
    Bengt Richter
     
    Bengt Richter, Jul 8, 2005
    #7
  8. gov

    Jorgen Grahn Guest

    On Fri, 8 Jul 2005 15:03:45 -0500, Thomas Bartkus <> wrote:
    > "gov" <> wrote in message
    > news:...
    >> Hi,
    >>
    >> I've just started to learn programming and was told this was a good
    >> place to ask questions :)
    >>
    >> Where I work, we receive large quantities of data which is currently
    >> all printed on large, obsolete, dot matrix printers. This is a problem
    >> because the replacement parts will not be available for much longer.
    >>
    >> So I'm trying to create a program which will capture the fixed width
    >> text file data and convert as well as sort the data (there are several
    >> different report types) into a different format which would allow it to
    >> be printed normally, or viewed on a computer.

    >
    > Text file data has no concept of "fixed width". Somewhere in your system,
    > text file data is being thrown at your dot matrix printer. It would seem a
    > trivial exercise to simply plug in a newer and probably inexpensive
    > replacement printer.
    >
    > What am I missing here?


    I was just wondering the same thing.

    Until/unless we don't get an answer: here's two hypotheses:

    - The text file is too wide for modern-day laser printers to print properly,
    or the printer isn't configured to accept plain text (accented characters,
    line feeds and so on).
    -> feed it through 'enscript' or a similar utility, which can
    scale it down and manipulate it in various ways into a Postscript
    file, and print that one
    - The text file isn't really a text file, but full of escape codes for
    the matrix printer (boldfacing and so on).
    -> attempt to clean it with a utility like the standard unix 'col' command
    -> ... and/or write custom code to do it. Python is a good choice.

    In general, this is an area where it's wise to use existing software.
    The hard part is to know what's available!

    /Jorgen

    --
    // Jorgen Grahn <jgrahn@ Ph'nglui mglw'nafh Cthulhu
    \X/ algonet.se> R'lyeh wgah'nagl fhtagn!
     
    Jorgen Grahn, Jul 9, 2005
    #8
  9. On Friday 08 July 2005 01:31 pm, gov wrote:
    > Where I work, we receive large quantities of data which is currently
    > all printed on large, obsolete, dot matrix printers. This is a problem
    > because the replacement parts will not be available for much longer.
    >
    > So I'm trying to create a program which will capture the fixed width
    > text file data and convert as well as sort the data (there are several
    > different report types) into a different format which would allow it to
    > be printed normally, or viewed on a computer.


    If this is really your reason for wanting to do this, it seems like your
    solutions is overkill. If you really just want the data to get
    reformatted for printing on a modern printer, it would be trivial to
    do this with a text-formatter like "enscript" (see, e.g.:
    http://people.ssh.com/mtr/genscript/ ) which produces Postscript
    output from ASCII text.

    On a typical Linux system, this sort of tool is usually part of your
    printer installation, after which it runs more or less invisibly.

    OTOH, if the *real* reason is that you don't like the look of the
    dot matrix output and you want it *rearranged* and reformatted
    for aesthetic reasons, then you might reasonably want to use
    Python to do that as you suggest.


    --
    Terry Hancock ( hancock at anansispaceworks.com )
    Anansi Spaceworks http://www.anansispaceworks.com
     
    Terry Hancock, Jul 11, 2005
    #9
  10. gov

    gov Guest

    Actually, we receive the data in the form of a text file. The original
    data is sent from an IBM mainframe then to Ottawa where it is captured
    by an "SNA Print Server that receives the VPS print jobs, writes them
    to disk and then runs a PERL script program on the disk file. This
    PERL script program scans the file's VPS banner page for key words
    (e.g. JobName, Destination, Form) and then creates a Plain Text and a
    Rich Text Format (RTF)." This system is available Nationally for every
    region in Canada. It is unfortunate that our government has been so
    slow in updating such an old process.

    Since I don't really know (or have access to) the inner workings of the
    mainframe or the conversion process, I can't really do much there.

    The reason why I don't wish to simply replace the printer simply
    convert it so it can be used on newer printers is because the data will
    also be used to automate tasks (such as creating form letters to
    clients).
     
    gov, Jul 11, 2005
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?aW5ldG11Zw==?=
    Replies:
    0
    Views:
    451
    =?Utf-8?B?aW5ldG11Zw==?=
    May 27, 2005
  2. Stylus Studio
    Replies:
    0
    Views:
    532
    Stylus Studio
    Oct 5, 2004
  3. Jeff Kish
    Replies:
    5
    Views:
    388
    Oliver Wong
    Apr 28, 2006
  4. Paddy McCarthy

    On: 'The Python CSV Module and Legacy Data'

    Paddy McCarthy, Apr 26, 2004, in forum: Python
    Replies:
    0
    Views:
    284
    Paddy McCarthy
    Apr 26, 2004
  5. ducnbyu
    Replies:
    2
    Views:
    6,163
    ducnbyu
    Sep 8, 2006
Loading...

Share This Page