Legacy data parsing

G

gov

Hi,

I've just started to learn programming and was told this was a good
place to ask questions :)

Where I work, we receive large quantities of data which is currently
all printed on large, obsolete, dot matrix printers. This is a problem
because the replacement parts will not be available for much longer.

So I'm trying to create a program which will capture the fixed width
text file data and convert as well as sort the data (there are several
different report types) into a different format which would allow it to
be printed normally, or viewed on a computer.

I've been reading up on the Regular Expression module and ways in which
to manipulate strings however it has been difficult to think of a way
in which to extract an address.

Here's an example of the raw text that I have to work with:


ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:
****************************

FOR/POUR AL/LA: 20
CORR TYP: A1B 2C3 P:3 CHNGD/CHANG
LANG: E CONS/REGR: #######
MRS XXX X XXXXXXX
### XXXXXXXXX ST DD TYP: P:6
CHNGD/CHANG
MONCTON NB LANG: E CONS/REGR:
#######
MRS XXX X XXXXXXX
#####
####
###-###-#

ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:
****************************

FOR/POUR AL/LA: 30
BOTH TYP: A1B 2D3 P:3 CHNGD/CHANG
LANG: E CONS/REGR: #######
MISS XXXX XXXXX
### XXXXXXXX ST
MONCTON NB

EARNINGS VITAL INFORMATION/RENSEIGNEMENTS ESSENTIELS SUR LES GAINS:
***********

(the # = any number, and the X's are just regular text)
I would like to extract the address information, but the two different
text objects on the right hand side are difficult to remove. I think
it would be easier if I could just extract a fixed square of
information, but I don't have a clue as to how to go about it.

If anyone could give me suggestions as to methods in sorting this type
of data, it would be appreciated.
 
J

Jeremy Jones

gov said:
If anyone could give me suggestions as to methods in sorting this type
of data, it would be appreciated.
Maybe it's overkill, but I'd *highly* recommend David Mertz's excellent
book "Text Processing in Python": http://gnosis.cx/TPiP/ Don't know
what all you're needing to do, but that small snip smells like it needs
a state machine which this book has an excellent, simple one in (I
think) chapter 4.

Jeremy Jones
 
M

Miki Tebeka

Hello gov,
Here's an example of the raw text that I have to work with:


ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:
****************************

FOR/POUR AL/LA: 20
CORR TYP: A1B 2C3 P:3 CHNGD/CHANG
LANG: E CONS/REGR: #######
MRS XXX X XXXXXXX
### XXXXXXXXX ST DD TYP: P:6
CHNGD/CHANG
MONCTON NB LANG: E CONS/REGR:
#######
MRS XXX X XXXXXXX
#####
####
###-###-#

ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:
****************************

FOR/POUR AL/LA: 30
BOTH TYP: A1B 2D3 P:3 CHNGD/CHANG
LANG: E CONS/REGR: #######
MISS XXXX XXXXX
### XXXXXXXX ST
MONCTON NB

EARNINGS VITAL INFORMATION/RENSEIGNEMENTS ESSENTIELS SUR LES GAINS:
***********

(the # = any number, and the X's are just regular text)
I would like to extract the address information, but the two different
text objects on the right hand side are difficult to remove. I think
it would be easier if I could just extract a fixed square of
information, but I don't have a clue as to how to go about it.

If anyone could give me suggestions as to methods in sorting this type
of data, it would be appreciated.
Maybe regular expression are too difficult for this. I'd try one of the
parsing toolkits (such as PLY, PyParsing ...), it might be more suitable
for the job.

HTH.
--
------------------------------------------------------------------------
Miki Tebeka <[email protected]>
http://tebeka.bizhat.com
The only difference between children and adults is the price of the toys

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (Cygwin)

iD8DBQFCzs5Y8jAdENsUuJsRAi3+AJ0SLBJvK2MmmLzQDTx0XbgY9d7ArQCgl02L
4U2vJdRK7zyiJpajE02KkoA=
=h7R+
-----END PGP SIGNATURE-----
 
C

Christopher Subich

gov said:
Hi,

I've just started to learn programming and was told this was a good
place to ask questions :)

Where I work, we receive large quantities of data which is currently
all printed on large, obsolete, dot matrix printers. This is a problem
because the replacement parts will not be available for much longer.

So I'm trying to create a program which will capture the fixed width
text file data and convert as well as sort the data (there are several
different report types) into a different format which would allow it to
be printed normally, or viewed on a computer.

Are these reports all of the same page-wise format, with fixed-width
columns? If so, then the suggestion about a state machine sounds good
-- just run a state machine to figure out which linetype you're on, then
extract the fixed width fields via slices.

name = line[x:y]

If that doesn't work, then pyparsing or DParser might work for you as a
more general-purpose parser.
 
T

Thomas Bartkus

gov said:
Hi,

I've just started to learn programming and was told this was a good
place to ask questions :)

Where I work, we receive large quantities of data which is currently
all printed on large, obsolete, dot matrix printers. This is a problem
because the replacement parts will not be available for much longer.

So I'm trying to create a program which will capture the fixed width
text file data and convert as well as sort the data (there are several
different report types) into a different format which would allow it to
be printed normally, or viewed on a computer.

Text file data has no concept of "fixed width". Somewhere in your system,
text file data is being thrown at your dot matrix printer. It would seem a
trivial exercise to simply plug in a newer and probably inexpensive
replacement printer.

What am I missing here?
I've been reading up on the Regular Expression module and ways in which
to manipulate strings however it has been difficult to think of a way
in which to extract an address.

Here's an example of the raw text that I have to work with:
<snip>

How are you intercepting this text data?
Are you replacing your old printer with a Python speaking computer?
How will you deliver this data to your Python program?
(the # = any number, and the X's are just regular text)
I would like to extract the address information, but the two different
text objects on the right hand side are difficult to remove. I think
it would be easier if I could just extract a fixed square of
information, but I don't have a clue as to how to go about it.

Assuming you know how your Python code will "see" this data -

You would need no more than standard Python string handling to perform these
tasks.

There is no concept of a "fixed square" here. This is a continuous stream
of (probably ascii) characters. If you could pick the data up from a file,
you would use readline() to build a list of individual lines. If you were
picking the data from a serial port, you might assemble the whole thing into
one big string and use split(/n) to build your list of lines.

Once you had a full record (print page?) as a list of individual lines you
could identify each line by it's position in the list *if*, as is likely,
each item arrives at the same line position. If not, your code can read
each line and test. For example:
The line
"#######"
Seems to immediately precede several address lines
" MRS XXX X XXXXXXX"
" #####"
" ####:
" ###-###-#"

If you can rely on this you would know that the line "#######" is
immediately followed by several lines of an address - up until the empty
line. And you can look at each of those address lines and use trim() to
remove leading and trailing blanks.

Similarly, the line that begins " LANG:" would seem to immediately precede
another address.

None of this is particularly difficult with standard Python.
But then - if we are merely replacing an old printer -

We are already working way too hard!
Thomas Bartkus
 
B

brian

Where I work, we receive large quantities of data which is currently
all printed on large, obsolete, dot matrix printers. This is a problem
because the replacement parts will not be available for much longer.

So I'm trying to create a program which will capture the fixed width
text file data and convert as well as sort the data (there are several
different report types) into a different format which would allow it to
be printed normally, or viewed on a computer.

Do you have access to the programs that generate these reports? If so,
its probably a simple fixed format, and you can pull the fields out
with the slice operator (eg name = line[30:40]) -- no regular
expressions necessary. I've done this in a couple of cases, and its
easy *if* you know exactly what the report format is.

Or, consider using another tool. I've also used Monarch (a purchased
program) for parsing reports, and its works well on most formats.

Brian.
 
B

Bengt Richter

Hi,

I've just started to learn programming and was told this was a good
place to ask questions :)

Where I work, we receive large quantities of data which is currently
all printed on large, obsolete, dot matrix printers. This is a problem
because the replacement parts will not be available for much longer.

So I'm trying to create a program which will capture the fixed width
text file data and convert as well as sort the data (there are several
different report types) into a different format which would allow it to
be printed normally, or viewed on a computer.

I've been reading up on the Regular Expression module and ways in which
to manipulate strings however it has been difficult to think of a way
in which to extract an address.

Here's an example of the raw text that I have to work with:


ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:
****************************

FOR/POUR AL/LA: 20
CORR TYP: A1B 2C3 P:3 CHNGD/CHANG
LANG: E CONS/REGR: #######
MRS XXX X XXXXXXX
### XXXXXXXXX ST DD TYP: P:6
CHNGD/CHANG
MONCTON NB LANG: E CONS/REGR:
#######
MRS XXX X XXXXXXX
#####
####
###-###-#

ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:
****************************

FOR/POUR AL/LA: 30
BOTH TYP: A1B 2D3 P:3 CHNGD/CHANG
LANG: E CONS/REGR: #######
MISS XXXX XXXXX
### XXXXXXXX ST
MONCTON NB

EARNINGS VITAL INFORMATION/RENSEIGNEMENTS ESSENTIELS SUR LES GAINS:
***********

(the # = any number, and the X's are just regular text)
I would like to extract the address information, but the two different
text objects on the right hand side are difficult to remove. I think
it would be easier if I could just extract a fixed square of
information, but I don't have a clue as to how to go about it.

If anyone could give me suggestions as to methods in sorting this type
of data, it would be appreciated.
If this is all fixed-width font characters and fixed record formats, you
might get some ideas about extracting a "fixed square". I re-joined the
strings of the fixed square with '\n'.join(<lines_of_the_square>) to print it,
but you could extract data from the lines in various ways with regexes and such.

I used your data example and added some under the alternate header.
(Not tested beyond what you see ;-)

----< legacy_data_parsing.py >---------------------------------------------------
data = """\
ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:
****************************

FOR/POUR AL/LA: 20
CORR TYP: A1B 2C3 P:3 CHNGD/CHANG
LANG: E CONS/REGR: #######
MRS XXX X XXXXXXX
### XXXXXXXXX ST DD TYP: P:6
CHNGD/CHANG
MONCTON NB LANG: E CONS/REGR:
#######
MRS XXX X XXXXXXX
#####
####
###-###-#

ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:
****************************

FOR/POUR AL/LA: 30
BOTH TYP: A1B 2D3 P:3 CHNGD/CHANG
LANG: E CONS/REGR: #######
MISS XXXX XXXXX
### XXXXXXXX ST
MONCTON NB

EARNINGS VITAL INFORMATION/RENSEIGNEMENTS ESSENTIELS SUR LES GAINS:
***********
1 [Don't know what [<- 1,34 This is a box of
2 goes in this kind text with top/left
3 of record, but this character row/col 1,34
4 is some text to show and bottom/right at 4,62 ->]
5 how it might get
6 extracted]

"""

record_headers = [
"""\
ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:
""",
"""\
EARNINGS VITAL INFORMATION/RENSEIGNEMENTS ESSENTIELS SUR LES GAINS:
"""
]

import re
recsplitter = re.compile('('+ '|'.join(map(re.escape,record_headers))+')')
def extract_block(tl, br, data):
lines = [s.ljust(br[1]+1) for s in data.splitlines()]
return '\n'.join([line[tl[1]:br[1]+1] for line in lines[tl[0]:br[0]+1]])

for i, hdr_or_body in enumerate(recsplitter.split(data)):
if i==0:
print '='*10, 'file prefix', '='*30
data_type = ''
elif i%2:
print '='*10, 'record hdr', '='*30
data_type = hdr_or_body
else:
print '='*10, 'record data', '='*30
print hdr_or_body
print '='*10
if not i%2 and data_type == record_headers[1]: # EARNINGS etc
print '---- earnings record right block ----'
print extract_block((1,34),(4,62), hdr_or_body)
print '----'
---------------------------------------------------------------------------------

Produces:

[15:33] C:\pywk\clp>py24 legacy_data_parsing.py
========== file prefix ==============================

==========
========== record hdr ==============================
ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:

==========
========== record data ==============================
****************************

FOR/POUR AL/LA: 20
CORR TYP: A1B 2C3 P:3 CHNGD/CHANG
LANG: E CONS/REGR: #######
MRS XXX X XXXXXXX
### XXXXXXXXX ST DD TYP: P:6
CHNGD/CHANG
MONCTON NB LANG: E CONS/REGR:
#######
MRS XXX X XXXXXXX
#####
####
###-###-#


==========
========== record hdr ==============================
ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:

==========
========== record data ==============================
****************************

FOR/POUR AL/LA: 30
BOTH TYP: A1B 2D3 P:3 CHNGD/CHANG
LANG: E CONS/REGR: #######
MISS XXXX XXXXX
### XXXXXXXX ST
MONCTON NB


==========
========== record hdr ==============================
EARNINGS VITAL INFORMATION/RENSEIGNEMENTS ESSENTIELS SUR LES GAINS:

==========
========== record data ==============================
***********
1 [Don't know what [<- 1,34 This is a box of
2 goes in this kind text with top/left
3 of record, but this character row/col 1,34
4 is some text to show and bottom/right at 4,62 ->]
5 how it might get
6 extracted]


==========
---- earnings record right block ----
[<- 1,34 This is a box of
text with top/left
character row/col 1,34
and bottom/right at 4,62 ->]
----

HTH

Regards,
Bengt Richter
 
J

Jorgen Grahn

Text file data has no concept of "fixed width". Somewhere in your system,
text file data is being thrown at your dot matrix printer. It would seem a
trivial exercise to simply plug in a newer and probably inexpensive
replacement printer.

What am I missing here?

I was just wondering the same thing.

Until/unless we don't get an answer: here's two hypotheses:

- The text file is too wide for modern-day laser printers to print properly,
or the printer isn't configured to accept plain text (accented characters,
line feeds and so on).
-> feed it through 'enscript' or a similar utility, which can
scale it down and manipulate it in various ways into a Postscript
file, and print that one
- The text file isn't really a text file, but full of escape codes for
the matrix printer (boldfacing and so on).
-> attempt to clean it with a utility like the standard unix 'col' command
-> ... and/or write custom code to do it. Python is a good choice.

In general, this is an area where it's wise to use existing software.
The hard part is to know what's available!

/Jorgen
 
T

Terry Hancock

Where I work, we receive large quantities of data which is currently
all printed on large, obsolete, dot matrix printers. This is a problem
because the replacement parts will not be available for much longer.

So I'm trying to create a program which will capture the fixed width
text file data and convert as well as sort the data (there are several
different report types) into a different format which would allow it to
be printed normally, or viewed on a computer.

If this is really your reason for wanting to do this, it seems like your
solutions is overkill. If you really just want the data to get
reformatted for printing on a modern printer, it would be trivial to
do this with a text-formatter like "enscript" (see, e.g.:
http://people.ssh.com/mtr/genscript/ ) which produces Postscript
output from ASCII text.

On a typical Linux system, this sort of tool is usually part of your
printer installation, after which it runs more or less invisibly.

OTOH, if the *real* reason is that you don't like the look of the
dot matrix output and you want it *rearranged* and reformatted
for aesthetic reasons, then you might reasonably want to use
Python to do that as you suggest.
 
G

gov

Actually, we receive the data in the form of a text file. The original
data is sent from an IBM mainframe then to Ottawa where it is captured
by an "SNA Print Server that receives the VPS print jobs, writes them
to disk and then runs a PERL script program on the disk file. This
PERL script program scans the file's VPS banner page for key words
(e.g. JobName, Destination, Form) and then creates a Plain Text and a
Rich Text Format (RTF)." This system is available Nationally for every
region in Canada. It is unfortunate that our government has been so
slow in updating such an old process.

Since I don't really know (or have access to) the inner workings of the
mainframe or the conversion process, I can't really do much there.

The reason why I don't wish to simply replace the printer simply
convert it so it can be used on newer printers is because the data will
also be used to automate tasks (such as creating form letters to
clients).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top