Translate tab-delimited to fixed width text

mike beck · Sep 21, 2004

I'm writing a script to take a tab-delimited input file where the
format is known and transform it to a fixed-width text file. I thought
that the script might be easier to troubleshoot or edit later (if the
file formats change, for example) if I specified the column names for
the input and output files, which is why I didn't just map input
column 1 -> export column 14.

Even though I've spent some quality time with the archives and the
fine manual, I'm pretty sure that I'm doing some WRONG THINGS. Or
perhaps some SUB-OPTIMAL THINGS. And possibly REALLY WEIRD THINGS. Or
just maybe THINGS THAT WILL WORK FOR A LITTLE WHILE BUT WILL SURELY
CAUSE DIRE PROBLEMS AT SOME LATER, INCREDIBLY INCONVENIENT TIME
RESULTING IN HORRIBLE, TRAGIC CONSEQUENCES FOR EVERYONE INVOLVED.

The script works, but the use of multiple dictionaries seems ... how
should I say ... unintuitive. Is there a better way to move values
from one dictionary to another than what I've done here? I'd
appreciate any other tips (especially those that fit into the
CATEGORIES listed above).

And I've omitted the read from file and write to file business for the
sake of brevity...

Standing on the shoulders of giants (meaning, you all!),
Mike

---

"""
Transforms a tab delimited text file to a fixed-width text file.
"""
import copy, string

DEFFIELDSEP = '\t';
DEFLINESEP = '\n';
DEFHASHEADER = 1; # 1 if yes, 0 if no

import_field_order = ( 'Email',
'First Name(s)',
'Last Name',
'Address Line 1',
'Address Line 2',
'City',
'State',
'Zip Code', );

# export columns (keys) to import columns (values)
mapped_fields = { 'EMAIL_ADDRESS_1':'Email',
'FIRST_NAME_1':'First Name(s)',
'LAST_NAME_1':'Last Name',
'ADDRESS_1':'Address Line 1',
'CITY':'City',
'STATE':'State',
'ZIP_CODE':'Zip Code', };

# required length and justification of export fields
export_fields = { 'RECORD_TYPE':[1,'left'],
'ACCOUNT_ID':[8,'left'],
'TITLE_1':[8,'left'],
'FIRST_NAME_1':[20,'left'],
'MIDDLE_NAME_1':[20,'left'],
'LAST_NAME_1':[20,'left'],
'SUFFIX_1':[8,'left'],
'NAME_ID_1':[2,'left'],
'ADDRESS_1':[256,'left'],
'CITY':[30,'left'],
'STATE':[2,'left'],
'ZIP_CODE':[5,'left'],
'ZIP_EXTENSION':[4,'left'],
'EMAIL_ADDRESS_1':[100,'left'],
'EMAIL_ADDRESS_TYPE_1':[2,'left'], };

export_field_order = ( 'RECORD_TYPE',
'ACCOUNT_ID',
'TITLE_1',
'FIRST_NAME_1',
'MIDDLE_NAME_1',
'LAST_NAME_1',
'SUFFIX_1',
'NAME_ID_1',
'ADDRESS_1',
'CITY',
'STATE',
'ZIP_CODE',
'ZIP_EXTENSION',
'EMAIL_ADDRESS_1',
'EMAIL_ADDRESS_TYPE_1', );

def pad ( elem, max_length, justification=None):
"""
Pad a string to a predetermined length. Truncate the string
if necessary. By default, left justify and pad with spaces.
"""
padded_value = '';
# what if elem is none?
if (justification is 'right'):
padded_value = elem.rjust(max_length)[:max_length]
else:
padded_value = elem.ljust(max_length)[:max_length]
return padded_value

def glue ( var, ref, order ):
"""
Build a string with padded dictionary values.
Var and ref are dictionaries, order is a tuple describing the
field order.
"""
finished_line = []
# pad the items
for k in var.keys():
# pad (value, length, justification)
var[k] = pad( str(var[k]), ref[k][0], ref[k][1])
# build the line in order
for elem in order:
finished_line.append(dict.get(var, elem))
# join the elements and return a string
return ''.join([x for x in finished_line])

def build ( line, field_sep = DEFFIELDSEP ):
"""
Create a single record by transposing fields via
multiple dictionaries.
"""
fields = string.split(line, field_sep);
# trim whitespace
fields = map(string.strip, fields)
# create dict of import column names (keys) and import
# data (values)
step_one = dict([(q,a) for q,a in zip(import_field_order,
fields)])
# create a 'throwaway' copy of the mapped fields dict to
# do the transform
step_two = copy.deepcopy(mapped_fields)
# replace the column names of the import file with the
# values from the import file
for k,v in step_two.items():
step_two[k] = dict.get(step_one, v)
# create an empty dict with the export column names
step_three = dict([(q,'') for q in export_field_order])
dict.update(step_three, step_two)
# special cases
step_three['RECORD_TYPE'] = 'D'
step_three['EMAIL_ADDRESS_TYPE_1'] = 'I'
return glue(step_three, export_fields, export_field_order)

def decapitate(infile,has_header=DEFHASHEADER,line_sep=DEFLINESEP):
"""
Return file without header row
"""
if (has_header):
sans_head = string.split(infile, line_sep)[1:]
else:
sans_head = string.split(infile, line_sep)
return sans_head

if __name__ == '__main__':

# init vars
out_lst = []

# sample data. Names have been changed ...

fin = {'data': 'Email\tFirst Name(s)\tLast Name\tAddress Line
1\tAddress Line 2\tCity\tState\tZip
Code\[email protected]\tBill\tMunroe\t416
Main\t\tManhattan\tNY\t10471\[email protected]\tGeorge\tScott\t111
Blue Street 2005\t\tSan
Francisco\tCA\t94144\[email protected]\tMike\tBork\t22 Oak
Rd\t\tAlbuquerque\tNM\t01720-5303\[email protected]\tIma\tCrazy\t1111 E
Maple Rd\t\tDenver\tCo\t80206-6139'};

# get a list of lines without a header row
lines = decapitate(fin['data']);

for line in lines:
# check for blank lines
if (line):
out_lst.append(build(line))

print out_lst;

Larry Bates · Sep 21, 2004

What's up with all the semi-colon (

line terminators?
Seems like some mixture of Javascript and Python here.

I tend to do this with classes. Then I can take
advantage of using the __str__ method to build the
output string.

class outputrec:
#
# default = default value if not given (also defines length)
# fmt = format string to build output (%s, %i, %.2f, etc.)
# justify = None (left) or right
# values = list of allowable values
#

_record={'Email': {'default':100*' ',
'fmt': '%s',
'justify': None,
'values': None}},
'FirstName':{'default':20*' ',
'fmt': '%s',
'justify': None,
'values': None}},
#
# Insert remainder of fields here
#
}

_attributes=('Email','FirstName',...)

def __init__(self, **kwargs):
#-----------------------------------------------------------------------
# If any keyword arguments were passed in, update local attributes
with
# their values.
#-----------------------------------------------------------------------
for attribute in [a for a in kwargs.keys() if a in
self._attributes]:
self.__setattr__(attribute, kwargs[attribute])

return

def __setattr__(self, attribute, value):
'''
Override the __setattr__ method so that I can limit the attributes
to only those in the _record defined and so I can properly justify
the attributes according to the justification entry.
'''
#-----------------------------------------------------------------------
# Create shortcut to self.__dict__
#-----------------------------------------------------------------------
D=self.__dict__
#-----------------------------------------------------------------------
# Make sure that caller is setting a legal attribute
#-----------------------------------------------------------------------
if attribute not in self._attributes:
raise AttributeError, "accpacPRbase: Illegal attribute=%s" % \
attribute

#-----------------------------------------------------------------------
# See if the attribute has a list of allowable values and check the
# value against that list.
#-----------------------------------------------------------------------
values=self._record[attribute]['values']
if values and (value not in values):
print "attribute=%s, value=%s not in " \
"values list=%s" % (attribute, value, values))
sys.exit(2)

#-----------------------------------------------------------------------
# Check the length of string attributes to make sure they are not
too
# long
#-----------------------------------------------------------------------
if not self._record[attribute]['fmt'] == '"%s"': D[attribute]=value
else:
max_length=len(self._record[attribute]['default'])
if len(value) > max_length:
print "attribute=%s, too long, " \
"value='%s' truncated" % (attribute, value))
print "len(%s)=%i, maxlen=%i" % \
(attribute, len(attribute), max_length))
#---------------------------------------------------------------
# Chop it off at the maximum length
#---------------------------------------------------------------
value=value[:max_length]

#-------------------------------------------------------------------
# See if this value is to be justified to the right, if so
justify
# it to the length of the default value.
#-------------------------------------------------------------------
if self._record[attribute]['justify']:
D[attribute]=value.rstrip().rjust(max_length)
else:
D[attribute]=value.rstrip().ljust(max_length)

return

def __str__(self):
'''
Create and return a string representation of this record for output.
'''
#-----------------------------------------------------------------------
# Create shortcut to self.__dict__
#-----------------------------------------------------------------------
D=self.__dict__
outputlist=[]
#-----------------------------------------------------------------------
# Loop over every possible attribute and use the defined format
string
# to format it for output record. Output record is then joined and
# returned.
#-----------------------------------------------------------------------
for attribute in self._attributes:
outputlist.append(self._record[attribute]['fmt'] % D[attribute])

return ''.join(outputlist)

OREC=outputrec()
OREC.Email='(e-mail address removed)'
OREC.FirstName='Whatever'
#
# Put more assignments here
#
# or you can do in single statement
#
OREC=outputrec(Email='(e-mail address removed)', FirstName='Whatever')

Then you can get output record as

f.writelines(OREC)

There are LOTS of ways to accomplish this and I've tried most of
them, this method just works well for me. I do basically the same
thing for the input records. Once the record classes are debugged
the program is very easy to write. It also "hides" the record
formats from the main program which makes it easy should they
change or if you want to support multiple formats (just write
another one for the new format and conditionally instantiate it
in your main program). I have actually created a base class for
this and I define my record class with it as the baseclass. This
way much of the baseclass code (that is debugged) is reusable.

Hope this helps.

Larry Bates

mike beck said:
I'm writing a script to take a tab-delimited input file where the
format is known and transform it to a fixed-width text file. I thought
that the script might be easier to troubleshoot or edit later (if the
file formats change, for example) if I specified the column names for
the input and output files, which is why I didn't just map input
column 1 -> export column 14.

Even though I've spent some quality time with the archives and the
fine manual, I'm pretty sure that I'm doing some WRONG THINGS. Or
perhaps some SUB-OPTIMAL THINGS. And possibly REALLY WEIRD THINGS. Or
just maybe THINGS THAT WILL WORK FOR A LITTLE WHILE BUT WILL SURELY
CAUSE DIRE PROBLEMS AT SOME LATER, INCREDIBLY INCONVENIENT TIME
RESULTING IN HORRIBLE, TRAGIC CONSEQUENCES FOR EVERYONE INVOLVED.

The script works, but the use of multiple dictionaries seems ... how
should I say ... unintuitive. Is there a better way to move values
from one dictionary to another than what I've done here? I'd
appreciate any other tips (especially those that fit into the
CATEGORIES listed above).

And I've omitted the read from file and write to file business for the
sake of brevity...

Standing on the shoulders of giants (meaning, you all!),
Mike

---

"""
Transforms a tab delimited text file to a fixed-width text file.
"""
import copy, string

DEFFIELDSEP = '\t';
DEFLINESEP = '\n';
DEFHASHEADER = 1; # 1 if yes, 0 if no

import_field_order = ( 'Email',
'First Name(s)',
'Last Name',
'Address Line 1',
'Address Line 2',
'City',
'State',
'Zip Code', );

# export columns (keys) to import columns (values)
mapped_fields = { 'EMAIL_ADDRESS_1':'Email',
'FIRST_NAME_1':'First Name(s)',
'LAST_NAME_1':'Last Name',
'ADDRESS_1':'Address Line 1',
'CITY':'City',
'STATE':'State',
'ZIP_CODE':'Zip Code', };

# required length and justification of export fields
export_fields = { 'RECORD_TYPE':[1,'left'],
'ACCOUNT_ID':[8,'left'],
'TITLE_1':[8,'left'],
'FIRST_NAME_1':[20,'left'],
'MIDDLE_NAME_1':[20,'left'],
'LAST_NAME_1':[20,'left'],
'SUFFIX_1':[8,'left'],
'NAME_ID_1':[2,'left'],
'ADDRESS_1':[256,'left'],
'CITY':[30,'left'],
'STATE':[2,'left'],
'ZIP_CODE':[5,'left'],
'ZIP_EXTENSION':[4,'left'],
'EMAIL_ADDRESS_1':[100,'left'],
'EMAIL_ADDRESS_TYPE_1':[2,'left'], };

export_field_order = ( 'RECORD_TYPE',
'ACCOUNT_ID',
'TITLE_1',
'FIRST_NAME_1',
'MIDDLE_NAME_1',
'LAST_NAME_1',
'SUFFIX_1',
'NAME_ID_1',
'ADDRESS_1',
'CITY',
'STATE',
'ZIP_CODE',
'ZIP_EXTENSION',
'EMAIL_ADDRESS_1',
'EMAIL_ADDRESS_TYPE_1', );

def pad ( elem, max_length, justification=None):
"""
Pad a string to a predetermined length. Truncate the string
if necessary. By default, left justify and pad with spaces.
"""
padded_value = '';
# what if elem is none?
if (justification is 'right'):
padded_value = elem.rjust(max_length)[:max_length]
else:
padded_value = elem.ljust(max_length)[:max_length]
return padded_value

def glue ( var, ref, order ):
"""
Build a string with padded dictionary values.
Var and ref are dictionaries, order is a tuple describing the
field order.
"""
finished_line = []
# pad the items
for k in var.keys():
# pad (value, length, justification)
var[k] = pad( str(var[k]), ref[k][0], ref[k][1])
# build the line in order
for elem in order:
finished_line.append(dict.get(var, elem))
# join the elements and return a string
return ''.join([x for x in finished_line])

def build ( line, field_sep = DEFFIELDSEP ):
"""
Create a single record by transposing fields via
multiple dictionaries.
"""
fields = string.split(line, field_sep);
# trim whitespace
fields = map(string.strip, fields)
# create dict of import column names (keys) and import
# data (values)
step_one = dict([(q,a) for q,a in zip(import_field_order,
fields)])
# create a 'throwaway' copy of the mapped fields dict to
# do the transform
step_two = copy.deepcopy(mapped_fields)
# replace the column names of the import file with the
# values from the import file
for k,v in step_two.items():
step_two[k] = dict.get(step_one, v)
# create an empty dict with the export column names
step_three = dict([(q,'') for q in export_field_order])
dict.update(step_three, step_two)
# special cases
step_three['RECORD_TYPE'] = 'D'
step_three['EMAIL_ADDRESS_TYPE_1'] = 'I'
return glue(step_three, export_fields, export_field_order)

def decapitate(infile,has_header=DEFHASHEADER,line_sep=DEFLINESEP):
"""
Return file without header row
"""
if (has_header):
sans_head = string.split(infile, line_sep)[1:]
else:
sans_head = string.split(infile, line_sep)
return sans_head

if __name__ == '__main__':

# init vars
out_lst = []

# sample data. Names have been changed ...
fin = {'data': 'Email\tFirst Name(s)\tLast Name\tAddress Line
1\tAddress Line 2\tCity\tState\tZip
Code\[email protected]\tBill\tMunroe\t416
Main\t\tManhattan\tNY\t10471\[email protected]\tGeorge\tScott\t111
Blue Street 2005\t\tSan
Francisco\tCA\t94144\[email protected]\tMike\tBork\t22 Oak
Rd\t\tAlbuquerque\tNM\t01720-5303\[email protected]\tIma\tCrazy\t1111 E
Maple Rd\t\tDenver\tCo\t80206-6139'};

# get a list of lines without a header row
lines = decapitate(fin['data']);

for line in lines:
# check for blank lines
if (line):
out_lst.append(build(line))

print out_lst;

mike beck · Sep 30, 2004

This is very cool and useful. Thanks for the help!

XML -> Tab-delimited text file (using lxml)	2	Nov 19, 2008
Parsing html :: output to comma delimited	3	Jul 16, 2005
A Script To Convert iTunes XML To Tab Delimited Format	7	May 27, 2006
Using re to get data from text file	6	Sep 10, 2004
A data transformation framework. A presentation inviting commentary.	0	Aug 21, 2013
Importing from text file to Excel	0	Dec 15, 2006
newbie: Need Help Moving Data from Kirbybase to Gadfly	3	Apr 10, 2006
(OT) wxPython & resizers, html text (code included)	1	Jun 21, 2004

Translate tab-delimited to fixed width text

mike beck

Larry Bates

mike beck

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads