splitting tables

R

robsom

Hi, I have a problem with a small python program I'm trying to write
and I hope somebody may help me. I'm working on tables of this kind:

CGA 1988 06 21 13 48 G500-050 D 509.62 J.. R1 1993 01 28 00 00 880006
CGA 1988 06 21 14 04 G500-051 D 550.62 J.. R1 1993 01 28 00 00 880007

I have to read each line of the table and put it into comma-separated
lists like these for later manipulation:

CGA,1988,06,21,13,48,G500-050,D,509.62,J..,R1,1993,01,28,00,00,880006
CGA,1988,06,21,14,04,G500-051,D,550.62,J..,R1,1993,01,28,00,00,880007

The 'split' function works pretty well, except when there is an error in
the original data table. For example if an element is missin in a line,
like this:

CGA 1990 08 15 13 16 G500-105 D 524.45 J.. R1 1993 01 29 00 00 900069
CGA 1990 08 16 01 22 D 508.06 J.. R1 1993 01 27 00 00 900065

This error happens quite often in my dataset and the tables are too
large to check for it manually. In this case what I get splitting the
line string is of course this:

CGA,1990,08,15,13,16,G500-105,D,524.45,J..,R1,1993,01,29,00,00,900069
CGA,1990,08,16,01,22,D,508.06,J..,R1,1993,01,27,00,00,900065

And when the program tries to work on the second list it stops (of course!).
Is there any way to avoid this problem? This kind of error happens quite
often in my dataset and the tables are usually too large to check for it
manually. Thanks a lot for any suggestions.

R
 
S

Skip Montanaro

r> The 'split' function works pretty well, except when there is an error in
r> the original data table. For example if an element is missin in a line,
r> like this:

r> CGA 1990 08 15 13 16 G500-105 D 524.45 J.. R1 1993 01 29 00 00 900069
r> CGA 1990 08 16 01 22 D 508.06 J.. R1 1993 01 27 00 00 900065

If the data are truly fixed width, just slice the strings:
>>> s 'CGA 1990 08 16 01 22 D 508.06 J.. R1 1993 01 27 00 00 900065'
>>> s[0:3], s[4:8], s[9:11], s[12:14], s[15:17], s[18:20], s[21:29], s[30:31]
('CGA', '1990', '08', '16', '01', '22', ' ', 'D')

Skip
 
S

Scott David Daniels

robsom said:
... The 'split' function works pretty well, except when there is an error
in the original data table. For example if an element is missin in a line,
like this:

CGA 1990 08 15 13 16 G500-105 D 524.45 J.. R1 1993 01 29 00 00 900069
CGA 1990 08 16 01 22 D 508.06 J.. R1 1993 01 27 00 00 900065

A way to stay reasonably fast is to use split when it works:

lengths = 3,4,2,2,2,8,1,8,3,2,4,2,2,2,2,6
...
result = line.split()
if len(result) != len(lengths):
result, pos = [], 0
for length in lengths:
next = pos + length + 1
result.append(line[pos : next].strip())
pos = next
...
 
K

Karl =?iso-8859-1?q?Pfl=E4sterer?=

And when the program tries to work on the second list it stops (of course!).
Is there any way to avoid this problem? This kind of error happens quite

What do you want to be done? To see if an item is missing is trivial:
just check the length of the splitted line (a list). But what the right
action in that case is belongs to you; should the user be asked? is
always the same column missing? is it possible to distinguish the
entries without errors from each other so the programm can decide which
column is missing?



KP
 
B

Bengt Richter

Hi, I have a problem with a small python program I'm trying to write
and I hope somebody may help me. I'm working on tables of this kind:

CGA 1988 06 21 13 48 G500-050 D 509.62 J.. R1 1993 01 28 00 00 880006
CGA 1988 06 21 14 04 G500-051 D 550.62 J.. R1 1993 01 28 00 00 880007

I have to read each line of the table and put it into comma-separated
lists like these for later manipulation:

CGA,1988,06,21,13,48,G500-050,D,509.62,J..,R1,1993,01,28,00,00,880006
CGA,1988,06,21,14,04,G500-051,D,550.62,J..,R1,1993,01,28,00,00,880007

The 'split' function works pretty well, except when there is an error in
the original data table. For example if an element is missin in a line,
like this:

CGA 1990 08 15 13 16 G500-105 D 524.45 J.. R1 1993 01 29 00 00 900069
CGA 1990 08 16 01 22 D 508.06 J.. R1 1993 01 27 00 00 900065

This error happens quite often in my dataset and the tables are too
large to check for it manually. In this case what I get splitting the
line string is of course this:

CGA,1990,08,15,13,16,G500-105,D,524.45,J..,R1,1993,01,29,00,00,900069
CGA,1990,08,16,01,22,D,508.06,J..,R1,1993,01,27,00,00,900065

And when the program tries to work on the second list it stops (of course!).
Is there any way to avoid this problem? This kind of error happens quite
often in my dataset and the tables are usually too large to check for it
manually. Thanks a lot for any suggestions.
... CGA 1990 08 15 13 16 G500-105 D 524.45 J.. R1 1993 01 29 00 00 900069
... CGA 1990 08 16 01 22 D 508.06 J.. R1 1993 01 27 00 00 900065
... """ ... '(...) (....) (..) (..) (..) (..) (........) (.) '
... '(......) (...) (..) (....) (..) (..) (..) (..) (......)'
... ) ...
CGA,1990,08,15,13,16,G500-105,D,524.45,J..,R1,1993,01,29,00,00,900069
CGA,1990,08,16,01,22, ,D,508.06,J..,R1,1993,01,27,00,00,900065

To write the csv lines to a file instead of sys.stdout, substitute (untested)
file('your_csv_output_file.csv') in place of sys.stdout in the above, and get your
lines from something like (note chopping off the trailing newline)

for line in file('your_table_file'):
line = line.rstrip('\n')

instead of

for line in s.splitlines()

If you have possible short lines that create no match, you'll need to check for those
before unpacking (by using the prefixed *) into writer.writerow's arg list.

That's it for clp today ;-)

Regards,
Bengt Richter
 
R

robsom

Il Sat, 07 Feb 2004 22:10:17 +0100, Karl Pflästerer ha scritto:
What do you want to be done? To see if an item is missing is trivial:
just check the length of the splitted line (a list). But what the right
action in that case is belongs to you; should the user be asked? is
always the same column missing? is it possible to distinguish the
entries without errors from each other so the programm can decide which
column is missing?

Ok, I'll try to give some more detail. These are files with data from
field measurements and contain information about location, time,
measurement, measurement flag, error, detection limit, calibration and
other stuff like that. The problem is that they are made by different
groups and are not always consistent in their format and I'm trying to
write a code which is as general as possible.
When a table has fixed-width columns and each element is correctly aligned
as in the example I showed you, the problem is solved by slicing the
string as Skip Montanaro suggested in his answer (thanks!), but this is
not always the case. For example I can have something like this:

47.455677 456.67
47.4558 453.8
47.46789 -9999
47.4567 456

where -9999 (or somethinbg similar) indicates there is a blank, one space
divides the columns and the elements can have a different number of
digits. This is of course a worst-case scenario :)
That is why I used split in the beginning, but then I fall into the other
problem, when there is a missing value.
Any suggestions will be much appreciated, thanks

R
 
D

Diez B. Roggisch

where -9999 (or somethinbg similar) indicates there is a blank, one space
divides the columns and the elements can have a different number of
digits. This is of course a worst-case scenario :)
That is why I used split in the beginning, but then I fall into the other
problem, when there is a missing value.
Any suggestions will be much appreciated, thanks


The data looks like the different fields have different types - e.g. floats,
years, fixed width integers and so on - so you might be able to create a
regular expression like this:

"(CGA|) ([0-9]{4}|) ([0-9][2]|)"

You then could retrieve the single groups of the match. If a line doesn't
match, you either adapt the expression, or if thats not possible, create a
error-file that you could look over manually.
 
B

Bengt Richter

Il Sat, 07 Feb 2004 22:10:17 +0100, Karl Pflästerer ha scritto:


Ok, I'll try to give some more detail. These are files with data from
field measurements and contain information about location, time,
measurement, measurement flag, error, detection limit, calibration and
other stuff like that. The problem is that they are made by different
groups and are not always consistent in their format and I'm trying to
write a code which is as general as possible.
When a table has fixed-width columns and each element is correctly aligned
as in the example I showed you, the problem is solved by slicing the
string as Skip Montanaro suggested in his answer (thanks!), but this is
not always the case. For example I can have something like this:

47.455677 456.67
47.4558 453.8
47.46789 -9999
47.4567 456

where -9999 (or somethinbg similar) indicates there is a blank, one space
divides the columns and the elements can have a different number of
digits. This is of course a worst-case scenario :)
That is why I used split in the beginning, but then I fall into the other
problem, when there is a missing value.
Any suggestions will be much appreciated, thanks
Maybe a modified regex that takes into account particular field formats?
A regex will search for things in order, so you can set one up to match
special things like -9999 while still allowing -9999.9 etc.
What do you know about each field and the separations? Is there always a full
set of fields, even if some are blank?

(BTW, my other post misleads in implying that line.rstrip('\n') is necessary
to get the regex to match).

Regards,
Bengt Richter
 
R

robsom

Il Sat, 07 Feb 2004 21:56:23 +0000, Bengt Richter ha scritto:

... CGA 1990 08 15 13 16 G500-105 D 524.45 J.. R1 1993 01 29 00 00 900069
... CGA 1990 08 16 01 22 D 508.06 J.. R1 1993 01 27 00 00 900065
... """
... '(...) (....) (..) (..) (..) (..) (........) (.) '
... '(......) (...) (..) (....) (..) (..) (..) (..) (......)'
... )
...
CGA,1990,08,15,13,16,G500-105,D,524.45,J..,R1,1993,01,29,00,00,900069
CGA,1990,08,16,01,22, ,D,508.06,J..,R1,1993,01,27,00,00,900065

To write the csv lines to a file instead of sys.stdout, substitute
(untested) file('your_csv_output_file.csv') in place of sys.stdout in
the above, and get your lines from something like (note chopping off the
trailing newline)

for line in file('your_table_file'):
line = line.rstrip('\n')

instead of

for line in s.splitlines()

If you have possible short lines that create no match, you'll need to
check for those before unpacking (by using the prefixed *) into
writer.writerow's arg list.

I'm not sure I completely understand what you are suggesting. Use a sort
of "mask" to extract information from the line and put it into a string?
And the use some function of the csv module (which I don't have) to
manipulate this string and output it?
If I got it correct this would work only if all the elements of the
columns have the same number of characters which is not always the case as
I was saying in my reply to Karl Pflästerer above.
Any other idea? Thanks

R
 
K

Karl =?iso-8859-1?q?Pfl=E4sterer?=

47.455677 456.67
47.4558 453.8
47.46789 -9999
47.4567 456
where -9999 (or somethinbg similar) indicates there is a blank, one space
divides the columns and the elements can have a different number of
digits. This is of course a worst-case scenario :)
That is why I used split in the beginning, but then I fall into the other
problem, when there is a missing value.

I still can't see the problem sorry; if the columns are always separated
by space you can split them with:
.... 47.4558 453.8
.... 47.46789 -9999
.... 47.4567 456""").... print ','.join(map(lambda dat: dat != '-9999' and dat or '', line.split()))
....
47.455677,456.67
47.4558,453.8
47.46789,
47.4567,456.... print ','.join(map(lambda dat: dat != '-9999' and dat or 'MISSING', line.split()))
....
47.455677,456.67
47.4558,453.8
47.46789,MISSING
47.4567,456

I used here your -9999 as value of a mssing item. Before you map that
function across the splitted line you culd check for the correct
length to see if you have e.g. like here always two coulmns. If the
length differs it depends if it's possible to find programmatically the
missing column and just insert a space or something other.



KP
 
R

robsom

Il Sat, 07 Feb 2004 13:04:05 -0800, Scott David Daniels ha scritto:
A way to stay reasonably fast is to use split when it works:

lengths = 3,4,2,2,2,8,1,8,3,2,4,2,2,2,2,6
...
result = line.split()
if len(result) != len(lengths):
result, pos = [], 0
for length in lengths:
next = pos + length + 1
result.append(line[pos : next].strip())
pos = next
...

I think I understand what you mean, but what data type is lenghts?
is it a list or a tuple? In this case there should be the parenthesis,
shouldn't they?
Sorry, I'm really making newbie silly questions :), but I couldn't find
that in my reference manual.
In any case this would work only when all the elements in a column have
the same length, which, as I said, is not always the case. thanks a lot

R
 
R

robsom

Sorry guys, I thank all of you, but I'm a beginner and my knowledge of
Python is not yet enough to follow you.
...print ','.join(map(lambda dat: dat != '-9999' and dat or '', line.split()))
...

As far as I understand you are using a 'map' function to apply the
function "lambda dat: dat != '-9999' and dat or ''" to the elements
obtained with "line.split()".
First question: shouldn't it be split(line)?
Now, my book says that lambda works by applying (in this example) the
instructions in "dat != '-9999' and dat or ''" to the dat variable. Does
this mean that (I'm more or less guessing now) this code splits the line
assign each element to a dat variable which could have three types of
value (i.e. its "normal" value or -9999 or a space)? And then you use a
join functions to put again together the elements yolu have separated with
map with a comma as separator? And here I have a second question:
shouldn't it be "join(map(....),',')" instead of "','.join(map(...))"?
I used here your -9999 as value of a mssing item. Before you map that
function across the splitted line you culd check for the correct length
to see if you have e.g. like here always two coulmns. If the length
differs it depends if it's possible to find programmatically the missing
column and just insert a space or something other.

Ok, but before going on, I'd like to know if I understand the code.
Probably I got it all wrong, but this is quite a few chapters beyond where
I am now in my book :)
Morevoer I wonder if it is possible to achieve the same thing without
using the functional programming.

many thanks again

R
 
K

Karl =?iso-8859-1?q?Pfl=E4sterer?=

Sorry guys, I thank all of you, but I'm a beginner and my knowledge of
Python is not yet enough to follow you.

If you're interested there's a vrey good mailing list for Python
starters (and others): Python Tutor (I don't know the correct address
ath the momwnr but it should be easy to find on the ython web site).
As far as I understand you are using a 'map' function to apply the
function "lambda dat: dat != '-9999' and dat or ''" to the elements
obtained with "line.split()".
Right.

First question: shouldn't it be split(line)?

No. The split method is a method of strings. Since line is a string
you call here a method of a string object.
Now, my book says that lambda works by applying (in this example) the
instructions in "dat != '-9999' and dat or ''" to the dat variable. Does
this mean that (I'm more or less guessing now) this code splits the line

It gets splitted on spaces.
assign each element to a dat variable which could have three types of
value (i.e. its "normal" value or -9999 or a space)? And then you use a

That's nearly right. This trick with `and' and `or' is necessary since
you can't have statements in Python lambda. Written the above as `if'
statement would be:
if dat != '9999':
return dat
else:
return ''
join functions to put again together the elements yolu have separated with

`join' is here again (like `split') a method of the string object.
map with a comma as separator? And here I have a second question:
shouldn't it be "join(map(....),',')" instead of "','.join(map(...))"?

No since join is a method. If you have an older Python book it may be
written differently.

[...]
Ok, but before going on, I'd like to know if I understand the code.
Probably I got it all wrong, but this is quite a few chapters beyond where
I am now in my book :)

I didn't know you were a beginner otherwise I would have written more
verbosely. The Tutor list might be the right place for you to start.
Morevoer I wonder if it is possible to achieve the same thing without
using the functional programming.

Yes, but it would be less fun :)

KP
 
R

robsom

Il Tue, 10 Feb 2004 18:09:33 +0100, Karl Pflästerer ha scritto:
If you're interested there's a vrey good mailing list for Python
starters (and others): Python Tutor (I don't know the correct address
ath the momwnr but it should be easy to find on the ython web site).

I'll have a look, I'm beginning to appreciate Python even though at first
I hated this indentation thing and I (still) don't like the "for in
range()" way to build loops.

No. The split method is a method of strings. Since line is a string
you call here a method of a string object.

Ok, but then why if I write this

for line in fin.readlines():
line = split(line)

it does work anyway? Are there two ways to use the same function?
It gets splitted on spaces.


That's nearly right. This trick with `and' and `or' is necessary since
you can't have statements in Python lambda. Written the above as `if'
statement would be:
if dat != '9999':
return dat
else:
return ''

So in the end here is ho this piece of code works:
takes a line, split the line on spaces and put each element into a dat
variable. if the dat variable is equal to '9999' or whatever other symbol
is used to indicate a blank substitute it with a space otherwise leave its
value. then put together the dat (into a list?) separated by a comma.
I got it! maybe...
No since join is a method. If you have an older Python book it may be
written differently.

well I'm using "Learning to Program" by Alan Gauld which I found on the
python website. But I found the description of the join function on the
python documentation of my system. Anyway, as I said it works both ways
(at least for split, haven't tried with join).
I was thinking of buying "Learning Python" of the O'Reilly series. What do
you think?
I didn't know you were a beginner otherwise I would have written more
verbosely. The Tutor list might be the right place for you to start.

don't worry I should have said :)
Yes, but it would be less fun :)

uhm... maybe... I'll tell you when I get there... :D:D
thanks a lot

R
 
B

Ben Finney

I'm beginning to appreciate Python even though at first I hated this
indentation thing

Yup. Just because something seems limiting doesn't mean it is.
and I (still) don't like the "for in range()" way to build loops.

The "for index in range( size )" is mostly recommended as a way to loop
over a list; range( size ) will elegantly generate a sequence of indices
into a list of size "size".

What is it you don't like?

Il Tue, 10 Feb 2004 18:09:33 +0100, Karl Pflästerer ha scritto:

Ok, but then why if I write this

for line in fin.readlines():
line = split(line)

it does work anyway?

Because somewhere earlier in that code you must have imported the name
'split' into the base namespace. It's not there to begin with (and
importing it is needlessly polluting the base namespace).

Are there two ways to use the same function?

They're two different functions. One is a method of string objects, one
is a function you've imported from somewhere. (This confusion is partly
why importing symbols into the base namespace is a bad idea.)
 
S

Skip Montanaro

robsom> ... I (still) don't like the "for in range()" way to build
robsom> loops.

That's needed only when you need a numeric index. Much of the time you can
just iterate over a container object:

for element in mylist:
print element

or

for key in mydict:
print (key, mydict[key])

Skip
 
R

robsom

Yup. Just because something seems limiting doesn't mean it is.

Right, I'm just saying that I was used to indent when I wanted and what I
wanted and that I missed the "closing instruction"
But, to be fair, I've adapted much faster than I thought. Just a matter of
getting used to it.
The "for index in range( size )" is mostly recommended as a way to loop
over a list; range( size ) will elegantly generate a sequence of indices
into a list of size "size".

What is it you don't like?

well, I just think that a kind of sintax like: for i=x to y
is much more intuitive and easy to use. It caused me a lot of errors,
particularly in the beginning because I wasn't getting the values I
wanted for the index. Again is probably just a matter of getting used to
it.
Because somewhere earlier in that code you must have imported the name
'split' into the base namespace. It's not there to begin with (and
importing it is needlessly polluting the base namespace).
They're two different functions. One is a method of string objects, one
is a function you've imported from somewhere. (This confusion is partly
why importing symbols into the base namespace is a bad idea.)

in fact there is an import string * instruction at the beginning.
ok, so you suggest using the method because it results in a faster code?
thanks

R
 
B

Ben Finney

[why do str.split() and split(str) both work?]
Because somewhere earlier in that code you must have imported the name
'split' into the base namespace. It's not there to begin with (and
importing it is needlessly polluting the base namespace).

in fact there is an import string * instruction at the beginning.

Yes, the "from module import *" is a deprecated usage; it pollutes the
base namespace, which leads to confusions like this and others.

The recommended way is to:

import module

module.function()

So, in your case, this would be:

import string

string.split( str )

Then it becomes clear in each instance that the "split" function comes
from the "string" module, and not some other arbitrary module.
[are they the same function?]
They're two different functions. One is a method of string objects,
one is a function you've imported from somewhere. (This confusion is
partly why importing symbols into the base namespace is a bad idea.)

ok, so you suggest using the method because it results in a faster
code?

No, I recommend using the method because it's already part of the string
objects, and results in clearer code.
 
J

John J. Lee

Skip Montanaro said:
robsom> ... I (still) don't like the "for in range()" way to build
robsom> loops.

That's needed only when you need a numeric index. Much of the time you can
just iterate over a container object:

for element in mylist:
print element

or

for key in mydict:
print (key, mydict[key])

also (Python 2.3):

for index, element in enumerate(mylist):
# blah, blah


John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top