Parsing of a file

T

Tommy Grav

I have a file with the format

Field f29227: Ra=20:23:46.54 Dec=+67:30:00.0 MJD=53370.06797690 Frames
5 Set 1
Field f31448: Ra=20:24:58.13 Dec=+79:39:43.9 MJD=53370.06811620 Frames
5 Set 2
Field f31226: Ra=20:24:45.50 Dec=+78:26:45.2 MJD=53370.06823860 Frames
5 Set 3
Field f31004: Ra=20:25:05.28 Dec=+77:13:46.9 MJD=53370.06836020 Frames
5 Set 4
Field f30782: Ra=20:25:51.94 Dec=+76:00:48.6 MJD=53370.06848210 Frames
5 Set 5
Field f30560: Ra=20:27:01.82 Dec=+74:47:50.3 MJD=53370.06860400 Frames
5 Set 6
Field f30338: Ra=20:28:32.35 Dec=+73:34:52.0 MJD=53370.06872620 Frames
5 Set 7
Field f30116: Ra=20:30:21.70 Dec=+72:21:53.6 MJD=53370.06884890 Frames
5 Set 8
Field f29894: Ra=20:32:28.54 Dec=+71:08:55.0 MJD=53370.06897070 Frames
5 Set 9
Field f29672: Ra=20:34:51.89 Dec=+69:55:56.6 MJD=53370.06909350 Frames
5 Set 10

I would like to parse this file by extracting the field id, ra, dec
and mjd for each line. It is
not, however, certain that the width of each value of the field id,
ra, dec or mjd is the same
in each line. Is there a way to do this such that even if there was a
line where Ra=****** and
MJD=******** was swapped it would be parsed correctly?

Cheers
Tommy
 
M

Mike Driscoll

I have a file with the format

Field f29227: Ra=20:23:46.54 Dec=+67:30:00.0 MJD=53370.06797690 Frames  
5 Set 1
Field f31448: Ra=20:24:58.13 Dec=+79:39:43.9 MJD=53370.06811620 Frames  
5 Set 2
Field f31226: Ra=20:24:45.50 Dec=+78:26:45.2 MJD=53370.06823860 Frames  
5 Set 3
Field f31004: Ra=20:25:05.28 Dec=+77:13:46.9 MJD=53370.06836020 Frames  
5 Set 4
Field f30782: Ra=20:25:51.94 Dec=+76:00:48.6 MJD=53370.06848210 Frames  
5 Set 5
Field f30560: Ra=20:27:01.82 Dec=+74:47:50.3 MJD=53370.06860400 Frames  
5 Set 6
Field f30338: Ra=20:28:32.35 Dec=+73:34:52.0 MJD=53370.06872620 Frames  
5 Set 7
Field f30116: Ra=20:30:21.70 Dec=+72:21:53.6 MJD=53370.06884890 Frames  
5 Set 8
Field f29894: Ra=20:32:28.54 Dec=+71:08:55.0 MJD=53370.06897070 Frames  
5 Set 9
Field f29672: Ra=20:34:51.89 Dec=+69:55:56.6 MJD=53370.06909350 Frames  
5 Set 10

I would like to parse this file by extracting the field id, ra, dec  
and mjd for each line. It is
not, however, certain that the width of each value of the field id,  
ra, dec or mjd is the same
in each line. Is there a way to do this such that even if there was a  
line where Ra=****** and
MJD=******** was swapped it would be parsed correctly?

Cheers
   Tommy

I'm sure Python can handle this. Try the PyParsing module or learn
Python regular expression syntax.

http://pyparsing.wikispaces.com/

You could probably do it very crudely by just iterating over each line
and then using the string's find() method.

Mike
 
J

John Machin

I'm sure Python can handle this. Try the PyParsing module or learn
Python regular expression syntax.

http://pyparsing.wikispaces.com/

You could probably do it very crudely by just iterating over each line
and then using the string's find() method.

Perhaps you and the OP could spend some time becoming familiar with
built-in functions and str methods. In particular, str.split is your
friend:

C:\junk>type tommy_grav.py
# Look, Ma, no imports!

guff = """\
Field f29227: Ra=20:23:46.54 Dec=+67:30:00.0 MJD=53370.06797690 Frames
5 Set 1
Field f31448: MJD=53370.06811620123 Dec=+79:39:43.9 Ra=20:24:58.13
Frames 5 Set
2
Field f31226: Ra=20:24:45.50 Dec=+78:26:45.2 MJD=53370.06823860 Frames
5 Set 3
Field f31004: Ra=20:25:05.28 Dec=+77:13:46.9 MJD=53370.06836020 Frames
5 Set 4
Field f30782: Ra=20:25:51.94 Dec=+76:00:48.6 MJD=53370.06848210 Frames
5 Set 5

Field f30560: Ra=20:27:01.82 Dec=+74:47:50.3 MJD=53370.06860400 Frames
5 Set 6
Field f30338: Ra=20:28:32.35 Dec=+73:34:52.0 MJD=53370.06872620 Frames
5 Set 7
Field f30116: Ra=20:30:21.70 Dec=+72:21:53.6 MJD=53370.06884890 Frames
5 Set 8
Field f29894: Ra=20:32:28.54 Dec=+71:08:55.0 MJD=53370.06897070 Frames
5 Set 9
Field f29672: Ra=20:34:51.89 Dec=+69:55:56.6 MJD=53370.06909350 Frames
5 Set 10

"""

is_angle = {
'ra': True,
'dec': True,
'mjd': False,
}

def convert_angle(text):
deg, min, sec = map(float, text.split(':'))
return (sec / 60. + min) / 60. + deg

def parse_line(line):
t = line.split()
assert t[0].lower() == 'field'
assert t[1].startswith('f')
assert t[1].endswith(':')
field_id = t[1].rstrip(':')
rdict = {}
for f in t[2:]:
parts = f.split('=')
if len(parts) == 2:
key = parts[0].lower()
value = parts[1]
assert key not in rdict
if is_angle[key]:
rvalue = convert_angle(value)
else:
rvalue = float(value)
rdict[key] = rvalue
return field_id, rdict['ra'], rdict['dec'], rdict['mjd']

for line in guff.splitlines():
line = line.strip()
if not line:
continue
field_id, ra, dec, mjd = parse_line(line)
print field_id, ra, dec, mjd


C:\junk>tommy_grav.py
f29227 20.3962611111 67.5 53370.0679769
f31448 20.4161472222 79.6621944444 53370.0681162
f31226 20.4126388889 78.4458888889 53370.0682386
f31004 20.4181333333 77.2296944444 53370.0683602
f30782 20.4310944444 76.0135 53370.0684821
f30560 20.4505055556 74.7973055556 53370.068604
f30338 20.4756527778 73.5811111111 53370.0687262
f30116 20.5060277778 72.3648888889 53370.0688489
f29894 20.5412611111 71.1486111111 53370.0689707
f29672 20.5810805556 69.9323888889 53370.0690935

Cheers,
John
 
B

bearophileHUGS

Using something like PyParsing is probably better, but if you don't
want to use it you may use something like this:

raw_data = """
Field f29227: Ra=20:23:46.54 Dec=+67:30:00.0 MJD=53370.06797690 Frames
5 Set 1
Field f31448: Ra=20:24:58.13 Dec=+79:39:43.9 MJD=53370.06811620 Frames
5 Set 2
Field f31226: Ra=20:24:45.50 Dec=+78:26:45.2 MJD=53370.06823860 Frames
5 Set 3
Field f31004: Ra=20:25:05.28 Dec=+77:13:46.9 MJD=53370.06836020 Frames
5 Set 4
Field f30782: Ra=20:25:51.94 Dec=+76:00:48.6 MJD=53370.06848210 Frames
5 Set 5
Field f30560: Ra=20:27:01.82 Dec=+74:47:50.3 MJD=53370.06860400 Frames
5 Set 6
Field f30338: Ra=20:28:32.35 Dec=+73:34:52.0 MJD=53370.06872620 Frames
5 Set 7
Field f30116: Ra=20:30:21.70 Dec=+72:21:53.6 MJD=53370.06884890 Frames
5 Set 8
Field f29894: Ra=20:32:28.54 Dec=+71:08:55.0 MJD=53370.06897070 Frames
5 Set 9
Field f29672: Ra=20:34:51.89 Dec=+69:55:56.6 MJD=53370.06909350 Frames
5 Set 10"""

# from each line extract the fields: id, ra, dec, mjd
# even if they are swapped

data = []
for line in raw_data.lower().splitlines():
if line.startswith("field"):
parts = line.split()
record = {"id": int(parts[1][1:-1])}
for part in parts[2:]:
if "=" in part:
title, field = part.split("=")
record[title] = field
data.append(record)
print data

-----------------

Stefan Behnel:
You can use named groups in a single regular expression.<

Can you show how to use them in this situation when fields can be
swapped?

Bye,
bearophile
 
J

John Machin

I'm sure Python can handle this. Try the PyParsing module or learn
Python regular expression syntax.

You could probably do it very crudely by just iterating over each line
and then using the string's find() method.

Perhaps you and the OP could spend some time becoming familiar with
built-in functions and str methods. In particular, str.split is your
friend:

C:\junk>type tommy_grav.py
# Look, Ma, no imports!

guff = """\
Field f29227: Ra=20:23:46.54 Dec=+67:30:00.0 MJD=53370.06797690 Frames
5 Set 1
Field f31448: MJD=53370.06811620123 Dec=+79:39:43.9 Ra=20:24:58.13
Frames 5 Set
2
Field f31226: Ra=20:24:45.50 Dec=+78:26:45.2 MJD=53370.06823860 Frames
5 Set 3
Field f31004: Ra=20:25:05.28 Dec=+77:13:46.9 MJD=53370.06836020 Frames
5 Set 4
Field f30782: Ra=20:25:51.94 Dec=+76:00:48.6 MJD=53370.06848210 Frames
5 Set 5

Field f30560: Ra=20:27:01.82 Dec=+74:47:50.3 MJD=53370.06860400 Frames
5 Set 6
Field f30338: Ra=20:28:32.35 Dec=+73:34:52.0 MJD=53370.06872620 Frames
5 Set 7
Field f30116: Ra=20:30:21.70 Dec=+72:21:53.6 MJD=53370.06884890 Frames
5 Set 8
Field f29894: Ra=20:32:28.54 Dec=+71:08:55.0 MJD=53370.06897070 Frames
5 Set 9
Field f29672: Ra=20:34:51.89 Dec=+69:55:56.6 MJD=53370.06909350 Frames
5 Set 10

"""

is_angle = {
'ra': True,
'dec': True,
'mjd': False,
}

def convert_angle(text):
deg, min, sec = map(float, text.split(':'))
return (sec / 60. + min) / 60. + deg

def parse_line(line):
t = line.split()
assert t[0].lower() == 'field'
assert t[1].startswith('f')
assert t[1].endswith(':')
field_id = t[1].rstrip(':')
rdict = {}
for f in t[2:]:
parts = f.split('=')
if len(parts) == 2:
key = parts[0].lower()
value = parts[1]
assert key not in rdict
if is_angle[key]:
rvalue = convert_angle(value)
else:
rvalue = float(value)
rdict[key] = rvalue
return field_id, rdict['ra'], rdict['dec'], rdict['mjd']

for line in guff.splitlines():
line = line.strip()
if not line:
continue
field_id, ra, dec, mjd = parse_line(line)
print field_id, ra, dec, mjd

C:\junk>tommy_grav.py
f29227 20.3962611111 67.5 53370.0679769
f31448 20.4161472222 79.6621944444 53370.0681162
f31226 20.4126388889 78.4458888889 53370.0682386
f31004 20.4181333333 77.2296944444 53370.0683602
f30782 20.4310944444 76.0135 53370.0684821
f30560 20.4505055556 74.7973055556 53370.068604
f30338 20.4756527778 73.5811111111 53370.0687262
f30116 20.5060277778 72.3648888889 53370.0688489
f29894 20.5412611111 71.1486111111 53370.0689707
f29672 20.5810805556 69.9323888889 53370.0690935

Cheers,
John

Slightly less ugly:

C:\junk>diff tommy_grav.py tommy_grav_2.py
18,23d17
< is_angle = {
< 'ra': True,
< 'dec': True,
< 'mjd': False,
< }
<
27a22,27
converter = {
'ra': convert_angle,
'dec': convert_angle,
'mjd': float,
}
41,44c41
< if is_angle[key]:
< rvalue = convert_angle(value)
< else:
< rvalue = float(value)
---
rvalue = converter[key](value)
 
H

Henrique Dante de Almeida

I have a file with the format

Field f29227: Ra=20:23:46.54 Dec=+67:30:00.0 MJD=53370.06797690 Frames  
5 Set 1
Field f31448: Ra=20:24:58.13 Dec=+79:39:43.9 MJD=53370.06811620 Frames  
5 Set 2
Field f31226: Ra=20:24:45.50 Dec=+78:26:45.2 MJD=53370.06823860 Frames  
5 Set 3
Field f31004: Ra=20:25:05.28 Dec=+77:13:46.9 MJD=53370.06836020 Frames  
5 Set 4
Field f30782: Ra=20:25:51.94 Dec=+76:00:48.6 MJD=53370.06848210 Frames  
5 Set 5
Field f30560: Ra=20:27:01.82 Dec=+74:47:50.3 MJD=53370.06860400 Frames  
5 Set 6
Field f30338: Ra=20:28:32.35 Dec=+73:34:52.0 MJD=53370.06872620 Frames  
5 Set 7
Field f30116: Ra=20:30:21.70 Dec=+72:21:53.6 MJD=53370.06884890 Frames  
5 Set 8
Field f29894: Ra=20:32:28.54 Dec=+71:08:55.0 MJD=53370.06897070 Frames  
5 Set 9
Field f29672: Ra=20:34:51.89 Dec=+69:55:56.6 MJD=53370.06909350 Frames  
5 Set 10

I would like to parse this file by extracting the field id, ra, dec  
and mjd for each line. It is
not, however, certain that the width of each value of the field id,  
ra, dec or mjd is the same
in each line. Is there a way to do this such that even if there was a  
line where Ra=****** and
MJD=******** was swapped it would be parsed correctly?

Cheers
   Tommy

Did you consider changing the file format in the first place, so that
you don't have to do any contortions to parse it ?

Anyway, here is a solution with regular expressions (I'm a beginner
with re's in python, so, please correct it if wrong and suggest better
solutions):

import re
s = """Field f29227: Ra=20:23:46.54 Dec=+67:30:00.0 MJD=53370.06797690
Frames 5 Set 1
Field f31448: Ra=20:24:58.13 Dec=+79:39:43.9 MJD=53370.06811620 Frames
5 Set 2
Field f31226: Ra=20:24:45.50 Dec=+78:26:45.2 MJD=53370.06823860 Frames
5 Set 3
Field f31004: Ra=20:25:05.28 Dec=+77:13:46.9 MJD=53370.06836020 Frames
5 Set 4
Field f30782: Ra=20:25:51.94 Dec=+76:00:48.6 MJD=53370.06848210 Frames
5 Set 5
Field f30560: Dec=+74:47:50.3 Ra=20:27:01.82 MJD=53370.06860400 Frames
5 Set 6
Field f30338: Ra=20:28:32.35 Dec=+73:34:52.0 MJD=53370.06872620 Frames
5 Set 7
Field f30116: Ra=20:30:21.70 Dec=+72:21:53.6 MJD=53370.06884890 Frames
5 Set 8
Field f29894: Ra=20:32:28.54 Dec=+71:08:55.0 MJD=53370.06897070 Frames
5 Set 9
Field f29672: Ra=20:34:51.89 Dec=+69:55:56.6 MJD=53370.06909350 Frames
5 Set 10"""

s = s.split('\n')
r = re.compile(r'Field (\S+): (?:(?:Ra=(\S+) Dec=(\S+))|(?:Dec=(\S+)
Ra=(\S+))) MJD=(\S+)')
for i in s:
match = r.findall(i)
field = match[0][0]
Ra = match[0][1] or match[0][4]
Dec = match[0][2] or match[0][3]
MJD = match[0][5]
print field, Ra, Dec, MJD
 
B

Bruno Desthuilliers

Tommy Grav a écrit :
I have a file with the format

Field f29227: Ra=20:23:46.54 Dec=+67:30:00.0 MJD=53370.06797690 Frames 5
Set 1
Field f31448: Ra=20:24:58.13 Dec=+79:39:43.9 MJD=53370.06811620 Frames 5
Set 2
Field f31226: Ra=20:24:45.50 Dec=+78:26:45.2 MJD=53370.06823860 Frames 5
Set 3
Field f31004: Ra=20:25:05.28 Dec=+77:13:46.9 MJD=53370.06836020 Frames 5
Set 4
Field f30782: Ra=20:25:51.94 Dec=+76:00:48.6 MJD=53370.06848210 Frames 5
Set 5
Field f30560: Ra=20:27:01.82 Dec=+74:47:50.3 MJD=53370.06860400 Frames 5
Set 6
Field f30338: Ra=20:28:32.35 Dec=+73:34:52.0 MJD=53370.06872620 Frames 5
Set 7
Field f30116: Ra=20:30:21.70 Dec=+72:21:53.6 MJD=53370.06884890 Frames 5
Set 8
Field f29894: Ra=20:32:28.54 Dec=+71:08:55.0 MJD=53370.06897070 Frames 5
Set 9
Field f29672: Ra=20:34:51.89 Dec=+69:55:56.6 MJD=53370.06909350 Frames 5
Set 10

I would like to parse this file by extracting the field id, ra, dec and
mjd for each line. It is
not, however, certain that the width of each value of the field id, ra,
dec or mjd is the same
in each line. Is there a way to do this such that even if there was a
line where Ra=****** and
MJD=******** was swapped it would be parsed correctly?

Q&D :

src = open('/path/to/yourfile.ext')
parsed = []
for line in src:
line = line.strip()
if not line:
continue
head, rest = line.split(':', 1)
field_id = head.split()[1]
data = dict(field_id=field_id)
parts = rest.split()
for part in parts:
try:
key, val = part.split('=')
except ValueError:
continue
data[key] = val
parsed.append(data)
src.close()
 
M

Mike Driscoll

Perhaps you and the OP could spend some time becoming familiar with
built-in functions and str methods. In particular, str.split is your
friend:

I'm well aware of the split() method and built-ins, however since this
appeared to be a homework-type question and I was at work, I didn't
spend any time on the issue. The only reason I mentioned McGuire's
PyParsing module was because I had just finished reading his article
on the subject in Python Magazine and it sounded like something the OP
might find interesting.

Here's my own implementation based on what's already been done here.
I'm sure one get have some fun doing it with itertools or list
comprehensions if you wanted to get really fancy.

<code>

raw_data = """
Field f29227: Ra=20:23:46.54 Dec=+67:30:00.0 MJD=53370.06797690 Frames
5 Set 1
Field f31448: Ra=20:24:58.13 Dec=+79:39:43.9 MJD=53370.06811620 Frames
5 Set 2
Field f31226: Ra=20:24:45.50 Dec=+78:26:45.2 MJD=53370.06823860 Frames
5 Set 3
Field f31004: Ra=20:25:05.28 Dec=+77:13:46.9 MJD=53370.06836020 Frames
5 Set 4
Field f30782: Ra=20:25:51.94 Dec=+76:00:48.6 MJD=53370.06848210 Frames
5 Set 5
Field f30560: Ra=20:27:01.82 Dec=+74:47:50.3 MJD=53370.06860400 Frames
5 Set 6
Field f30338: Ra=20:28:32.35 Dec=+73:34:52.0 MJD=53370.06872620 Frames
5 Set 7
Field f30116: Ra=20:30:21.70 Dec=+72:21:53.6 MJD=53370.06884890 Frames
5 Set 8
Field f29894: Ra=20:32:28.54 Dec=+71:08:55.0 MJD=53370.06897070 Frames
5 Set 9
Field f29672: Ra=20:34:51.89 Dec=+69:55:56.6 MJD=53370.06909350 Frames
5 Set 10
""".splitlines()

myList = []
for line in raw_data:
items = line.split()
myDict = {}
for item in items:
if '=' in item:
key, value = item.split('=')
myDict[key] = value
elif item[:1].lower() == 'f' and item[-1:] == ':':
myDict['id'] = item[1:-1]
myList.append(myDict)

print myList

</code>

This doesn't have any type checking or error handling, but it works
with the data provided.

Mike
 
T

Tommy Grav

I'm well aware of the split() method and built-ins, however since this
appeared to be a homework-type question and I was at work, I didn't
spend any time on the issue. The only reason I mentioned McGuire's
PyParsing module was because I had just finished reading his article
on the subject in Python Magazine and it sounded like something the OP
might find interesting.\

Thanks to everyone that responded, I learned a lot about text parsing
from
the responses. I just wanted to respond to Mike and let him know that
this
was not a homework problem. I was given a file in the format by a
colleague
for a project that I am working on (it contains a list of fields
observed by
the LINEAR asteroid search project during 2005 and 2006). I could have
parsed it using slices of each line, but the unusual format of each line
got me thinking about wether there was another way to do it. I had
tried a
few approaches, but I had not considered the .split() and .split("=").
Of course
the list members quickly came up with a simple and elegant solution. And
I learned a lot in the process :)

Cheers
Tommy Grav
+
-----------------------------------------------------------------------------------------------------------------+
Associate Research Scientist Dept. of Physics and Astronomy
Johns Hopkins University Bloomberg 243
(e-mail address removed) 3400 N. Charles St.
(410) 516-7683 Baltimore, MD21218
+
-----------------------------------------------------------------------------------------------------------------+
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,743
Messages
2,569,477
Members
44,898
Latest member
BlairH7607

Latest Threads

Top