Multiline regex help

Y

Yatima

Hey Folks,

I've got some info in a bunch of files that kind of looks like so:

Gibberish
53
MoreGarbage
12
RelevantInfo1
10/10/04
NothingImportant
ThisDoesNotMatter
44
RelevantInfo2
22
BlahBlah
343
RelevantInfo3
23
Hubris
Crap
34

and so on...

Anyhow, these "fields" repeat several times in a given file (number of
repetitions varies from file to file). The number on the line following the
"RelevantInfo" lines is really what I'm after. Ideally, I would like to have
something like so:

RelevantInfo1 = 10/10/04 # The variable name isn't actually important
RelevantInfo3 = 23 # it's just there to illustrate what info I'm
# trying to snag.

Score[RelevantInfo1][RelevantInfo3] = 22 # The value from RelevantInfo2

Collected from all of the files.

So, there would be several of these "scores" per file and there are a bunch
of files. Ultimately, I am interested in printing them out as a csv file but
that should be relatively easy once they are trapped in my array of doom
<cue evil laughter>.

I've got a fairly ugly "solution" (I am using this term *very* loosely)
using awk and his faithfail companion sed, but I would prefer something in
python.

Thanks for your time.
 
K

Kent Johnson

Yatima said:
Hey Folks,

I've got some info in a bunch of files that kind of looks like so:

Gibberish
53
MoreGarbage
12
RelevantInfo1
10/10/04
NothingImportant
ThisDoesNotMatter
44
RelevantInfo2
22
BlahBlah
343
RelevantInfo3
23
Hubris
Crap
34

and so on...

Anyhow, these "fields" repeat several times in a given file (number of
repetitions varies from file to file). The number on the line following the
"RelevantInfo" lines is really what I'm after. Ideally, I would like to have
something like so:

RelevantInfo1 = 10/10/04 # The variable name isn't actually important
RelevantInfo3 = 23 # it's just there to illustrate what info I'm
# trying to snag.

Here is a way to create a list of [RelevantInfo, value] pairs:
import cStringIO

raw_data = '''Gibberish
53
MoreGarbage
12
RelevantInfo1
10/10/04
NothingImportant
ThisDoesNotMatter
44
RelevantInfo2
22
BlahBlah
343
RelevantInfo3
23
Hubris
Crap
34'''
raw_data = cStringIO.StringIO(raw_data)

data = []
for line in raw_data:
if line.startswith('RelevantInfo'):
key = line.strip()
value = raw_data.next().strip()
data.append([key, value])

print data

Score[RelevantInfo1][RelevantInfo3] = 22 # The value from RelevantInfo2

I'm not sure what you mean by this. Do you want to build a Score dictionary as well?

Kent
 
S

Steven Bethard

Yatima said:
Hey Folks,

I've got some info in a bunch of files that kind of looks like so:

Gibberish
53
MoreGarbage
12
RelevantInfo1
10/10/04
NothingImportant
ThisDoesNotMatter
44
RelevantInfo2
22
BlahBlah
343
RelevantInfo3
23
Hubris
Crap
34

and so on...

Anyhow, these "fields" repeat several times in a given file (number of
repetitions varies from file to file). The number on the line following the
"RelevantInfo" lines is really what I'm after. Ideally, I would like to have
something like so:

RelevantInfo1 = 10/10/04 # The variable name isn't actually important
RelevantInfo3 = 23 # it's just there to illustrate what info I'm
# trying to snag.

Score[RelevantInfo1][RelevantInfo3] = 22 # The value from RelevantInfo2

A possible solution, using the re module:

py> s = """\
.... Gibberish
.... 53
.... MoreGarbage
.... 12
.... RelevantInfo1
.... 10/10/04
.... NothingImportant
.... ThisDoesNotMatter
.... 44
.... RelevantInfo2
.... 22
.... BlahBlah
.... 343
.... RelevantInfo3
.... 23
.... Hubris
.... Crap
.... 34
.... """
py> import re
py> m = re.compile(r"""^RelevantInfo1\n([^\n]*)
.... .*
.... ^RelevantInfo2\n([^\n]*)
.... .*
.... ^RelevantInfo3\n([^\n]*)""",
.... re.DOTALL | re.MULTILINE | re.VERBOSE)
py> score = {}
py> for info1, info2, info3 in m.findall(s):
.... score.setdefault(info1, {})[info3] = info2
....
py> score
{'10/10/04': {'23': '22'}}

Note that I use DOTALL to allow .* to cross line boundaries, MULTILINE
to have ^ apply at the start of each line, and VERBOSE to allow me to
write the re in a more readable form.

If I didn't get your dict update quite right, hopefully you can see how
to fix it!

HTH,

STeVe
 
Y

Yatima

A possible solution, using the re module:

py> s = """\
... Gibberish
... 53
... MoreGarbage
... 12
... RelevantInfo1
... 10/10/04
... NothingImportant
... ThisDoesNotMatter
... 44
... RelevantInfo2
... 22
... BlahBlah
... 343
... RelevantInfo3
... 23
... Hubris
... Crap
... 34
... """
py> import re
py> m = re.compile(r"""^RelevantInfo1\n([^\n]*)
... .*
... ^RelevantInfo2\n([^\n]*)
... .*
... ^RelevantInfo3\n([^\n]*)""",
... re.DOTALL | re.MULTILINE | re.VERBOSE)
py> score = {}
py> for info1, info2, info3 in m.findall(s):
... score.setdefault(info1, {})[info3] = info2
...
py> score
{'10/10/04': {'23': '22'}}

Note that I use DOTALL to allow .* to cross line boundaries, MULTILINE
to have ^ apply at the start of each line, and VERBOSE to allow me to
write the re in a more readable form.

If I didn't get your dict update quite right, hopefully you can see how
to fix it!

Thanks! That was very helpful. Unfortunately, I wasn't completely clear when
describing the problem. Is there anyway to extract multiple scores from the
same file and from multiple files (I will probably use the "fileinput"
module to deal with multiple files). So, if I've got say:

Gibberish
53
MoreGarbage
12
RelevantInfo1
10/10/04
NothingImportant
ThisDoesNotMatter
44
RelevantInfo2
22
BlahBlah
343
RelevantInfo3
23
Hubris
Crap
34

SecondSetofGarbage
2423
YouGetThePicture
342342
RelevantInfo1
10/10/04
HoHum
343
MoreStuffNotNeeded
232
RelevantInfo2
33
RelevantInfo3
44
sdfsdf
RelevantInfo1
10/11/04
InsertBoringFillerHere
43234
Stuff
MoreStuff
RelevantInfo2
45
ExcitingIsntIt
324234
RelevantInfo3
60
Lalala

Sorry for the long and painful example input. Notice that the first two
"RelevantInfo1" fields have the same info but that the RelevantInfo2 and
RelevantInfo3 fields have different info. Also, there will be cases where
RelevantInfo3 might be the same with a different RelevantInfo2. What, I'm
hoping for is something along then lines of being able to organize it like
so (don't worry about the format of the output -- I'll deal with that
later; "RelevantInfo" shortened to "Info" for readability):

Info1[0], Info[1], Info[2] ...
Info3[0] Info2[Info1[0],Info3[0]] Info2[Info1[1],Info3[1]] ...
Info3[1] Info2[Info1[0],Info3[1]] ...
Info3[2] Info2[Info1[0],Info3[2]] ...
....

I don't really care if it's a list, dictionary, array etc.

Thanks again for your help. The multiline option in the re module is very
useful.

Take care.
 
J

James Stroud

Have a look at "martel", part of biopython. The world of bioinformatics is
filled with files with structure like this.

http://www.biopython.org/docs/api/public/Martel-module.html

James

A possible solution, using the re module:

py> s = """\
... Gibberish
... 53
... MoreGarbage
... 12
... RelevantInfo1
... 10/10/04
... NothingImportant
... ThisDoesNotMatter
... 44
... RelevantInfo2
... 22
... BlahBlah
... 343
... RelevantInfo3
... 23
... Hubris
... Crap
... 34
... """
py> import re
py> m = re.compile(r"""^RelevantInfo1\n([^\n]*)
... .*
... ^RelevantInfo2\n([^\n]*)
... .*
... ^RelevantInfo3\n([^\n]*)""",
... re.DOTALL | re.MULTILINE | re.VERBOSE)
py> score = {}
py> for info1, info2, info3 in m.findall(s):
... score.setdefault(info1, {})[info3] = info2
...
py> score
{'10/10/04': {'23': '22'}}

Note that I use DOTALL to allow .* to cross line boundaries, MULTILINE
to have ^ apply at the start of each line, and VERBOSE to allow me to
write the re in a more readable form.

If I didn't get your dict update quite right, hopefully you can see how
to fix it!

Thanks! That was very helpful. Unfortunately, I wasn't completely clear
when describing the problem. Is there anyway to extract multiple scores
from the same file and from multiple files (I will probably use the
"fileinput" module to deal with multiple files). So, if I've got say:

Gibberish
53
MoreGarbage
12
RelevantInfo1
10/10/04
NothingImportant
ThisDoesNotMatter
44
RelevantInfo2
22
BlahBlah
343
RelevantInfo3
23
Hubris
Crap
34

SecondSetofGarbage
2423
YouGetThePicture
342342
RelevantInfo1
10/10/04
HoHum
343
MoreStuffNotNeeded
232
RelevantInfo2
33
RelevantInfo3
44
sdfsdf
RelevantInfo1
10/11/04
InsertBoringFillerHere
43234
Stuff
MoreStuff
RelevantInfo2
45
ExcitingIsntIt
324234
RelevantInfo3
60
Lalala

Sorry for the long and painful example input. Notice that the first two
"RelevantInfo1" fields have the same info but that the RelevantInfo2 and
RelevantInfo3 fields have different info. Also, there will be cases where
RelevantInfo3 might be the same with a different RelevantInfo2. What, I'm
hoping for is something along then lines of being able to organize it like
so (don't worry about the format of the output -- I'll deal with that
later; "RelevantInfo" shortened to "Info" for readability):

Info1[0], Info[1], Info[2]
... Info3[0] Info2[Info1[0],Info3[0]] Info2[Info1[1],Info3[1]] ...
Info3[1] Info2[Info1[0],Info3[1]] ...
Info3[2] Info2[Info1[0],Info3[2]] ...
...

I don't really care if it's a list, dictionary, array etc.

Thanks again for your help. The multiline option in the re module is very
useful.

Take care.
 
Y

Yatima

Here is a way to create a list of [RelevantInfo, value] pairs:
import cStringIO

raw_data = '''Gibberish
53
MoreGarbage
12
RelevantInfo1
10/10/04
NothingImportant
ThisDoesNotMatter
44
RelevantInfo2
22
BlahBlah
343
RelevantInfo3
23
Hubris
Crap
34'''
raw_data = cStringIO.StringIO(raw_data)

data = []
for line in raw_data:
if line.startswith('RelevantInfo'):
key = line.strip()
value = raw_data.next().strip()
data.append([key, value])

print data

Thank you. This isn't exactly what I'm looking for (I wasn't clear in
describing the problem -- please see my reply to Steve for a, hopefully,
better explanation) but it does give me a few ideas.
Score[RelevantInfo1][RelevantInfo3] = 22 # The value from RelevantInfo2

I'm not sure what you mean by this. Do you want to build a Score dictionary as well?

Sure... Uhhh.. I think. Okay, what I want is some kind of awk-like
associative array because the raw data files will have repeats for certain
field vaues such that there would be, for example, multiple RelevantInfo2's
and RelevantInfo3's for the same RelevantInfo1 (i.e. on the same date). To
make matters more exciting, there will be multiple RelevantInfo1's (dates)
for the same RelevantInfo3 (e.g. a subject ID). RelevantInfo2 will be the
value for all unique combinations of RelevantInfo1 and RelevantInfo3. There
will be multiple occurrences of these fields in the same file (original data
sample was not very good for this reason) and multiple files as well. The
interesting three fields will always be repeated in the same order although
the amount of irrelevant data in between may vary. So:

RelevantInfo1
10/10/04
<snipped crap>
RelevantInfo2
12
<more snippage>
RelevantInfo3
43
<more snippage>
RelevantInfo1
10/10/04 <- The same as the first occurrence of RelevantInfo1
<snipped>
RelevantInfo2
22
<snipped>
RelevantInfo3
25
<snipped>
RelevantInfo1
10/11/04
<snipped>
RelevantInfo2
34
<snipped>
RelevantInfo3
28
<snipped>
RelevantInfo1
10/12/04
<snipped>
RelevantInfo2
98
<snipped>
RelevantInfo3
25 <- The same as the second occurrence of RelevantInfo3
....

Sorry for the long and tedious "data" example.

There will be missing values for some combinations of RelevantInfo1 and
RelevantInfo3 so hopefully that won't be an issue.

Thanks again for your reply.

Take care.
 
S

Steven Bethard

Yatima said:
A possible solution, using the re module:

py> s = """\
... Gibberish
... 53
... MoreGarbage
... 12
... RelevantInfo1
... 10/10/04
... NothingImportant
... ThisDoesNotMatter
... 44
... RelevantInfo2
... 22
... BlahBlah
... 343
... RelevantInfo3
... 23
... Hubris
... Crap
... 34
... """
py> import re
py> m = re.compile(r"""^RelevantInfo1\n([^\n]*)
... .*
... ^RelevantInfo2\n([^\n]*)
... .*
... ^RelevantInfo3\n([^\n]*)""",
... re.DOTALL | re.MULTILINE | re.VERBOSE)
py> score = {}
py> for info1, info2, info3 in m.findall(s):
... score.setdefault(info1, {})[info3] = info2
...
py> score
{'10/10/04': {'23': '22'}}

Note that I use DOTALL to allow .* to cross line boundaries, MULTILINE
to have ^ apply at the start of each line, and VERBOSE to allow me to
write the re in a more readable form.

If I didn't get your dict update quite right, hopefully you can see how
to fix it!


Thanks! That was very helpful. Unfortunately, I wasn't completely clear when
describing the problem. Is there anyway to extract multiple scores from the
same file and from multiple files

I think if you use the non-greedy .*? instead of the greedy .*, you'll
get this behavior. For example:

py> s = """\
.... Gibberish
.... 53
.... MoreGarbage
[snip a whole bunch of stuff]
.... RelevantInfo3
.... 60
.... Lalala
.... """
py> import re
py> m = re.compile(r"""^RelevantInfo1\n([^\n]*)
.... .*?
.... ^RelevantInfo2\n([^\n]*)
.... .*?
.... ^RelevantInfo3\n([^\n]*)""",
.... re.DOTALL | re.MULTILINE | re.VERBOSE)
py> score = {}
py> for info1, info2, info3 in m.findall(s):
.... score.setdefault(info1, {})[info3] = info2
....
py> score
{'10/10/04': {'44': '33', '23': '22'}, '10/11/04': {'60': '45'}}

If you might have multiple info2 values for the same (info1, info3)
pair, you can try something like:

py> score = {}
py> for info1, info2, info3 in m.findall(s):
.... score.setdefault(info1, {}).setdefault(info3, []).append(info2)
....
py> score
{'10/10/04': {'44': ['33'], '23': ['22']}, '10/11/04': {'60': ['45']}}

HTH,

STeVe
 
K

Kent Johnson

Here is another attempt. I'm still not sure I understand what form you want the data in. I made a
dict -> dict -> list structure so if you lookup e.g. scores['10/11/04']['60'] you get a list of all
the RelevantInfo2 values for Relevant1='10/11/04' and Relevant2='60'.

The parser is a simple-minded state machine that will misbehave if the input does not have entries
in the order Relevant1, Relevant2, Relevant3 (with as many intervening lines as you like).

All three values are available when Relevant3 is detected so you could do something else with them
if you want.

HTH
Kent

import cStringIO

raw_data = '''Gibberish
53
MoreGarbage
12
RelevantInfo1
10/10/04
NothingImportant
ThisDoesNotMatter
44
RelevantInfo2
22
BlahBlah
343
RelevantInfo3
23
Hubris
Crap
34

Gibberish
53
MoreGarbage
12
RelevantInfo1
10/10/04
NothingImportant
ThisDoesNotMatter
44
RelevantInfo2
22
BlahBlah
343
RelevantInfo3
23
Hubris
Crap
34

SecondSetofGarbage
2423
YouGetThePicture
342342
RelevantInfo1
10/10/04
HoHum
343
MoreStuffNotNeeded
232
RelevantInfo2
33
RelevantInfo3
44
sdfsdf
RelevantInfo1
10/11/04
InsertBoringFillerHere
43234
Stuff
MoreStuff
RelevantInfo2
45
ExcitingIsntIt
324234
RelevantInfo3
60
Lalala'''
raw_data = cStringIO.StringIO(raw_data)

scores = {}
info1 = info2 = info3 = None

for line in raw_data:
if line.startswith('RelevantInfo1'):
info1 = raw_data.next().strip()
elif line.startswith('RelevantInfo2'):
info2 = raw_data.next().strip()
elif line.startswith('RelevantInfo3'):
info3 = raw_data.next().strip()
scores.setdefault(info1, {}).setdefault(info3, []).append(info2)
info1 = info2 = info3 = None

print scores
print scores['10/11/04']['60']
print scores['10/10/04']['23']

## prints:
{'10/10/04': {'44': ['33'], '23': ['22', '22']}, '10/11/04': {'60': ['45']}}
['45']
['22', '22']
 
Y

Yatima

I think if you use the non-greedy .*? instead of the greedy .*, you'll
get this behavior. For example:

py> s = """\
... Gibberish
... 53
... MoreGarbage
[snip a whole bunch of stuff]
... RelevantInfo3
... 60
... Lalala
... """
py> import re
py> m = re.compile(r"""^RelevantInfo1\n([^\n]*)
... .*?
... ^RelevantInfo2\n([^\n]*)
... .*?
... ^RelevantInfo3\n([^\n]*)""",
... re.DOTALL | re.MULTILINE | re.VERBOSE)
py> score = {}
py> for info1, info2, info3 in m.findall(s):
... score.setdefault(info1, {})[info3] = info2
...
py> score
{'10/10/04': {'44': '33', '23': '22'}, '10/11/04': {'60': '45'}}

If you might have multiple info2 values for the same (info1, info3)
pair, you can try something like:

py> score = {}
py> for info1, info2, info3 in m.findall(s):
... score.setdefault(info1, {}).setdefault(info3, []).append(info2)
...
py> score
{'10/10/04': {'44': ['33'], '23': ['22']}, '10/11/04': {'60': ['45']}}
Perfect! Thank you so much. This is the behaviour I'm looking for. I will
fiddle around with this some more tonight but the rest should be okay.

Take care.
 
Y

Yatima

Here is another attempt. I'm still not sure I understand what form you want the data in. I made a
dict -> dict -> list structure so if you lookup e.g. scores['10/11/04']['60'] you get a list of all
the RelevantInfo2 values for Relevant1='10/11/04' and Relevant2='60'.

The parser is a simple-minded state machine that will misbehave if the input does not have entries
in the order Relevant1, Relevant2, Relevant3 (with as many intervening lines as you like).

All three values are available when Relevant3 is detected so you could do something else with them
if you want.

HTH
Kent

import cStringIO

raw_data = '''Gibberish
53
MoreGarbage [mass snippage]
60
Lalala'''
raw_data = cStringIO.StringIO(raw_data)

scores = {}
info1 = info2 = info3 = None

for line in raw_data:
if line.startswith('RelevantInfo1'):
info1 = raw_data.next().strip()
elif line.startswith('RelevantInfo2'):
info2 = raw_data.next().strip()
elif line.startswith('RelevantInfo3'):
info3 = raw_data.next().strip()
scores.setdefault(info1, {}).setdefault(info3, []).append(info2)
info1 = info2 = info3 = None

print scores
print scores['10/11/04']['60']
print scores['10/10/04']['23']

## prints:
{'10/10/04': {'44': ['33'], '23': ['22', '22']}, '10/11/04': {'60': ['45']}}
['45']
['22', '22']

Thank you so much. Your solution and Steve's both give me what I'm looking
for. I appreciate both of your incredibly quick replies!

Take care.
 
S

Steven Bethard

Kent said:
for line in raw_data:
if line.startswith('RelevantInfo1'):
info1 = raw_data.next().strip()
elif line.startswith('RelevantInfo2'):
info2 = raw_data.next().strip()
elif line.startswith('RelevantInfo3'):
info3 = raw_data.next().strip()
scores.setdefault(info1, {}).setdefault(info3, []).append(info2)
info1 = info2 = info3 = None

Very pretty. =) I have to say, I hadn't ever used iterators this way
before, that is, calling their next method from within a for-loop. I
like it. =)

Thanks for opening my mind. ;)

STeVe
 
K

Kent Johnson

Steven said:
Kent said:
for line in raw_data:
if line.startswith('RelevantInfo1'):
info1 = raw_data.next().strip()
elif line.startswith('RelevantInfo2'):
info2 = raw_data.next().strip()
elif line.startswith('RelevantInfo3'):
info3 = raw_data.next().strip()
scores.setdefault(info1, {}).setdefault(info3, []).append(info2)
info1 = info2 = info3 = None


Very pretty. =) I have to say, I hadn't ever used iterators this way
before, that is, calling their next method from within a for-loop. I
like it. =)

I confess I have a nagging suspicion that someone who actually knows something about CPython
internals will tell me why it's a bad idea...but it sure is handy!
Thanks for opening my mind. ;)

My pleasure :)

Kent
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,051
Latest member
CarleyMcCr

Latest Threads

Top