split and regexp on textfile

F

Flyzone

Hi,
i have a problem with the split function and regexp.
I have a file that i want to split using the date as token.
Here a sample:
-----
Mon Apr 9 22:30:18 2007
text
text
Mon Apr 9 22:31:10 2007
text
text
----

I'm trying to put all the lines in a one string and then to separate
it
(could be better to not delete the \n if possible...)
while 1:
line = ftoparse.readline()
if not line: break
if line[-1]=='\n': line=line[:-1]
file_str += line
matchobj=re.compile('[A-Z][a-z][a-z][ ][A-Z][a-z][a-z][ ][0-9| ][0-9]
[ ][0-9][0-9][:]')
matchobj=matchobj.split(file_str)
print matchobj

i have tried also
matchobj=re.split(r"^[A-Z][a-z][a-z][ ][A-Z][a-z][a-z][ ][0-9| ]
[0-9][ ][0-9][0-9][:]",file_str)
and reading all with one:
file_str=ftoparse.readlines()
but the split doesn't work...where i am wronging?
 
M

mik3l3374

Hi,
i have a problem with the split function and regexp.
I have a file that i want to split using the date as token.
Here a sample:
-----
Mon Apr 9 22:30:18 2007
text
text
Mon Apr 9 22:31:10 2007
text
text
----

I'm trying to put all the lines in a one string and then to separate
it
(could be better to not delete the \n if possible...)
while 1:
line = ftoparse.readline()
if not line: break
if line[-1]=='\n': line=line[:-1]
file_str += line
matchobj=re.compile('[A-Z][a-z][a-z][ ][A-Z][a-z][a-z][ ][0-9| ][0-9]
[ ][0-9][0-9][:]')
matchobj=matchobj.split(file_str)
print matchobj

i have tried also
matchobj=re.split(r"^[A-Z][a-z][a-z][ ][A-Z][a-z][a-z][ ][0-9| ]
[0-9][ ][0-9][0-9][:]",file_str)
and reading all with one:
file_str=ftoparse.readlines()
but the split doesn't work...where i am wronging?

you trying to match the date part right? if re is what you desire,
here's one example:
data = open("file").read()
pat = re.compile("[A-Z][a-z]{2} [A-Z][a-z]{2} \d{,2}\s+\d{,2}:\d{,2}:\d{,2} \d{4}",re.M|re.DOTALL)
print pat.findall(data)
['Mon Apr 9 22:30:18 2007', 'Mon Apr 9 22:31:10 2007']
 
F

Flyzone

you trying to match the date part right? if re is what you desire,
here's one example:

Amm..not! I need to get the text-block between the two data, not the
data! :)
 
M

mik3l3374

Amm..not! I need to get the text-block between the two data, not the
data! :)

change to pat.split(data) then.
I get this:

['', '\ntext\ntext\n', '\ntext\ntext ']
 
F

Flyzone

change to pat.split(data) then.

next what i have tried originally..but is not working, my result is
here:

["Mon Feb 26 11:25:04 2007\ntext\n text\ntext\nMon Feb 26 11:25:16
2007\ntext\n text\n text\nMon Feb 26 17:06:41 2007\ntext"]

all together :(
 
B

bearophileHUGS

Flyzone:
i have a problem with the split function and regexp.
I have a file that i want to split using the date as token.

My first try:

data = """
error text
Mon Apr 9 22:30:18 2007
text
text
Mon Apr 9 22:31:10 2007
text
text
Mon Apr 10 22:31:10 2007
text
text
"""

import re
date_find = re.compile(r"\d\d:\d\d:\d\d \d{4}$")

section = []
for line in data.splitlines():
if date_find.search(line):
if section:
print "\n" + "-" * 10 + "\n", "\n".join(section)
section = [line]
else:
if line:
section.append(line)

print "\n" + "-" * 10 + "\n", "\n".join(section)



itertools.groupby() is fit to split sequences like:
1111100011111100011100101011111
as:
11111 000 111111 000 111 00 1 0 1 0 11111
While here we have a sequence like:
100001000101100001000000010000
that has to be splitted as:
10000 1000 10 1 10000 10000000 10000
A standard itertool can be added for such quite common situation too.

Along those lines I have devised this different (and maybe over-
engineered) version:


from itertools import groupby
import re

class Splitter(object):
# Not tested much
def __init__(self, predicate):
self.predicate = predicate
self.precedent_el = None
self.state = True
def __call__(self, el):
if self.predicate(el):
self.state = not self.state
self.precedent_el = el
return self.state

date_find = re.compile(r"\d\d:\d\d:\d\d \d{4}$")
splitter = Splitter(date_find.search)

sections = ("\n".join(g) for h,g in groupby(data.splitlines(),
key=splitter))
for section in sections:
if section:
print "\n" + "-" * 10 + "\n", section


The Splitter class + the groupby can become a single simpler
generator, like in this this version:


def grouper(seq, key=bool):
# A fast identity function can be used instead of bool()
# Not tested much
group = []
for part in seq:
if key(part):
if group: yield group
group = [part]
else:
group.append(part)
yield group

import re
date_find = re.compile(r"\d\d:\d\d:\d\d \d{4}$")

for section in grouper(data.splitlines(), date_find.search):
print "\n" + "-" * 10 + "\n", "\n".join(section)


Maybe that grouper can be modified to manage group lazily, like
groupby does, instead of building a true list.


Flyzone (seen later):
Amm..not! I need to get the text-block between the two data, not the data! :)

Then you can modify the code like this:

def grouper(seq, key=bool):
group = []
for part in seq:
if key(part):
if group: yield group
group = [] # changed
else:
group.append(part)
yield group

Bye,
bearophile
 
F

Flyzone

all together :(

Damn was wrong mine regexp:
pat = re.compile("[A-Z][a-z][a-z][ ][A-Z][a-z][a-z][ ][0-9| ][0-9][ ]
[0-9][0-9][:][0-9][0-9]",re.M|re.DOTALL)

now is working! :)
Great! really thanks for the helps!

A little question: the pat.split can split without delete the date?
 
M

mik3l3374

all together :(

Damn was wrong mine regexp:
pat = re.compile("[A-Z][a-z][a-z][ ][A-Z][a-z][a-z][ ][0-9| ][0-9][ ]
[0-9][0-9][:][0-9][0-9]",re.M|re.DOTALL)

now is working! :)
Great! really thanks for the helps!

A little question: the pat.split can split without delete the date?

not that i know of.
 
G

Gabriel Genellina

A little question: the pat.split can split without delete the date?

No, but instead of reading the whole file and splitting on dates, you
could iterate over the file and detect block endings:

def split_on_dates(ftoparse):
block = None
for line in ftoparse:
if fancy_date_regexp.match(line):
# a new block begins, yield the previous one
if block is not None:
yield current_date, block
current_date = line
block = []
else:
# accumulate lines for current block
block.append(line)
# don't forget the last block
if block is not None:
yield current_date, block

for date, block in split_on_dates(ftoparse):
# process block
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,071
Latest member
MetabolicSolutionsKeto

Latest Threads

Top