split and regexp on textfile

Flyzone · Apr 13, 2007

Hi,
i have a problem with the split function and regexp.
I have a file that i want to split using the date as token.
Here a sample:
-----
Mon Apr 9 22:30:18 2007
text
text
Mon Apr 9 22:31:10 2007
text
text
----

I'm trying to put all the lines in a one string and then to separate
it
(could be better to not delete the \n if possible...)
while 1:
line = ftoparse.readline()
if not line: break
if line[-1]=='\n': line=line[:-1]
file_str += line
matchobj=re.compile('[A-Z][a-z][a-z][ ][A-Z][a-z][a-z][ ][0-9| ][0-9]
[ ][0-9][0-9][:]')
matchobj=matchobj.split(file_str)
print matchobj

i have tried also
matchobj=re.split(r"^[A-Z][a-z][a-z][ ][A-Z][a-z][a-z][ ][0-9| ]
[0-9][ ][0-9][0-9][:]",file_str)
and reading all with one:
file_str=ftoparse.readlines()
but the split doesn't work...where i am wronging?

mik3l3374 · Apr 13, 2007

Hi,
i have a problem with the split function and regexp.
I have a file that i want to split using the date as token.
Here a sample:
-----
Mon Apr 9 22:30:18 2007
text
text
Mon Apr 9 22:31:10 2007
text
text
----

I'm trying to put all the lines in a one string and then to separate
it
(could be better to not delete the \n if possible...)
while 1:
line = ftoparse.readline()
if not line: break
if line[-1]=='\n': line=line[:-1]
file_str += line
matchobj=re.compile('[A-Z][a-z][a-z][ ][A-Z][a-z][a-z][ ][0-9| ][0-9]
[ ][0-9][0-9][:]')
matchobj=matchobj.split(file_str)
print matchobj

i have tried also
matchobj=re.split(r"^[A-Z][a-z][a-z][ ][A-Z][a-z][a-z][ ][0-9| ]
[0-9][ ][0-9][0-9][:]",file_str)
and reading all with one:
file_str=ftoparse.readlines()
but the split doesn't work...where i am wronging?

you trying to match the date part right? if re is what you desire,
here's one example:

data = open("file").read()
pat = re.compile("[A-Z][a-z]{2} [A-Z][a-z]{2} \d{,2}\s+\d{,2}:\d{,2}:\d{,2} \d{4}",re.M|re.DOTALL)
print pat.findall(data)

Click to expand...

Click to expand...

['Mon Apr 9 22:30:18 2007', 'Mon Apr 9 22:31:10 2007']

Flyzone · Apr 13, 2007

you trying to match the date part right? if re is what you desire,
here's one example:

Amm..not! I need to get the text-block between the two data, not the
data!

mik3l3374 · Apr 13, 2007

Amm..not! I need to get the text-block between the two data, not the
data!

change to pat.split(data) then.
I get this:

['', '\ntext\ntext\n', '\ntext\ntext ']

Flyzone · Apr 13, 2007

change to pat.split(data) then.

next what i have tried originally..but is not working, my result is
here:

["Mon Feb 26 11:25:04 2007\ntext\n text\ntext\nMon Feb 26 11:25:16
2007\ntext\n text\n text\nMon Feb 26 17:06:41 2007\ntext"]

all together

bearophileHUGS · Apr 13, 2007

Flyzone:

i have a problem with the split function and regexp.
I have a file that i want to split using the date as token.

My first try:

data = """
error text
Mon Apr 9 22:30:18 2007
text
text
Mon Apr 9 22:31:10 2007
text
text
Mon Apr 10 22:31:10 2007
text
text
"""

import re
date_find = re.compile(r"\d\d:\d\d:\d\d \d{4}$")

section = []
for line in data.splitlines():
if date_find.search(line):
if section:
print "\n" + "-" * 10 + "\n", "\n".join(section)
section = [line]
else:
if line:
section.append(line)

print "\n" + "-" * 10 + "\n", "\n".join(section)

itertools.groupby() is fit to split sequences like:
1111100011111100011100101011111
as:
11111 000 111111 000 111 00 1 0 1 0 11111
While here we have a sequence like:
100001000101100001000000010000
that has to be splitted as:
10000 1000 10 1 10000 10000000 10000
A standard itertool can be added for such quite common situation too.

Along those lines I have devised this different (and maybe over-
engineered) version:

from itertools import groupby
import re

class Splitter(object):
# Not tested much
def __init__(self, predicate):
self.predicate = predicate
self.precedent_el = None
self.state = True
def __call__(self, el):
if self.predicate(el):
self.state = not self.state
self.precedent_el = el
return self.state

date_find = re.compile(r"\d\d:\d\d:\d\d \d{4}$")
splitter = Splitter(date_find.search)

sections = ("\n".join(g) for h,g in groupby(data.splitlines(),
key=splitter))
for section in sections:
if section:
print "\n" + "-" * 10 + "\n", section

The Splitter class + the groupby can become a single simpler
generator, like in this this version:

def grouper(seq, key=bool):
# A fast identity function can be used instead of bool()
# Not tested much
group = []
for part in seq:
if key(part):
if group: yield group
group = [part]
else:
group.append(part)
yield group

import re
date_find = re.compile(r"\d\d:\d\d:\d\d \d{4}$")

for section in grouper(data.splitlines(), date_find.search):
print "\n" + "-" * 10 + "\n", "\n".join(section)

Maybe that grouper can be modified to manage group lazily, like
groupby does, instead of building a true list.

Flyzone (seen later):

Amm..not! I need to get the text-block between the two data, not the data!

Then you can modify the code like this:

def grouper(seq, key=bool):
group = []
for part in seq:
if key(part):
if group: yield group
group = [] # changed
else:
group.append(part)
yield group

Bye,
bearophile

Flyzone · Apr 13, 2007

all together

Damn was wrong mine regexp:
pat = re.compile("[A-Z][a-z][a-z][ ][A-Z][a-z][a-z][ ][0-9| ][0-9][ ]
[0-9][0-9][:][0-9][0-9]",re.M|re.DOTALL)

now is working!

Great! really thanks for the helps!

A little question: the pat.split can split without delete the date?

mik3l3374 · Apr 13, 2007

all together

Click to expand...

Damn was wrong mine regexp:
pat = re.compile("[A-Z][a-z][a-z][ ][A-Z][a-z][a-z][ ][0-9| ][0-9][ ]
[0-9][0-9][:][0-9][0-9]",re.M|re.DOTALL)

now is working!
Great! really thanks for the helps!

A little question: the pat.split can split without delete the date?

not that i know of.

Gabriel Genellina · Apr 15, 2007

A little question: the pat.split can split without delete the date?

No, but instead of reading the whole file and splitting on dates, you
could iterate over the file and detect block endings:

def split_on_dates(ftoparse):
block = None
for line in ftoparse:
if fancy_date_regexp.match(line):
# a new block begins, yield the previous one
if block is not None:
yield current_date, block
current_date = line
block = []
else:
# accumulate lines for current block
block.append(line)
# don't forget the last block
if block is not None:
yield current_date, block

for date, block in split_on_dates(ftoparse):
# process block

Minimum Total Difficulty	0	Nov 15, 2023
SENTINEL CONTROL LOOP WHEN DEALING WITH TWO ARRAYS	1	Oct 26, 2023
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
Taskcproblem calendar	4	Aug 31, 2023
How to position the tooltip comment on these buttons?	9	Nov 4, 2023
Not sure why drop-down is not working.	2	Mar 24, 2024
Interfering CSS	1	Feb 9, 2024
Problem with displaying character that code number is 219 (after SetConsoleTextAttribute)?	3	Jan 9, 2023

split and regexp on textfile

Flyzone

mik3l3374

Flyzone

mik3l3374

Flyzone

bearophileHUGS

Flyzone

mik3l3374

Gabriel Genellina

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads