Record seperator

greymaus · Aug 26, 2011

Is there an equivelent for the AWK RS in Python?

as in RS='\n\n'
will seperate a file at two blank line intervals

D'Arcy J.M. Cain · Aug 26, 2011

Is there an equivelent for the AWK RS in Python?

as in RS='\n\n'
will seperate a file at two blank line intervals

open("file.txt").read().split("\n\n")

greymaus · Aug 27, 2011

open("file.txt").read().split("\n\n")

Ta!.. bit awkard.

)))))

Steven D'Aprano · Aug 27, 2011

greymaus said:
Ta!.. bit awkard. )))))

Er, is that meant to be a pun? "Awk[w]ard", as in awk-ward?

In any case, no, the Python line might be a handful of characters longer
than the AWK equivalent, but it isn't awkward. It is logical and easy to
understand. It's embarrassingly easy to describe what it does:

open("file.txt") # opens the file
.read() # reads the contents of the file
.split("\n\n") # splits the text on double-newlines.

The only tricky part is knowing that \n means newline, but anyone familiar
with C, Perl, AWK etc. should know that.

The Python code might be "long" (but only by the standards of AWK, which can
be painfully concise), but it is simple, obvious and readable. A few extra
characters is the price you pay for making your language readable. At the
cost of a few extra key presses, you get something that you will be able to
understand in 10 years time.

AWK is a specialist text processing language. Python is a general scripting
and programming language. They have different values: AWK values short,
concise code, Python is willing to pay a little more in source code.

Roy Smith · Aug 27, 2011

Steven D'Aprano said:
open("file.txt") # opens the file
.read() # reads the contents of the file
.split("\n\n") # splits the text on double-newlines.

The biggest problem with this code is that read() slurps the entire file
into a string. That's fine for moderately sized files, but will fail
(or at least be grossly inefficient) for very large files.

It's always annoyed me a little that while it's easy to iterate over the
lines of a file, it's more complicated to iterate over a file character
by character. You could write your own generator to do that:

for c in getchar(open("file.txt")):
whatever

def getchar(f):
for line in f:
for c in line:
yield c

but that's annoyingly verbose (and probably not hugely efficient).

Of course, the next problem for the specific problem at hand is that
even with an iterator over the characters of a file, split() only works
on strings. It would be nice to have a version of split which took an
iterable and returned an iterator over the split components. Maybe
there is such a thing and I'm just missing it?

ChasBrown · Aug 27, 2011

The biggest problem with this code is that read() slurps the entire file
into a string. That's fine for moderately sized files, but will fail
(or at least be grossly inefficient) for very large files.

It's always annoyed me a little that while it's easy to iterate over the
lines of a file, it's more complicated to iterate over a file character
by character. You could write your own generator to do that:

for c in getchar(open("file.txt")):
whatever

def getchar(f):
for line in f:
for c in line:
yield c

but that's annoyingly verbose (and probably not hugely efficient).

read() takes an optional size parameter; so f.read(1) is another
option...

Of course, the next problem for the specific problem at hand is that
even with an iterator over the characters of a file, split() only works
on strings. It would be nice to have a version of split which took an
iterable and returned an iterator over the split components. Maybe
there is such a thing and I'm just missing it?

I don't know if there is such a thing; but for the OP's problem you
could read the file in chunks, e.g.:

def readgroup(f, delim, buffsize=8192):
tail=''
while True:
s = f.read(buffsize)
if not s:
yield tail
break
groups = (tail + s).split(delim)
tail = groups[-1]
for group in groups[:-1]:
yield group

for group in readgroup(open('file.txt'), '\n\n'):
# do something

Cheers - Chas

Terry Reedy · Aug 27, 2011

The biggest problem with this code is that read() slurps the entire file
into a string. That's fine for moderately sized files, but will fail
(or at least be grossly inefficient) for very large files.

I read the above as separating the file into paragraphs, as indicated by
blank lines.

def paragraphs(file):
para = []
for line in file:
if line:
para.append(line)
else:
yield para # or ''.join(para), as desired
para = []

Chris Angelico · Aug 27, 2011

yield para # or ''.join(para), as desired

Or possibly '\n'.join(para) if you want to keep the line breaks inside
paragraphs.

ChrisA

Roy Smith · Aug 27, 2011

Terry Reedy said:
The biggest problem with this code is that read() slurps the entire file
into a string. That's fine for moderately sized files, but will fail
(or at least be grossly inefficient) for very large files.

Click to expand...

I read the above as separating the file into paragraphs, as indicated by
blank lines.

def paragraphs(file):
para = []
for line in file:
if line:
para.append(line)
else:
yield para # or ''.join(para), as desired
para = []

Plus or minus the last paragraph in the file

Terry Reedy · Aug 28, 2011

Terry Reedy said:
Terry Reedy said:

open("file.txt") # opens the file
.read() # reads the contents of the file
.split("\n\n") # splits the text on double-newlines.

The biggest problem with this code is that read() slurps the entire file
into a string. That's fine for moderately sized files, but will fail
(or at least be grossly inefficient) for very large files.

Click to expand...

I read the above as separating the file into paragraphs, as indicated by
blank lines.

def paragraphs(file):
para = []
for line in file:
if line:
para.append(line)
else:
yield para # or ''.join(para), as desired
para = []

Click to expand...

Plus or minus the last paragraph in the file

Or right, I forgot the last line, which is a repeat of the yield after
the for loop finishes.

greymaus · Aug 28, 2011

greymaus said:
greymaus said:

Ta!.. bit awkard. )))))

Click to expand...

Er, is that meant to be a pun? "Awk[w]ard", as in awk-ward?

Yup, mispelled it and realized th error

In any case, no, the Python line might be a handful of characters longer
than the AWK equivalent, but it isn't awkward. It is logical and easy to
understand. It's embarrassingly easy to describe what it does:

open("file.txt") # opens the file
.read() # reads the contents of the file
.split("\n\n") # splits the text on double-newlines.

The only tricky part is knowing that \n means newline, but anyone familiar
with C, Perl, AWK etc. should know that.

The Python code might be "long" (but only by the standards of AWK, which can
be painfully concise), but it is simple, obvious and readable. A few extra
characters is the price you pay for making your language readable. At the
cost of a few extra key presses, you get something that you will be able to
understand in 10 years time.

AWK is a specialist text processing language. Python is a general scripting
and programming language. They have different values: AWK values short,
concise code, Python is willing to pay a little more in source code.

RS, and its Perl equivelent, which I forget, mean that you can read in
full multiline records.

(I am coming into Python via Perl from AWK, and trying to get a grip
on the language and its idions)

Thanks to All

Oh, Awk is far more than a text processing language, may be old (like me!)
but useful (ditto)

What programming language to choose?	4	Jul 3, 2022
Automatic parking area lighting system	4	Apr 22, 2023
input record seperator (equivalent of "$\|" of perl)	35	Dec 19, 2004
Problem Splitting Text String	2	Dec 29, 2022
Collecting multiple items and saving to one list item, for eventual storage as a record.	8	Mar 5, 2023
How do i set specific code where in arduino	1	Mar 7, 2023
Upgrading Company's Internal Record Keeping Systems	0	Sep 24, 2021
record pixel value with Python script	2	Jan 11, 2014

Record seperator

greymaus

D'Arcy J.M. Cain

greymaus

Steven D'Aprano

Roy Smith

ChasBrown

Terry Reedy

Chris Angelico

Roy Smith

Terry Reedy

greymaus

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads