itertools.groupby

Jason Friedman · Apr 20, 2013

I have a file such as:

$ cat my_data
Starting a new group
a
b
c
Starting a new group
1
2
3
4
Starting a new group
X
Y
Z
Starting a new group

I am wanting a list of lists:
['a', 'b', 'c']
['1', '2', '3', '4']
['X', 'Y', 'Z']
[]

I wrote this:
------------------------------------
#!/usr/bin/python3
from itertools import groupby

def get_lines_from_file(file_name):
with open(file_name) as reader:
for line in reader.readlines():
yield(line.strip())

counter = 0
def key_func(x):
if x.startswith("Starting a new group"):
global counter
counter += 1
return counter

for key, group in groupby(get_lines_from_file("my_data"), key_func):
print(list(group)[1:])

Steven D'Aprano · Apr 20, 2013

I have a file such as:

$ cat my_data
Starting a new group
a
b
c
Starting a new group
1
2
3
4
Starting a new group
X
Y
Z
Starting a new group

I am wanting a list of lists:
['a', 'b', 'c']
['1', '2', '3', '4']
['X', 'Y', 'Z']
[]

I wrote this: [...]
I get the output I desire, but I'm wondering if there is a solution
without the global counter.

I wouldn't use groupby. It's a hammer, not every grouping job is a nail.

Instead, use a simple accumulator:

def group(lines):
accum = []
for line in lines:
line = line.strip()
if line == 'Starting a new group':
if accum: # Don't bother if there are no accumulated lines.
yield accum
accum = []
else:
accum.append(line)
# Don't forget the last group of lines.
if accum: yield accum

Joshua Landau · Apr 21, 2013

On 21 April 2013 01:13, Steven D'Aprano <

I wouldn't use groupby. It's a hammer, not every grouping job is a nail.

Instead, use a simple accumulator:

def group(lines):
accum = []
for line in lines:
line = line.strip()
if line == 'Starting a new group':
if accum: # Don't bother if there are no accumulated lines.
yield accum
accum = []
else:
accum.append(line)
# Don't forget the last group of lines.
if accum: yield accum

Whilst yours is the simplest bar Dennis Lee Bieber's and nicer in that it
yields, neither of yours work for empty groups properly.

I recommend the simple change:

def group(lines):
accum = None
for line in lines:
line = line.strip()
if line == 'Starting a new group':
if accum is not None: # Don't bother if there are no
accumulated lines.
yield accum
accum = []
else:
accum.append(line)
# Don't forget the last group of lines.
yield accum

But will recommend my own small twist (because I think it is clever):

def group(lines):
lines = (line.strip() for line in lines)

if next(lines) != "Starting a new group":
raise ValueError("First line must be 'Starting a new group'")

while True:
acum = []

for line in lines:
if line == "Starting a new group":
break

acum.append(line)

else:
yield acum
break

yield acum

Neil Cerutti · Apr 22, 2013

I have a file such as:

$ cat my_data
Starting a new group
a
b
c
Starting a new group
1
2
3
4
Starting a new group
X
Y
Z
Starting a new group

I am wanting a list of lists:
['a', 'b', 'c']
['1', '2', '3', '4']
['X', 'Y', 'Z']
[]

Hrmmm, hoomm. Nobody cares for slicing any more.

def headered_groups(lst, header):
b = lst.index(header) + 1
while True:
try:
e = lst.index(header, b)
except ValueError:
yield lst[b:]
break
yield lst[b:e]
b = e+1

for group in headered_groups([line.strip() for line in open('data.txt')],
"Starting a new group"):
print(group)

Oscar Benjamin · Apr 22, 2013

Hrmmm, hoomm. Nobody cares for slicing any more.

def headered_groups(lst, header):
b = lst.index(header) + 1
while True:
try:
e = lst.index(header, b)
except ValueError:
yield lst[b:]
break
yield lst[b:e]
b = e+1

This requires the whole file to be read into memory. Iterators are
typically preferred over list slicing for sequential text file access
since you can avoid loading the whole file at once. This means that
you can process a large file while only using a constant amount of
memory.

for group in headered_groups([line.strip() for line in open('data.txt')],
"Starting a new group"):
print(group)

The list comprehension above loads the entire file into memory.
Assuming that .strip() is just being used to remove the newline at the
end it would be better to just use the readlines() method since that
loads everything into memory and removes the newlines. To remove them
without reading everything you can use map (or imap in Python 2):

with open('data.txt') as inputfile:
for group in headered_groups(map(str.strip, inputfile)):
print(group)

Oscar

Neil Cerutti · Apr 22, 2013

Hrmmm, hoomm. Nobody cares for slicing any more.

def headered_groups(lst, header):
b = lst.index(header) + 1
while True:
try:
e = lst.index(header, b)
except ValueError:
yield lst[b:]
break
yield lst[b:e]
b = e+1

Click to expand...

This requires the whole file to be read into memory. Iterators
are typically preferred over list slicing for sequential text
file access since you can avoid loading the whole file at once.
This means that you can process a large file while only using a
constant amount of memory.

I agree, but this application processes unknowns-sized slices,
you have to build lists anyhow. I find slicing much more
convenient than accumulating in this case, but it's possibly a
tradeoff.

with open('data.txt') as inputfile:
for group in headered_groups(map(str.strip, inputfile)):
print(group)

Thanks, that's a nice improvement.

Chris Angelico · Apr 22, 2013

Iterators are
typically preferred over list slicing for sequential text file access
since you can avoid loading the whole file at once. This means that
you can process a large file while only using a constant amount of
memory.

And, perhaps even more importantly, allows you to pipe text in and
out. Obviously some operations (eg grep) lend themselves better to
this than do others (eg sort), but with this it ought at least to
output each group as it comes.

ChrisA

Insert replace text based on a name in other file python script	4	Mar 5, 2025
I am using 3D map generation with noise, the texture does not match the model.	1	Feb 11, 2026
Trouble with prediction code, for the life of me I can't figure out why it isnt running properly. Help would be appreciated.	0	Jul 8, 2023
Register Question	0	Oct 21, 2024
Iam having trouble adding a level editor to my platformer	0	Nov 4, 2025
Code sharing	2	Oct 15, 2024
Programming math challenge gives wrong answer	2	Aug 6, 2023
I Need Fix In Code	1	Apr 12, 2023

itertools.groupby

Jason Friedman

Steven D'Aprano

Joshua Landau

Neil Cerutti

Oscar Benjamin

Neil Cerutti

Chris Angelico

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads