converting a sed / grep / awk / . . . bash pipe line into python

hofer · Sep 2, 2008

Hi,

Something I have to do very often is filtering / transforming line
based file contents and storing the result in an array or a
dictionary.

Very often the functionallity exists already in form of a shell script
with sed / awk / grep , . . .
and I would like to have the same implementation in my script

What's a compact, efficient (no intermediate arrays generated /
regexps compiled only once) way in python
for such kind of 'pipe line'

Example 1 (in bash): (annotated with comment (thus not working) if
copied / pasted
#-------------------------------------------------------------------------------------------
cat file \ ### read from file
| sed 's/\.\..*//' \ ### remove '//' comments
| sed 's/#.*//' \ ### remove '#' comments
| grep -v '^\s*$' \ ### get rid of empty lines
| awk '{ print $1 + $2 " " $2 }' \ ### knowing, that all remaining
lines contain always at least
\ ### two integers calculate
sum and 'keep' second number
| grep '^42 ' ### keep lines for which sum is 42
| awk '{ print $2 }' ### print number

Same example in perl:
# I guess (but didn't try), taht the perl example will create more
intermediate
# data structures than necessary.
# Ideally the python implementation shouldn't do this, but just
'chain' iterators.
#-------------------------------------------------------------------------------------------
my $filename= "file";
open(my $fh,$filename) or die "failed opening file $filename";

# order of 'pipeline' is syntactically reversed (if compared to shell
script)
my @numbers =
map { $_->[1] } # extract num 2
grep { $_->[0] == 42 } # keep lines with result 42
map { [ $_->[0]+$_->[1],$_->[1] ] } # calculate sum of first two
nums and keep second num
map { [ split(' ',$_,3) ] } # split by white space
grep { ! ($_ =~ /^\s*$/) } # remove empty lines
map { $_ =~ s/#.*// ; $_} # strip '#' comments
map { $_ =~ s/\/\/.*// ; $_} # strip '//' comments
<$fh>;
print "Numbers are:\n",join("\n",@numbers),"\n";

thanks in advance for any suggestions of how to code this (keeping the
comments)

H

Marc 'BlackJack' Rintsch · Sep 2, 2008

sed 's/\.\..*//' \ ### remove '//' comments | sed 's/#.*//'

Comment does not match the code. Or vice versa.

Untested:

from __future__ import with_statement
from itertools import ifilter, ifilterfalse, imap

def is_junk(line):
line = line.rstrip()
return not line or line.startswith('//') or line.startswith('#')

def extract_numbers(line):
result = map(int, line.split()[:2])
assert len(result) == 2
return result

def main():
with open('test.txt') as lines:
clean_lines = ifilterfalse(is_junk, lines)
pairs = imap(extract_numbers, clean_lines)
print '\n'.join(b for a, b in pairs if a + b == 42)

if __name__ == '__main__':
main()

Ciao,
Marc 'BlackJack' Rintsch

Paul McGuire · Sep 3, 2008

Hi,

Something I have to do very often is filtering / transforming line
based file contents and storing the result in an array or a
dictionary.

Very often the functionallity exists already in form of a shell script
with sed / awk / grep , . . .
and I would like to have the same implementation in my script

All that sed'ing, grep'ing and awk'ing, you might want to take a look
at pyparsing. Here is a pyparsing take on your posted problem:

from pyparsing import LineEnd, Word, nums, LineStart, OneOrMore,
restOfLine

test = """

1 2 3
47 23 // this will never match
# blank lines are not of any interest
91 26

23 19

41 1 97 26 // extra numbers don't matter
"""

# define pyparsing expressions to match a line of integers
EOL = LineEnd()
integer = Word(nums)

# by default, pyparsing will implicitly skip over whitespace and
# newlines, so EOL is skipped over by default - this would mix
together
# integers on consecutive lines - we only want OneOrMore integers as
long
# as they are on the same line, that is, integers with no intervening
# EOL's
line_of_integers = (LineStart() + integer + OneOrMore(~EOL + integer))

# use a parse action to identify the target lines
def select_significant_values(t):
v1, v2 = map(int, t[:2])
if v1+v2 == 42:
print v2
line_of_integers.setParseAction(select_significant_values)

# skip over comments, wherever they are
line_of_integers.ignore( '//' + restOfLine )
line_of_integers.ignore( '#' + restOfLine )

# use the line_of_integers expression to search through the test text
# the parse action will print the matching values
line_of_integers.searchString(test)

-- Paul

Peter Otten · Sep 3, 2008

hofer said:
Something I have to do very often is filtering / transforming line
based file contents and storing the result in an array or a
dictionary.

Very often the functionallity exists already in form of a shell script
with sed / awk / grep , . . .
and I would like to have the same implementation in my script

What's a compact, efficient (no intermediate arrays generated /
regexps compiled only once) way in python
for such kind of 'pipe line'

Example 1 (in bash): (annotated with comment (thus not working) if
copied / pasted

cat file \ ### read from file
| sed 's/\.\..*//' \ ### remove '//' comments
| sed 's/#.*//' \ ### remove '#' comments
| grep -v '^\s*$' \ ### get rid of empty lines
| awk '{ print $1 + $2 " " $2 }' \ ### knowing, that all remaining
lines contain always at least
\ ### two integers calculate
sum and 'keep' second number
| grep '^42 ' ### keep lines for which sum is 42
| awk '{ print $2 }' ### print number
thanks in advance for any suggestions of how to code this (keeping the
comments)

for line in open("file"): # read from file
try:
a, b = map(int, line.split(None, 2)[:2]) # remove extra columns,
# convert to integer
except ValueError:
pass # remove comments, get rid of empty lines,
# skip lines with less than two integers
else:
# line did start with two integers
if a + b == 42: # keep lines for which the sum is 42
print b # print number

The hard part was keeping the comments

Without them it looks better:

import sys
for line in sys.stdin:
try:
a, b = map(int, line.split(None, 2)[:2])
except ValueError:
pass
else:
if a + b == 42:
print b

Peter

Roy Smith · Sep 3, 2008

Peter Otten said:
Without them it looks better:

import sys
for line in sys.stdin:
try:
a, b = map(int, line.split(None, 2)[:2])
except ValueError:
pass
else:
if a + b == 42:
print b

I'm philosophically opposed to one-liners like:

a, b = map(int, line.split(None, 2)[:2])

because they're difficult to understand at a glance. You need to visually
parse it and work your way out from the inside to figure out what's going
on. Better to keep it longer and simpler.

Now that I've got my head around it, I realized there's no reason to make
the split part so complicated. No reason to limit how many splits get done
if you're explicitly going to slice the first two. And since you don't
need to supply the second argument, the first one can be defaulted as well.
So, you immediately get down to:

a, b = map(int, line.split()[:2])

which isn't too bad. I might take it one step further, however, and do:

fields = line.split()[:2]
a, b = map(int, fields)

in fact, I might even get rid of the very generic, but conceptually
overkill, use of map() and just write:

a, b = line.split()[:2]
a = int(a)
b = int(b)

Peter Otten · Sep 3, 2008

Roy said:
Peter Otten said:

Without them it looks better:

import sys
for line in sys.stdin:
try:
a, b = map(int, line.split(None, 2)[:2])
except ValueError:
pass
else:
if a + b == 42:
print b

Click to expand...

I'm philosophically opposed to one-liners

I'm not, as long as you don't /force/ the code into one line.

like:

a, b = map(int, line.split(None, 2)[:2])

Click to expand...

because they're difficult to understand at a glance. You need to visually
parse it and work your way out from the inside to figure out what's going
on. Better to keep it longer and simpler.

Now that I've got my head around it, I realized there's no reason to make
the split part so complicated. No reason to limit how many splits get
done
if you're explicitly going to slice the first two. And since you don't
need to supply the second argument, the first one can be defaulted as
well. So, you immediately get down to:

a, b = map(int, line.split()[:2])

Click to expand...

I agree that the above is an improvement.

which isn't too bad. I might take it one step further, however, and do:

fields = line.split()[:2]
a, b = map(int, fields)

Click to expand...

in fact, I might even get rid of the very generic, but conceptually
overkill, use of map() and just write:

a, b = line.split()[:2]
a = int(a)
b = int(b)

Click to expand...

If you go that route your next step is to introduce another try...except,
one for the unpacking and another for the integer conversion...

Peter

bearophileHUGS · Sep 3, 2008

Roy Smith:

No reason to limit how many splits get done if you're
explicitly going to slice the first two.

You are probably right for this problem, because most lines are 2
items long, but in scripts that have to process lines potentially
composed of many parts, setting a max number of parts speeds up your
script and reduces memory used, because you have less parts at the
end.

Bye,
bearophile

Roy Smith · Sep 3, 2008

I might take it one step further, however, and do:

fields = line.split()[:2]
a, b = map(int, fields)

Click to expand...

in fact, I might even get rid of the very generic, but conceptually
overkill, use of map() and just write:

a, b = line.split()[:2]
a = int(a)
b = int(b)

Click to expand...

If you go that route your next step is to introduce another try...except,
one for the unpacking and another for the integer conversion...[/QUOTE]

Why another try/except? The potential unpack and conversion errors exist
in both versions, and the existing try block catches them all. Splitting
the one line up into three with some intermediate variables doesn't change
that.

Roy Smith · Sep 3, 2008

Roy Smith:

You are probably right for this problem, because most lines are 2
items long, but in scripts that have to process lines potentially
composed of many parts, setting a max number of parts speeds up your
script and reduces memory used, because you have less parts at the
end.

Bye,
bearophile

Sounds like premature optimization to me. Make it work and be easy to
understand first. Then worry about how fast it is.

But, along those lines, I've often thought that split() needed a way to not
just limit the number of splits, but to also throw away the extra stuff.
Getting the first N fields of a string is something I've done often enough
that refactoring the slicing operation right into the split() code seems
worthwhile. And, it would be even faster

Peter Otten · Sep 3, 2008

Roy said:
I might take it one step further, however, and do:

fields = line.split()[:2]
a, b = map(int, fields)

in fact, I might even get rid of the very generic, but conceptually
overkill, use of map() and just write:

a, b = line.split()[:2]
a = int(a)
b = int(b)

Click to expand...

If you go that route your next step is to introduce another try...except,
one for the unpacking and another for the integer conversion...

Why another try/except? The potential unpack and conversion errors exist
in both versions, and the existing try block catches them all. Splitting
the one line up into three with some intermediate variables doesn't change
that.[/QUOTE]

As I understood it you didn't just split a line of code into three, but
wanted two processing steps. These logical steps are then somewhat remixed
by the shared error handling. You lose the information which step failed.
In the general case you may even mask a bug.

Peter

Roy Smith · Sep 3, 2008

Peter Otten said:
Roy said:

Peter Otten said:

I might take it one step further, however, and do:

fields = line.split()[:2]
a, b = map(int, fields)

in fact, I might even get rid of the very generic, but conceptually
overkill, use of map() and just write:

a, b = line.split()[:2]
a = int(a)
b = int(b)

If you go that route your next step is to introduce another try...except,
one for the unpacking and another for the integer conversion...

Click to expand...

Why another try/except? The potential unpack and conversion errors exist
in both versions, and the existing try block catches them all. Splitting
the one line up into three with some intermediate variables doesn't change
that.

Click to expand...

As I understood it you didn't just split a line of code into three, but
wanted two processing steps. These logical steps are then somewhat remixed
by the shared error handling. You lose the information which step failed.
In the general case you may even mask a bug.

Peter

Well, what I really wanted was two conceptual steps, to make it easier for
a reader of the code to follow what it's doing. My standard for code being
adequately comprehensible is not that the reader *can* figure it out, but
that the reader doesn't have to exert any effort to figure it out. Or even
be aware that there's any figuring-out going on. He or she just reads it.

bearophileHUGS · Sep 3, 2008

Roy Smith:

But, along those lines, I've often thought that split() needed a way to not
just limit the number of splits, but to also throw away the extra stuff.
Getting the first N fields of a string is something I've done often enough
that refactoring the slicing operation right into the split() code seems
worthwhile. And, it would be even faster

Given the hypothetical .xsplit() string method I was talking about,
it's then easy to use islice() on it to skip the first items:

islice(sometext.xsplit(), 10, None)

Bye,
bearophile

awk like usage in python	0	Nov 9, 2012
Convert AWK regex to Python	6	May 16, 2011
do a sed / awk filter with python tools (at least as fast)	2	Jul 7, 2008
Right tool and method to strip off html files (python, sed, awk?)	5	Jul 13, 2007
eval within grep not working	1	Oct 1, 2010
Trouble with prediction code, for the life of me I can't figure out why it isnt running properly. Help would be appreciated.	0	Jul 8, 2023
Safe Pipe Open question	6	Mar 22, 2006
Suggestions on writing a sh <--> python Howto/Tutorial	0	Jul 27, 2011

converting a sed / grep / awk / . . . bash pipe line into python

hofer

Marc 'BlackJack' Rintsch

Paul McGuire

Peter Otten

Roy Smith

Peter Otten

bearophileHUGS

Roy Smith

Roy Smith

Peter Otten

Roy Smith

bearophileHUGS

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads