converting a sed / grep / awk / . . . bash pipe line into python

H

hofer

Hi,

Something I have to do very often is filtering / transforming line
based file contents and storing the result in an array or a
dictionary.

Very often the functionallity exists already in form of a shell script
with sed / awk / grep , . . .
and I would like to have the same implementation in my script

What's a compact, efficient (no intermediate arrays generated /
regexps compiled only once) way in python
for such kind of 'pipe line'

Example 1 (in bash): (annotated with comment (thus not working) if
copied / pasted
#-------------------------------------------------------------------------------------------
cat file \ ### read from file
| sed 's/\.\..*//' \ ### remove '//' comments
| sed 's/#.*//' \ ### remove '#' comments
| grep -v '^\s*$' \ ### get rid of empty lines
| awk '{ print $1 + $2 " " $2 }' \ ### knowing, that all remaining
lines contain always at least
\ ### two integers calculate
sum and 'keep' second number
| grep '^42 ' ### keep lines for which sum is 42
| awk '{ print $2 }' ### print number

Same example in perl:
# I guess (but didn't try), taht the perl example will create more
intermediate
# data structures than necessary.
# Ideally the python implementation shouldn't do this, but just
'chain' iterators.
#-------------------------------------------------------------------------------------------
my $filename= "file";
open(my $fh,$filename) or die "failed opening file $filename";

# order of 'pipeline' is syntactically reversed (if compared to shell
script)
my @numbers =
map { $_->[1] } # extract num 2
grep { $_->[0] == 42 } # keep lines with result 42
map { [ $_->[0]+$_->[1],$_->[1] ] } # calculate sum of first two
nums and keep second num
map { [ split(' ',$_,3) ] } # split by white space
grep { ! ($_ =~ /^\s*$/) } # remove empty lines
map { $_ =~ s/#.*// ; $_} # strip '#' comments
map { $_ =~ s/\/\/.*// ; $_} # strip '//' comments
<$fh>;
print "Numbers are:\n",join("\n",@numbers),"\n";

thanks in advance for any suggestions of how to code this (keeping the
comments)


H
 
M

Marc 'BlackJack' Rintsch

sed 's/\.\..*//' \ ### remove '//' comments | sed 's/#.*//'

Comment does not match the code. Or vice versa. :)

Untested:

from __future__ import with_statement
from itertools import ifilter, ifilterfalse, imap


def is_junk(line):
line = line.rstrip()
return not line or line.startswith('//') or line.startswith('#')


def extract_numbers(line):
result = map(int, line.split()[:2])
assert len(result) == 2
return result


def main():
with open('test.txt') as lines:
clean_lines = ifilterfalse(is_junk, lines)
pairs = imap(extract_numbers, clean_lines)
print '\n'.join(b for a, b in pairs if a + b == 42)


if __name__ == '__main__':
main()

Ciao,
Marc 'BlackJack' Rintsch
 
P

Paul McGuire

Hi,

Something I have to do very often is filtering / transforming line
based file contents and storing the result in an array or a
dictionary.

Very often the functionallity exists already in form of a shell script
with sed / awk / grep , . . .
and I would like to have the same implementation in my script

All that sed'ing, grep'ing and awk'ing, you might want to take a look
at pyparsing. Here is a pyparsing take on your posted problem:

from pyparsing import LineEnd, Word, nums, LineStart, OneOrMore,
restOfLine

test = """

1 2 3
47 23 // this will never match
# blank lines are not of any interest
91 26

23 19

41 1 97 26 // extra numbers don't matter
"""

# define pyparsing expressions to match a line of integers
EOL = LineEnd()
integer = Word(nums)

# by default, pyparsing will implicitly skip over whitespace and
# newlines, so EOL is skipped over by default - this would mix
together
# integers on consecutive lines - we only want OneOrMore integers as
long
# as they are on the same line, that is, integers with no intervening
# EOL's
line_of_integers = (LineStart() + integer + OneOrMore(~EOL + integer))

# use a parse action to identify the target lines
def select_significant_values(t):
v1, v2 = map(int, t[:2])
if v1+v2 == 42:
print v2
line_of_integers.setParseAction(select_significant_values)

# skip over comments, wherever they are
line_of_integers.ignore( '//' + restOfLine )
line_of_integers.ignore( '#' + restOfLine )

# use the line_of_integers expression to search through the test text
# the parse action will print the matching values
line_of_integers.searchString(test)


-- Paul
 
P

Peter Otten

hofer said:
Something I have to do very often is filtering / transforming line
based file contents and storing the result in an array or a
dictionary.

Very often the functionallity exists already in form of a shell script
with sed / awk / grep , . . .
and I would like to have the same implementation in my script

What's a compact, efficient (no intermediate arrays generated /
regexps compiled only once) way in python
for such kind of 'pipe line'

Example 1 (in bash): (annotated with comment (thus not working) if
copied / pasted
cat file \ ### read from file
| sed 's/\.\..*//' \ ### remove '//' comments
| sed 's/#.*//' \ ### remove '#' comments
| grep -v '^\s*$' \ ### get rid of empty lines
| awk '{ print $1 + $2 " " $2 }' \ ### knowing, that all remaining
lines contain always at least
\ ### two integers calculate
sum and 'keep' second number
| grep '^42 ' ### keep lines for which sum is 42
| awk '{ print $2 }' ### print number
thanks in advance for any suggestions of how to code this (keeping the
comments)

for line in open("file"): # read from file
try:
a, b = map(int, line.split(None, 2)[:2]) # remove extra columns,
# convert to integer
except ValueError:
pass # remove comments, get rid of empty lines,
# skip lines with less than two integers
else:
# line did start with two integers
if a + b == 42: # keep lines for which the sum is 42
print b # print number

The hard part was keeping the comments ;)

Without them it looks better:

import sys
for line in sys.stdin:
try:
a, b = map(int, line.split(None, 2)[:2])
except ValueError:
pass
else:
if a + b == 42:
print b

Peter
 
R

Roy Smith

Peter Otten said:
Without them it looks better:

import sys
for line in sys.stdin:
try:
a, b = map(int, line.split(None, 2)[:2])
except ValueError:
pass
else:
if a + b == 42:
print b

I'm philosophically opposed to one-liners like:
a, b = map(int, line.split(None, 2)[:2])

because they're difficult to understand at a glance. You need to visually
parse it and work your way out from the inside to figure out what's going
on. Better to keep it longer and simpler.

Now that I've got my head around it, I realized there's no reason to make
the split part so complicated. No reason to limit how many splits get done
if you're explicitly going to slice the first two. And since you don't
need to supply the second argument, the first one can be defaulted as well.
So, you immediately get down to:
a, b = map(int, line.split()[:2])

which isn't too bad. I might take it one step further, however, and do:
fields = line.split()[:2]
a, b = map(int, fields)

in fact, I might even get rid of the very generic, but conceptually
overkill, use of map() and just write:
a, b = line.split()[:2]
a = int(a)
b = int(b)
 
P

Peter Otten

Roy said:
Peter Otten said:
Without them it looks better:

import sys
for line in sys.stdin:
try:
a, b = map(int, line.split(None, 2)[:2])
except ValueError:
pass
else:
if a + b == 42:
print b

I'm philosophically opposed to one-liners

I'm not, as long as you don't /force/ the code into one line.
like:
a, b = map(int, line.split(None, 2)[:2])

because they're difficult to understand at a glance. You need to visually
parse it and work your way out from the inside to figure out what's going
on. Better to keep it longer and simpler.

Now that I've got my head around it, I realized there's no reason to make
the split part so complicated. No reason to limit how many splits get
done
if you're explicitly going to slice the first two. And since you don't
need to supply the second argument, the first one can be defaulted as
well. So, you immediately get down to:
a, b = map(int, line.split()[:2])

I agree that the above is an improvement.
which isn't too bad. I might take it one step further, however, and do:
fields = line.split()[:2]
a, b = map(int, fields)

in fact, I might even get rid of the very generic, but conceptually
overkill, use of map() and just write:
a, b = line.split()[:2]
a = int(a)
b = int(b)

If you go that route your next step is to introduce another try...except,
one for the unpacking and another for the integer conversion...

Peter
 
B

bearophileHUGS

Roy Smith:
No reason to limit how many splits get done if you're
explicitly going to slice the first two.

You are probably right for this problem, because most lines are 2
items long, but in scripts that have to process lines potentially
composed of many parts, setting a max number of parts speeds up your
script and reduces memory used, because you have less parts at the
end.

Bye,
bearophile
 
R

Roy Smith

I might take it one step further, however, and do:
fields = line.split()[:2]
a, b = map(int, fields)

in fact, I might even get rid of the very generic, but conceptually
overkill, use of map() and just write:
a, b = line.split()[:2]
a = int(a)
b = int(b)

If you go that route your next step is to introduce another try...except,
one for the unpacking and another for the integer conversion...[/QUOTE]

Why another try/except? The potential unpack and conversion errors exist
in both versions, and the existing try block catches them all. Splitting
the one line up into three with some intermediate variables doesn't change
that.
 
R

Roy Smith

Roy Smith:

You are probably right for this problem, because most lines are 2
items long, but in scripts that have to process lines potentially
composed of many parts, setting a max number of parts speeds up your
script and reduces memory used, because you have less parts at the
end.

Bye,
bearophile

Sounds like premature optimization to me. Make it work and be easy to
understand first. Then worry about how fast it is.

But, along those lines, I've often thought that split() needed a way to not
just limit the number of splits, but to also throw away the extra stuff.
Getting the first N fields of a string is something I've done often enough
that refactoring the slicing operation right into the split() code seems
worthwhile. And, it would be even faster :)
 
P

Peter Otten

Roy said:
I might take it one step further, however, and do:

fields = line.split()[:2]
a, b = map(int, fields)

in fact, I might even get rid of the very generic, but conceptually
overkill, use of map() and just write:

a, b = line.split()[:2]
a = int(a)
b = int(b)

If you go that route your next step is to introduce another try...except,
one for the unpacking and another for the integer conversion...

Why another try/except? The potential unpack and conversion errors exist
in both versions, and the existing try block catches them all. Splitting
the one line up into three with some intermediate variables doesn't change
that.[/QUOTE]

As I understood it you didn't just split a line of code into three, but
wanted two processing steps. These logical steps are then somewhat remixed
by the shared error handling. You lose the information which step failed.
In the general case you may even mask a bug.

Peter
 
R

Roy Smith

Peter Otten said:
Roy said:
Peter Otten said:
I might take it one step further, however, and do:

fields = line.split()[:2]
a, b = map(int, fields)

in fact, I might even get rid of the very generic, but conceptually
overkill, use of map() and just write:

a, b = line.split()[:2]
a = int(a)
b = int(b)

If you go that route your next step is to introduce another try...except,
one for the unpacking and another for the integer conversion...

Why another try/except? The potential unpack and conversion errors exist
in both versions, and the existing try block catches them all. Splitting
the one line up into three with some intermediate variables doesn't change
that.

As I understood it you didn't just split a line of code into three, but
wanted two processing steps. These logical steps are then somewhat remixed
by the shared error handling. You lose the information which step failed.
In the general case you may even mask a bug.

Peter

Well, what I really wanted was two conceptual steps, to make it easier for
a reader of the code to follow what it's doing. My standard for code being
adequately comprehensible is not that the reader *can* figure it out, but
that the reader doesn't have to exert any effort to figure it out. Or even
be aware that there's any figuring-out going on. He or she just reads it.
 
B

bearophileHUGS

Roy Smith:
But, along those lines, I've often thought that split() needed a way to not
just limit the number of splits, but to also throw away the extra stuff.
Getting the first N fields of a string is something I've done often enough
that refactoring the slicing operation right into the split() code seems
worthwhile. And, it would be even faster :)

Given the hypothetical .xsplit() string method I was talking about,
it's then easy to use islice() on it to skip the first items:

islice(sometext.xsplit(), 10, None)

Bye,
bearophile
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,754
Messages
2,569,526
Members
44,997
Latest member
mileyka

Latest Threads

Top