converting a sed / grep / awk / . . . bash pipe line into python

Discussion in 'Python' started by hofer, Sep 2, 2008.

  1. hofer

    hofer Guest

    Hi,

    Something I have to do very often is filtering / transforming line
    based file contents and storing the result in an array or a
    dictionary.

    Very often the functionallity exists already in form of a shell script
    with sed / awk / grep , . . .
    and I would like to have the same implementation in my script

    What's a compact, efficient (no intermediate arrays generated /
    regexps compiled only once) way in python
    for such kind of 'pipe line'

    Example 1 (in bash): (annotated with comment (thus not working) if
    copied / pasted
    #-------------------------------------------------------------------------------------------
    cat file \ ### read from file
    | sed 's/\.\..*//' \ ### remove '//' comments
    | sed 's/#.*//' \ ### remove '#' comments
    | grep -v '^\s*$' \ ### get rid of empty lines
    | awk '{ print $1 + $2 " " $2 }' \ ### knowing, that all remaining
    lines contain always at least
    \ ### two integers calculate
    sum and 'keep' second number
    | grep '^42 ' ### keep lines for which sum is 42
    | awk '{ print $2 }' ### print number

    Same example in perl:
    # I guess (but didn't try), taht the perl example will create more
    intermediate
    # data structures than necessary.
    # Ideally the python implementation shouldn't do this, but just
    'chain' iterators.
    #-------------------------------------------------------------------------------------------
    my $filename= "file";
    open(my $fh,$filename) or die "failed opening file $filename";

    # order of 'pipeline' is syntactically reversed (if compared to shell
    script)
    my @numbers =
    map { $_->[1] } # extract num 2
    grep { $_->[0] == 42 } # keep lines with result 42
    map { [ $_->[0]+$_->[1],$_->[1] ] } # calculate sum of first two
    nums and keep second num
    map { [ split(' ',$_,3) ] } # split by white space
    grep { ! ($_ =~ /^\s*$/) } # remove empty lines
    map { $_ =~ s/#.*// ; $_} # strip '#' comments
    map { $_ =~ s/\/\/.*// ; $_} # strip '//' comments
    <$fh>;
    print "Numbers are:\n",join("\n",@numbers),"\n";

    thanks in advance for any suggestions of how to code this (keeping the
    comments)


    H
     
    hofer, Sep 2, 2008
    #1
    1. Advertising

  2. Re: converting a sed / grep / awk / . . . bash pipe line intopython

    On Tue, 02 Sep 2008 10:36:50 -0700, hofer wrote:

    > sed 's/\.\..*//' \ ### remove '//' comments | sed 's/#.*//'


    Comment does not match the code. Or vice versa. :)

    Untested:

    from __future__ import with_statement
    from itertools import ifilter, ifilterfalse, imap


    def is_junk(line):
    line = line.rstrip()
    return not line or line.startswith('//') or line.startswith('#')


    def extract_numbers(line):
    result = map(int, line.split()[:2])
    assert len(result) == 2
    return result


    def main():
    with open('test.txt') as lines:
    clean_lines = ifilterfalse(is_junk, lines)
    pairs = imap(extract_numbers, clean_lines)
    print '\n'.join(b for a, b in pairs if a + b == 42)


    if __name__ == '__main__':
    main()

    Ciao,
    Marc 'BlackJack' Rintsch
     
    Marc 'BlackJack' Rintsch, Sep 2, 2008
    #2
    1. Advertising

  3. hofer

    Paul McGuire Guest

    On Sep 2, 12:36 pm, hofer <> wrote:
    > Hi,
    >
    > Something I have to do very often is filtering / transforming line
    > based file contents and storing the result in an array or a
    > dictionary.
    >
    > Very often the functionallity exists already in form of a shell script
    > with sed / awk / grep , . . .
    > and I would like to have the same implementation in my script
    >


    All that sed'ing, grep'ing and awk'ing, you might want to take a look
    at pyparsing. Here is a pyparsing take on your posted problem:

    from pyparsing import LineEnd, Word, nums, LineStart, OneOrMore,
    restOfLine

    test = """

    1 2 3
    47 23 // this will never match
    # blank lines are not of any interest
    91 26

    23 19

    41 1 97 26 // extra numbers don't matter
    """

    # define pyparsing expressions to match a line of integers
    EOL = LineEnd()
    integer = Word(nums)

    # by default, pyparsing will implicitly skip over whitespace and
    # newlines, so EOL is skipped over by default - this would mix
    together
    # integers on consecutive lines - we only want OneOrMore integers as
    long
    # as they are on the same line, that is, integers with no intervening
    # EOL's
    line_of_integers = (LineStart() + integer + OneOrMore(~EOL + integer))

    # use a parse action to identify the target lines
    def select_significant_values(t):
    v1, v2 = map(int, t[:2])
    if v1+v2 == 42:
    print v2
    line_of_integers.setParseAction(select_significant_values)

    # skip over comments, wherever they are
    line_of_integers.ignore( '//' + restOfLine )
    line_of_integers.ignore( '#' + restOfLine )

    # use the line_of_integers expression to search through the test text
    # the parse action will print the matching values
    line_of_integers.searchString(test)


    -- Paul
     
    Paul McGuire, Sep 3, 2008
    #3
  4. hofer

    Peter Otten Guest

    hofer wrote:

    > Something I have to do very often is filtering / transforming line
    > based file contents and storing the result in an array or a
    > dictionary.
    >
    > Very often the functionallity exists already in form of a shell script
    > with sed / awk / grep , . . .
    > and I would like to have the same implementation in my script
    >
    > What's a compact, efficient (no intermediate arrays generated /
    > regexps compiled only once) way in python
    > for such kind of 'pipe line'
    >
    > Example 1 (in bash): (annotated with comment (thus not working) if
    > copied / pasted


    > cat file \ ### read from file
    > | sed 's/\.\..*//' \ ### remove '//' comments
    > | sed 's/#.*//' \ ### remove '#' comments
    > | grep -v '^\s*$' \ ### get rid of empty lines
    > | awk '{ print $1 + $2 " " $2 }' \ ### knowing, that all remaining
    > lines contain always at least
    > \ ### two integers calculate
    > sum and 'keep' second number
    > | grep '^42 ' ### keep lines for which sum is 42
    > | awk '{ print $2 }' ### print number
    > thanks in advance for any suggestions of how to code this (keeping the
    > comments)


    for line in open("file"): # read from file
    try:
    a, b = map(int, line.split(None, 2)[:2]) # remove extra columns,
    # convert to integer
    except ValueError:
    pass # remove comments, get rid of empty lines,
    # skip lines with less than two integers
    else:
    # line did start with two integers
    if a + b == 42: # keep lines for which the sum is 42
    print b # print number

    The hard part was keeping the comments ;)

    Without them it looks better:

    import sys
    for line in sys.stdin:
    try:
    a, b = map(int, line.split(None, 2)[:2])
    except ValueError:
    pass
    else:
    if a + b == 42:
    print b

    Peter
     
    Peter Otten, Sep 3, 2008
    #4
  5. hofer

    Roy Smith Guest

    In article <g9ldi5$2ea$03$-online.com>,
    Peter Otten <> wrote:

    > Without them it looks better:
    >
    > import sys
    > for line in sys.stdin:
    > try:
    > a, b = map(int, line.split(None, 2)[:2])
    > except ValueError:
    > pass
    > else:
    > if a + b == 42:
    > print b


    I'm philosophically opposed to one-liners like:

    > a, b = map(int, line.split(None, 2)[:2])


    because they're difficult to understand at a glance. You need to visually
    parse it and work your way out from the inside to figure out what's going
    on. Better to keep it longer and simpler.

    Now that I've got my head around it, I realized there's no reason to make
    the split part so complicated. No reason to limit how many splits get done
    if you're explicitly going to slice the first two. And since you don't
    need to supply the second argument, the first one can be defaulted as well.
    So, you immediately get down to:

    > a, b = map(int, line.split()[:2])


    which isn't too bad. I might take it one step further, however, and do:

    > fields = line.split()[:2]
    > a, b = map(int, fields)


    in fact, I might even get rid of the very generic, but conceptually
    overkill, use of map() and just write:

    > a, b = line.split()[:2]
    > a = int(a)
    > b = int(b)
     
    Roy Smith, Sep 3, 2008
    #5
  6. hofer

    Peter Otten Guest

    Roy Smith wrote:

    > In article <g9ldi5$2ea$03$-online.com>,
    > Peter Otten <> wrote:
    >
    >> Without them it looks better:
    >>
    >> import sys
    >> for line in sys.stdin:
    >> try:
    >> a, b = map(int, line.split(None, 2)[:2])
    >> except ValueError:
    >> pass
    >> else:
    >> if a + b == 42:
    >> print b

    >
    > I'm philosophically opposed to one-liners


    I'm not, as long as you don't /force/ the code into one line.

    > like:
    >
    >> a, b = map(int, line.split(None, 2)[:2])

    >
    > because they're difficult to understand at a glance. You need to visually
    > parse it and work your way out from the inside to figure out what's going
    > on. Better to keep it longer and simpler.
    >
    > Now that I've got my head around it, I realized there's no reason to make
    > the split part so complicated. No reason to limit how many splits get
    > done
    > if you're explicitly going to slice the first two. And since you don't
    > need to supply the second argument, the first one can be defaulted as
    > well. So, you immediately get down to:
    >
    >> a, b = map(int, line.split()[:2])


    I agree that the above is an improvement.

    > which isn't too bad. I might take it one step further, however, and do:
    >
    >> fields = line.split()[:2]
    >> a, b = map(int, fields)

    >
    > in fact, I might even get rid of the very generic, but conceptually
    > overkill, use of map() and just write:
    >
    >> a, b = line.split()[:2]
    >> a = int(a)
    >> b = int(b)


    If you go that route your next step is to introduce another try...except,
    one for the unpacking and another for the integer conversion...

    Peter
     
    Peter Otten, Sep 3, 2008
    #6
  7. hofer

    Guest

    Roy Smith:
    > No reason to limit how many splits get done if you're
    > explicitly going to slice the first two.


    You are probably right for this problem, because most lines are 2
    items long, but in scripts that have to process lines potentially
    composed of many parts, setting a max number of parts speeds up your
    script and reduces memory used, because you have less parts at the
    end.

    Bye,
    bearophile
     
    , Sep 3, 2008
    #7
  8. hofer

    Roy Smith Guest

    In article <g9lvc5$8qq$03$-online.com>,
    Peter Otten <> wrote:

    > > I might take it one step further, however, and do:
    > >
    > >> fields = line.split()[:2]
    > >> a, b = map(int, fields)

    > >
    > > in fact, I might even get rid of the very generic, but conceptually
    > > overkill, use of map() and just write:
    > >
    > >> a, b = line.split()[:2]
    > >> a = int(a)
    > >> b = int(b)

    >
    > If you go that route your next step is to introduce another try...except,
    > one for the unpacking and another for the integer conversion...


    Why another try/except? The potential unpack and conversion errors exist
    in both versions, and the existing try block catches them all. Splitting
    the one line up into three with some intermediate variables doesn't change
    that.
     
    Roy Smith, Sep 3, 2008
    #8
  9. hofer

    Roy Smith Guest

    In article
    <>,
    wrote:

    > Roy Smith:
    > > No reason to limit how many splits get done if you're
    > > explicitly going to slice the first two.

    >
    > You are probably right for this problem, because most lines are 2
    > items long, but in scripts that have to process lines potentially
    > composed of many parts, setting a max number of parts speeds up your
    > script and reduces memory used, because you have less parts at the
    > end.
    >
    > Bye,
    > bearophile


    Sounds like premature optimization to me. Make it work and be easy to
    understand first. Then worry about how fast it is.

    But, along those lines, I've often thought that split() needed a way to not
    just limit the number of splits, but to also throw away the extra stuff.
    Getting the first N fields of a string is something I've done often enough
    that refactoring the slicing operation right into the split() code seems
    worthwhile. And, it would be even faster :)
     
    Roy Smith, Sep 3, 2008
    #9
  10. hofer

    Peter Otten Guest

    Roy Smith wrote:

    > In article <g9lvc5$8qq$03$-online.com>,
    > Peter Otten <> wrote:
    >
    >> > I might take it one step further, however, and do:
    >> >
    >> >> fields = line.split()[:2]
    >> >> a, b = map(int, fields)
    >> >
    >> > in fact, I might even get rid of the very generic, but conceptually
    >> > overkill, use of map() and just write:
    >> >
    >> >> a, b = line.split()[:2]
    >> >> a = int(a)
    >> >> b = int(b)

    >>
    >> If you go that route your next step is to introduce another try...except,
    >> one for the unpacking and another for the integer conversion...

    >
    > Why another try/except? The potential unpack and conversion errors exist
    > in both versions, and the existing try block catches them all. Splitting
    > the one line up into three with some intermediate variables doesn't change
    > that.


    As I understood it you didn't just split a line of code into three, but
    wanted two processing steps. These logical steps are then somewhat remixed
    by the shared error handling. You lose the information which step failed.
    In the general case you may even mask a bug.

    Peter
     
    Peter Otten, Sep 3, 2008
    #10
  11. hofer

    Roy Smith Guest

    In article <g9m6at$a71$01$-online.com>,
    Peter Otten <> wrote:

    > Roy Smith wrote:
    >
    > > In article <g9lvc5$8qq$03$-online.com>,
    > > Peter Otten <> wrote:
    > >
    > >> > I might take it one step further, however, and do:
    > >> >
    > >> >> fields = line.split()[:2]
    > >> >> a, b = map(int, fields)
    > >> >
    > >> > in fact, I might even get rid of the very generic, but conceptually
    > >> > overkill, use of map() and just write:
    > >> >
    > >> >> a, b = line.split()[:2]
    > >> >> a = int(a)
    > >> >> b = int(b)
    > >>
    > >> If you go that route your next step is to introduce another try...except,
    > >> one for the unpacking and another for the integer conversion...

    > >
    > > Why another try/except? The potential unpack and conversion errors exist
    > > in both versions, and the existing try block catches them all. Splitting
    > > the one line up into three with some intermediate variables doesn't change
    > > that.

    >
    > As I understood it you didn't just split a line of code into three, but
    > wanted two processing steps. These logical steps are then somewhat remixed
    > by the shared error handling. You lose the information which step failed.
    > In the general case you may even mask a bug.
    >
    > Peter


    Well, what I really wanted was two conceptual steps, to make it easier for
    a reader of the code to follow what it's doing. My standard for code being
    adequately comprehensible is not that the reader *can* figure it out, but
    that the reader doesn't have to exert any effort to figure it out. Or even
    be aware that there's any figuring-out going on. He or she just reads it.
     
    Roy Smith, Sep 3, 2008
    #11
  12. hofer

    Guest

    Roy Smith:
    > But, along those lines, I've often thought that split() needed a way to not
    > just limit the number of splits, but to also throw away the extra stuff.
    > Getting the first N fields of a string is something I've done often enough
    > that refactoring the slicing operation right into the split() code seems
    > worthwhile. And, it would be even faster :)


    Given the hypothetical .xsplit() string method I was talking about,
    it's then easy to use islice() on it to skip the first items:

    islice(sometext.xsplit(), 10, None)

    Bye,
    bearophile
     
    , Sep 3, 2008
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. gorda
    Replies:
    2
    Views:
    547
    Andrew Shitov
    Oct 21, 2003
  2. Replies:
    5
    Views:
    787
  3. Mathieu Prevot
    Replies:
    2
    Views:
    396
    Mathieu Prevot
    Jul 7, 2008
  4. gorda
    Replies:
    3
    Views:
    154
    Barry Kimelman
    Oct 21, 2003
  5. Jay eL
    Replies:
    2
    Views:
    206
    James Willmore
    Dec 9, 2003
Loading...

Share This Page