Record seperator

Discussion in 'Python' started by greymaus, Aug 26, 2011.

  1. greymaus

    greymaus Guest

    Is there an equivelent for the AWK RS in Python?


    as in RS='\n\n'
    will seperate a file at two blank line intervals


    --
    maus
    greymaus, Aug 26, 2011
    #1
    1. Advertising

  2. On 26 Aug 2011 18:39:07 GMT
    greymaus <> wrote:
    >
    > Is there an equivelent for the AWK RS in Python?
    >
    >
    > as in RS='\n\n'
    > will seperate a file at two blank line intervals


    open("file.txt").read().split("\n\n")

    --
    D'Arcy J.M. Cain <> | Democracy is three wolves
    http://www.druid.net/darcy/ | and a sheep voting on
    +1 416 425 1212 (DoD#0082) (eNTP) | what's for dinner.
    D'Arcy J.M. Cain, Aug 26, 2011
    #2
    1. Advertising

  3. greymaus

    greymaus Guest

    On 2011-08-26, D'Arcy J.M. Cain <> wrote:
    > On 26 Aug 2011 18:39:07 GMT
    > greymaus <> wrote:
    >>
    >> Is there an equivelent for the AWK RS in Python?
    >>
    >>
    >> as in RS='\n\n'
    >> will seperate a file at two blank line intervals

    >
    > open("file.txt").read().split("\n\n")
    >



    Ta!.. bit awkard. :))))))


    --
    maus
    greymaus, Aug 27, 2011
    #3
  4. greymaus wrote:

    > On 2011-08-26, D'Arcy J.M. Cain <> wrote:
    >> On 26 Aug 2011 18:39:07 GMT
    >> greymaus <> wrote:
    >>>
    >>> Is there an equivelent for the AWK RS in Python?
    >>>
    >>>
    >>> as in RS='\n\n'
    >>> will seperate a file at two blank line intervals

    >>
    >> open("file.txt").read().split("\n\n")
    >>

    >
    >
    > Ta!.. bit awkard. :))))))


    Er, is that meant to be a pun? "Awk[w]ard", as in awk-ward?

    In any case, no, the Python line might be a handful of characters longer
    than the AWK equivalent, but it isn't awkward. It is logical and easy to
    understand. It's embarrassingly easy to describe what it does:

    open("file.txt") # opens the file
    .read() # reads the contents of the file
    .split("\n\n") # splits the text on double-newlines.

    The only tricky part is knowing that \n means newline, but anyone familiar
    with C, Perl, AWK etc. should know that.

    The Python code might be "long" (but only by the standards of AWK, which can
    be painfully concise), but it is simple, obvious and readable. A few extra
    characters is the price you pay for making your language readable. At the
    cost of a few extra key presses, you get something that you will be able to
    understand in 10 years time.

    AWK is a specialist text processing language. Python is a general scripting
    and programming language. They have different values: AWK values short,
    concise code, Python is willing to pay a little more in source code.


    --
    Steven
    Steven D'Aprano, Aug 27, 2011
    #4
  5. greymaus

    Roy Smith Guest

    In article <4e592852$0$29965$c3e8da3$>,
    Steven D'Aprano <> wrote:

    > open("file.txt") # opens the file
    > .read() # reads the contents of the file
    > .split("\n\n") # splits the text on double-newlines.


    The biggest problem with this code is that read() slurps the entire file
    into a string. That's fine for moderately sized files, but will fail
    (or at least be grossly inefficient) for very large files.

    It's always annoyed me a little that while it's easy to iterate over the
    lines of a file, it's more complicated to iterate over a file character
    by character. You could write your own generator to do that:

    for c in getchar(open("file.txt")):
    whatever

    def getchar(f):
    for line in f:
    for c in line:
    yield c

    but that's annoyingly verbose (and probably not hugely efficient).

    Of course, the next problem for the specific problem at hand is that
    even with an iterator over the characters of a file, split() only works
    on strings. It would be nice to have a version of split which took an
    iterable and returned an iterator over the split components. Maybe
    there is such a thing and I'm just missing it?
    Roy Smith, Aug 27, 2011
    #5
  6. greymaus

    ChasBrown Guest

    On Aug 27, 10:45 am, Roy Smith <> wrote:
    > In article <4e592852$0$29965$c3e8da3$>,
    >  Steven D'Aprano <> wrote:
    >
    > > open("file.txt")   # opens the file
    > >  .read()           # reads the contents of the file
    > >  .split("\n\n")    # splits the text on double-newlines.

    >
    > The biggest problem with this code is that read() slurps the entire file
    > into a string.  That's fine for moderately sized files, but will fail
    > (or at least be grossly inefficient) for very large files.
    >
    > It's always annoyed me a little that while it's easy to iterate over the
    > lines of a file, it's more complicated to iterate over a file character
    > by character.  You could write your own generator to do that:
    >
    > for c in getchar(open("file.txt")):
    >    whatever
    >
    > def getchar(f):
    >    for line in f:
    >       for c in line:
    >          yield c
    >
    > but that's annoyingly verbose (and probably not hugely efficient).


    read() takes an optional size parameter; so f.read(1) is another
    option...

    >
    > Of course, the next problem for the specific problem at hand is that
    > even with an iterator over the characters of a file, split() only works
    > on strings.  It would be nice to have a version of split which took an
    > iterable and returned an iterator over the split components.  Maybe
    > there is such a thing and I'm just missing it?


    I don't know if there is such a thing; but for the OP's problem you
    could read the file in chunks, e.g.:

    def readgroup(f, delim, buffsize=8192):
    tail=''
    while True:
    s = f.read(buffsize)
    if not s:
    yield tail
    break
    groups = (tail + s).split(delim)
    tail = groups[-1]
    for group in groups[:-1]:
    yield group

    for group in readgroup(open('file.txt'), '\n\n'):
    # do something

    Cheers - Chas
    ChasBrown, Aug 27, 2011
    #6
  7. greymaus

    Terry Reedy Guest

    On 8/27/2011 1:45 PM, Roy Smith wrote:
    > In article<4e592852$0$29965$c3e8da3$>,
    > Steven D'Aprano<> wrote:
    >
    >> open("file.txt") # opens the file
    >> .read() # reads the contents of the file
    >> .split("\n\n") # splits the text on double-newlines.

    >
    > The biggest problem with this code is that read() slurps the entire file
    > into a string. That's fine for moderately sized files, but will fail
    > (or at least be grossly inefficient) for very large files.


    I read the above as separating the file into paragraphs, as indicated by
    blank lines.

    def paragraphs(file):
    para = []
    for line in file:
    if line:
    para.append(line)
    else:
    yield para # or ''.join(para), as desired
    para = []

    --
    Terry Jan Reedy
    Terry Reedy, Aug 27, 2011
    #7
  8. On Sun, Aug 28, 2011 at 6:03 AM, Terry Reedy <> wrote:
    >      yield para # or ''.join(para), as desired
    >


    Or possibly '\n'.join(para) if you want to keep the line breaks inside
    paragraphs.

    ChrisA
    Chris Angelico, Aug 27, 2011
    #8
  9. greymaus

    Roy Smith Guest

    In article <>,
    Terry Reedy <> wrote:

    > On 8/27/2011 1:45 PM, Roy Smith wrote:
    > > In article<4e592852$0$29965$c3e8da3$>,
    > > Steven D'Aprano<> wrote:
    > >
    > >> open("file.txt") # opens the file
    > >> .read() # reads the contents of the file
    > >> .split("\n\n") # splits the text on double-newlines.

    > >
    > > The biggest problem with this code is that read() slurps the entire file
    > > into a string. That's fine for moderately sized files, but will fail
    > > (or at least be grossly inefficient) for very large files.

    >
    > I read the above as separating the file into paragraphs, as indicated by
    > blank lines.
    >
    > def paragraphs(file):
    > para = []
    > for line in file:
    > if line:
    > para.append(line)
    > else:
    > yield para # or ''.join(para), as desired
    > para = []


    Plus or minus the last paragraph in the file :)
    Roy Smith, Aug 27, 2011
    #9
  10. greymaus

    Terry Reedy Guest

    On 8/27/2011 5:07 PM, Roy Smith wrote:
    > In article<>,
    > Terry Reedy<> wrote:
    >
    >> On 8/27/2011 1:45 PM, Roy Smith wrote:
    >>> In article<4e592852$0$29965$c3e8da3$>,
    >>> Steven D'Aprano<> wrote:
    >>>
    >>>> open("file.txt") # opens the file
    >>>> .read() # reads the contents of the file
    >>>> .split("\n\n") # splits the text on double-newlines.
    >>>
    >>> The biggest problem with this code is that read() slurps the entire file
    >>> into a string. That's fine for moderately sized files, but will fail
    >>> (or at least be grossly inefficient) for very large files.

    >>
    >> I read the above as separating the file into paragraphs, as indicated by
    >> blank lines.
    >>
    >> def paragraphs(file):
    >> para = []
    >> for line in file:
    >> if line:
    >> para.append(line)
    >> else:
    >> yield para # or ''.join(para), as desired
    >> para = []

    >
    > Plus or minus the last paragraph in the file :)


    Or right, I forgot the last line, which is a repeat of the yield after
    the for loop finishes.

    --
    Terry Jan Reedy
    Terry Reedy, Aug 28, 2011
    #10
  11. greymaus

    greymaus Guest

    On 2011-08-27, Steven D'Aprano <> wrote:
    > greymaus wrote:
    >
    >> On 2011-08-26, D'Arcy J.M. Cain <> wrote:
    >>> On 26 Aug 2011 18:39:07 GMT
    >>> greymaus <> wrote:
    >>>>
    >>>> Is there an equivelent for the AWK RS in Python?
    >>>>
    >>>>
    >>>> as in RS='\n\n'
    >>>> will seperate a file at two blank line intervals
    >>>
    >>> open("file.txt").read().split("\n\n")
    >>>

    >>
    >>
    >> Ta!.. bit awkard. :))))))

    >
    > Er, is that meant to be a pun? "Awk[w]ard", as in awk-ward?


    Yup, mispelled it and realized th error :)
    >
    > In any case, no, the Python line might be a handful of characters longer
    > than the AWK equivalent, but it isn't awkward. It is logical and easy to
    > understand. It's embarrassingly easy to describe what it does:
    >
    > open("file.txt") # opens the file
    > .read() # reads the contents of the file
    > .split("\n\n") # splits the text on double-newlines.
    >
    > The only tricky part is knowing that \n means newline, but anyone familiar
    > with C, Perl, AWK etc. should know that.
    >
    > The Python code might be "long" (but only by the standards of AWK, which can
    > be painfully concise), but it is simple, obvious and readable. A few extra
    > characters is the price you pay for making your language readable. At the
    > cost of a few extra key presses, you get something that you will be able to
    > understand in 10 years time.
    >
    > AWK is a specialist text processing language. Python is a general scripting
    > and programming language. They have different values: AWK values short,
    > concise code, Python is willing to pay a little more in source code.
    >
    >


    RS, and its Perl equivelent, which I forget, mean that you can read in
    full multiline records.

    (I am coming into Python via Perl from AWK, and trying to get a grip
    on the language and its idions)

    Thanks to All

    Oh, Awk is far more than a text processing language, may be old (like me!)
    but useful (ditto)



    --
    maus
    greymaus, Aug 28, 2011
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. xmldig
    Replies:
    0
    Views:
    522
    xmldig
    Nov 30, 2005
  2. Replies:
    35
    Views:
    1,296
    Steve Holden
    Jan 4, 2005
  3. Replies:
    3
    Views:
    3,937
    barryman9000
    Jul 16, 2008
  4. Thousand Seperator

    , Mar 14, 2008, in forum: Python
    Replies:
    5
    Views:
    344
    Jeroen Ruigrok van der Werven
    Mar 14, 2008
  5. eddie wang

    thousand seperator for a number

    eddie wang, Apr 16, 2004, in forum: ASP General
    Replies:
    2
    Views:
    153
    Bullschmidt
    Apr 19, 2004
Loading...

Share This Page