groveling over a file for Q:: and A:: stmts

Discussion in 'Python' started by paul618, Jul 24, 2012.

  1. paul618

    paul618 Guest

    #!/usr/bin/env python
    # grep_for_QA.py I am only looking to isolate uniq Q:: and A:: stmts from my daily files
    #
    # note: This algorithm will fail if there are any blank lines within the Q and A area of interest (a paragraph)

    # D. Beazley is my fav documentation

    import re, glob
    import pprint as pp

    sampledata = '''
    A:: And Straight Street is playin on the Radio Free Tibet. What are the chances, DTMB?
    Q:: About 1 in 518400, Professor.
    A:: Correct! Err, I thought it was 1:410400, but <i>close enough for jazz!</i>


    '''

    pattern0 = re.compile("Q::")
    pattern1 = re.compile("A::") # objects of interest can start with A:: ;; not alway Q::
    END_OF_PARAGRAPH_pat = "\n\s*\n"

    path = "/Users/paultaney/dailies2012/0722" # an example of real data set.

    toggle = False
    L = []
    M = []

    #file = open(path, "r")
    try:
    #for line in file.readlines():
    for line in sampledata:
    try:
    # Later, I also need to treat Unicode -- and I am clueless.

    # falsestarts::
    #line.encode("utf8").decode('xxx', 'ignore')
    #line.encode("utf8", 'ignore')
    #line.decode('8859')
    #line.decode('8859') # 8859, Latin-1 doesn't cover my CJK pastings AT ALL
    #line.decode('GB18030') # 171006 -- ack
    #encoded_line = line # xxx line.encode("utf8")

    mo0 = re.search(pattern0, line)
    mo1 = re.search(pattern1, line)
    mo2 = re.search(END_OF_PARAGRAPH_pat, line)

    if mo0:
    if 1: print ("I see pattern 0")
    toggle = True
    if 1: print(line)
    M.append(mo0.group())

    if mo1:
    if 1: print ("I see pattern 1")
    toggle = True
    M.append(mo1.group())

    if mo2 and toggle:
    if 1: print ("I see pattern 2 AND toggle is set")
    # got one. save it for uniqifying, and empty the container
    toggle = False
    L.append(M)
    M = []

    except Exception as e:
    print("--- " + e + " ---")

    except UnicodeDecodeError:
    #encoded_line = encoded_line.urlsafe_b64encode(re.replace("asdf", encoded_line))
    #line = re.sub(".+", "--- asdf ---", line)
    pass

    L.sort
    print (L)

    # and what"s wrong with some of this, here!
    #myHash = set(L) # uniqify
    #pp.pprint(myHash) # july 23, 131001 hike!
    paul618, Jul 24, 2012
    #1
    1. Advertising

  2. On Tue, 24 Jul 2012 00:50:22 -0700, paul618 wrote:

    > #!/usr/bin/env python
    > # grep_for_QA.py I am only looking to isolate uniq Q:: and A:: stmts
    > from my daily files #
    > # note: This algorithm will fail if there are any blank lines within
    > the Q and A area of interest (a paragraph)
    >
    > # D. Beazley is my fav documentation



    If you are going to ask a question, please ask a question. Don't just
    dump a whole pile of code in our laps and expect us to work out what your
    question is.

    It may help if you read this page:

    http://sscce.org/

    Some further comments below:

    > import re, glob
    > import pprint as pp
    >
    > sampledata = '''
    > A:: And Straight Street is playin on the Radio Free Tibet. What are the
    > chances, DTMB? Q:: About 1 in 518400, Professor.
    > A:: Correct! Err, I thought it was 1:410400, but <i>close enough for
    > jazz!</i>
    >
    >
    > '''
    >
    > pattern0 = re.compile("Q::")


    There is no point in using a regular expression for something as trivial
    as that. That is like swinging a 20 kg sledge-hammer to crack a peanut.

    Just use a string method:

    if my_string.startswith("Q::"): ...


    [...]
    > # Later, I also need to treat Unicode -- and I am clueless.


    If you have a question about Unicode, you should ask it.

    If you have not already read this page, you should read it now:

    http://www.joelonsoftware.com/printerFriendly/articles/Unicode.html



    > except Exception as e:
    > print("--- " + e + " ---")


    Please don't throw away useful debugging information.

    You should learn to read exception tracebacks, not hide them. They
    contain a lot of very useful information to help you debug your code.

    > except UnicodeDecodeError:
    > #encoded_line = encoded_line.urlsafe_b64encode(re.replace("asdf",
    > encoded_line)) #line = re.sub(".+", "--- asdf ---", line) pass


    This will never be caught because any UnicodeDecodeError will already be
    caught by the "except Exception" line above.


    > L.sort
    > print (L)
    >
    > # and what"s wrong with some of this, here! #myHash = set(L) #
    > uniqify
    > #pp.pprint(myHash) # july 23, 131001 hike!


    I don't know what's wrong with it. What do you expect it to do, and what
    does it actually do instead?



    --
    Steven
    Steven D'Aprano, Jul 24, 2012
    #2
    1. Advertising

  3. paul618

    paul618 Guest

    Hi Steve:


    Thank you for your quick response.

    Ah, indeed I failed to ask my question:: Why doesnt this code print the sampledata? Instead it prints the empty list.

    The answer is probably quite simple, as I really am an idiot.


    Thanks again,
    paul
    paul618, Jul 24, 2012
    #3
  4. paul618

    MRAB Guest

    On 24/07/2012 08:50, paul618 wrote:
    > #!/usr/bin/env python
    > # grep_for_QA.py I am only looking to isolate uniq Q:: and A:: stmts from my daily files
    > #
    > # note: This algorithm will fail if there are any blank lines within the Q and A area of interest (a paragraph)
    >
    > # D. Beazley is my fav documentation
    >
    > import re, glob
    > import pprint as pp
    >
    > sampledata = '''
    > A:: And Straight Street is playin on the Radio Free Tibet. What are the chances, DTMB?
    > Q:: About 1 in 518400, Professor.
    > A:: Correct! Err, I thought it was 1:410400, but <i>close enough for jazz!</i>
    >
    >
    > '''
    >
    > pattern0 = re.compile("Q::")
    > pattern1 = re.compile("A::") # objects of interest can start with A:: ;; not alway Q::
    > END_OF_PARAGRAPH_pat = "\n\s*\n"
    >
    > path = "/Users/paultaney/dailies2012/0722" # an example of real data set.
    >
    > toggle = False
    > L = []
    > M = []
    >
    > #file = open(path, "r")
    > try:
    > #for line in file.readlines():
    > for line in sampledata:


    sampledata is a string, therefore this is iterating over the string,
    which yields characters, not lines. Try using sampledata.splitlines():

    for line in sampledata.splitlines():

    > try:
    > # Later, I also need to treat Unicode -- and I am clueless.
    >
    > # falsestarts::
    > #line.encode("utf8").decode('xxx', 'ignore')
    > #line.encode("utf8", 'ignore')
    > #line.decode('8859')
    > #line.decode('8859') # 8859, Latin-1 doesn't cover my CJK pastings AT ALL
    > #line.decode('GB18030') # 171006 -- ack
    > #encoded_line = line # xxx line.encode("utf8")
    >
    > mo0 = re.search(pattern0, line)


    This searches for pattern0 anywhere in the line. You really want to
    check whether the line starts with pattern0, which is better done with:

    line.startswith("Q::")

    > mo1 = re.search(pattern1, line)
    > mo2 = re.search(END_OF_PARAGRAPH_pat, line)
    >
    > if mo0:
    > if 1: print ("I see pattern 0")
    > toggle = True
    > if 1: print(line)
    > M.append(mo0.group())
    >
    > if mo1:
    > if 1: print ("I see pattern 1")
    > toggle = True
    > M.append(mo1.group())
    >
    > if mo2 and toggle:
    > if 1: print ("I see pattern 2 AND toggle is set")
    > # got one. save it for uniqifying, and empty the container
    > toggle = False
    > L.append(M)
    > M = []
    >
    > except Exception as e:
    > print("--- " + e + " ---")
    >
    > except UnicodeDecodeError:
    > #encoded_line = encoded_line.urlsafe_b64encode(re.replace("asdf", encoded_line))
    > #line = re.sub(".+", "--- asdf ---", line)
    > pass
    >
    > L.sort
    > print (L)
    >
    > # and what"s wrong with some of this, here!
    > #myHash = set(L) # uniqify
    > #pp.pprint(myHash) # july 23, 131001 hike!
    >
    MRAB, Jul 24, 2012
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Eric DELAGE
    Replies:
    2
    Views:
    679
  2. Eric DELAGE
    Replies:
    1
    Views:
    846
    Jonathan Bromley
    Apr 5, 2005
  3. Bob
    Replies:
    1
    Views:
    290
    Steve C. Orr, MCSD
    Jul 22, 2003
  4. Jack Frost

    using inline stmts vs Page_Load event

    Jack Frost, Oct 31, 2003, in forum: ASP .Net
    Replies:
    3
    Views:
    412
    Kevin Spencer
    Nov 3, 2003
  5. GGP
    Replies:
    2
    Views:
    460
    Paul Tomblin
    Mar 17, 2007
Loading...

Share This Page