String Manipulation Help!

Discussion in 'Python' started by Dave, Jan 28, 2006.

  1. Dave

    Dave Guest

    OK, I'm stumped.

    I'm trying to find newline characters (\n, specifically) that are NOT
    in comments.

    So, for example (where "<-" = a newline character):
    ==========================================
    1: <-
    2: /*<-
    3: ----------------------<-
    4: comment<-
    5: ----------------------<-
    6: */<-
    7: <-
    8: CODE CODE CODE<-
    9: <-
    ==========================================

    I want to return the newline characters at lines 1, 6, 7, 8, and 9 but
    NOT the others.

    I've tried using regular expressions but I dislike them because they
    aren't immediately readable (and also I don't bloody understand the
    things). I'm not opposed to using them, though, if they provide a
    solution to this problem!

    Thanks in advance for any suggestions anyone can provide.

    - Dave
     
    Dave, Jan 28, 2006
    #1
    1. Advertising

  2. Dave wrote:
    > OK, I'm stumped.
    >
    > I'm trying to find newline characters (\n, specifically) that are NOT
    > in comments.
    >
    > So, for example (where "<-" = a newline character):
    > ==========================================
    > 1: <-
    > 2: /*<-
    > 3: ----------------------<-
    > 4: comment<-
    > 5: ----------------------<-
    > 6: */<-
    > 7: <-
    > 8: CODE CODE CODE<-
    > 9: <-
    > ==========================================


    [snip]

    Well, I'm sure there is some regex that'll do it, but here's a stupid
    iterative solution:

    def newlines(s):
    nl = []
    inComment = False
    for i in xrange(len(s)):
    if s[i:i+2] == '/*':
    inComment = True
    if s[i:i+2] == '*/':
    inComment = False
    if inComment: continue
    if s == '\n':
    nl.append(i)
    return tuple(nl)

    Your example returns:
    (0, 64, 65, 80, 81)

    This probably isn't as fast as a regex, but at least it works.

    -Kirk McDonald
     
    Kirk McDonald, Jan 28, 2006
    #2
    1. Advertising

  3. Dave

    Dave Guest

    This is great, thanks!
     
    Dave, Jan 28, 2006
    #3
  4. Dave

    Paul McGuire Guest

    "Dave" <> wrote in message
    news:...
    > OK, I'm stumped.
    >
    > I'm trying to find newline characters (\n, specifically) that are NOT
    > in comments.
    >
    > So, for example (where "<-" = a newline character):
    > ==========================================
    > 1: <-
    > 2: /*<-
    > 3: ----------------------<-
    > 4: comment<-
    > 5: ----------------------<-
    > 6: */<-
    > 7: <-
    > 8: CODE CODE CODE<-
    > 9: <-
    > ==========================================
    >
    > I want to return the newline characters at lines 1, 6, 7, 8, and 9 but
    > NOT the others.
    >


    Dave -

    Pyparsing has built-in support for detecting line breaks and comments, and
    the syntax is pretty simple, I think. Here's a pyparsing program that gives
    your desired results:

    ===============================
    from pyparsing import lineEnd, cStyleComment, lineno

    testsource = """
    /*
    ----------------------
    comment
    ----------------------
    */

    CODE CODE CODE

    """

    # define the expression you want to search for
    eol = lineEnd

    # specify that you don't want to match within C-style comments
    eol.ignore(cStyleComment.leaveWhitespace())

    # loop through all the occurrences returned by scanString
    # and print the line number of that location within the original string
    for toks,startloc,endloc in eol.scanString(testsource):
    print lineno(startloc,data)
    ===============================

    The expression you are searching for is pretty basic, just a plain
    end-of-line, or pyparsing's built-in expression, lineEnd. The curve you are
    throwing is that you *don't* want eol's inside of C-style comments.
    Pyparsing allows you to designate an "ignore" expression to skip undesirable
    content, and fortunately, ignoring comments happens so often during parsing,
    that pyparsing includes common comment expressions for C, C++, Java, Python,
    and HTML. Next, pyparsing's version of re.search is scanString. scanString
    returns a generator that gives the matching tokens, start location, and end
    location of every occurrence of the given parse expression, in your case,
    eol. Finally, in the body of our for loop, we use pyparsing's lineno
    function to give us the line number of a string location within the original
    string.

    About the only real wart on all this is that pyparsing implicitly skips over
    leading whitespace, even when looking for expressions to be ignored. In
    order not to lose eols that are just before a comment (like your line 1), we
    have to modify cStyleComment to leave leading whitespace.

    Download pyparsing at http://pyparsing.sourceforge.net.

    -- Paul
     
    Paul McGuire, Jan 28, 2006
    #4
  5. I really enjoyed your article. I will try to understand this.
    Will you be doing more of this in the future with more complicated examples?

    Paul McGuire wrote:

    > "Dave" <> wrote in message
    > news:...
    >> OK, I'm stumped.
    >>
    >> I'm trying to find newline characters (\n, specifically) that are NOT
    >> in comments.
    >>
    >> So, for example (where "<-" = a newline character):
    >> ==========================================
    >> 1: <-
    >> 2: /*<-
    >> 3: ----------------------<-
    >> 4: comment<-
    >> 5: ----------------------<-
    >> 6: */<-
    >> 7: <-
    >> 8: CODE CODE CODE<-
    >> 9: <-
    >> ==========================================
    >>
    >> I want to return the newline characters at lines 1, 6, 7, 8, and 9 but
    >> NOT the others.
    >>

    >
    > Dave -
    >
    > Pyparsing has built-in support for detecting line breaks and comments, and
    > the syntax is pretty simple, I think. Here's a pyparsing program that
    > gives your desired results:
    >
    > ===============================
    > from pyparsing import lineEnd, cStyleComment, lineno
    >
    > testsource = """
    > /*
    > ----------------------
    > comment
    > ----------------------
    > */
    >
    > CODE CODE CODE
    >
    > """
    >
    > # define the expression you want to search for
    > eol = lineEnd
    >
    > # specify that you don't want to match within C-style comments
    > eol.ignore(cStyleComment.leaveWhitespace())
    >
    > # loop through all the occurrences returned by scanString
    > # and print the line number of that location within the original string
    > for toks,startloc,endloc in eol.scanString(testsource):
    > print lineno(startloc,data)
    > ===============================
    >
    > The expression you are searching for is pretty basic, just a plain
    > end-of-line, or pyparsing's built-in expression, lineEnd. The curve you
    > are throwing is that you *don't* want eol's inside of C-style comments.
    > Pyparsing allows you to designate an "ignore" expression to skip
    > undesirable content, and fortunately, ignoring comments happens so often
    > during parsing, that pyparsing includes common comment expressions for C,
    > C++, Java, Python,
    > and HTML. Next, pyparsing's version of re.search is scanString.
    > scanString returns a generator that gives the matching tokens, start
    > location, and end location of every occurrence of the given parse
    > expression, in your case,
    > eol. Finally, in the body of our for loop, we use pyparsing's lineno
    > function to give us the line number of a string location within the
    > original string.
    >
    > About the only real wart on all this is that pyparsing implicitly skips
    > over
    > leading whitespace, even when looking for expressions to be ignored. In
    > order not to lose eols that are just before a comment (like your line 1),
    > we have to modify cStyleComment to leave leading whitespace.
    >
    > Download pyparsing at http://pyparsing.sourceforge.net.
    >
    > -- Paul
     
    Richard Schneiderman, Jan 28, 2006
    #5
  6. Dave

    Paul McGuire Guest

    "Richard Schneiderman" <> wrote in message
    news:ViSCf.561664$...
    > I really enjoyed your article. I will try to understand this.
    > Will you be doing more of this in the future with more complicated

    examples?
    >

    I'm giving two presentations at PyCon at the end of February, so I think
    those will be published after the conference.

    Otherwise, I'll be answering pyparsing questions as they come up on c.l.py
    or on the pyparsing forums on SourceForge. I'd like to compile these into
    more of a book form at some point, but my work schedule is pretty crazy
    right now.

    Glad you liked the article,

    -- Paul
     
    Paul McGuire, Jan 28, 2006
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. dd711
    Replies:
    6
    Views:
    915
    Alex Hunsley
    Oct 1, 2004
  2. Replies:
    3
    Views:
    428
  3. morc

    String manipulation help.

    morc, Mar 6, 2006, in forum: Java
    Replies:
    5
    Views:
    362
    Oliver Wong
    Mar 7, 2006
  4. John
    Replies:
    5
    Views:
    406
  5. mjakowlew

    Filepath string manipulation help

    mjakowlew, Nov 2, 2005, in forum: Python
    Replies:
    9
    Views:
    366
Loading...

Share This Page