(mostly-)POSIX regular expressions

Discussion in 'Python' started by =?iso-8859-1?q?S=E9bastien_Boisg=E9rault?=, May 27, 2006.

  1. Hi,

    I'm searching for a POSIX 1003.2 compatible regular expression engine.
    The Python binding "pregex" by Neal Becker may do the job, but I did
    not manage to download it as the original link
    ftp://ftp.ctd.comsat.com/pub/
    seems dead.

    Does any old-timer (<wink>) have a copy of this package ?

    Cheers,

    SB
     
    =?iso-8859-1?q?S=E9bastien_Boisg=E9rault?=, May 27, 2006
    #1
    1. Advertising

  2. =?iso-8859-1?q?S=E9bastien_Boisg=E9rault?=

    Paddy Guest

    maybe this: http://www.pcre.org/pcre.txt and ctypes might work for you?
    (I was suprised to find out that PCRE supported POSIX but don't know
    what version it supports or how well).

    - Pad
     
    Paddy, May 28, 2006
    #2
    1. Advertising

  3. Very good hint ! I wouldn't have found it alone ...

    I have to study the doc, but the "THE DFA MATCHING ALGORITHM" may do
    what I need Obviously, I didn't expect the Perl-Compatible Regular
    Expressions to implement
    "an alternative algorithm, provided by the pcre_dfa_exec() function,
    that operates in a different way, and is not Perl-compatible".

    Maybe the lib should be renamed in PCREWSO for:
    Perl-compatible regular expressions ... well, sort of :)

    Cheers,

    SB
     
    =?iso-8859-1?q?S=E9bastien_Boisg=E9rault?=, May 28, 2006
    #3
  4. Paddy a écrit :

    > maybe this: http://www.pcre.org/pcre.txt and ctypes might work for you?


    Well finally, it doesn't fit. What I need is a "longest match" policy
    in
    patterns like "(a)|(b)|(c)" and NOT a "left-to-right" policy.
    Additionaly,
    I need to be able to obtain the matched ("captured") substring and
    the PCRE does not allow this in DFA mode.

    Too bad ...

    SB
     
    =?iso-8859-1?q?S=E9bastien_Boisg=E9rault?=, May 28, 2006
    #4
  5. =?iso-8859-1?q?S=E9bastien_Boisg=E9rault?=

    John Machin Guest

    On 29/05/2006 7:46 AM, Sébastien Boisgérault wrote:
    > Paddy a écrit :
    >
    >> maybe this: http://www.pcre.org/pcre.txt and ctypes might work for you?

    >
    > Well finally, it doesn't fit. What I need is a "longest match" policy
    > in
    > patterns like "(a)|(b)|(c)" and NOT a "left-to-right" policy.
    > Additionaly,
    > I need to be able to obtain the matched ("captured") substring and
    > the PCRE does not allow this in DFA mode.
    >


    Perhaps you might like to be somewhat more precise with your
    requirements. "POSIX-compliant" made me think of yuckies like [:fubar:]
    in character classes :)

    The operands of | are such that the length is not fixed and so you can't
    write them in descending length order? Care to tell us some more detail
    about those operands?

    If those operands are simple strings (LOGICAL|LOGIC|LOG) and you've got
    more than a handful of them, try Danny Yoo's ahocorasick module.

    HTH,
    John
     
    John Machin, May 28, 2006
    #5
  6. John Machin wrote:
    > On 29/05/2006 7:46 AM, Sébastien Boisgérault wrote:
    > > Paddy a écrit :
    > >
    > >> maybe this: http://www.pcre.org/pcre.txt and ctypes might work for you?

    > >
    > > Well finally, it doesn't fit. What I need is a "longest match" policy
    > > in
    > > patterns like "(a)|(b)|(c)" and NOT a "left-to-right" policy.
    > > Additionaly,
    > > I need to be able to obtain the matched ("captured") substring and
    > > the PCRE does not allow this in DFA mode.
    > >

    >
    > Perhaps you might like to be somewhat more precise with your
    > requirements.


    Sure. More on this below.

    > "POSIX-compliant" made me think of yuckies like [:fubar:]
    > in character classes :)


    Yep. I do not need POSIX *syntax* for regular expressions but POSIX
    *semantics*, at least the "leftmost-longest" part (in contrast to the
    "first then longest" used in Python, Perl, .NET, etc.)

    > The operands of | are such that the length is not fixed and so you can't
    > write them in descending length order? Care to tell us some more detail
    > about those operands?


    Basically, I'd like to use the (excellent) python module SPARK
    of John Aycock to build an (extended) C lexer. To do so, I need
    to specify the patterns that match my tokens as well as a priority
    between them. SPARK then builds a big alternate list of patterns
    that begins with the high priority patterns and ends with the low
    priority patterns and runs a match.

    The problem with to be very careful and to specify explicitely the
    priorities to get the desired results: "<=" shall be higher than "<",
    decimal stuff higher than integer, etc, when most of the time what
    you really want is to match the longest pattern ...

    Worse, the priority work-around does not work well when you
    compare keywords and (other) identifiers. To match "fortune"
    as a identifier, you would need to define identifier with a higher
    priority than keyword and it is a problem: "for" would be then
    match as a identifier when it is a keyword.

    I can come up with possible work-arounds for the "id vs
    keyword" issue, but nothing that really makes me happy ...
    Therefore, I was studying the possible replacement of the
    Python native regular expression engine with a "POSIX
    semantics" regular expression engine that would give the
    longest match and avoid me a lot of extra work ...

    I hope it's clearer now :)

    Any advice ?

    Cheers

    SB
     
    =?iso-8859-1?q?S=E9bastien_Boisg=E9rault?=, May 29, 2006
    #6
  7. =?iso-8859-1?q?S=E9bastien_Boisg=E9rault?=

    Anthony Guest

    os.tilmes() problem

    i have a problem with the os.times() command, on different Python
    versions, i get different printout:

    Server1# python
    Python 2.3.4 (#1, Feb 2 2005, 11:44:13)
    [GCC 3.4.3 20041212 (Red Hat 3.4.3-9.EL4)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import time
    >>> import os
    >>>
    >>> print os.times()[4]

    4880406.62


    ----------------------------------
    Server2% python
    Python 2.3.2 (#4, Sep 14 2004, 09:41:45) [C] on sunos5
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import time
    >>> import os
    >>>
    >>> print os.times()[4]

    -21464227.74


    ---------------
    Server3% python
    Python 2.4.1 (#1, May 16 2005, 15:19:29)
    [GCC 4.0.0 20050512 (Red Hat 4.0.0-5)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import time
    >>> import os
    >>>
    >>> print os.times()[4]

    18390711.21



    and on the 3 servers, the linux command: $date
    returns the same value.....

    any suggestions???
     
    Anthony, May 29, 2006
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. RN
    Replies:
    0
    Views:
    2,562
  2. Jay Douglas
    Replies:
    0
    Views:
    610
    Jay Douglas
    Aug 15, 2003
  3. sloan
    Replies:
    2
    Views:
    409
    Alan Silver
    Jun 19, 2006
  4. Radu
    Replies:
    2
    Views:
    43,341
  5. Noman Shapiro
    Replies:
    0
    Views:
    235
    Noman Shapiro
    Jul 17, 2013
Loading...

Share This Page