Re: ignore case only for a part of the regex?

Discussion in 'Python' started by Roy Smith, Dec 30, 2012.

  1. Roy Smith

    Roy Smith Guest

    Helmut Jarausch <> wrote:

    > is there a means to specify that 'ignore-case' should only apply to a part
    > of a regex?


    Not that I'm aware of.

    > the regex should match Msg-id:, Msg-Id, ... but not msg-id: and so on.


    What's the use-case for this?

    The way I would typically do something like this is build my regexes in
    all lower case and .lower() the text I was matching against them. I'm
    curious what you're doing where you want to enforce case sensitivity in
    one part of a header, but not in another.
     
    Roy Smith, Dec 30, 2012
    #1
    1. Advertising

  2. On Sun, Dec 30, 2012 at 10:20 AM, Roy Smith <> wrote:

    > Helmut Jarausch <> wrote:
    >
    > > is there a means to specify that 'ignore-case' should only apply to a

    > part
    > > of a regex?

    >


    Python has excellent string methods. There seems to be a split between
    people who first always grab regex for string parsing, and those who might
    not. If you go with your regex, I think you can comment what you have and
    move on. I glaze over looking at regexes. That's just me. The code to
    first search for "Msg-", then check what follows would take a couple of
    lines, but might be easier to understand later. I've been writing python
    for a couple of years, and although I feel comfortable with it, there is
    much more more me to learn. One thing I have learned over many years of
    programming is that figuring out what a piece of code is trying to
    accomplish takes more time than writing it originally.

    Do you really want to match "Msg-iD" (lower case i)? Or are you only
    allowing "ID" or "Id"?

    >
    > Not that I'm aware of.
    >
    > > the regex should match Msg-id:, Msg-Id, ... but not msg-id: and so on.

    >
    > What's the use-case for this?
    >
    > The way I would typically do something like this is build my regexes in
    > all lower case and .lower() the text I was matching against them. I'm
    > curious what you're doing where you want to enforce case sensitivity in
    > one part of a header, but not in another.
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    >




    --
    Joel Goldstick
     
    Joel Goldstick, Dec 30, 2012
    #2
    1. Advertising

  3. On Sun, 30 Dec 2012 10:20:19 -0500, Roy Smith wrote:

    > The way I would typically do something like this is build my regexes in
    > all lower case and .lower() the text I was matching against them. I'm
    > curious what you're doing where you want to enforce case sensitivity in
    > one part of a header, but not in another.


    Well, sometimes you have things that are case sensitive, and other things
    which are not, and sometimes you need to match them at the same time. I
    don't think this is any more unusual than (say) wanting to match an
    otherwise lowercase word whether or not it comes at the start of a
    sentence:

    "[Pp]rogramming"

    is conceptually equivalent to "match case-insensitive `p`, and case-
    sensitive `rogramming`".


    By the way, although there is probably nothing you can (easily) do about
    this prior to Python 3.3, converting to lowercase is not the right way to
    do case-insensitive matching. It happens to work correctly for ASCII, but
    it is not correct for all alphabetic characters.


    py> 'Straße'.lower()
    'straße'
    py> 'Straße'.upper()
    'STRASSE'


    The right way is to casefold first, then match:

    py> 'Straße'.casefold()
    'strasse'


    Curiously, there is an uppercase ß in old German. In recent years some
    typographers have started using it instead of SS, but it's still rare,
    and the official German rules have ß transform into SS and vice versa.
    It's in Unicode, but few fonts show it:

    py> unicodedata.lookup('LATIN CAPITAL LETTER SHARP S')
    'ẞ'



    --
    Steven
     
    Steven D'Aprano, Jan 1, 2013
    #3
  4. 2013/1/1 Steven D'Aprano <>:
    > On Sun, 30 Dec 2012 10:20:19 -0500, Roy Smith wrote:
    >
    >> The way I would typically do something like this is build my regexes in
    >> all lower case and .lower() the text I was matching against them. I'm
    >> curious what you're doing where you want to enforce case sensitivity in
    >> one part of a header, but not in another.

    >
    > Well, sometimes you have things that are case sensitive, and other things
    > which are not, and sometimes you need to match them at the same time. I
    > don't think this is any more unusual than (say) wanting to match an
    > otherwise lowercase word whether or not it comes at the start of a
    > sentence:
    >
    > "[Pp]rogramming"
    >
    > is conceptually equivalent to "match case-insensitive `p`, and case-
    > sensitive `rogramming`".
    >
    >
    > By the way, although there is probably nothing you can (easily) do about
    > this prior to Python 3.3, converting to lowercase is not the right way to
    > do case-insensitive matching. It happens to work correctly for ASCII, but
    > it is not correct for all alphabetic characters.
    >
    >
    > py> 'Straße'.lower()
    > 'straße'
    > py> 'Straße'.upper()
    > 'STRASSE'
    >
    >
    > The right way is to casefold first, then match:
    >
    > py> 'Straße'.casefold()
    > 'strasse'
    >
    >
    > Curiously, there is an uppercase ß in old German. In recent years some
    > typographers have started using it instead of SS, but it's still rare,
    > and the official German rules have ß transform into SS and vice versa.
    > It's in Unicode, but few fonts show it:
    >
    > py> unicodedata.lookup('LATIN CAPITAL LETTER SHARP S')
    > 'ẞ'
    >
    >
    >
    > --
    > Steven
    > --
    > http://mail.python.org/mailman/listinfo/python-list


    Hi,
    just for completeness, the mentioned regex library can take care of
    casfolding in case insensitive matching (in all supported versions:
    Python 2.5-2.7 and 3.1-3.3); i.e.:
    # case sensitive match:
    >>> for m in regex.findall(ur"Straße", u" STRAßE STRASSE STRAẞE Strasse Straße "): print m

    ....
    Straße

    # case insensitive match:
    >>> for m in regex.findall(ur"(?i)Straße", u" STRAßE STRASSE STRAẞE Strasse Straße "): print m

    ....
    STRAßE
    STRAẞE
    Straße

    # case insensitive match with casefolding:
    >>> for m in regex.findall(ur"(?if)Straße", u" STRAßE STRASSE STRAẞE Strasse Straße "): print m

    ....
    STRAßE
    STRASSE
    STRAẞE
    Strasse
    Straße
    >>>
    >>>


    # after enabling the backwards incompatible modern matching behaviour,
    casefolding is by default turned on for case insensitive matches
    >>> for m in regex.findall(ur"(?V1i)Straße", u" STRAßE STRASSE STRAẞE Strasse Straße "): print m

    ....
    STRAßE
    STRASSE
    STRAẞE
    Strasse
    Straße
    >>>



    As a small addition, the originally posted pattern r'^Msg-(?:(?i)id):'
    would actually work as expected in this modern matching mode in regex
    - enabled with the V1 flag. In this case the flag-setting (?i) only
    affects the following parts of the pattern, not the whole pattern like
    in the current "re" and V0-compatibility-mode "regex"

    >>> regex.findall(r"(?V1)Msg-(?:(?i)id):", "the regex should match Msg-id:, Msg-Id:, ... but not msg-id:, MSG-ID: and so on")

    ['Msg-id:', 'Msg-Id:']
    >>>


    regards,
    vbr
     
    Vlastimil Brom, Jan 1, 2013
    #4
  5. Roy Smith

    Guest

    Le mercredi 2 janvier 2013 00:09:45 UTC+1, Vlastimil Brom a écrit :
    > 2013/1/1 Steven D'Aprano <>:
    >
    > > On Sun, 30 Dec 2012 10:20:19 -0500, Roy Smith wrote:

    >
    > >

    >
    > >> The way I would typically do something like this is build my regexes in

    >
    > >> all lower case and .lower() the text I was matching against them. I'm

    >
    > >> curious what you're doing where you want to enforce case sensitivity in

    >
    > >> one part of a header, but not in another.

    >
    > >

    >
    > > Well, sometimes you have things that are case sensitive, and other things

    >
    > > which are not, and sometimes you need to match them at the same time. I

    >
    > > don't think this is any more unusual than (say) wanting to match an

    >
    > > otherwise lowercase word whether or not it comes at the start of a

    >
    > > sentence:

    >
    > >

    >
    > > "[Pp]rogramming"

    >
    > >

    >
    > > is conceptually equivalent to "match case-insensitive `p`, and case-

    >
    > > sensitive `rogramming`".

    >
    > >

    >
    > >

    >
    > > By the way, although there is probably nothing you can (easily) do about

    >
    > > this prior to Python 3.3, converting to lowercase is not the right way to

    >
    > > do case-insensitive matching. It happens to work correctly for ASCII, but

    >
    > > it is not correct for all alphabetic characters.

    >
    > >

    >
    > >

    >
    > > py> 'Straße'.lower()

    >
    > > 'straße'

    >
    > > py> 'Straße'.upper()

    >
    > > 'STRASSE'

    >
    > >

    >
    > >

    >
    > > The right way is to casefold first, then match:

    >
    > >

    >
    > > py> 'Straße'.casefold()

    >
    > > 'strasse'

    >
    > >

    >
    > >

    >
    > > Curiously, there is an uppercase ß in old German. In recent years some

    >
    > > typographers have started using it instead of SS, but it's still rare,

    >
    > > and the official German rules have ß transform into SS and vice versa.

    >
    > > It's in Unicode, but few fonts show it:

    >
    > >

    >
    > > py> unicodedata.lookup('LATIN CAPITAL LETTER SHARP S')

    >
    > > 'ẞ'

    >
    > >

    >
    > >

    >
    > >

    >
    > > --

    >
    > > Steven

    >
    > > --

    >
    > > http://mail.python.org/mailman/listinfo/python-list

    >
    >
    >
    > Hi,
    >
    > just for completeness, the mentioned regex library can take care of
    >
    > casfolding in case insensitive matching (in all supported versions:
    >
    > Python 2.5-2.7 and 3.1-3.3); i.e.:
    >
    > # case sensitive match:
    >
    > >>> for m in regex.findall(ur"Straße", u" STRAßE STRASSE STRAẞE Strasse Straße "): print m

    >
    > ...
    >
    > Straße
    >
    >
    >
    > # case insensitive match:
    >
    > >>> for m in regex.findall(ur"(?i)Straße", u" STRAßE STRASSE STRAẞE Strasse Straße "): print m

    >
    > ...
    >
    > STRAßE
    >
    > STRAẞE
    >
    > Straße
    >
    >
    >
    > # case insensitive match with casefolding:
    >
    > >>> for m in regex.findall(ur"(?if)Straße", u" STRAßE STRASSE STRAẞE Strasse Straße "): print m

    >
    > ...
    >
    > STRAßE
    >
    > STRASSE
    >
    > STRAẞE
    >
    > Strasse
    >
    > Straße
    >
    > >>>

    >
    > >>>

    >
    >
    >
    > # after enabling the backwards incompatible modern matching behaviour,
    >
    > casefolding is by default turned on for case insensitive matches
    >
    > >>> for m in regex.findall(ur"(?V1i)Straße", u" STRAßE STRASSE STRAẞE Strasse Straße "): print m

    >
    > ...
    >
    > STRAßE
    >
    > STRASSE
    >
    > STRAẞE
    >
    > Strasse
    >
    > Straße
    >
    > >>>

    >
    >
    >
    >
    >
    > As a small addition, the originally posted pattern r'^Msg-(?:(?i)id):'
    >
    > would actually work as expected in this modern matching mode in regex
    >
    > - enabled with the V1 flag. In this case the flag-setting (?i) only
    >
    > affects the following parts of the pattern, not the whole pattern like
    >
    > in the current "re" and V0-compatibility-mode "regex"
    >
    >
    >
    > >>> regex.findall(r"(?V1)Msg-(?:(?i)id):", "the regex should match Msg-id:, Msg-Id:, ... but not msg-id:, MSG-ID: and so on")

    >
    > ['Msg-id:', 'Msg-Id:']
    >


    ------

    Vlastimil:

    Excellent.

    -----

    Steven:

    ...." It's in Unicode, but few fonts show it:" ...

    Das grosse Eszett is a member of the unicode subsets MES-2, WGL-4.
    Good - serious - fonts are via OpenType MES-2 or WGL-4 compliant.
    So, it is a no problem.

    I do not know (and I did not check) if the code point, 1e9e, is part of
    the utf32 table.

    jmf
     
    , Jan 2, 2013
    #5
  6. Roy Smith

    Guest

    Le mercredi 2 janvier 2013 00:09:45 UTC+1, Vlastimil Brom a écrit :
    > 2013/1/1 Steven D'Aprano <>:
    >
    > > On Sun, 30 Dec 2012 10:20:19 -0500, Roy Smith wrote:

    >
    > >

    >
    > >> The way I would typically do something like this is build my regexes in

    >
    > >> all lower case and .lower() the text I was matching against them. I'm

    >
    > >> curious what you're doing where you want to enforce case sensitivity in

    >
    > >> one part of a header, but not in another.

    >
    > >

    >
    > > Well, sometimes you have things that are case sensitive, and other things

    >
    > > which are not, and sometimes you need to match them at the same time. I

    >
    > > don't think this is any more unusual than (say) wanting to match an

    >
    > > otherwise lowercase word whether or not it comes at the start of a

    >
    > > sentence:

    >
    > >

    >
    > > "[Pp]rogramming"

    >
    > >

    >
    > > is conceptually equivalent to "match case-insensitive `p`, and case-

    >
    > > sensitive `rogramming`".

    >
    > >

    >
    > >

    >
    > > By the way, although there is probably nothing you can (easily) do about

    >
    > > this prior to Python 3.3, converting to lowercase is not the right way to

    >
    > > do case-insensitive matching. It happens to work correctly for ASCII, but

    >
    > > it is not correct for all alphabetic characters.

    >
    > >

    >
    > >

    >
    > > py> 'Straße'.lower()

    >
    > > 'straße'

    >
    > > py> 'Straße'.upper()

    >
    > > 'STRASSE'

    >
    > >

    >
    > >

    >
    > > The right way is to casefold first, then match:

    >
    > >

    >
    > > py> 'Straße'.casefold()

    >
    > > 'strasse'

    >
    > >

    >
    > >

    >
    > > Curiously, there is an uppercase ß in old German. In recent years some

    >
    > > typographers have started using it instead of SS, but it's still rare,

    >
    > > and the official German rules have ß transform into SS and vice versa.

    >
    > > It's in Unicode, but few fonts show it:

    >
    > >

    >
    > > py> unicodedata.lookup('LATIN CAPITAL LETTER SHARP S')

    >
    > > 'ẞ'

    >
    > >

    >
    > >

    >
    > >

    >
    > > --

    >
    > > Steven

    >
    > > --

    >
    > > http://mail.python.org/mailman/listinfo/python-list

    >
    >
    >
    > Hi,
    >
    > just for completeness, the mentioned regex library can take care of
    >
    > casfolding in case insensitive matching (in all supported versions:
    >
    > Python 2.5-2.7 and 3.1-3.3); i.e.:
    >
    > # case sensitive match:
    >
    > >>> for m in regex.findall(ur"Straße", u" STRAßE STRASSE STRAẞE Strasse Straße "): print m

    >
    > ...
    >
    > Straße
    >
    >
    >
    > # case insensitive match:
    >
    > >>> for m in regex.findall(ur"(?i)Straße", u" STRAßE STRASSE STRAẞE Strasse Straße "): print m

    >
    > ...
    >
    > STRAßE
    >
    > STRAẞE
    >
    > Straße
    >
    >
    >
    > # case insensitive match with casefolding:
    >
    > >>> for m in regex.findall(ur"(?if)Straße", u" STRAßE STRASSE STRAẞE Strasse Straße "): print m

    >
    > ...
    >
    > STRAßE
    >
    > STRASSE
    >
    > STRAẞE
    >
    > Strasse
    >
    > Straße
    >
    > >>>

    >
    > >>>

    >
    >
    >
    > # after enabling the backwards incompatible modern matching behaviour,
    >
    > casefolding is by default turned on for case insensitive matches
    >
    > >>> for m in regex.findall(ur"(?V1i)Straße", u" STRAßE STRASSE STRAẞE Strasse Straße "): print m

    >
    > ...
    >
    > STRAßE
    >
    > STRASSE
    >
    > STRAẞE
    >
    > Strasse
    >
    > Straße
    >
    > >>>

    >
    >
    >
    >
    >
    > As a small addition, the originally posted pattern r'^Msg-(?:(?i)id):'
    >
    > would actually work as expected in this modern matching mode in regex
    >
    > - enabled with the V1 flag. In this case the flag-setting (?i) only
    >
    > affects the following parts of the pattern, not the whole pattern like
    >
    > in the current "re" and V0-compatibility-mode "regex"
    >
    >
    >
    > >>> regex.findall(r"(?V1)Msg-(?:(?i)id):", "the regex should match Msg-id:, Msg-Id:, ... but not msg-id:, MSG-ID: and so on")

    >
    > ['Msg-id:', 'Msg-Id:']
    >


    ------

    Vlastimil:

    Excellent.

    -----

    Steven:

    ...." It's in Unicode, but few fonts show it:" ...

    Das grosse Eszett is a member of the unicode subsets MES-2, WGL-4.
    Good - serious - fonts are via OpenType MES-2 or WGL-4 compliant.
    So, it is a no problem.

    I do not know (and I did not check) if the code point, 1e9e, is part of
    the utf32 table.

    jmf
     
    , Jan 2, 2013
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. A.M
    Replies:
    5
    Views:
    5,542
    subha
    Sep 23, 2011
  2. Mosas
    Replies:
    1
    Views:
    422
    Maarten Sneep
    Mar 22, 2005
  3. Morten71
    Replies:
    0
    Views:
    649
    Morten71
    Apr 2, 2007
  4. Rob Meade
    Replies:
    6
    Views:
    283
    Rob Meade
    Mar 1, 2004
  5. Vlastimil Brom
    Replies:
    3
    Views:
    102
    Cameron Simpson
    Dec 31, 2012
Loading...

Share This Page