How to escape # hash character in regex match strings

Discussion in 'Python' started by 504crank@gmail.com, Jun 10, 2009.

  1. Guest

    I've encountered a problem with my RegEx learning curve -- how to
    escape hash characters # in strings being matched, e.g.:

    >>> string = re.escape('123#abc456')
    >>> match = re.match('\d+', string)
    >>> print match


    <_sre.SRE_Match object at 0x00A6A800>
    >>> print match.group()


    123

    The correct result should be:

    123456

    I've tried to escape the hash symbol in the match string without
    result.

    Any ideas? Is the answer something I overlooked in my lurching Python
    schooling?
     
    , Jun 10, 2009
    #1
    1. Advertising

  2. Peter Otten Guest

    wrote:

    > I've encountered a problem with my RegEx learning curve -- how to
    > escape hash characters # in strings being matched, e.g.:
    >
    >>>> string = re.escape('123#abc456')
    >>>> match = re.match('\d+', string)
    >>>> print match

    >
    > <_sre.SRE_Match object at 0x00A6A800>
    >>>> print match.group()

    >
    > 123
    >
    > The correct result should be:
    >
    > 123456


    >>> "".join(re.findall("\d+", "123#abc456"))

    '123456'

    > I've tried to escape the hash symbol in the match string without
    > result.
    >
    > Any ideas? Is the answer something I overlooked in my lurching Python
    > schooling?


    re.escape() is used to build the regex from a string that may contain
    characters that have a special meaning in regular expressions but that you
    want to treat as literals. You can for example search for r"C:\dir" with

    >>> re.compile(re.escape(r"C:\dir")).findall(r"C:\dir C:7ir")

    ['C:\\dir']

    Without escaping you'd get

    >>> re.compile(r"C:\dir").findall(r"C:\dir C:7ir")

    ['C:7ir']

    Peter
     
    Peter Otten, Jun 10, 2009
    #2
    1. Advertising

  3. Maybe a using a Unicode equiv of # would do the trick.

    -----Original Message-----
    From: python-list-bounces+david.shapiro= [mailto:python-list-bounces+david.shapiro=] On Behalf Of Peter Otten
    Sent: Wednesday, June 10, 2009 11:32 AM
    To:
    Subject: Re: How to escape # hash character in regex match strings

    wrote:

    > I've encountered a problem with my RegEx learning curve -- how to
    > escape hash characters # in strings being matched, e.g.:
    >
    >>>> string = re.escape('123#abc456')
    >>>> match = re.match('\d+', string)
    >>>> print match

    >
    > <_sre.SRE_Match object at 0x00A6A800>
    >>>> print match.group()

    >
    > 123
    >
    > The correct result should be:
    >
    > 123456


    >>> "".join(re.findall("\d+", "123#abc456"))

    '123456'

    > I've tried to escape the hash symbol in the match string without
    > result.
    >
    > Any ideas? Is the answer something I overlooked in my lurching Python
    > schooling?


    re.escape() is used to build the regex from a string that may contain
    characters that have a special meaning in regular expressions but that you
    want to treat as literals. You can for example search for r"C:\dir" with

    >>> re.compile(re.escape(r"C:\dir")).findall(r"C:\dir C:7ir")

    ['C:\\dir']

    Without escaping you'd get

    >>> re.compile(r"C:\dir").findall(r"C:\dir C:7ir")

    ['C:7ir']

    Peter

    --
    http://mail.python.org/mailman/listinfo/python-list
     
    David Shapiro, Jun 10, 2009
    #3
  4. Lie Ryan Guest

    wrote:
    > I've encountered a problem with my RegEx learning curve -- how to
    > escape hash characters # in strings being matched, e.g.:
    >
    >>>> string = re.escape('123#abc456')
    >>>> match = re.match('\d+', string)
    >>>> print match

    >
    > <_sre.SRE_Match object at 0x00A6A800>
    >>>> print match.group()

    >
    > 123
    >
    > The correct result should be:
    >
    > 123456
    >
    > I've tried to escape the hash symbol in the match string without
    > result.
    >
    > Any ideas? Is the answer something I overlooked in my lurching Python
    > schooling?


    As you're not being clear on what you wanted, I'm just guessing this is
    what you wanted:

    >>> s = '123#abc456'
    >>> re.match('\d+', re.sub('#\D+', '', s)).group()

    '123456'
    >>> s = '123#this is a comment and is ignored456'
    >>> re.match('\d+', re.sub('#\D+', '', s)).group()

    '123456'
     
    Lie Ryan, Jun 11, 2009
    #4
  5. Brian D Guest

    On Jun 11, 2:01 am, Lie Ryan <> wrote:
    > wrote:
    > > I've encountered a problem with my RegEx learning curve -- how to
    > > escape hash characters # in strings being matched, e.g.:

    >
    > >>>> string = re.escape('123#abc456')
    > >>>> match = re.match('\d+', string)
    > >>>> print match

    >
    > > <_sre.SRE_Match object at 0x00A6A800>
    > >>>> print match.group()

    >
    > > 123

    >
    > > The correct result should be:

    >
    > > 123456

    >
    > > I've tried to escape the hash symbol in the match string without
    > > result.

    >
    > > Any ideas? Is the answer something I overlooked in my lurching Python
    > > schooling?

    >
    > As you're not being clear on what you wanted, I'm just guessing this is
    > what you wanted:
    >
    > >>> s = '123#abc456'
    > >>> re.match('\d+', re.sub('#\D+', '', s)).group()

    > '123456'
    > >>> s = '123#this is a comment and is ignored456'
    > >>> re.match('\d+', re.sub('#\D+', '', s)).group()

    >
    > '123456'


    Sorry I wasn't more clear. I positively appreciate your reply. It
    provides half of what I'm hoping to learn. The hash character is
    actually a desirable hook to identify a data entity in a scraping
    routine I'm developing, but not a character I want in the scrubbed
    data.

    In my application, the hash makes a string of alphanumeric characters
    unique from other alphanumeric strings. The strings I'm looking for
    are actually manually-entered identifiers, but a real machine-created
    identifier shouldn't contain that hash character. The correct pattern
    should be 'A1234509', but is instead often merely entered as '#12345'
    when the first character, representing an alphabet sequence for the
    month, and the last two characters, representing a two-digit year, can
    be assumed. Identifying the hash character in a RegEx match is a way
    of trapping the string and transforming it into its correct machine-
    generated form.

    I'm surprised it's been so difficult to find an example of the hash
    character in a RegEx string -- for exactly this type of situation,
    since it's so common in the real world that people want to put a pound
    symbol in front of a number.

    Thanks!
     
    Brian D, Jun 11, 2009
    #5
  6. Brian D Guest

    On Jun 11, 9:22 am, Brian D <> wrote:
    > On Jun 11, 2:01 am, Lie Ryan <> wrote:
    >
    >
    >
    > > wrote:
    > > > I've encountered a problem with my RegEx learning curve -- how to
    > > > escape hash characters # in strings being matched, e.g.:

    >
    > > >>>> string = re.escape('123#abc456')
    > > >>>> match = re.match('\d+', string)
    > > >>>> print match

    >
    > > > <_sre.SRE_Match object at 0x00A6A800>
    > > >>>> print match.group()

    >
    > > > 123

    >
    > > > The correct result should be:

    >
    > > > 123456

    >
    > > > I've tried to escape the hash symbol in the match string without
    > > > result.

    >
    > > > Any ideas? Is the answer something I overlooked in my lurching Python
    > > > schooling?

    >
    > > As you're not being clear on what you wanted, I'm just guessing this is
    > > what you wanted:

    >
    > > >>> s = '123#abc456'
    > > >>> re.match('\d+', re.sub('#\D+', '', s)).group()

    > > '123456'
    > > >>> s = '123#this is a comment and is ignored456'
    > > >>> re.match('\d+', re.sub('#\D+', '', s)).group()

    >
    > > '123456'

    >
    > Sorry I wasn't more clear. I positively appreciate your reply. It
    > provides half of what I'm hoping to learn. The hash character is
    > actually a desirable hook to identify a data entity in a scraping
    > routine I'm developing, but not a character I want in the scrubbed
    > data.
    >
    > In my application, the hash makes a string of alphanumeric characters
    > unique from other alphanumeric strings. The strings I'm looking for
    > are actually manually-entered identifiers, but a real machine-created
    > identifier shouldn't contain that hash character. The correct pattern
    > should be 'A1234509', but is instead often merely entered as '#12345'
    > when the first character, representing an alphabet sequence for the
    > month, and the last two characters, representing a two-digit year, can
    > be assumed. Identifying the hash character in a RegEx match is a way
    > of trapping the string and transforming it into its correct machine-
    > generated form.
    >
    > I'm surprised it's been so difficult to find an example of the hash
    > character in a RegEx string -- for exactly this type of situation,
    > since it's so common in the real world that people want to put a pound
    > symbol in front of a number.
    >
    > Thanks!


    By the way, other forms the strings can take in their manually created
    forms:

    A#12345
    #1234509

    Garbage in, garbage out -- I know. I wish I could tell the people
    entering the data how challenging it is to work with what they
    provide, but it is, after all, a screen-scraping routine.
     
    Brian D, Jun 11, 2009
    #6
  7. Guest

    On Jun 11, 2:01 am, Lie Ryan <> wrote:
    > wrote:
    > > I've encountered a problem with my RegEx learning curve -- how to
    > > escape hash characters # in strings being matched, e.g.:

    >
    > >>>> string = re.escape('123#abc456')
    > >>>> match = re.match('\d+', string)
    > >>>> print match

    >
    > > <_sre.SRE_Match object at 0x00A6A800>
    > >>>> print match.group()

    >
    > > 123

    >
    > > The correct result should be:

    >
    > > 123456

    >
    > > I've tried to escape the hash symbol in the match string without
    > > result.

    >
    > > Any ideas? Is the answer something I overlooked in my lurching Python
    > > schooling?

    >
    > As you're not being clear on what you wanted, I'm just guessing this is
    > what you wanted:
    >
    > >>> s = '123#abc456'
    > >>> re.match('\d+', re.sub('#\D+', '', s)).group()

    > '123456'
    > >>> s = '123#this is a comment and is ignored456'
    > >>> re.match('\d+', re.sub('#\D+', '', s)).group()

    >
    > '123456'- Hide quoted text -
    >
    > - Show quoted text -


    Sorry I wasn't more clear. I positively appreciate your reply. It
    provides half of what I'm hoping to learn. The hash character is
    actually a desirable hook to identify a data entity in a scraping
    routine I'm developing, but not a character I want in the scrubbed
    data.

    In my application, the hash makes a string of alphanumeric characters
    unique from other alphanumeric strings. The strings I'm looking for
    are actually manually-entered identifiers, but a real machine-created
    identifier shouldn't contain that hash character. The correct pattern
    should be 'A1234509', but is instead often merely entered as '#12345'
    when the first character, representing an alphabet sequence for the
    month, and the last two characters, representing a two-digit year, can
    be assumed. Identifying the hash character in a RegEx match is a way
    of trapping the string and transforming it into its correct machine-
    generated form.

    Other patterns the strings can take in their manually-created
    form:

    A#12345
    #1234509

    Garbage in, garbage out -- I know. I wish I could tell the people
    entering the data how challenging it is to work with what they
    provide, but it is, after all, a screen-scraping routine.

    I'm surprised it's been so difficult to find an example of the hash
    character in a RegEx string -- for exactly this type of situation,
    since it's so common in the real world that people want to put a pound
    symbol in front of a number.

    Thanks!
     
    , Jun 11, 2009
    #7
  8. Rhodri James Guest

    On Thu, 11 Jun 2009 15:22:44 +0100, Brian D <> wrote:

    > I'm surprised it's been so difficult to find an example of the hash
    > character in a RegEx string -- for exactly this type of situation,
    > since it's so common in the real world that people want to put a pound
    > symbol in front of a number.


    It's a character with no special meaning to the regex engine, so I'm not
    in the least surprised that there aren't many examples containing it.
    You could just as validly claim that there aren't many examples involving
    the letter 'q'.

    By the way, I don't know what you're doing but I'm seeing all of your
    posts twice, from two different addresses. This is a little confusing,
    to put it mildly, and doesn't half break the threading.

    --
    Rhodri James *-* Wildebeest Herder to the Masses
     
    Rhodri James, Jun 11, 2009
    #8
  9. Lie Ryan Guest

    Brian D wrote:
    > On Jun 11, 9:22 am, Brian D <> wrote:
    >> On Jun 11, 2:01 am, Lie Ryan <> wrote:
    >>
    >>
    >>
    >>> wrote:
    >>>> I've encountered a problem with my RegEx learning curve -- how to
    >>>> escape hash characters # in strings being matched, e.g.:
    >>>>>>> string = re.escape('123#abc456')
    >>>>>>> match = re.match('\d+', string)
    >>>>>>> print match
    >>>> <_sre.SRE_Match object at 0x00A6A800>
    >>>>>>> print match.group()
    >>>> 123
    >>>> The correct result should be:
    >>>> 123456
    >>>> I've tried to escape the hash symbol in the match string without
    >>>> result.
    >>>> Any ideas? Is the answer something I overlooked in my lurching Python
    >>>> schooling?
    >>> As you're not being clear on what you wanted, I'm just guessing this is
    >>> what you wanted:
    >>>>>> s = '123#abc456'
    >>>>>> re.match('\d+', re.sub('#\D+', '', s)).group()
    >>> '123456'
    >>>>>> s = '123#this is a comment and is ignored456'
    >>>>>> re.match('\d+', re.sub('#\D+', '', s)).group()
    >>> '123456'

    >> Sorry I wasn't more clear. I positively appreciate your reply. It
    >> provides half of what I'm hoping to learn. The hash character is
    >> actually a desirable hook to identify a data entity in a scraping
    >> routine I'm developing, but not a character I want in the scrubbed
    >> data.
    >>
    >> In my application, the hash makes a string of alphanumeric characters
    >> unique from other alphanumeric strings. The strings I'm looking for
    >> are actually manually-entered identifiers, but a real machine-created
    >> identifier shouldn't contain that hash character. The correct pattern
    >> should be 'A1234509', but is instead often merely entered as '#12345'
    >> when the first character, representing an alphabet sequence for the
    >> month, and the last two characters, representing a two-digit year, can
    >> be assumed. Identifying the hash character in a RegEx match is a way
    >> of trapping the string and transforming it into its correct machine-
    >> generated form.
    >>
    >> I'm surprised it's been so difficult to find an example of the hash
    >> character in a RegEx string -- for exactly this type of situation,
    >> since it's so common in the real world that people want to put a pound
    >> symbol in front of a number.
    >>
    >> Thanks!

    >
    > By the way, other forms the strings can take in their manually created
    > forms:
    >
    > A#12345
    > #1234509
    >
    > Garbage in, garbage out -- I know. I wish I could tell the people
    > entering the data how challenging it is to work with what they
    > provide, but it is, after all, a screen-scraping routine.


    perhaps it's like this?

    >>> # you can use re.search if that suits better
    >>> a = re.match('([A-Z]?)#(\d{5})(\d\d)?', 'A#12345')
    >>> b = re.match('([A-Z]?)#(\d{5})(\d\d)?', '#1234509')
    >>> a.group(0)

    'A#12345'
    >>> a.group(1)

    'A'
    >>> a.group(2)

    '12345'
    >>> a.group(3)
    >>> b.group(0)

    '#1234509'
    >>> b.group(1)

    ''
    >>> b.group(2)

    '12345'
    >>> b.group(3)

    '09'
     
    Lie Ryan, Jun 14, 2009
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. hiwa
    Replies:
    0
    Views:
    642
  2. slomo
    Replies:
    5
    Views:
    1,549
    Duncan Booth
    Dec 2, 2007
  3. walterbyrd
    Replies:
    12
    Views:
    627
    Steven D'Aprano
    May 24, 2009
  4. rp
    Replies:
    1
    Views:
    543
    red floyd
    Nov 10, 2011
  5. Deepu Damodaran

    Escape character is strings

    Deepu Damodaran, Nov 26, 2007, in forum: Ruby
    Replies:
    2
    Views:
    139
Loading...

Share This Page