Weird problem matching with REs

Discussion in 'Python' started by Andrew Berg, May 29, 2011.

  1. Andrew Berg

    Andrew Berg Guest

    I have an RE that should work (it even works in Kodos [1], but not in my
    code), but it keeps failing to match characters after a newline.

    I'm writing a little program that scans the webpage of an arbitrary
    application and gets the newest version advertised on the page.


    test3.py:
    > # -*- coding: utf-8 -*-
    >
    > import configparser
    > import re
    > import urllib.request
    > import os
    > import sys
    > import logging
    > import collections
    >
    >
    > class CouldNotFindVersion(Exception):
    > def __init__(self, app_name, reason, exc_value):
    > self.value = 'The latest version of ' + app_name + ' could not
    > be determined because ' + reason
    > self.cause = exc_value
    > def __str__(self):
    > return repr(self.value)
    >
    > class AppUpdateItem():
    > def __init__(self, config_file_name, config_file_section):
    > self.section = config_file_section
    > self.name = self.section['Name']
    > self.url = self.section['URL']
    > self.filename = self.section['Filename']
    > self.file_re = re.compile(self.section['FileURLRegex'])
    > self.ver_re = re.compile(self.section['VersionRegex'])
    > self.prev_ver = self.section['CurrentVersion']
    > try:
    > self.page = str(urllib.request.urlopen(self.url).read(),
    > encoding='utf-8')
    > self.file_URL = self.file_re.findall(self.page)[0] #here
    > is where it fails
    > self.last_ver = self.ver_re.findall(self.file_URL)[0]
    > except urllib.error.URLError:
    > self.error = str(sys.exc_info()[1])
    > logging.info('[' + self.name + ']' + ' Could not load URL:
    > ' + self.url + ' : ' + self.error)
    > self.success = False
    > raise CouldNotFindVersion(self.name, self.error,
    > sys.exc_info()[0])
    > except IndexError:
    > logging.warning('Regex did not return a match.')
    > def update_ini(self):
    > self.section['CurrentVersion'] = self.last_ver
    > with open(config_file_name, 'w') as configfile:
    > config.write(configfile)
    > def rollback_ini(self):
    > self.section['CurrentVersion'] = self.prev_ver
    > with open(config_file_name, 'w') as configfile:
    > config.write(configfile)
    > def download_file(self):
    > self.__filename = self.section['Filename']
    > with open(self.__filename, 'wb') as file:
    > self.__file_req = urllib.request.urlopen(self.file_URL).read()
    > file.write(self.__file_req)
    >
    >
    > if __name__ == '__main__':
    > config = configparser.ConfigParser()
    > config_file = 'checklist.ini'
    > config.read(config_file)
    > queue = collections.deque()
    > for section in config.sections():
    > try:
    > queue.append(AppUpdateItem(config_file, config[section]))
    > except CouldNotFindVersion as exc:
    > logging.warning(exc.value)
    > for elem in queue:
    > if elem.last_ver != elem.prev_ver:
    > elem.update_ini()
    > try:
    > elem.download_file()
    > except IOError:
    > logging.warning('[' + elem.name + '] Download failed.')
    > except:
    > elem.rollback_ini()
    > print(elem.name + ' succeeded.')


    checklist.ini:
    > [x264_64]
    > name = x264 (64-bit)
    > filename = x264.exe
    > url = http://x264.nl/x264_main.php
    > fileurlregex =
    > http://x264.nl/x264/64bit/8bit_depth/revision\n{0,3}[0-9]{4}\n{0,3}/x264\n{0,3}.exe
    > versionregex = [0-9]{4}
    > currentversion = 1995


    The part it's supposed to match in http://x264.nl/x264_main.php:
    > <a href="http://x264.nl/x264/64bit/8bit_depth/revision
    > 1995
    > /x264
    >
    > .exe <view-source-tab:http://x264.nl/x264/64bit/8bit_depth/revision%0A1995%0A/x264%0A%0A.exe>"

    I was able to make a regex that matches in my code, but it shouldn't:
    http://x264.nl/x264/64bit/8bit_depth/revision.\n{1,3}[0-9]{4}.\n{1,3}/x264.\n{1,3}.\n{1,3}.exe
    I have to add a dot before each "\n". There is no character not
    accounted for before those newlines, but I don't get a match without the
    dots. I also need both those ".\n{1,3}" sequences before the ".exe". I'm
    really confused.

    Using Python 3.2 on Windows, in case it matters.


    [1] http://kodos.sourceforge.net/ (using the compiled Win32 version
    since it doesn't work with Python 3)
     
    Andrew Berg, May 29, 2011
    #1
    1. Advertising

  2. On Sun, 29 May 2011 06:45:30 -0500, Andrew Berg wrote:

    > I have an RE that should work (it even works in Kodos [1], but not in my
    > code), but it keeps failing to match characters after a newline.


    Not all regexes are the same. Different regex engines accept different
    symbols, and sometimes behave differently, or have different default
    behavior. That your regex works in Kodos but not Python might mean you're
    writing a Kodus regex instead of a Python regex.

    > I'm writing a little program that scans the webpage of an arbitrary
    > application and gets the newest version advertised on the page.


    Firstly, most of the code you show is irrelevant to the problem. Please
    simplify it to the shortest, most simple example you can give. That would
    be a simplified piece of text (not the entire web page!), the regex, and
    the failed attempt to use it. The rest of your code is just noise for the
    purposes of solving this problem.

    Secondly, you probably should use a proper HTML parser, rather than a
    regex. Resist the temptation to use regexes to rip out bits of text from
    HTML, it almost always goes wrong eventually.


    > I was able to make a regex that matches in my code, but it shouldn't:
    > http://x264.nl/x264/64bit/8bit_depth/revision.\n{1,3}[0-9]{4}.\n{1,3}/

    x264.\n{1,3}.\n{1,3}.exe

    What makes you think it shouldn't match?

    By the way, you probably should escape the dots, otherwise it will match
    strings containing any arbitrary character, rather than *just* dots:

    http://x264Znl ...blah blah blah



    --
    Steven
     
    Steven D'Aprano, May 29, 2011
    #2
    1. Advertising

  3. Andrew Berg

    Andrew Berg Guest

    On 2011.05.29 08:00 AM, Ben Finney wrote:
    > You are aware that most text-emitting processes on Windows, and Internet
    > text protocols like the HTTP standard, use the two-character “CR LFâ€
    > sequence (U+000C U+000A) for terminating lines?

    Yes, but I was not having trouble with just '\n' before, and the pattern
    did match in Kodos, so I figured Python was doing its newline magic like
    it does with the write() method for file objects.
    http://x264.nl/x264/64bit/8bit_depth/revision[\r\n]{1,3}[0-9]{4}[\r\n]{1,3}/x264[\r\n]{1,3}.exe
    does indeed match. One thing that confuses me, though (and one reason I
    dismissed the possibility of it being a newline issue): isn't '.'
    supposed to not match '\r'?
     
    Andrew Berg, May 29, 2011
    #3
  4. Andrew Berg

    Andrew Berg Guest

    On 2011.05.29 08:09 AM, Steven D'Aprano wrote:
    > On Sun, 29 May 2011 06:45:30 -0500, Andrew Berg wrote:
    >
    > > I have an RE that should work (it even works in Kodos [1], but not in my
    > > code), but it keeps failing to match characters after a newline.

    >
    > Not all regexes are the same. Different regex engines accept different
    > symbols, and sometimes behave differently, or have different default
    > behavior. That your regex works in Kodos but not Python might mean you're
    > writing a Kodus regex instead of a Python regex.

    Kodos is written in Python and uses Python's regex engine. In fact, it
    is specifically intended to debug Python regexes.
    > Firstly, most of the code you show is irrelevant to the problem. Please
    > simplify it to the shortest, most simple example you can give. That would
    > be a simplified piece of text (not the entire web page!), the regex, and
    > the failed attempt to use it. The rest of your code is just noise for the
    > purposes of solving this problem.

    I wasn't sure how much would be relevant since it could've been a
    problem with other code. I do apologize for not putting more effort into
    trimming it down, though.
    > Secondly, you probably should use a proper HTML parser, rather than a
    > regex. Resist the temptation to use regexes to rip out bits of text from
    > HTML, it almost always goes wrong eventually.

    I find this a much simpler approach, especially since I'm dealing with
    broken HTML. I guess I don't see how the effort put into learning a
    parser and adding the extra code to use it pays off in this particular
    endeavor.
    > > I was able to make a regex that matches in my code, but it shouldn't:
    > > http://x264.nl/x264/64bit/8bit_depth/revision.\n{1,3}[0-9]{4}.\n{1,3}/

    > x264.\n{1,3}.\n{1,3}.exe
    >
    > What makes you think it shouldn't match?

    AFAIK, dots aren't supposed to match carriage returns or any other
    whitespace characters.
    > By the way, you probably should escape the dots, otherwise it will match
    > strings containing any arbitrary character, rather than *just* dots:

    You're right; I overlooked the dots in the URL.
     
    Andrew Berg, May 29, 2011
    #4
  5. On Sun, 29 May 2011 08:41:16 -0500, Andrew Berg wrote:

    > On 2011.05.29 08:09 AM, Steven D'Aprano wrote:

    [...]
    > Kodos is written in Python and uses Python's regex engine. In fact, it
    > is specifically intended to debug Python regexes.


    Fair enough.

    >> Secondly, you probably should use a proper HTML parser, rather than a
    >> regex. Resist the temptation to use regexes to rip out bits of text
    >> from HTML, it almost always goes wrong eventually.

    >
    > I find this a much simpler approach, especially since I'm dealing with
    > broken HTML. I guess I don't see how the effort put into learning a
    > parser and adding the extra code to use it pays off in this particular
    > endeavor.


    The temptation to take short-cuts leads to the Dark Side :)

    Perhaps you're right, in this instance. But if you need to deal with
    broken HTML, try BeautifulSoup.


    >> What makes you think it shouldn't match?

    >
    > AFAIK, dots aren't supposed to match carriage returns or any other
    > whitespace characters.


    They won't match *newlines* \n unless you pass the DOTALL flag, but they
    do match whitespace:

    >>> re.search('abc.efg', '----abc efg----').group()

    'abc efg'
    >>> re.search('abc.efg', '----abc\refg----').group()

    'abc\refg'
    >>> re.search('abc.efg', '----abc\nefg----') is None

    True


    --
    Steven
     
    Steven D'Aprano, May 29, 2011
    #5
  6. Andrew Berg

    Andrew Berg Guest

    On 2011.05.29 09:18 AM, Steven D'Aprano wrote:
    > >> What makes you think it shouldn't match?

    > >
    > > AFAIK, dots aren't supposed to match carriage returns or any other
    > > whitespace characters.

    >
    > They won't match *newlines* \n unless you pass the DOTALL flag, but they
    > do match whitespace:
    >
    > >>> re.search('abc.efg', '----abc efg----').group()

    > 'abc efg'
    > >>> re.search('abc.efg', '----abc\refg----').group()

    > 'abc\refg'
    > >>> re.search('abc.efg', '----abc\nefg----') is None

    > True

    I got things mixed up there (was thinking whitespace instead of
    newlines), but I thought dots aren't supposed to match '\r' (carriage
    return). Why is '\r' not considered a newline character?
     
    Andrew Berg, May 29, 2011
    #6
  7. Andrew Berg

    Roy Smith Guest

    In article <>,
    Andrew Berg <> wrote:

    > Kodos is written in Python and uses Python's regex engine. In fact, it
    > is specifically intended to debug Python regexes.


    Named after the governor of Tarsus IV?
     
    Roy Smith, May 29, 2011
    #7
  8. Andrew Berg

    Andrew Berg Guest

    Andrew Berg, May 29, 2011
    #8
  9. Andrew Berg

    John S Guest

    On May 29, 10:35 am, Andrew Berg <> wrote:
    > On 2011.05.29 09:18 AM, Steven D'Aprano wrote:> >> What makes you think it shouldn't match?
    >
    > > > AFAIK, dots aren't supposed to match carriage returns or any other
    > > > whitespace characters.

    >
    > I got things mixed up there (was thinking whitespace instead of
    > newlines), but I thought dots aren't supposed to match '\r' (carriage
    > return). Why is '\r' not considered a newline character?


    Dots don't match end-of-line-for-your-current-OS is how I think of
    it.

    While I almost usually nod my head at Steven D'Aprano's comments, in
    this case I have to say that if you just want to grab something from a
    chunk of HTML, full-blown HTML parsers are overkill. True, malformed
    HTML can throw you off, but they can also throw a parser off.

    I could not make your regex work on my Linux box with Python 2.6.

    In your case, and because x264 might change their HTML, I suggest the
    following code, which works great on my system.YMMV. I changed your
    newline matches to use \s and put some capturing parentheses around
    the date, so you could grab it.

    >>> import urllib2
    >>> import re
    >>>
    >>> content = urllib2.urlopen("http://x264.nl/x264_main.php").read()
    >>>
    >>> rx_x264version= re.compile(r"http://x264\.nl/x264/64bit/8bit_depth/revision\s*(\d{4})\s*/x264\s*\.exe")
    >>>
    >>> m = rx_x264version.search(content)
    >>> if m:

    .... print m.group(1)
    ....
    1995
    >>>



    \s is your friend -- matches space, tab, newline, or carriage return.
    \s* says match 0 or more spaces, which is what's needed here in case
    the web site decides to *not* put whitespace in the middle of a URL...

    As Steven said, when you want match a dot, it needs to be escaped,
    although it will work by accident much of the time. Also, be sure to
    use a raw string when composing REs, so you don't run into backslash
    issues.

    HTH,
    John Strickler
     
    John S, May 29, 2011
    #9
  10. Andrew Berg

    Andrew Berg Guest

    On 2011.05.29 10:48 AM, John S wrote:
    > Dots don't match end-of-line-for-your-current-OS is how I think of
    > it.

    IMO, the docs should say the dot matches any character except a line
    feed ('\n'), since that is more accurate.
    > True, malformed
    > HTML can throw you off, but they can also throw a parser off.

    That was part of my point. html.parser.HTMLParser from the standard
    library will definitely not work on x264.nl's broken HTML, and fixing it
    requires lxml (I'm working with Python 3; I've looked into
    BeautifulSoup, and does not work with Python 3 at all). Admittedly,
    fixing x264.nl's HTML only requires one or two lines of code, but really
    nasty HTML might require quite a bit of work.
    > In your case, and because x264 might change their HTML, I suggest the
    > following code, which works great on my system.YMMV. I changed your
    > newline matches to use \s and put some capturing parentheses around
    > the date, so you could grab it.

    I've been meaning to learn how to use parenthesis groups.
    > Also, be sure to
    > use a raw string when composing REs, so you don't run into backslash
    > issues.

    How would I do that when grabbing strings from a config file (via the
    configparser module)? Or rather, if I have a predefined variable
    containing a string, how do change it into a raw string?
     
    Andrew Berg, May 29, 2011
    #10
  11. Andrew Berg

    John S Guest

    On May 29, 12:16 pm, Andrew Berg <> wrote:
    >
    > I've been meaning to learn how to use parenthesis groups.
    > > Also, be sure to
    > > use a raw string when composing REs, so you don't run into backslash
    > > issues.

    >
    > How would I do that when grabbing strings from a config file (via the
    > configparser module)? Or rather, if I have a predefined variable
    > containing a string, how do change it into a raw string?

    When reading the RE from a file it's not an issue. Only literal
    strings can be raw. If the data is in a file, the data will not be
    parsed by the Python interpreter. This was just a general warning to
    anyone working with REs. It didn't apply in this case.

    --john strickler
     
    John S, May 29, 2011
    #11
  12. Andrew Berg wrote:

    > On 2011.05.29 10:19 AM, Roy Smith wrote:
    >> Named after the governor of Tarsus IV?

    > Judging by the graphic at http://kodos.sourceforge.net/help/kodos.html ,
    > it's named after the Simpsons character.


    <OT>

    I don't think that's a coincidence; both are from other planets and both are
    rather evil[tm]. Kodos the Executioner, arguably human, became a dictator
    who had thousands killed (by his own account, not to let the rest die of
    hunger); Kodos the slimy extra-terrestrial is a conqueror (and he likes to
    zap humans as well ;-))

    [BTW, Tarsus IV, a planet where thousands (would) have died of hunger and
    have died in executions was probably yet another hidden Star Trek euphemism.
    I have found out that Tarsus is, among other things, the name of a
    collection of bones in the human foot next to the heel. Bones as a
    reference to death aside, see also Achilles for the heel. But I'm only
    speculating here.]

    </OT>

    --
    \\//, PointedEars (F'up2 trek)

    Bitte keine Kopien per E-Mail. / Please do not Cc: me.
     
    Thomas 'PointedEars' Lahn, May 29, 2011
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    6
    Views:
    14,621
    Darryl L. Pierce
    Jan 1, 2006
  2. Ricardo
    Replies:
    1
    Views:
    403
    Victor Bazarov
    Jun 7, 2007
  3. Leiradella, Andre V Matos Da Cunha

    RES: RES: Bare-bones Ruby

    Leiradella, Andre V Matos Da Cunha, Dec 29, 2004, in forum: Ruby
    Replies:
    1
    Views:
    307
    Stefan Schmiedl
    Dec 29, 2004
  4. Leiradella, Andre V Matos Da Cunha

    RES: RES: RES: Bare-bones Ruby

    Leiradella, Andre V Matos Da Cunha, Dec 29, 2004, in forum: Ruby
    Replies:
    0
    Views:
    138
    Leiradella, Andre V Matos Da Cunha
    Dec 29, 2004
  5. Marc Bissonnette

    Pattern matching : not matching problem

    Marc Bissonnette, Jan 8, 2004, in forum: Perl Misc
    Replies:
    9
    Views:
    246
    Marc Bissonnette
    Jan 13, 2004
Loading...

Share This Page