Partition Recursive

Discussion in 'Python' started by macm, Dec 23, 2010.

  1. macm

    macm Guest

    Hi Folks

    I have this:

    url = 'http://docs.python.org/dev/library/stdtypes.html?
    highlight=partition#str.partition'

    So I want convert to

    myList =
    ['http',':','//','docs','.','python','.','org','/','dev','/','library','/','stdtypes','.','html','?','highlight','=','partition','#','str','.','partition']

    The reserved char are:

    specialMeaning = ["//",";","/", "?", ":", "@", "=" , "&","#"]

    Regards

    Mario
    macm, Dec 23, 2010
    #1
    1. Advertising

  2. macm

    MRAB Guest

    On 23/12/2010 17:26, macm wrote:
    > Hi Folks
    >
    > I have this:
    >
    > url = 'http://docs.python.org/dev/library/stdtypes.html?
    > highlight=partition#str.partition'
    >
    > So I want convert to
    >
    > myList =
    > ['http',':','//','docs','.','python','.','org','/','dev','/','library','/','stdtypes','.','html','?','highlight','=','partition','#','str','.','partition']
    >
    > The reserved char are:
    >
    > specialMeaning = ["//",";","/", "?", ":", "@", "=" , "&","#"]
    >

    I would use re.findall.
    MRAB, Dec 23, 2010
    #2
    1. Advertising

  3. macm

    Jon Clements Guest

    On Dec 23, 5:26 pm, macm <> wrote:
    > Hi Folks
    >
    > I have this:
    >
    > url = 'http://docs.python.org/dev/library/stdtypes.html?
    > highlight=partition#str.partition'
    >
    > So I want convert to
    >
    > myList =
    > ['http',':','//','docs','.','python','.','org','/','dev','/','library','/','stdtypes','.','html','?','highlight','=','partition','#','str','.','partition']
    >
    > The reserved char are:
    >
    > specialMeaning = ["//",";","/", "?", ":", "@", "=" , "&","#"]
    >
    > Regards
    >
    > Mario


    I would use urlparse.urlsplit, then split further, if required.

    >>> urlsplit(url)

    SplitResult(scheme='http', netloc='docs.python.org', path='/dev/
    library/stdtypes.html', query='highlight=partition',
    fragment='str.partition')



    Jon.
    Jon Clements, Dec 23, 2010
    #3
  4. macm

    macm Guest

    Hi

    urlparse isnt a option.

    My reasult must be:

    myList =
    ['http',':','//','docs','.','python','.','org','/','dev','/','library','/',
    'stdtypes','.','html','?','highlight','=','partition','#','str','.','partition']

    re module is slow.

    Even I make a loop in urlparse.urlsplit I can lost specialMeaning
    order.

    Seen easy but best aproach will be recursive.

    Regards

    Mario




    On Dec 23, 3:57 pm, Jon Clements <> wrote:
    > On Dec 23, 5:26 pm, macm <> wrote:
    >
    >
    >
    >
    >
    >
    >
    >
    >
    > > Hi Folks

    >
    > > I have this:

    >
    > > url = 'http://docs.python.org/dev/library/stdtypes.html?
    > > highlight=partition#str.partition'

    >
    > > So I want convert to

    >
    > > myList =
    > > ['http',':','//','docs','.','python','.','org','/','dev','/','library','/', 'stdtypes','.','html','?','highlight','=','partition','#','str','.','partit ion']

    >
    > > The reserved char are:

    >
    > > specialMeaning = ["//",";","/", "?", ":", "@", "=" , "&","#"]

    >
    > > Regards

    >
    > > Mario

    >
    > I would use urlparse.urlsplit, then split further, if required.
    >
    > >>> urlsplit(url)

    >
    > SplitResult(scheme='http', netloc='docs.python.org', path='/dev/
    > library/stdtypes.html', query='highlight=partition',
    > fragment='str.partition')
    >
    > Jon.
    macm, Dec 23, 2010
    #4
  5. macm

    kj Guest

    In <> macm <> writes:

    >url = 'http://docs.python.org/dev/library/stdtypes.html?highlight=partition#str.partition'


    >So I want convert to


    >myList =
    >['http',':','//','docs','.','python','.','org','/','dev','/','library','/','stdtypes','.','html','?','highlight','=','partition','#','str','.','partition']


    >The reserved char are:


    >specialMeaning = ["//",";","/", "?", ":", "@", "=" , "&","#"]



    You forgot '.'.

    >>> import re # sorry
    >>> sp = re.compile('(//?|[;?:mad:=&#.])')
    >>> filter(len, sp.split(url))

    ['http', ':', '//', 'docs', '.', 'python', '.', 'org', '/', 'dev', '/', 'library', '/', 'stdtypes', '.', 'html', '\
    ?', 'highlight', '=', 'partition', '#', 'str', '.', 'partition']

    ~kj
    kj, Dec 24, 2010
    #5
  6. macm

    Ian Kelly Guest

    On 12/23/2010 10:03 PM, kj wrote:
    >>>> import re # sorry
    >>>> sp = re.compile('(//?|[;?:mad:=&#.])')
    >>>> filter(len, sp.split(url))


    Perhaps I'm being overly pedantic, but I would likely have written that
    as "filter(None, sp.split(url))" for the same reason that "if string:"
    is generally preferred to "if len(string):".

    Cheers,
    Ian
    Ian Kelly, Dec 24, 2010
    #6
  7. macm

    macm Guest

    Thanks all


    In [11]: reps = 5
    In [12]: t = Timer("url = 'http://docs.python.org/dev/library/
    stdtypes.html? highlight=partition#str.partition' ;sp =
    re.compile('(//?|[;?:mad:=&#.])'); filter(len, sp.split(url))", 'import
    re')
    In [13]: print sum(t.repeat(repeat=reps, number=1)) / reps
    4.94003295898e-05

    In [65]: t = Timer("url = 'http://docs.python.org/dev/library/
    stdtypes.html? highlight=partition#str.partition' ;sp =
    re.compile('(//?|[;?:mad:=&#.])'); filter(None, sp.split(url))", 'import
    re')
    In [66]: print sum(t.repeat(repeat=reps, number=1)) / reps
    3.50475311279e-05


    Ian with None is a litle fast, thanks kj!

    Hi Mr. James, speed is always important. But ok re is fine. (but could
    be e-07)

    In next step I'll go to cython to win something.

    Regards

    Mario



    On Dec 24, 3:33 am, Ian Kelly <> wrote:
    > On 12/23/2010 10:03 PM, kj wrote:
    >
    > >>>> import re # sorry
    > >>>> sp = re.compile('(//?|[;?:mad:=&#.])')
    > >>>> filter(len, sp.split(url))

    >
    > Perhaps I'm being overly pedantic, but I would likely have written that
    > as "filter(None, sp.split(url))" for the same reason that "if string:"
    > is generally preferred to "if len(string):".
    >
    > Cheers,
    > Ian
    macm, Dec 24, 2010
    #7
  8. macm

    DevPlayer Guest

    # parse_url11.py

    #
    # 2010-12 (Dec)-27
    # A brute force ugly hack from a novice programmer.

    # You're welcome to use the code, clean it up, make positive
    suggestions
    # for improvement.

    """
    Parse a url string into a list using a generator.
    """

    #special_itemMeaning = ";?:mad:=&#."
    #"//",
    #"/",
    special_item = [";", "?", ":", "@", "=", "&", "#", ".", "/", "//"]

    # drop urls with obviously bad formatting - NOTIMPLEMENTED
    drop_item = ["|", "localhost", "..", "///"]
    ignore_urls_containing = ["php", "cgi"]

    def url_parser_generator(url):
    len_text = len(url)
    index = 0
    start1 = 0 # required here if url contains ONLY specials
    start2 = 0 # required here if url contains ONLY non specials
    while index < len_text:

    # LOOP1 == Get and item in the special_item list; can be any
    length
    if url[index] in special_item:
    start1 = index
    inloop1 = True
    while inloop1:
    if inloop1:
    if url[start1:index+1] in special_item:
    #print "[",start1, ":", index+1, "] = ",
    url[start1:index+1]
    inloop1 = True
    else: # not in ANYMORE, but was in special_item
    #print "[",start1, ":", index, "] = ",
    url[start1:index]
    yield url[start1:index]
    start1 = index
    inloop1 = False

    if inloop1:
    if index < len_text-1:
    index = index + 1
    else:
    #yield url[start1:index] # NEW
    inloop1 = False

    elif url[index] in drop_item:
    # not properly implemeted at all
    raise NotImplemented(
    "Processing items in the drop_item list is not "\
    "implemented.", url[index])

    elif url[index] in ignore_urls_containing:
    # not properly implemeted at all
    raise NotImplemented(
    "Processing items in the ignore_urls_containing list
    "\
    "is not implemented.", url[index])

    # LOOP2 == Get any item not in the special_item list; can be
    any length
    elif not url[index] in special_item:
    start2 = index
    inloop2 = True
    while inloop2:
    if inloop2:
    #if not url[start2:index+1] in special_item: #<-
    doesn"t work
    if not url[index] in special_item:
    #print "[",start2, ":", index+1, "] = ",
    url[start2:index+1]
    inloop2 = True
    else: # not in ANYMORE, but item was not in
    special_item before
    #print "[",start2, ":", index, "] = ",
    url[start2:index]
    yield url[start2:index]
    start2 = index
    inloop2 = False

    if inloop2:
    if index < len_text-1:
    index = index + 1
    else:
    #yield url[start2:index] # NEW
    inloop2 = False

    else:
    print url[index], "Not Implemented" # should not get here
    index = index + 1

    if index >= len_text-1:
    break

    # Process any remaining part of URL and yield it to caller.
    # Don't know if last item in url is a special or non special.
    # Used start1 and start2 instead of start and
    # used inloop1 and inloop2 instead of inloop
    # to help debug, as using just "start" and "inloop" can get be
    # harder to track in a generator.
    if start1 >= start2:
    start = start1
    else:
    start = start2
    yield url[start: index+1]

    def parse(url):
    mylist = []
    words = url_parser_generator(url)
    for word in words:
    mylist.append(word)
    #print word
    return mylist

    def test():
    urls = {
    0: (True,"http://docs.python.org/dev/library/stdtypes.html?
    highlight=partition#str.partition"),

    1: (True,"/http:///docs.python.org/dev/library/stdtypes.html?
    highlight=partition#str.partition"),
    2: (True,"//http:///docs.python.org/dev/library/stdtypes.html?
    highlight=partition#str.partition"),
    3: (True,"///http:///docs.python.org/dev/library/stdtypes.html?
    highlight=partition#str.partition"),

    4: (True,"/http:///docs.python.org/dev/library/stdtypes.html?
    highlight=partition#str.partition/"),
    5: (True,"//http:///docs.python.org/dev/library/stdtypes.html?
    highlight=partition#str.partition//"),
    6: (True,"///http:///docs.python.org/dev/library/stdtypes.html?
    highlight=partition#str.partition///"),

    7: (True,"/#/http:///#docs.python..org/dev//////library/
    stdtypes./html??highlight=p=partition#str.partition///"),

    8:
    (True,"httpdocspythonorgdevlibrarystdtypeshtmlhighlightpartitionstrpartition"),
    9:
    (True,"httpdocs.pythonorgdevlibrarystdtypeshtmlhighlightpartitionstrpartition"),
    10:
    (True,":httpdocspythonorgdevlibrarystdtypeshtmlhighlightpartitionstrpartition"),
    11:
    (True,"httpdocspythonorgdevlibrarystdtypeshtmlhighlightpartitionstrpartition/"),

    12: (True,"///:;#.???"), # only special_items
    13: (True,"///a:;#.???"), # only 1 non special_item
    14: (True,"///:;#.???a"), # only 1 non special_item
    15: (True,"a///:;#.???"), # only 1 non special_item
    16: (True,"http://docs.python.php"),
    17: (True,"http://php.python.org"),
    18: (True,"http://www.localhost.com"),
    }

    # test various combinations of special_item characters in possible
    in urls
    for url_num in range(len(urls)):
    value = urls[url_num]
    test, url = value
    if test: # allow for single tesing
    mylist = parse(url)
    print
    print
    print "url:", url_num, " ", url
    print
    print mylist
    print
    return mylist

    test()
    DevPlayer, Dec 28, 2010
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Thee_Psycho

    OT: Java from 2nd Partition.

    Thee_Psycho, Nov 2, 2003, in forum: Java
    Replies:
    2
    Views:
    327
    Thee_Psycho
    Nov 2, 2003
  2. news
    Replies:
    0
    Views:
    363
  3. n00m
    Replies:
    12
    Views:
    1,114
  4. vamsi
    Replies:
    21
    Views:
    2,076
    Keith Thompson
    Mar 9, 2009
  5. bolega
    Replies:
    1
    Views:
    683
    Stan Bischof
    Mar 28, 2011
Loading...

Share This Page