String splitting with exceptions

J

John Levine

I have a crufty old DNS provisioning system that I'm rewriting and I
hope improving in python. (It's based on tinydns if you know what
that is.)

The record formats are, in the worst case, like this:

foo.[DOM]::[IP6::4361:6368:6574]:600::

What I would like to do is to split this string into a list like this:

[ 'foo.[DOM]','','[IP6::4361:6368:6574]','600','' ]

Colons are separators except when they're inside square brackets. I
have been messing around with re.split() and re.findall() and haven't
been able to come up with either a working separator pattern for
split() or a working field pattern for findall(). I came pretty
close with findall() but can't get it to reliably match the
nothing between two adjacent colons not inside brackets.

Any suggestions? I realize I could do it in a loop where I pick stuff
off the front of the string, but yuck.

This is in python 2.7.5.
 
R

random832

I have a crufty old DNS provisioning system that I'm rewriting and I
hope improving in python. (It's based on tinydns if you know what
that is.)

The record formats are, in the worst case, like this:

foo.[DOM]::[IP6::4361:6368:6574]:600::

What I would like to do is to split this string into a list like this:

[ 'foo.[DOM]','','[IP6::4361:6368:6574]','600','' ]

Colons are separators except when they're inside square brackets. I
have been messing around with re.split() and re.findall() and haven't
been able to come up with either a working separator pattern for
split() or a working field pattern for findall(). I came pretty
close with findall() but can't get it to reliably match the
nothing between two adjacent colons not inside brackets.

Any suggestions? I realize I could do it in a loop where I pick stuff
off the front of the string, but yuck.

This is in python 2.7.5.

Can you have brackets within brackets? If so, this is impossible to deal
with within a regex.

Otherwise:
re.findall('((?:[^[:]|\[[^]]*\])*):?',s)
['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '', '']

I'm not sure why _your_ list only has one empty string at the end. Is
the record always terminated by a colon that is not meant to imply an
empty field after it? If so, remove the question mark:
re.findall('((?:[^[:]|\[[^]]*\])*):',s)
['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '']

I've done this kind of thing (for validation, not capturing) for email
addresses (there are some obscure bits of email address syntax that need
it) before, so it came to mind immediately.
 
T

Tim Chase

I have a crufty old DNS provisioning system that I'm rewriting
and I hope improving in python. (It's based on tinydns if you
know what that is.)

The record formats are, in the worst case, like this:

foo.[DOM]::[IP6::4361:6368:6574]:600::
Otherwise:
re.findall('((?:[^[:]|\[[^]]*\])*):?',s)
['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '', '']

I'm not sure why _your_ list only has one empty string at the end.

I wondered that. I also wondered about bracketed quoting that
doesn't start at the beginning of a field:

foo.[one:two]::[IP6::1234:5678:9101]:600::
^
This might be bogus, or one might want to catch this case.

-tkc
 
N

Neil Cerutti

I have a crufty old DNS provisioning system that I'm rewriting and I
hope improving in python. (It's based on tinydns if you know what
that is.)

The record formats are, in the worst case, like this:

foo.[DOM]::[IP6::4361:6368:6574]:600::

What I would like to do is to split this string into a list like this:

[ 'foo.[DOM]','','[IP6::4361:6368:6574]','600','' ]

Colons are separators except when they're inside square
brackets. I have been messing around with re.split() and
re.findall() and haven't been able to come up with either a
working separator pattern for split() or a working field
pattern for findall(). I came pretty close with findall() but
can't get it to reliably match the nothing between two adjacent
colons not inside brackets.

Any suggestions? I realize I could do it in a loop where I pick
stuff off the front of the string, but yuck.

A little parser, as Skip suggested, is a good way to go.

The brackets make your string context-sensitive, a difficult
concept to cleanly parse with a regex.

I initially hoped a csv module dialect could work, but the quote
character is (currently) hard-coded to be a single, simple
character, i.e., I can't tell it to treat [xxx] as "xxx".

What about Skip's suggestion? A little parser. It might seem
crass or something, but it really is easier than musceling a
regex into a context sensitive grammer.

def dns_split(s):
in_brackets = False
b = 0 # index of beginning of current string
for i, c in enumerate(s):
if not in_brackets:
if c == "[":
in_brackets = True
elif c == ':':
yield s[b:i]
b = i+1
elif c == "]":
in_brackets = False
['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '']

It'll gag on nested brackets (fixable with a counter) and has no
error handling (requires thought), but it's a start.
 
N

Neil Cerutti

I have a crufty old DNS provisioning system that I'm rewriting
and I hope improving in python. (It's based on tinydns if you
know what that is.)

The record formats are, in the worst case, like this:

foo.[DOM]::[IP6::4361:6368:6574]:600::
Otherwise:
re.findall('((?:[^[:]|\[[^]]*\])*):?',s)
['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '', '']

I'm not sure why _your_ list only has one empty string at the end.

I wondered that.

Good point. My little parser fails on that, too. It'll miss *all*
final fields. My parser needs "if s: yield s[b:]" at the end, to
operate like str.split, where the empty string is special.
 
P

Peter Otten

Neil said:
I have a crufty old DNS provisioning system that I'm rewriting and I
hope improving in python. (It's based on tinydns if you know what
that is.)

The record formats are, in the worst case, like this:

foo.[DOM]::[IP6::4361:6368:6574]:600::

What I would like to do is to split this string into a list like this:

[ 'foo.[DOM]','','[IP6::4361:6368:6574]','600','' ]

Colons are separators except when they're inside square
brackets. I have been messing around with re.split() and
re.findall() and haven't been able to come up with either a
working separator pattern for split() or a working field
pattern for findall(). I came pretty close with findall() but
can't get it to reliably match the nothing between two adjacent
colons not inside brackets.

Any suggestions? I realize I could do it in a loop where I pick
stuff off the front of the string, but yuck.

A little parser, as Skip suggested, is a good way to go.

The brackets make your string context-sensitive, a difficult
concept to cleanly parse with a regex.

I initially hoped a csv module dialect could work, but the quote
character is (currently) hard-coded to be a single, simple
character, i.e., I can't tell it to treat [xxx] as "xxx".

What about Skip's suggestion? A little parser. It might seem
crass or something, but it really is easier than musceling a
regex into a context sensitive grammer.

def dns_split(s):
in_brackets = False
b = 0 # index of beginning of current string
for i, c in enumerate(s):
if not in_brackets:
if c == "[":
in_brackets = True
elif c == ':':
yield s[b:i]
b = i+1
elif c == "]":
in_brackets = False

I think you need one more yield outside the loop.
['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '']

It'll gag on nested brackets (fixable with a counter) and has no
error handling (requires thought), but it's a start.

Something similar on top of regex:
.... start = level = 0
.... for m in re.compile(r"[[:\]]").finditer(s):
.... if m.group() == "[": level += 1
.... elif m.group() == "]":
.... assert level
.... level -= 1
.... elif level == 0:
.... yield s[start:m.start()]
.... start = m.end()
.... yield s[start:]
....
list(split("a[b:c:]:d")) ['a[b:c:]', 'd']
list(split("a[b:c[:]]:d")) ['a[b:c[:]]', 'd']
list(split("")) ['']
list(split(":")) ['', '']
list(split(":x")) ['', 'x']
list(split("[:x]")) ['[:x]']
list(split(":[:x]")) ['', '[:x]']
list(split(":[:[:]:x]")) ['', '[:[:]:x]']
list(split("[:::]")) ['[:::]']
s = "foo.[DOM]::[IP6::4361:6368:6574]:600::"
list(split(s))
['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '', '']

Note that there is one more empty string which I believe the OP forgot.
 
W

wxjmfauth

Le mercredi 28 août 2013 18:44:53 UTC+2, John Levine a écrit :
I have a crufty old DNS provisioning system that I'm rewriting and I

hope improving in python. (It's based on tinydns if you know what

that is.)



The record formats are, in the worst case, like this:



foo.[DOM]::[IP6::4361:6368:6574]:600::



What I would like to do is to split this string into a list like this:



[ 'foo.[DOM]','','[IP6::4361:6368:6574]','600','' ]



Colons are separators except when they're inside square brackets. I

have been messing around with re.split() and re.findall() and haven't

been able to come up with either a working separator pattern for

split() or a working field pattern for findall(). I came pretty

close with findall() but can't get it to reliably match the

nothing between two adjacent colons not inside brackets.



Any suggestions? I realize I could do it in a loop where I pick stuff

off the front of the string, but yuck.



This is in python 2.7.5.



--

Regards,

John Levine, (e-mail address removed), Primary Perpetrator of "The Internet for Dummies",

Please consider the environment before reading this e-mail. http://jl.ly

----------

Basic idea: protect -> split -> unprotect
s = 'foo.[DOM]::[IP6::4361:6368:6574]:600::'
r = s.replace('[IP6::', '***')
a = r.split('::')
a ['foo.[DOM]', '***4361:6368:6574]:600', '']
a[1] = a[1].replace('***', '[IP6::')
a
['foo.[DOM]', '[IP6::4361:6368:6574]:600', '']

jmf
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top