String splitting with exceptions

John Levine · Aug 28, 2013

I have a crufty old DNS provisioning system that I'm rewriting and I
hope improving in python. (It's based on tinydns if you know what
that is.)

The record formats are, in the worst case, like this:

foo.[DOM]::[IP6::4361:6368:6574]:600::

What I would like to do is to split this string into a list like this:

[ 'foo.[DOM]','','[IP6::4361:6368:6574]','600','' ]

Colons are separators except when they're inside square brackets. I
have been messing around with re.split() and re.findall() and haven't
been able to come up with either a working separator pattern for
split() or a working field pattern for findall(). I came pretty
close with findall() but can't get it to reliably match the
nothing between two adjacent colons not inside brackets.

Any suggestions? I realize I could do it in a loop where I pick stuff
off the front of the string, but yuck.

This is in python 2.7.5.

random832 · Aug 28, 2013

I have a crufty old DNS provisioning system that I'm rewriting and I
hope improving in python. (It's based on tinydns if you know what
that is.)

The record formats are, in the worst case, like this:

foo.[DOM]::[IP6::4361:6368:6574]:600::

What I would like to do is to split this string into a list like this:

[ 'foo.[DOM]','','[IP6::4361:6368:6574]','600','' ]

Colons are separators except when they're inside square brackets. I
have been messing around with re.split() and re.findall() and haven't
been able to come up with either a working separator pattern for
split() or a working field pattern for findall(). I came pretty
close with findall() but can't get it to reliably match the
nothing between two adjacent colons not inside brackets.

Any suggestions? I realize I could do it in a loop where I pick stuff
off the front of the string, but yuck.

This is in python 2.7.5.

Can you have brackets within brackets? If so, this is impossible to deal
with within a regex.

Otherwise:

re.findall('((?:[^[:]|\[[^]]*\])*):?',s)

Click to expand...

Click to expand...

['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '', '']

I'm not sure why _your_ list only has one empty string at the end. Is
the record always terminated by a colon that is not meant to imply an
empty field after it? If so, remove the question mark:

re.findall('((?:[^[:]|\[[^]]*\])*):',s)

Click to expand...

Click to expand...

['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '']

I've done this kind of thing (for validation, not capturing) for email
addresses (there are some obscure bits of email address syntax that need
it) before, so it came to mind immediately.

Tim Chase · Aug 28, 2013

I have a crufty old DNS provisioning system that I'm rewriting
and I hope improving in python. (It's based on tinydns if you
know what that is.)

The record formats are, in the worst case, like this:

foo.[DOM]::[IP6::4361:6368:6574]:600::
Otherwise:

re.findall('((?:[^[:]|\[[^]]*\])*):?',s)

Click to expand...

Click to expand...

['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '', '']

I'm not sure why _your_ list only has one empty string at the end.

I wondered that. I also wondered about bracketed quoting that
doesn't start at the beginning of a field:

foo.[one:two]::[IP6::1234:5678:9101]:600::
^
This might be bogus, or one might want to catch this case.

-tkc

Neil Cerutti · Aug 28, 2013

I have a crufty old DNS provisioning system that I'm rewriting and I
hope improving in python. (It's based on tinydns if you know what
that is.)

The record formats are, in the worst case, like this:

foo.[DOM]::[IP6::4361:6368:6574]:600::

What I would like to do is to split this string into a list like this:

[ 'foo.[DOM]','','[IP6::4361:6368:6574]','600','' ]

Colons are separators except when they're inside square
brackets. I have been messing around with re.split() and
re.findall() and haven't been able to come up with either a
working separator pattern for split() or a working field
pattern for findall(). I came pretty close with findall() but
can't get it to reliably match the nothing between two adjacent
colons not inside brackets.

Any suggestions? I realize I could do it in a loop where I pick
stuff off the front of the string, but yuck.

A little parser, as Skip suggested, is a good way to go.

The brackets make your string context-sensitive, a difficult
concept to cleanly parse with a regex.

I initially hoped a csv module dialect could work, but the quote
character is (currently) hard-coded to be a single, simple
character, i.e., I can't tell it to treat [xxx] as "xxx".

What about Skip's suggestion? A little parser. It might seem
crass or something, but it really is easier than musceling a
regex into a context sensitive grammer.

def dns_split(s):
in_brackets = False
b = 0 # index of beginning of current string
for i, c in enumerate(s):
if not in_brackets:
if c == "[":
in_brackets = True
elif c == ':':
yield s[b:i]
b = i+1
elif c == "]":
in_brackets = False
['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '']

It'll gag on nested brackets (fixable with a counter) and has no
error handling (requires thought), but it's a start.

Neil Cerutti · Aug 28, 2013

I have a crufty old DNS provisioning system that I'm rewriting
and I hope improving in python. (It's based on tinydns if you
know what that is.)

The record formats are, in the worst case, like this:

foo.[DOM]::[IP6::4361:6368:6574]:600::
Otherwise:
re.findall('((?:[^[:]|\[[^]]*\])*):?',s)

Click to expand...

['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '', '']

I'm not sure why _your_ list only has one empty string at the end.

Click to expand...

I wondered that.

Good point. My little parser fails on that, too. It'll miss *all*
final fields. My parser needs "if s: yield s[b:]" at the end, to
operate like str.split, where the empty string is special.

Peter Otten · Aug 28, 2013

Neil said:
I have a crufty old DNS provisioning system that I'm rewriting and I
hope improving in python. (It's based on tinydns if you know what
that is.)

The record formats are, in the worst case, like this:

foo.[DOM]::[IP6::4361:6368:6574]:600::

What I would like to do is to split this string into a list like this:

[ 'foo.[DOM]','','[IP6::4361:6368:6574]','600','' ]

Colons are separators except when they're inside square
brackets. I have been messing around with re.split() and
re.findall() and haven't been able to come up with either a
working separator pattern for split() or a working field
pattern for findall(). I came pretty close with findall() but
can't get it to reliably match the nothing between two adjacent
colons not inside brackets.

Any suggestions? I realize I could do it in a loop where I pick
stuff off the front of the string, but yuck.

Click to expand...

A little parser, as Skip suggested, is a good way to go.

The brackets make your string context-sensitive, a difficult
concept to cleanly parse with a regex.

I initially hoped a csv module dialect could work, but the quote
character is (currently) hard-coded to be a single, simple
character, i.e., I can't tell it to treat [xxx] as "xxx".

What about Skip's suggestion? A little parser. It might seem
crass or something, but it really is easier than musceling a
regex into a context sensitive grammer.

def dns_split(s):
in_brackets = False
b = 0 # index of beginning of current string
for i, c in enumerate(s):
if not in_brackets:
if c == "[":
in_brackets = True
elif c == ':':
yield s[b:i]
b = i+1
elif c == "]":
in_brackets = False

I think you need one more yield outside the loop.

['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '']

It'll gag on nested brackets (fixable with a counter) and has no
error handling (requires thought), but it's a start.

Something similar on top of regex:
.... start = level = 0
.... for m in re.compile(r"[[:\]]").finditer(s):
.... if m.group() == "[": level += 1
.... elif m.group() == "]":
.... assert level
.... level -= 1
.... elif level == 0:
.... yield s[start:m.start()]
.... start = m.end()
.... yield s[start:]
....

list(split("a[b:c:]:d")) ['a[b:c:]', 'd']
list(split("a[b:c[:]]:d")) ['a[b:c[:]]', 'd']
list(split("")) ['']
list(split(":")) ['', '']
list(split(":x")) ['', 'x']
list(split("[:x]")) ['[:x]']
list(split(":[:x]")) ['', '[:x]']
list(split(":[:[:]:x]")) ['', '[:[:]:x]']
list(split("[:::]")) ['[:::]']
s = "foo.[DOM]::[IP6::4361:6368:6574]:600::"
list(split(s))

Click to expand...

Click to expand...

['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '', '']

Note that there is one more empty string which I believe the OP forgot.

John Levine · Aug 28, 2013

Can you have brackets within brackets? If so, this is impossible to deal

with within a regex.

Nope. It's a regular language, not a CFL.

Otherwise:

re.findall('((?:[^[:]|\[[^]]*\])*):?',s)

Click to expand...

Click to expand...

['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '', '']

That seems to do it, thanks.

wxjmfauth · Aug 29, 2013

Le mercredi 28 août 2013 18:44:53 UTC+2, John Levine a écrit :

I have a crufty old DNS provisioning system that I'm rewriting and I

hope improving in python. (It's based on tinydns if you know what

that is.)

The record formats are, in the worst case, like this:

foo.[DOM]::[IP6::4361:6368:6574]:600::

What I would like to do is to split this string into a list like this:

[ 'foo.[DOM]','','[IP6::4361:6368:6574]','600','' ]

Colons are separators except when they're inside square brackets. I

have been messing around with re.split() and re.findall() and haven't

been able to come up with either a working separator pattern for

split() or a working field pattern for findall(). I came pretty

close with findall() but can't get it to reliably match the

nothing between two adjacent colons not inside brackets.

Any suggestions? I realize I could do it in a loop where I pick stuff

off the front of the string, but yuck.

This is in python 2.7.5.

--

Regards,

John Levine, (e-mail address removed), Primary Perpetrator of "The Internet for Dummies",

Please consider the environment before reading this e-mail. http://jl.ly

----------

Basic idea: protect -> split -> unprotect

s = 'foo.[DOM]::[IP6::4361:6368:6574]:600::'
r = s.replace('[IP6::', '***')
a = r.split('::')
a ['foo.[DOM]', '***4361:6368:6574]:600', '']
a[1] = a[1].replace('***', '[IP6::')
a

Click to expand...

Click to expand...

['foo.[DOM]', '[IP6::4361:6368:6574]:600', '']

jmf

splitting words with brackets	17	Jul 26, 2006
Download the JAVA , .NET and SQL Server interview with answers	0	Sep 14, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004

String splitting with exceptions

John Levine

random832

Tim Chase

Neil Cerutti

Neil Cerutti

Peter Otten

John Levine

wxjmfauth

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads