OT: novice regular expression question

I

It's me

I am never very good with regular expressions. My head always hurts
whenever I need to use it.

I need to read a data file and parse each data record. Each item on the
data record begins with either a string, or a list of strings. I searched
around and didn't see any existing Python packages that does that.
scanf.py, for instance, can do standard items but doesn't know about list.
So, I figure I might have to write a lex engine for it and of course I have
to deal wit RE again.

But I run into problem right from the start. To recognize a list, I need a
RE for the string:

1) begin with [" (left bracket followed by a double quote with zero or more
spaces in between)
2) followed by any characters until ] but only if that left bracket is not
preceeded by the escape character \.

So, I tried:

^\[[" "]*"[a-z,A-Z\,, ]*(\\\])*[a-z,A-Z\,, \"]*]

and tested with:

["This line\] works"]

but it fails with:

["This line fails"]

I would have thought that:

(\\\])*

should work because it's zero or more incidence of the pattern \]

Any help is greatly appreciated.

Sorry for beign OT. I posted this question at the lex group and didn't get
any response. I figure may be somebody would know around here.
 
S

Steve Holden

It's me said:
I am never very good with regular expressions. My head always hurts
whenever I need to use it.
Well, they are a pain to more than just you, and the conventional advice
is "even when you are convinced you need to use REs, try and find
another way".
I need to read a data file and parse each data record. Each item on the
data record begins with either a string, or a list of strings. I searched
around and didn't see any existing Python packages that does that.
scanf.py, for instance, can do standard items but doesn't know about list.
So, I figure I might have to write a lex engine for it and of course I have
to deal wit RE again.
Well, you haven't yet convinced me that you *have* to. Personally, I
think you just like trouble :)
But I run into problem right from the start. To recognize a list, I need a
RE for the string:

1) begin with [" (left bracket followed by a double quote with zero or more
spaces in between)
2) followed by any characters until ] but only if that left bracket is not
preceeded by the escape character \.
So the pattern is

1. If the line begins with a "[" it should end with a "]"

2. Otherwise, it shouldn't?

I'm trying to gently point out that the syntax you want to accept isn't
actually very clear. If the format is "Python strings and lists of
strings" then you might want to use the Python lexer to parse them, but
that's quite an advanced topic. [too advanced for me :-]

The problem is matching "up to a right bracket not preceded by a
backslash". This seems to require what's technically referred to as a
"negative lookbehind assertion" - in other words, a pattern that doesn't
match anything, but checks that a specific condition is false or fails.
So, I tried:

^\[[" "]*"[a-z,A-Z\,, ]*(\\\])*[a-z,A-Z\,, \"]*]

and tested with:

["This line\] works"]

but it fails with:

["This line fails"]

I would have thought that:

(\\\])*

should work because it's zero or more incidence of the pattern \]

Any help is greatly appreciated.

Sorry for beign OT. I posted this question at the lex group and didn't get
any response. I figure may be somebody would know around here.

I'd start with baby steps. First of all, make sure that you can match
the individual strings. Then use that pattern, parenthesized to turn it
into a group, as a component in a more complex pattern.

Do you want to treat "this is also \" a string" as an allowable string?
In that case you need a pattern that matches 'up to the first quotation
mark not preceded by a backslash" as well!

Let's try matching a single string first:
('"s1"',)

Note that I followed the "*" with a "?" to stop it being greedy, and
matching as many characters as it could. OK, does that work when we have
escaped quotation marks?
('"s1\\"\\""',)

Apparently so. The negative lookbehind assertion stops a quote from
matching when it's preceded by a backslash. Can we match a
comma-separated list of such strings?

This is a bit trickier: here the second grouping beginning with "(?:" is
intended to ensure that only the strings that get matched are included
in the groups, not the separators, even though they must be grouped
together. The list *must* be separated by ", ", but you could alter the
pattern to allow zero or more whitespace characters.
('"s1\\"\\""', '"s2"')

Well, that seems to work. Note that these patterns all ignore bracket
characters, so all you need to do now is to surround them with patterns
to match the opening and closing brackets, and you're done (I hope).

Anyway, it'll give you a few ideas to work with.

regards
Steve
 
R

RyanMorillo

check jgsoft dot com, they have2 things witch may help. Edit pad pro
(the test version has a good tutorial) or power grep (if you do a lot
of regexes, or the mastering regular expressions book from Orielly (if
yo do a lot of regex work)

Also the perl group would be good for regexes (pythons are Perl 5
compatable)
 
I

It's me

I'll chew on this. Thanks, got to go.


Steve Holden said:
It's me said:
I am never very good with regular expressions. My head always hurts
whenever I need to use it.
Well, they are a pain to more than just you, and the conventional advice
is "even when you are convinced you need to use REs, try and find
another way".
I need to read a data file and parse each data record. Each item on the
data record begins with either a string, or a list of strings. I searched
around and didn't see any existing Python packages that does that.
scanf.py, for instance, can do standard items but doesn't know about list.
So, I figure I might have to write a lex engine for it and of course I have
to deal wit RE again.
Well, you haven't yet convinced me that you *have* to. Personally, I
think you just like trouble :)
But I run into problem right from the start. To recognize a list, I need a
RE for the string:

1) begin with [" (left bracket followed by a double quote with zero or more
spaces in between)
2) followed by any characters until ] but only if that left bracket is not
preceeded by the escape character \.
So the pattern is

1. If the line begins with a "[" it should end with a "]"

2. Otherwise, it shouldn't?

I'm trying to gently point out that the syntax you want to accept isn't
actually very clear. If the format is "Python strings and lists of
strings" then you might want to use the Python lexer to parse them, but
that's quite an advanced topic. [too advanced for me :-]

The problem is matching "up to a right bracket not preceded by a
backslash". This seems to require what's technically referred to as a
"negative lookbehind assertion" - in other words, a pattern that doesn't
match anything, but checks that a specific condition is false or fails.
So, I tried:

^\[[" "]*"[a-z,A-Z\,, ]*(\\\])*[a-z,A-Z\,, \"]*]

and tested with:

["This line\] works"]

but it fails with:

["This line fails"]

I would have thought that:

(\\\])*

should work because it's zero or more incidence of the pattern \]

Any help is greatly appreciated.

Sorry for beign OT. I posted this question at the lex group and didn't get
any response. I figure may be somebody would know around here.

I'd start with baby steps. First of all, make sure that you can match
the individual strings. Then use that pattern, parenthesized to turn it
into a group, as a component in a more complex pattern.

Do you want to treat "this is also \" a string" as an allowable string?
In that case you need a pattern that matches 'up to the first quotation
mark not preceded by a backslash" as well!

Let's try matching a single string first:
('"s1"',)

Note that I followed the "*" with a "?" to stop it being greedy, and
matching as many characters as it could. OK, does that work when we have
escaped quotation marks?
('"s1\\"\\""',)

Apparently so. The negative lookbehind assertion stops a quote from
matching when it's preceded by a backslash. Can we match a
comma-separated list of such strings?

This is a bit trickier: here the second grouping beginning with "(?:" is
intended to ensure that only the strings that get matched are included
in the groups, not the separators, even though they must be grouped
together. The list *must* be separated by ", ", but you could alter the
pattern to allow zero or more whitespace characters.
('"s1\\"\\""', '"s2"')

Well, that seems to work. Note that these patterns all ignore bracket
characters, so all you need to do now is to surround them with patterns
to match the opening and closing brackets, and you're done (I hope).

Anyway, it'll give you a few ideas to work with.

regards
Steve
 
M

M.E.Farmer

Hello me,
Have you tried shlex.py it is a tokenizer for writing lexical
parsers.
Should be a breeze to whip something up with it.
an example of tokenizing:
py>import shlex
py># fake an open record
py>import cStringIO
py>myfakeRecord = cStringIO.StringIO()
py>myfakeRecord.write("['1','2'] \n 'fdfdfdfd' \n 'dfdfdfdfd'
['1','2']\n")
py>myfakeRecord.seek(0)
py>lexer = shlex.shlex(myfakeRecord)

py>lexer.get_token()
'['
py>lexer.get_token()
'1'
py>lexer.get_token()
','
py>lexer.get_token()
'2'
py>lexer.get_token()
']'
py>lexer.get_token()
'fdfdfdfd'

You can do a lot with it that is just a teaser.
M.E.Farmer
 
I

It's me

The shlex.py needs quite a number of .py files. I tried to hunt down a few
of them and got really tire.

Is there one batch of .py files that I can download from somewhere?

Thanks,
 
M

M.E.Farmer

It's me said:
The shlex.py needs quite a number of .py files. I tried to hunt down a few
of them and got really tire.

Is there one batch of .py files that I can download from somewhere?

Thanks,
Not sure what you mean by this.
Shlex is a standard library module.
It imports os and sys only, they are standard library modules.
If you have python you have them already.
If you mean cStringIO it is in the standard library(at least on my
system).
You dont have to use it just feed shlex an open file.
py>lexer = shlex.shlex(open('myrecord.txt', 'r'))

Hth,
M.E.Farmer
 
I

It's me

Oops!

Sorry, didn't realize that.

Thanks,

M.E.Farmer said:
Not sure what you mean by this.
Shlex is a standard library module.
It imports os and sys only, they are standard library modules.
If you have python you have them already.
If you mean cStringIO it is in the standard library(at least on my
system).
You dont have to use it just feed shlex an open file.
py>lexer = shlex.shlex(open('myrecord.txt', 'r'))

Hth,
M.E.Farmer
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,598
Members
45,152
Latest member
LorettaGur
Top