re beginner

S

SuperHik

hi all,

I'm trying to understand regex for the first time, and it would be very
helpful to get an example. I have an old(er) script with the following
task - takes a string I copy-pasted and wich always has the same format:
Yellow hat 2 Blue shirt 1
White socks 4 Green pants 1
Blue bag 4 Nice perfume 3
Wrist watch 7 Mobile phone 4
Wireless cord! 2 Building tools 3
One for the money 7 Two for the show 4
'Yellow hat\t2\tBlue shirt\t1\nWhite socks\t4\tGreen pants\t1\nBlue
bag\t4\tNice perfume\t3\nWrist watch\t7\tMobile phone\t4\nWireless
cord!\t2\tBuilding tools\t3\nOne for the money\t7\tTwo for the show\t4'

I want to put items from stuff into a dict like this:{'Wireless cord!': 2, 'Green pants': 1, 'Blue shirt': 1, 'White socks':
4, 'Mobile phone': 4, 'Two for the show': 4, 'One for the money': 7,
'Blue bag': 4, 'Wrist watch': 7, 'Nice perfume': 3, 'Yellow hat': 2,
'Building tools': 3}

Here's how I did it:.... items = items.replace('\n', '\t')
.... items = items.split('\t')
.... d = {}
.... for x in xrange( len(items) ):
.... if not items[x].isdigit(): d[items[x]] = int(items[x+1])
.... return d

I was wondering is there a better way to do it using re module?
perheps even avoiding this for loop?

thanks!
 
B

bearophileHUGS

SuperHik said:
I was wondering is there a better way to do it using re module?
perheps even avoiding this for loop?

This is a way to do the same thing without REs:

data = 'Yellow hat\t2\tBlue shirt\t1\nWhite socks\t4\tGreen
pants\t1\nBlue bag\t4\tNice perfume\t3\nWrist watch\t7\tMobile
phone\t4\nWireless cord!\t2\tBuilding tools\t3\nOne for the
money\t7\tTwo for the show\t4'

data2 = data.replace("\n","\t").split("\t")
result1 = dict( zip(data2[::2], map(int, data2[1::2])) )

O if you want to be light:

from itertools import imap, izip, islice
data2 = data.replace("\n","\t").split("\t")
strings = islice(data2, 0, len(data), 2)
numbers = islice(data2, 1, len(data), 2)
result2 = dict( izip(strings, imap(int, numbers)) )

Bye,
bearophile
 
F

faulkner

you could write a function which takes a match object and modifies d,
pass the function to re.sub, and ignore what re.sub returns.

# untested code
d = {}
def record(match):
s = match.string[match.start() : match.end()]
i = s.index('\t')
print s, i # debugging
d[s[:i]] = int(s[i+1:])
return ''
re.sub('\w+\t\d+\t', record, stuff)
# end code

it may be a bit faster, but it's very roundabout and difficult to
debug.
hi all,

I'm trying to understand regex for the first time, and it would be very
helpful to get an example. I have an old(er) script with the following
task - takes a string I copy-pasted and wich always has the same format:
Yellow hat 2 Blue shirt 1
White socks 4 Green pants 1
Blue bag 4 Nice perfume 3
Wrist watch 7 Mobile phone 4
Wireless cord! 2 Building tools 3
One for the money 7 Two for the show 4
'Yellow hat\t2\tBlue shirt\t1\nWhite socks\t4\tGreen pants\t1\nBlue
bag\t4\tNice perfume\t3\nWrist watch\t7\tMobile phone\t4\nWireless
cord!\t2\tBuilding tools\t3\nOne for the money\t7\tTwo for the show\t4'

I want to put items from stuff into a dict like this:{'Wireless cord!': 2, 'Green pants': 1, 'Blue shirt': 1, 'White socks':
4, 'Mobile phone': 4, 'Two for the show': 4, 'One for the money': 7,
'Blue bag': 4, 'Wrist watch': 7, 'Nice perfume': 3, 'Yellow hat': 2,
'Building tools': 3}

Here's how I did it:... items = items.replace('\n', '\t')
... items = items.split('\t')
... d = {}
... for x in xrange( len(items) ):
... if not items[x].isdigit(): d[items[x]] = int(items[x+1])
... return d

I was wondering is there a better way to do it using re module?
perheps even avoiding this for loop?

thanks!
 
B

bearophileHUGS

strings = islice(data2, 0, len(data), 2)
numbers = islice(data2, 1, len(data), 2)

This probably has to be:

strings = islice(data2, 0, len(data2), 2)
numbers = islice(data2, 1, len(data2), 2)

Sorry,
bearophile
 
J

John Machin

SuperHik a écrit :
hi all,

I'm trying to understand regex for the first time, and it would be
very helpful to get an example. I have an old(er) script with the
following task - takes a string I copy-pasted and wich always has the
same format:
print stuff
Yellow hat 2 Blue shirt 1
White socks 4 Green pants 1
Blue bag 4 Nice perfume 3
Wrist watch 7 Mobile phone 4
Wireless cord! 2 Building tools 3
One for the money 7 Two for the show 4
'Yellow hat\t2\tBlue shirt\t1\nWhite socks\t4\tGreen pants\t1\nBlue
bag\t4\tNice perfume\t3\nWrist watch\t7\tMobile phone\t4\nWireless
cord!\t2\tBuilding tools\t3\nOne for the money\t7\tTwo for the show\t4'

I want to put items from stuff into a dict like this:
print mydict
{'Wireless cord!': 2, 'Green pants': 1, 'Blue shirt': 1, 'White
socks': 4, 'Mobile phone': 4, 'Two for the show': 4, 'One for the
money': 7, 'Blue bag': 4, 'Wrist watch': 7, 'Nice perfume': 3, 'Yellow
hat': 2, 'Building tools': 3}

Here's how I did it:
def putindict(items):
... items = items.replace('\n', '\t')
... items = items.split('\t')
... d = {}
... for x in xrange( len(items) ):
... if not items[x].isdigit(): d[items[x]] = int(items[x+1])
... return d
mydict = putindict(stuff)


I was wondering is there a better way to do it using re module?
perheps even avoiding this for loop?

There are better ways. One of them avoids the for loop, and even the re
module:

def to_dict(items):
items = items.replace('\t', '\n').split('\n')

In case there are leading/trailing spaces on the keys:

items = [x.strip() for x in items.replace('\t', '\n').split('\n')]
return dict(zip(items[::2], map(int, items[1::2])))

HTH

Fantastic -- at least for the OP's carefully copied-and-pasted input.
Meanwhile back in the real world, there might be problems with multiple
tabs used for 'prettiness' instead of 1 tab, non-integer values, etc etc.
In that case a loop approach that validated as it went and was able to
report the position and contents of any invalid input might be better.
 
P

Paul McGuire

John Machin said:
Fantastic -- at least for the OP's carefully copied-and-pasted input.
Meanwhile back in the real world, there might be problems with multiple
tabs used for 'prettiness' instead of 1 tab, non-integer values, etc etc.
In that case a loop approach that validated as it went and was able to
report the position and contents of any invalid input might be better.

Yeah, for that you'd need more like a real parser... hey, wait a minute!
What about pyparsing?!

Here's a pyparsing version. The definition of the parsing patterns takes
little more than the re definition does - the bulk of the rest of the code
is parsing/scanning the input and reporting the results.

The pyparsing home page is at http://pyparsing.wikispaces.com.

-- Paul


stuff = 'Yellow hat\t2\tBlue shirt\t1\nWhite socks\t4\tGreen pants\t1\nBlue
bag\t4\tNice perfume\t3\nWrist watch\t7\tMobile phone\t4\nWireless
cord!\t2\tBuilding tools\t3\nOne for the money\t7\tTwo for the show\t4'
print "Original input string:"
print stuff
print

from pyparsing import *

# define low-level elements for parsing
itemWord = Word(alphas, alphanums+".!?")
itemDesc = OneOrMore(itemWord)
integer = Word(nums)

# add parse action to itemDesc to merge separate words into single string
itemDesc.setParseAction( lambda s,l,t: " ".join(t) )

# define macro element for an entry
entry = itemDesc.setResultsName("item") + integer.setResultsName("qty")

# scan through input string for entry's, print out their named fields
print "Results when scanning for entries:"
for t,s,e in entry.scanString(stuff):
print t.item,t.qty
print

# parse entire string, building ParseResults with dict-like access
results = dictOf( itemDesc, integer ).parseString(stuff)
print "Results when parsing entries as a dict:"
print "Keys:", results.keys()
for item in results.items():
print item
for k in results.keys():
print k,"=", results[k]


prints:

Original input string:
Yellow hat 2 Blue shirt 1
White socks 4 Green pants 1
Blue bag 4 Nice perfume 3
Wrist watch 7 Mobile phone 4
Wireless cord! 2 Building tools 3
One for the money 7 Two for the show 4

Results when scanning for entries:
Yellow hat 2
Blue shirt 1
White socks 4
Green pants 1
Blue bag 4
Nice perfume 3
Wrist watch 7
Mobile phone 4
Wireless cord! 2
Building tools 3
One for the money 7
Two for the show 4

Results when parsing entries as a dict:
Keys: ['Wireless cord!', 'Green pants', 'Blue shirt', 'White socks', 'Mobile
phone', 'Two for the show', 'One for the money', 'Blue bag', 'Wrist watch',
'Nice perfume', 'Yellow hat', 'Building tools']
('Wireless cord!', '2')
('Green pants', '1')
('Blue shirt', '1')
('White socks', '4')
('Mobile phone', '4')
('Two for the show', '4')
('One for the money', '7')
('Blue bag', '4')
('Wrist watch', '7')
('Nice perfume', '3')
('Yellow hat', '2')
('Building tools', '3')
Wireless cord! = 2
Green pants = 1
Blue shirt = 1
White socks = 4
Mobile phone = 4
Two for the show = 4
One for the money = 7
Blue bag = 4
Wrist watch = 7
Nice perfume = 3
Yellow hat = 2
Building tools = 3
 
B

Bruno Desthuilliers

SuperHik a écrit :
hi all,

I'm trying to understand regex for the first time, and it would be very
helpful to get an example. I have an old(er) script with the following
task - takes a string I copy-pasted and wich always has the same format:
Yellow hat 2 Blue shirt 1
White socks 4 Green pants 1
Blue bag 4 Nice perfume 3
Wrist watch 7 Mobile phone 4
Wireless cord! 2 Building tools 3
One for the money 7 Two for the show 4
'Yellow hat\t2\tBlue shirt\t1\nWhite socks\t4\tGreen pants\t1\nBlue
bag\t4\tNice perfume\t3\nWrist watch\t7\tMobile phone\t4\nWireless
cord!\t2\tBuilding tools\t3\nOne for the money\t7\tTwo for the show\t4'

I want to put items from stuff into a dict like this:{'Wireless cord!': 2, 'Green pants': 1, 'Blue shirt': 1, 'White socks':
4, 'Mobile phone': 4, 'Two for the show': 4, 'One for the money': 7,
'Blue bag': 4, 'Wrist watch': 7, 'Nice perfume': 3, 'Yellow hat': 2,
'Building tools': 3}

Here's how I did it:... items = items.replace('\n', '\t')
... items = items.split('\t')
... d = {}
... for x in xrange( len(items) ):
... if not items[x].isdigit(): d[items[x]] = int(items[x+1])
... return d

I was wondering is there a better way to do it using re module?
perheps even avoiding this for loop?

There are better ways. One of them avoids the for loop, and even the re
module:

def to_dict(items):
items = items.replace('\t', '\n').split('\n')
return dict(zip(items[::2], map(int, items[1::2])))

HTH
 
B

Bruno Desthuilliers

(e-mail address removed) a écrit :
This probably has to be:

strings = islice(data2, 0, len(data2), 2)
numbers = islice(data2, 1, len(data2), 2)

try with islice(data2, 0, None, 2)
 
J

John Machin

Yeah, for that you'd need more like a real parser... hey, wait a minute!
What about pyparsing?!

Here's a pyparsing version. The definition of the parsing patterns takes
little more than the re definition does - the bulk of the rest of the code
is parsing/scanning the input and reporting the results.

[big snip]

I didn't see any evidence of error handling in there anywhere.
 
P

Paul McGuire

John Machin said:
Yeah, for that you'd need more like a real parser... hey, wait a minute!
What about pyparsing?!

Here's a pyparsing version. The definition of the parsing patterns takes
little more than the re definition does - the bulk of the rest of the code
is parsing/scanning the input and reporting the results.

[big snip]

I didn't see any evidence of error handling in there anywhere.
Pyparsing has a certain amount of error reporting built in, raising a
ParseException when a mismatch occurs.

This particular "grammar" is actually pretty error-tolerant. To force an
error, I replaced "One for the money" with "1 for the money", and here is
the exception reported by pyparsing, along with a diagnostic method,
markInputline:


stuff = 'Yellow hat\t2\tBlue shirt\t1\nWhite socks\t4\tGreen pants\t1\nBlue
bag\t4\tNice perfume\t3\nWrist watch\t7\tMobile phone\t4\nWireless
cord!\t2\tBuilding tools\t3\nOne for the money\t7\tTwo for the show\t4'
badstuff = 'Yellow hat\t2\tBlue shirt\t1\nWhite socks\t4\tGreen
pants\t1\nBlue bag\t4\tNice perfume\t3\nWrist watch\t7\tMobile
phone\t4\nWireless cord!\t2\tBuilding tools\t3\n1 for the money\t7\tTwo for
the show\t4'
pattern = dictOf( itemDesc, integer ) + stringEnd
print pattern.parseString(stuff)
print
try:
print pattern.parseString(badstuff)
except ParseException, pe:
print pe
print pe.markInputline()

Gives:
[['Yellow hat', '2'], ['Blue shirt', '1'], ['White socks', '4'], ['Green
pants', '1'], ['Blue bag', '4'], ['Nice perfume', '3'], ['Wrist watch',
'7'], ['Mobile phone', '4'], ['Wireless cord!', '2'], ['Building tools',
'3'], ['One for the money', '7'], ['Two for the show', '4']]

Expected stringEnd (at char 210), (line:6, col:1)
!<1 for the money 7 Two for the show 4

-- Paul
 
F

Fredrik Lundh

John said:
Fantastic -- at least for the OP's carefully copied-and-pasted input.
Meanwhile back in the real world, there might be problems with multiple
tabs used for 'prettiness' instead of 1 tab, non-integer values, etc etc.

yeah, that's probably why the OP stated "which always has the same format".

and the "trying to understand regex for the first time, and it would be
very helpful to get an example" part was obviously mostly irrelevant to
the "smarter than thou" crowd; only one thread contributor was silly
enough to actually provide an RE-based example.

</F>
 
F

Fredrik Lundh

SuperHik said:
I'm trying to understand regex for the first time, and it would be very
helpful to get an example. I have an old(er) script with the following
task - takes a string I copy-pasted and wich always has the same format:

Yellow hat 2 Blue shirt 1
White socks 4 Green pants 1
Blue bag 4 Nice perfume 3
Wrist watch 7 Mobile phone 4
Wireless cord! 2 Building tools 3
One for the money 7 Two for the show 4

'Yellow hat\t2\tBlue shirt\t1\nWhite socks\t4\tGreen pants\t1\nBlue
bag\t4\tNice perfume\t3\nWrist watch\t7\tMobile phone\t4\nWireless
cord!\t2\tBuilding tools\t3\nOne for the money\t7\tTwo for the show\t4'

the first thing you need to do is to figure out exactly what the syntax
is. given your example, the format of the items you are looking for
seems to be "some text" followed by a tab character followed by an integer.

a initial attempt would be "\w+\t\d+" (one or more word characters,
followed by a tab, followed by one or more digits). to try this out,
you can do:
['hat\t2', 'shirt\t1', 'socks\t4', ...]

as you can see, using \w+ isn't good enough here; the "keys" in this
case may contain whitespace as well, and findall simply skips stuff that
doesn't match the pattern. if we assume that a key consists of words
and spaces, we can replace the single \w with [\w ] (either word
character or space), and get
>>> re.findall('[\w ]+\t\d+', stuff)
['Yellow hat\t2', 'Blue shirt\t1', 'White socks\t4', ...]

which looks a bit better. however, if you check the output carefully,
you'll notice that the "Wireless cord!" entry is missing: the "!" isn't
a letter or a digit. the easiest way to fix this is to look for
"non-tab characters" instead, using "[^\t]" (this matches anything
except a tab):
>>> len(re.findall('[\w ]+\t\d+', stuff)) 11
>>> len(re.findall('[^\t]+\t\d+', stuff))
12

now, to turn this into a dictionary, you could split the returned
strings on a tab character (\t), but RE provides a better mechanism:
capturing groups. by adding () to the pattern string, you can mark the
sections you want returned:
>>> re.findall('([^\t]+)\t(\d+)', stuff)
[('Yellow hat', '2'), ('Blue shirt', '1'), ('White socks', ...]

turning this into a dictionary is trivial:
>>> dict(re.findall('([^\t]+)\t(\d+)', stuff)) {'Green pants': '1', 'Blue shirt': '1', 'White socks': ...}
>>> len(dict(re.findall('([^\t]+)\t(\d+)', stuff)))
12

or, in function terms:

def putindict(items):
return dict(re.findall('([^\t]+)\t(\d+)', stuff))

hope this helps!

</F>
 
J

John Machin

yeah, that's probably why the OP stated "which always has the same format".

Such statements by users are in the the same category as "The cheque is
in the mail" and "Of course I'll still love you in the morning".
 
B

Bruno Desthuilliers

John Machin a écrit :
(snip)



In case there are leading/trailing spaces on the keys:

There aren't. Test passes.

(snip)
Fantastic -- at least for the OP's carefully copied-and-pasted input.

That was the spec, and my code passes the test.
Meanwhile back in the real world,

The "real world" is mostly defined by customer's test set (is that the
correct translation for "jeu d'essai" ?). Code passes the test. period.
there might be problems with multiple
tabs used for 'prettiness' instead of 1 tab, non-integer values, etc etc.

Which means that the spec and the customer's test set is wrong. Not my
responsability. Any way, I refuse to change anything in the parsing
algorithm before having another test set.
In that case a loop approach that validated as it went and was able to
report the position and contents of any invalid input might be better.

One doesn't know what *will* be better without actual facts. You can be
right (and, from my experience, you probably are !-), *but* you can be
wrong as well. Until you have a correct spec and test data set on which
the code fails, writing any other code is a waste of time. Better to
work on other parts of the system, and come back on this if and when the
need arise.

<ot>
Kind of reminds me of a former employer that paid me 2 full monthes to
work on a very hairy data migration script (the original data set was so
f... up and incoherent even a human parser could barely make any sens of
it), before discovering than none of the users of the old system was
interested in migrating that part of the data. Talk about a waste of
time and money...
</ot>

Now FWIW, there's actually something else bugging me with this code : it
loads the whole data set in memory. It's ok for a few lines, but
obviously wrong if one is to parse huge files. *That* would be the first
thing I would change - it takes a couple of minutes to do so no real
waste of time, but it obviously imply rethinking the API, which is
better done yet than when client code will have been written.

My 2 cents....
 
J

John Machin

John Machin a écrit :

There aren't. Test passes.

(snip)


That was the spec, and my code passes the test.


The "real world" is mostly defined by customer's test set (is that the
correct translation for "jeu d'essai" ?). Code passes the test. period.

"Jeu d'essai" could be construed as "toss a coin" -- yup, that fits some
user test sets I've seen.

In the real world, you are lucky to get a test set that covers all the
user-expected "good" cases. They have to be driven with whips to think
about the "bad" cases. Never come across a problem caused by "FOO " !=
"FOO"? You *have* lead a charmed life, so far.
Which means that the spec and the customer's test set is wrong. Not my
responsability.

That's what you think. The users, the pointy-haired boss, and the evil
HR director may have other ideas :)
Any way, I refuse to change anything in the parsing
algorithm before having another test set.


One doesn't know what *will* be better without actual facts. You can be
right (and, from my experience, you probably are !-), *but* you can be
wrong as well. Until you have a correct spec and test data set on which
the code fails, writing any other code is a waste of time. Better to
work on other parts of the system, and come back on this if and when the
need arise.

Unfortunately one is likely to be told in a Sunday 03:00 phone call that
the "test data set on which the code fails" is somewhere in the
production database :-(

Cheers,
John
 
B

Bruno Desthuilliers

Fredrik Lundh a écrit :
yeah, that's probably why the OP stated "which always has the same format".
Lol.

and the "trying to understand regex for the first time, and it would be
very helpful to get an example" part

Yeps, I missed that part when answering yesterday. My bad.
 
S

SuperHik

WOW!
Thanks for all the answers, even those not related to regular
expressions tought me some stuff I wasn't aware of.
I appreciate it very much.
hi all,

I'm trying to understand regex for the first time, and it would be very
helpful to get an example. I have an old(er) script with the following
task - takes a string I copy-pasted and wich always has the same format:
Yellow hat 2 Blue shirt 1
White socks 4 Green pants 1
Blue bag 4 Nice perfume 3
Wrist watch 7 Mobile phone 4
Wireless cord! 2 Building tools 3
One for the money 7 Two for the show 4
'Yellow hat\t2\tBlue shirt\t1\nWhite socks\t4\tGreen pants\t1\nBlue
bag\t4\tNice perfume\t3\nWrist watch\t7\tMobile phone\t4\nWireless
cord!\t2\tBuilding tools\t3\nOne for the money\t7\tTwo for the show\t4'

I want to put items from stuff into a dict like this:{'Wireless cord!': 2, 'Green pants': 1, 'Blue shirt': 1, 'White socks':
4, 'Mobile phone': 4, 'Two for the show': 4, 'One for the money': 7,
'Blue bag': 4, 'Wrist watch': 7, 'Nice perfume': 3, 'Yellow hat': 2,
'Building tools': 3}

Here's how I did it:... items = items.replace('\n', '\t')
... items = items.split('\t')
... d = {}
... for x in xrange( len(items) ):
... if not items[x].isdigit(): d[items[x]] = int(items[x+1])
... return d

I was wondering is there a better way to do it using re module?
perheps even avoiding this for loop?

thanks!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,564
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top