re beginner

SuperHik · Jun 4, 2006

hi all,

I'm trying to understand regex for the first time, and it would be very
helpful to get an example. I have an old(er) script with the following
task - takes a string I copy-pasted and wich always has the same format:
Yellow hat 2 Blue shirt 1
White socks 4 Green pants 1
Blue bag 4 Nice perfume 3
Wrist watch 7 Mobile phone 4
Wireless cord! 2 Building tools 3
One for the money 7 Two for the show 4
'Yellow hat\t2\tBlue shirt\t1\nWhite socks\t4\tGreen pants\t1\nBlue
bag\t4\tNice perfume\t3\nWrist watch\t7\tMobile phone\t4\nWireless
cord!\t2\tBuilding tools\t3\nOne for the money\t7\tTwo for the show\t4'

I want to put items from stuff into a dict like this:{'Wireless cord!': 2, 'Green pants': 1, 'Blue shirt': 1, 'White socks':
4, 'Mobile phone': 4, 'Two for the show': 4, 'One for the money': 7,
'Blue bag': 4, 'Wrist watch': 7, 'Nice perfume': 3, 'Yellow hat': 2,
'Building tools': 3}

Here's how I did it:.... items = items.replace('\n', '\t')
.... items = items.split('\t')
.... d = {}
.... for x in xrange( len(items) ):
.... if not items[x].isdigit(): d[items[x]] = int(items[x+1])
.... return d

I was wondering is there a better way to do it using re module?
perheps even avoiding this for loop?

thanks!

bearophileHUGS · Jun 4, 2006

SuperHik said:
I was wondering is there a better way to do it using re module?
perheps even avoiding this for loop?

This is a way to do the same thing without REs:

data = 'Yellow hat\t2\tBlue shirt\t1\nWhite socks\t4\tGreen
pants\t1\nBlue bag\t4\tNice perfume\t3\nWrist watch\t7\tMobile
phone\t4\nWireless cord!\t2\tBuilding tools\t3\nOne for the
money\t7\tTwo for the show\t4'

data2 = data.replace("\n","\t").split("\t")
result1 = dict( zip(data2[::2], map(int, data2[1::2])) )

O if you want to be light:

from itertools import imap, izip, islice
data2 = data.replace("\n","\t").split("\t")
strings = islice(data2, 0, len(data), 2)
numbers = islice(data2, 1, len(data), 2)
result2 = dict( izip(strings, imap(int, numbers)) )

Bye,
bearophile

faulkner · Jun 4, 2006

you could write a function which takes a match object and modifies d,
pass the function to re.sub, and ignore what re.sub returns.

# untested code
d = {}
def record(match):
s = match.string[match.start() : match.end()]
i = s.index('\t')
print s, i # debugging
d[s[:i]] = int(s[i+1:])
return ''
re.sub('\w+\t\d+\t', record, stuff)
# end code

it may be a bit faster, but it's very roundabout and difficult to
debug.

hi all,

I'm trying to understand regex for the first time, and it would be very
helpful to get an example. I have an old(er) script with the following
task - takes a string I copy-pasted and wich always has the same format:
Yellow hat 2 Blue shirt 1
White socks 4 Green pants 1
Blue bag 4 Nice perfume 3
Wrist watch 7 Mobile phone 4
Wireless cord! 2 Building tools 3
One for the money 7 Two for the show 4
'Yellow hat\t2\tBlue shirt\t1\nWhite socks\t4\tGreen pants\t1\nBlue
bag\t4\tNice perfume\t3\nWrist watch\t7\tMobile phone\t4\nWireless
cord!\t2\tBuilding tools\t3\nOne for the money\t7\tTwo for the show\t4'

I want to put items from stuff into a dict like this:{'Wireless cord!': 2, 'Green pants': 1, 'Blue shirt': 1, 'White socks':
4, 'Mobile phone': 4, 'Two for the show': 4, 'One for the money': 7,
'Blue bag': 4, 'Wrist watch': 7, 'Nice perfume': 3, 'Yellow hat': 2,
'Building tools': 3}

Here's how I did it:... items = items.replace('\n', '\t')
... items = items.split('\t')
... d = {}
... for x in xrange( len(items) ):
... if not items[x].isdigit(): d[items[x]] = int(items[x+1])
... return d

I was wondering is there a better way to do it using re module?
perheps even avoiding this for loop?

thanks!

bearophileHUGS · Jun 4, 2006

strings = islice(data2, 0, len(data), 2)

numbers = islice(data2, 1, len(data), 2)

This probably has to be:

strings = islice(data2, 0, len(data2), 2)
numbers = islice(data2, 1, len(data2), 2)

Sorry,
bearophile

John Machin · Jun 5, 2006

SuperHik a écrit :

hi all,

I'm trying to understand regex for the first time, and it would be
very helpful to get an example. I have an old(er) script with the
following task - takes a string I copy-pasted and wich always has the
same format:

print stuff

Click to expand...

Yellow hat 2 Blue shirt 1
White socks 4 Green pants 1
Blue bag 4 Nice perfume 3
Wrist watch 7 Mobile phone 4
Wireless cord! 2 Building tools 3
One for the money 7 Two for the show 4

stuff

Click to expand...

'Yellow hat\t2\tBlue shirt\t1\nWhite socks\t4\tGreen pants\t1\nBlue
bag\t4\tNice perfume\t3\nWrist watch\t7\tMobile phone\t4\nWireless
cord!\t2\tBuilding tools\t3\nOne for the money\t7\tTwo for the show\t4'

I want to put items from stuff into a dict like this:

print mydict

Click to expand...

{'Wireless cord!': 2, 'Green pants': 1, 'Blue shirt': 1, 'White
socks': 4, 'Mobile phone': 4, 'Two for the show': 4, 'One for the
money': 7, 'Blue bag': 4, 'Wrist watch': 7, 'Nice perfume': 3, 'Yellow
hat': 2, 'Building tools': 3}

Here's how I did it:

def putindict(items):

Click to expand...

... items = items.replace('\n', '\t')
... items = items.split('\t')
... d = {}
... for x in xrange( len(items) ):
... if not items[x].isdigit(): d[items[x]] = int(items[x+1])
... return d

mydict = putindict(stuff)

Click to expand...

I was wondering is there a better way to do it using re module?
perheps even avoiding this for loop?

Click to expand...

There are better ways. One of them avoids the for loop, and even the re
module:

def to_dict(items):
items = items.replace('\t', '\n').split('\n')

In case there are leading/trailing spaces on the keys:

items = [x.strip() for x in items.replace('\t', '\n').split('\n')]

return dict(zip(items[::2], map(int, items[1::2])))

HTH

Fantastic -- at least for the OP's carefully copied-and-pasted input.
Meanwhile back in the real world, there might be problems with multiple
tabs used for 'prettiness' instead of 1 tab, non-integer values, etc etc.
In that case a loop approach that validated as it went and was able to
report the position and contents of any invalid input might be better.

Paul McGuire · Jun 5, 2006

John Machin said:
Fantastic -- at least for the OP's carefully copied-and-pasted input.
Meanwhile back in the real world, there might be problems with multiple
tabs used for 'prettiness' instead of 1 tab, non-integer values, etc etc.
In that case a loop approach that validated as it went and was able to
report the position and contents of any invalid input might be better.

Yeah, for that you'd need more like a real parser... hey, wait a minute!
What about pyparsing?!

Here's a pyparsing version. The definition of the parsing patterns takes
little more than the re definition does - the bulk of the rest of the code
is parsing/scanning the input and reporting the results.

The pyparsing home page is at http://pyparsing.wikispaces.com.

-- Paul

stuff = 'Yellow hat\t2\tBlue shirt\t1\nWhite socks\t4\tGreen pants\t1\nBlue
bag\t4\tNice perfume\t3\nWrist watch\t7\tMobile phone\t4\nWireless
cord!\t2\tBuilding tools\t3\nOne for the money\t7\tTwo for the show\t4'
print "Original input string:"
print stuff
print

from pyparsing import *

# define low-level elements for parsing
itemWord = Word(alphas, alphanums+".!?")
itemDesc = OneOrMore(itemWord)
integer = Word(nums)

# add parse action to itemDesc to merge separate words into single string
itemDesc.setParseAction( lambda s,l,t: " ".join(t) )

# define macro element for an entry
entry = itemDesc.setResultsName("item") + integer.setResultsName("qty")

# scan through input string for entry's, print out their named fields
print "Results when scanning for entries:"
for t,s,e in entry.scanString(stuff):
print t.item,t.qty
print

# parse entire string, building ParseResults with dict-like access
results = dictOf( itemDesc, integer ).parseString(stuff)
print "Results when parsing entries as a dict:"
print "Keys:", results.keys()
for item in results.items():
print item
for k in results.keys():
print k,"=", results[k]

prints:

Original input string:
Yellow hat 2 Blue shirt 1
White socks 4 Green pants 1
Blue bag 4 Nice perfume 3
Wrist watch 7 Mobile phone 4
Wireless cord! 2 Building tools 3
One for the money 7 Two for the show 4

Results when scanning for entries:
Yellow hat 2
Blue shirt 1
White socks 4
Green pants 1
Blue bag 4
Nice perfume 3
Wrist watch 7
Mobile phone 4
Wireless cord! 2
Building tools 3
One for the money 7
Two for the show 4

Results when parsing entries as a dict:
Keys: ['Wireless cord!', 'Green pants', 'Blue shirt', 'White socks', 'Mobile
phone', 'Two for the show', 'One for the money', 'Blue bag', 'Wrist watch',
'Nice perfume', 'Yellow hat', 'Building tools']
('Wireless cord!', '2')
('Green pants', '1')
('Blue shirt', '1')
('White socks', '4')
('Mobile phone', '4')
('Two for the show', '4')
('One for the money', '7')
('Blue bag', '4')
('Wrist watch', '7')
('Nice perfume', '3')
('Yellow hat', '2')
('Building tools', '3')
Wireless cord! = 2
Green pants = 1
Blue shirt = 1
White socks = 4
Mobile phone = 4
Two for the show = 4
One for the money = 7
Blue bag = 4
Wrist watch = 7
Nice perfume = 3
Yellow hat = 2
Building tools = 3

Bruno Desthuilliers · Jun 5, 2006

SuperHik a écrit :

hi all,

I'm trying to understand regex for the first time, and it would be very
helpful to get an example. I have an old(er) script with the following
task - takes a string I copy-pasted and wich always has the same format:
Yellow hat 2 Blue shirt 1
White socks 4 Green pants 1
Blue bag 4 Nice perfume 3
Wrist watch 7 Mobile phone 4
Wireless cord! 2 Building tools 3
One for the money 7 Two for the show 4
'Yellow hat\t2\tBlue shirt\t1\nWhite socks\t4\tGreen pants\t1\nBlue
bag\t4\tNice perfume\t3\nWrist watch\t7\tMobile phone\t4\nWireless
cord!\t2\tBuilding tools\t3\nOne for the money\t7\tTwo for the show\t4'

I want to put items from stuff into a dict like this:{'Wireless cord!': 2, 'Green pants': 1, 'Blue shirt': 1, 'White socks':
4, 'Mobile phone': 4, 'Two for the show': 4, 'One for the money': 7,
'Blue bag': 4, 'Wrist watch': 7, 'Nice perfume': 3, 'Yellow hat': 2,
'Building tools': 3}

Here's how I did it:... items = items.replace('\n', '\t')
... items = items.split('\t')
... d = {}
... for x in xrange( len(items) ):
... if not items[x].isdigit(): d[items[x]] = int(items[x+1])
... return d

I was wondering is there a better way to do it using re module?
perheps even avoiding this for loop?

There are better ways. One of them avoids the for loop, and even the re
module:

def to_dict(items):
items = items.replace('\t', '\n').split('\n')
return dict(zip(items[::2], map(int, items[1::2])))

HTH

Bruno Desthuilliers · Jun 5, 2006

(e-mail address removed) a écrit :

This probably has to be:

strings = islice(data2, 0, len(data2), 2)
numbers = islice(data2, 1, len(data2), 2)

try with islice(data2, 0, None, 2)

John Machin · Jun 5, 2006

Yeah, for that you'd need more like a real parser... hey, wait a minute!
What about pyparsing?!

Here's a pyparsing version. The definition of the parsing patterns takes
little more than the re definition does - the bulk of the rest of the code
is parsing/scanning the input and reporting the results.

[big snip]

I didn't see any evidence of error handling in there anywhere.

Paul McGuire · Jun 5, 2006

John Machin said:
Yeah, for that you'd need more like a real parser... hey, wait a minute!
What about pyparsing?!

Here's a pyparsing version. The definition of the parsing patterns takes
little more than the re definition does - the bulk of the rest of the code
is parsing/scanning the input and reporting the results.

Click to expand...

[big snip]

I didn't see any evidence of error handling in there anywhere.

Pyparsing has a certain amount of error reporting built in, raising a
ParseException when a mismatch occurs.

This particular "grammar" is actually pretty error-tolerant. To force an
error, I replaced "One for the money" with "1 for the money", and here is
the exception reported by pyparsing, along with a diagnostic method,
markInputline:

stuff = 'Yellow hat\t2\tBlue shirt\t1\nWhite socks\t4\tGreen pants\t1\nBlue
bag\t4\tNice perfume\t3\nWrist watch\t7\tMobile phone\t4\nWireless
cord!\t2\tBuilding tools\t3\nOne for the money\t7\tTwo for the show\t4'
badstuff = 'Yellow hat\t2\tBlue shirt\t1\nWhite socks\t4\tGreen
pants\t1\nBlue bag\t4\tNice perfume\t3\nWrist watch\t7\tMobile
phone\t4\nWireless cord!\t2\tBuilding tools\t3\n1 for the money\t7\tTwo for
the show\t4'
pattern = dictOf( itemDesc, integer ) + stringEnd
print pattern.parseString(stuff)
print
try:
print pattern.parseString(badstuff)
except ParseException, pe:
print pe
print pe.markInputline()

Gives:
[['Yellow hat', '2'], ['Blue shirt', '1'], ['White socks', '4'], ['Green
pants', '1'], ['Blue bag', '4'], ['Nice perfume', '3'], ['Wrist watch',
'7'], ['Mobile phone', '4'], ['Wireless cord!', '2'], ['Building tools',
'3'], ['One for the money', '7'], ['Two for the show', '4']]

Expected stringEnd (at char 210), (line:6, col:1)

!<1 for the money 7 Two for the show 4

-- Paul

Fredrik Lundh · Jun 5, 2006

John said:
Fantastic -- at least for the OP's carefully copied-and-pasted input.
Meanwhile back in the real world, there might be problems with multiple
tabs used for 'prettiness' instead of 1 tab, non-integer values, etc etc.

yeah, that's probably why the OP stated "which always has the same format".

and the "trying to understand regex for the first time, and it would be
very helpful to get an example" part was obviously mostly irrelevant to
the "smarter than thou" crowd; only one thread contributor was silly
enough to actually provide an RE-based example.

</F>

Fredrik Lundh · Jun 5, 2006

SuperHik said:
I'm trying to understand regex for the first time, and it would be very
helpful to get an example. I have an old(er) script with the following
task - takes a string I copy-pasted and wich always has the same format:

Yellow hat 2 Blue shirt 1
White socks 4 Green pants 1
Blue bag 4 Nice perfume 3
Wrist watch 7 Mobile phone 4
Wireless cord! 2 Building tools 3
One for the money 7 Two for the show 4

'Yellow hat\t2\tBlue shirt\t1\nWhite socks\t4\tGreen pants\t1\nBlue
bag\t4\tNice perfume\t3\nWrist watch\t7\tMobile phone\t4\nWireless
cord!\t2\tBuilding tools\t3\nOne for the money\t7\tTwo for the show\t4'

the first thing you need to do is to figure out exactly what the syntax
is. given your example, the format of the items you are looking for
seems to be "some text" followed by a tab character followed by an integer.

a initial attempt would be "\w+\t\d+" (one or more word characters,
followed by a tab, followed by one or more digits). to try this out,
you can do:
['hat\t2', 'shirt\t1', 'socks\t4', ...]

as you can see, using \w+ isn't good enough here; the "keys" in this
case may contain whitespace as well, and findall simply skips stuff that
doesn't match the pattern. if we assume that a key consists of words
and spaces, we can replace the single \w with [\w ] (either word
character or space), and get

>>> re.findall('[\w ]+\t\d+', stuff)

Click to expand...

Click to expand...

['Yellow hat\t2', 'Blue shirt\t1', 'White socks\t4', ...]

which looks a bit better. however, if you check the output carefully,
you'll notice that the "Wireless cord!" entry is missing: the "!" isn't
a letter or a digit. the easiest way to fix this is to look for
"non-tab characters" instead, using "[^\t]" (this matches anything
except a tab):

>>> len(re.findall('[\w ]+\t\d+', stuff)) 11
>>> len(re.findall('[^\t]+\t\d+', stuff))

Click to expand...

Click to expand...

12

now, to turn this into a dictionary, you could split the returned
strings on a tab character (\t), but RE provides a better mechanism:
capturing groups. by adding () to the pattern string, you can mark the
sections you want returned:

>>> re.findall('([^\t]+)\t(\d+)', stuff)

Click to expand...

Click to expand...

[('Yellow hat', '2'), ('Blue shirt', '1'), ('White socks', ...]

turning this into a dictionary is trivial:

>>> dict(re.findall('([^\t]+)\t(\d+)', stuff)) {'Green pants': '1', 'Blue shirt': '1', 'White socks': ...}
>>> len(dict(re.findall('([^\t]+)\t(\d+)', stuff)))

Click to expand...

Click to expand...

12

or, in function terms:

def putindict(items):
return dict(re.findall('([^\t]+)\t(\d+)', stuff))

hope this helps!

</F>

John Machin · Jun 5, 2006

yeah, that's probably why the OP stated "which always has the same format".

Such statements by users are in the the same category as "The cheque is
in the mail" and "Of course I'll still love you in the morning".

Bruno Desthuilliers · Jun 5, 2006

John Machin a écrit :

(snip)

In case there are leading/trailing spaces on the keys:

There aren't. Test passes.

(snip)

Fantastic -- at least for the OP's carefully copied-and-pasted input.

That was the spec, and my code passes the test.

Meanwhile back in the real world,

The "real world" is mostly defined by customer's test set (is that the
correct translation for "jeu d'essai" ?). Code passes the test. period.

there might be problems with multiple
tabs used for 'prettiness' instead of 1 tab, non-integer values, etc etc.

Which means that the spec and the customer's test set is wrong. Not my
responsability. Any way, I refuse to change anything in the parsing
algorithm before having another test set.

In that case a loop approach that validated as it went and was able to
report the position and contents of any invalid input might be better.

One doesn't know what *will* be better without actual facts. You can be
right (and, from my experience, you probably are !-), *but* you can be
wrong as well. Until you have a correct spec and test data set on which
the code fails, writing any other code is a waste of time. Better to
work on other parts of the system, and come back on this if and when the
need arise.

<ot>
Kind of reminds me of a former employer that paid me 2 full monthes to
work on a very hairy data migration script (the original data set was so
f... up and incoherent even a human parser could barely make any sens of
it), before discovering than none of the users of the old system was
interested in migrating that part of the data. Talk about a waste of
time and money...
</ot>

Now FWIW, there's actually something else bugging me with this code : it
loads the whole data set in memory. It's ok for a few lines, but
obviously wrong if one is to parse huge files. *That* would be the first
thing I would change - it takes a couple of minutes to do so no real
waste of time, but it obviously imply rethinking the API, which is
better done yet than when client code will have been written.

My 2 cents....

John Machin · Jun 5, 2006

John Machin a écrit :

There aren't. Test passes.

(snip)

That was the spec, and my code passes the test.

The "real world" is mostly defined by customer's test set (is that the
correct translation for "jeu d'essai" ?). Code passes the test. period.

"Jeu d'essai" could be construed as "toss a coin" -- yup, that fits some
user test sets I've seen.

In the real world, you are lucky to get a test set that covers all the
user-expected "good" cases. They have to be driven with whips to think
about the "bad" cases. Never come across a problem caused by "FOO " !=
"FOO"? You *have* lead a charmed life, so far.

Which means that the spec and the customer's test set is wrong. Not my
responsability.

That's what you think. The users, the pointy-haired boss, and the evil
HR director may have other ideas

Any way, I refuse to change anything in the parsing
algorithm before having another test set.

One doesn't know what *will* be better without actual facts. You can be
right (and, from my experience, you probably are !-), *but* you can be
wrong as well. Until you have a correct spec and test data set on which
the code fails, writing any other code is a waste of time. Better to
work on other parts of the system, and come back on this if and when the
need arise.

Unfortunately one is likely to be told in a Sunday 03:00 phone call that
the "test data set on which the code fails" is somewhere in the
production database :-(

Cheers,
John

Bruno Desthuilliers · Jun 5, 2006

Fredrik Lundh a écrit :

yeah, that's probably why the OP stated "which always has the same format".
Lol.

and the "trying to understand regex for the first time, and it would be
very helpful to get an example" part

Yeps, I missed that part when answering yesterday. My bad.

SuperHik · Jun 5, 2006

WOW!
Thanks for all the answers, even those not related to regular
expressions tought me some stuff I wasn't aware of.
I appreciate it very much.

hi all,

I'm trying to understand regex for the first time, and it would be very
helpful to get an example. I have an old(er) script with the following
task - takes a string I copy-pasted and wich always has the same format:
Yellow hat 2 Blue shirt 1
White socks 4 Green pants 1
Blue bag 4 Nice perfume 3
Wrist watch 7 Mobile phone 4
Wireless cord! 2 Building tools 3
One for the money 7 Two for the show 4
'Yellow hat\t2\tBlue shirt\t1\nWhite socks\t4\tGreen pants\t1\nBlue
bag\t4\tNice perfume\t3\nWrist watch\t7\tMobile phone\t4\nWireless
cord!\t2\tBuilding tools\t3\nOne for the money\t7\tTwo for the show\t4'

I want to put items from stuff into a dict like this:{'Wireless cord!': 2, 'Green pants': 1, 'Blue shirt': 1, 'White socks':
4, 'Mobile phone': 4, 'Two for the show': 4, 'One for the money': 7,
'Blue bag': 4, 'Wrist watch': 7, 'Nice perfume': 3, 'Yellow hat': 2,
'Building tools': 3}

Here's how I did it:... items = items.replace('\n', '\t')
... items = items.split('\t')
... d = {}
... for x in xrange( len(items) ):
... if not items[x].isdigit(): d[items[x]] = int(items[x+1])
... return d

I was wondering is there a better way to do it using re module?
perheps even avoiding this for loop?

thanks!

Python point location of intersect between two lines	0	Feb 28, 2018
Merge Sort on linked list..my code is almost done..please help me onit	9	Apr 26, 2010
Merge Sort on linked list..my code is almost done..please help me onit	0	Apr 26, 2010
How can I infer resource re-use in my VHDL code?	4	Sep 12, 2003
[ANN] Ruby-Traits 0.2 released (sorry long)	2	Oct 23, 2007
Nokia N95 Hoodies Sunglasses Nokia N93 Nokia N70 Sony memory cardMobile Phones	0	Nov 18, 2007
Simple Perl	6	Aug 15, 2005
Multiple drop down boxes with same options	0	Nov 30, 2006

re beginner

SuperHik

bearophileHUGS

faulkner

bearophileHUGS

John Machin

Paul McGuire

Bruno Desthuilliers

Bruno Desthuilliers

John Machin

Paul McGuire

Fredrik Lundh

Fredrik Lundh

John Machin

Bruno Desthuilliers

John Machin

Bruno Desthuilliers

SuperHik

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads