Using dictionary to hold regex patterns?

G

Gilles Ganault

Hello

After downloading a web page, I need to search for several patterns,
and if found, extract information and put them into a database.

To avoid a bunch of "if m", I figured maybe I could use a dictionary
to hold the patterns, and loop through it:

======
pattern = {}
pattern["pattern1"] = ">.+?</td>.+?>(.+?)</td>"
for key,value in pattern.items():
response = ">whatever</td>.+?>Blababla</td>"

#AttributeError: 'str' object has no attribute 'search'
m = key.search(response)
if m:
print key + "#" + value
======

Is there a way to use a dictionary this way, or am I stuck with
copy/pasting blocks of "if m:"?

Thank you.
 
A

Arnaud Delobelle

Gilles Ganault said:
Hello

After downloading a web page, I need to search for several patterns,
and if found, extract information and put them into a database.

To avoid a bunch of "if m", I figured maybe I could use a dictionary
to hold the patterns, and loop through it:

======
pattern = {}
pattern["pattern1"] = ">.+?</td>.+?>(.+?)</td>"

pattern["pattern1 said:
= re.compile(">.+?"]
for key,value in pattern.items():
response = ">whatever</td>.+?>Blababla</td>"

#AttributeError: 'str' object has no attribute 'search'
m = key.search(response)

m = value.search(response)
if m:
print key + "#" + value
======

Is there a way to use a dictionary this way, or am I stuck with
copy/pasting blocks of "if m:"?

But there is no reason why you should use a dictionary; just use a list
of key-value pairs:

patterns = [
("pattern1", re.compile(">.+?</td>.+?>(.+?)</td>"),
("pattern2", re.compile("something else"),
....
]

for name, pattern in patterns:
...
 
T

Terry Reedy

Gilles said:
Hello

After downloading a web page, I need to search for several patterns,
and if found, extract information and put them into a database.

To avoid a bunch of "if m", I figured maybe I could use a dictionary
to hold the patterns, and loop through it:

Good idea.

import re
pattern = {}
pattern["pattern1"] = ">.+?</td>.+?>(.+?)</td>"

.... = re.compile("...")
for key,value in pattern.items():

for name, regex in ...
response = ">whatever</td>.+?>Blababla</td>"

#AttributeError: 'str' object has no attribute 'search'

Correct, only compiled re patterns have search, better naming would make
error obvious.
m = key.search(response)

m = regex.search(response)
if m:
print key + "#" + value

print name + '#' + regex
 
G

Gilles Ganault

But there is no reason why you should use a dictionary; just use a list
of key-value pairs:

patterns = [
("pattern1", re.compile(">.+?</td>.+?>(.+?)</td>"),

Thanks for the tip, but... I thought that lists could only use integer
indexes, while text indexes had to use dictionaries. In which case do
we need dictionaries, then?
 
J

John Machin

But there is no reason why you should use a dictionary; just use a list
of key-value pairs:
patterns = [
   ("pattern1", re.compile(">.+?</td>.+?>(.+?)</td>"),

Thanks for the tip, but... I thought that lists could only use integer
indexes, while text indexes had to use dictionaries. In which case do
we need dictionaries, then?

You don't have a requirement for indexing -- neither a text index nor
an integer index. Your requirement is met by a sequence of (name,
regex) pairs. Yes, a list is a sequence, and a list has integer
indexes, but this is irrelevant.

General tip: Don't us a data structure that is more complicated than
what you need.
 
J

John Machin

Gilles said:
After downloading a web page, I need to search for several patterns,
and if found, extract information and put them into a database.
To avoid a bunch of "if m", I figured maybe I could use a dictionary
to hold the patterns, and loop through it:

Good idea.

import re
pattern = {}
pattern["pattern1"] = ">.+?</td>.+?>(.+?)</td>"

... = re.compile("...")
for key,value in pattern.items():

for name, regex in ...
   response = ">whatever</td>.+?>Blababla</td>"
   #AttributeError: 'str' object has no attribute 'search'

Correct, only compiled re patterns have search, better naming would make
error obvious.
   m = key.search(response)

m = regex.search(response)
   if m:
           print key + "#" + value

print name + '#' + regex

Perhaps you meant:
print key + "#" + regex.pattern
 
T

Thomas Mlynarczyk

John said:
General tip: Don't us a data structure that is more complicated than
what you need.

Is "[ ( name, regex ), ... ]" really "simpler" than "{ name: regex, ...
}"? Intuitively, I would consider the dictionary to be the simpler
structure.

Greetings,
Thomas
 
A

André

Hello

After downloading a web page, I need to search for several patterns,
and if found, extract information and put them into a database.

To avoid a bunch of "if m", I figured maybe I could use a dictionary
to hold the patterns, and loop through it:

======
pattern = {}
pattern["pattern1"] = ">.+?</td>.+?>(.+?)</td>"
for key,value in pattern.items():
        response = ">whatever</td>.+?>Blababla</td>"

        #AttributeError: 'str' object has no attribute 'search'
        m = key.search(response)
        if m:
                print key + "#" + value
======

Is there a way to use a dictionary this way, or am I stuck with
copy/pasting blocks of "if m:"?

Thank you.

Yes it is possible and you don't need to use pattern.items()...

Here is something I use (straight cut-and-paste):

def parse_single_line(self, line):
'''Parses a given line to see if it match a known pattern'''
for name in self.patterns:
result = self.patterns[name].match(line)
if result is not None:
return name, result.groups()
return None, line


where self.patterns is something like
self.patterns={
'pattern1': re.compile(...),
'pattern2': re.compile(...)
}

The one potential problem with the method as I wrote it is that
sometimes a more generic pattern gets matched first whereas a more
specific pattern may be desired.

André
 
J

John Machin

John said:
General tip: Don't us a data structure that is more complicated than
what you need.

Is "[ ( name, regex ), ... ]" really "simpler" than "{ name: regex, ...}"? Intuitively, I would consider the dictionary to be the simpler

structure.

Hi Thomas,

Rephrasing for clarity: Don't use a data structure that is more
complicated than that indicated by your requirements.

Judging which of two structures is "simpler" should not be independent
of those requirements. I don't see a role for intuition in this
process.

Please see my belated response in your "My first Python program -- a
lexer" thread.

Cheers,
John
 
G

Gilles Ganault

But there is no reason why you should use a dictionary; just use a list
of key-value pairs:

Thanks for the tip. I didn't know it was possible to use arrays to
hold more than one value. Actually, it's a better solution, as
key/value tuples in a dictionary aren't used in the order in which
they're put in the dictionary, while arrays are.

For those interested:

========
response = ">dummy</td>bla>good stuff</td>"
for name, pattern in patterns:
m = pattern.search(response)
if m:
print m.group(1)
break
else:
print "here"
========

Thanks guys.
 
M

MRAB

Gilles said:
Thanks for the tip. I didn't know it was possible to use arrays to
hold more than one value. Actually, it's a better solution, as
key/value tuples in a dictionary aren't used in the order in which
they're put in the dictionary, while arrays are.
[snip]
A list is an ordered collection of items. Each item can be anything: a
string, an integer, a dictionary, a tuple, a list...
 
G

Gilles Ganault

A list is an ordered collection of items. Each item can be anything: a
string, an integer, a dictionary, a tuple, a list...

Yup, learned something new today. Naively, I though a list was
index=value, where value=a single piece of data. Works like a charm.
Thanks.
 
M

Marc 'BlackJack' Rintsch

Yup, learned something new today. Naively, I though a list was
index=value, where value=a single piece of data.

Your thought was correct, each value is a single piece of data: *one*
tuple.

Ciao,
Marc 'BlackJack' Rintsch
 
T

Thomas Mlynarczyk

Dennis said:
Is "[ ( name, regex ), ... ]" really "simpler" than "{ name: regex, ...
}"? Intuitively, I would consider the dictionary to be the simpler
structure.
Why, when you aren't /using/ the name to retrieve the expression...

So as soon as I start retrieving a regex by its name, the dict will be
the most suitable structure?

Greetings,
Thomas
 
T

Thomas Mlynarczyk

John said:
Rephrasing for clarity: Don't use a data structure that is more
complicated than that indicated by your requirements.

Could you please define "complicated" in this context? In terms of
characters to type and reading, the dict is surely simpler. But I
suppose that under the hood, it is "less work" for Python to deal with a
list of tuples than a dict?
Judging which of two structures is "simpler" should not be independent
of those requirements. I don't see a role for intuition in this
process.

Maybe I should have said "upon first sight" / "judging from the outer
appearance" instead of "intuition".
Please see my belated response in your "My first Python program -- a
lexer" thread.

(See my answer there.) I think I should definitely read up a bit on the
implementation details of those data structures in Python. (As it was
suggested earlier in my lexer thread.)

Greetings,
Thomas
 
J

John Machin

Could you please define "complicated" in this context? In terms of
characters to type and reading, the dict is surely simpler. But I
suppose that under the hood, it is "less work" for Python to deal with a
list of tuples than a dict?

The two extra parentheses per item are a trivial cosmetic factor only
when the data is hard-coded i.e. don't exist if the data is read from
a file i.e nothing to do with "complicated". The amount of work done
by Python under the hood is relevant only to a speed/memory
requirement. No, "complicated" is more related to unused features. In
the case of using an aeroplane to transport 3 passengers 10 km along
the autobahn, you aren't using the radar, wheel-retractability, wings,
pressurised cabin, etc. In your original notion of using a dict in
your lexer, you weren't using the mapping functionality of a dict at
all. In both cases you have perplexed bystanders asking "Why use a
plane/dict when a car/list will do the job?".
Maybe I should have said "upon first sight" / "judging from the outer
appearance" instead of "intuition".

I don't see a role for "upon first sight" or "judging from the outer
appearance" either.
 
S

Steve Holden

John said:
On Nov 25, 4:38 am, Thomas Mlynarczyk <[email protected]> [...]
Maybe I should have said "upon first sight" / "judging from the outer
appearance" instead of "intuition".

I don't see a role for "upon first sight" or "judging from the outer
appearance" either.
They are all potentially (inadequate) substitutes for the knowledge and
experience you bring to the problem.

regards
Steve
 
T

Thomas Mlynarczyk

John said:
No, "complicated" is more related to unused features. In
the case of using an aeroplane to transport 3 passengers 10 km along
the autobahn, you aren't using the radar, wheel-retractability, wings,
pressurised cabin, etc. In your original notion of using a dict in
your lexer, you weren't using the mapping functionality of a dict at
all. In both cases you have perplexed bystanders asking "Why use a
plane/dict when a car/list will do the job?".

Now the matter is getting clearer in my head.

Thanks and greetings,
Thomas
 
B

Bruno Desthuilliers

André a écrit :
(snip)
you don't need to use pattern.items()...

Here is something I use (straight cut-and-paste):

def parse_single_line(self, line):
'''Parses a given line to see if it match a known pattern'''
for name in self.patterns:
result = self.patterns[name].match(line)

FWIW, this is more expansive than iterating over (key, value) tuples
using dict.items(), since you have one extra call to dict.__getitem__
per entry.
if result is not None:
return name, result.groups()
return None, line


where self.patterns is something like
self.patterns={
'pattern1': re.compile(...),
'pattern2': re.compile(...)
}

The one potential problem with the method as I wrote it is that
sometimes a more generic pattern gets matched first whereas a more
specific pattern may be desired.

As usual when order matters, the solution is to use a list of (name,
whatever) tuples instead of a dict. You can still build a dict from this
list when needed (the dict initializer accepts a list of (name, object)
as argument).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top