re.compile.match() results in unicode strings - why?

A

Axel Bock

Hi,

I am doing matches with the re module, and I am experiencing a strange
problem. I match a string with
exp = re.compile(blah)
m = exp.match(string)
a,b,c,d = m.groups()
now a,b,c,d are all string variables, and they all come out as unicode strings
(u"xxx").

now my question is: is that "normal" behaviour, for there really is NO
character possibly triggering any unicode thing - at all. I mean, I don't
really care, for the program works fine, but I _am_ curious a bit :)


thanks in advance & greetings,

axel.
 
S

Scott David Daniels

Axel said:
.... I match a string with
exp = re.compile(blah)
m = exp.match(string)
a,b,c,d = m.groups()
now a,b,c,d are all string variables,
This shows a misunderstanding. Python does not have typed variables.
a,b,c,d = 'a', 'b', 'c', 'd'
a = 4.5
b = 3+5j
c = u'\N{LATIN CAPITAL LETTER D WITH STROKE}one'
d = repr
is perfectly legal.
... and [a,b,c,d] all come out as unicode > strings (u"xxx").

The example is not concrete enough to reproduce. Please give particular
examples for blah and string.

-Scott David Daniels
(e-mail address removed)
 
S

Stefan Seefeld

Scott said:
This shows a misunderstanding. Python does not have typed variables.

huh ? It is perfectly valid to ask what type a variable has, i.e. python
is 'strongly typed'.
a,b,c,d = 'a', 'b', 'c', 'd'
a = 4.5
b = 3+5j
c = u'\N{LATIN CAPITAL LETTER D WITH STROKE}one'
d = repr
is perfectly legal.

indeed. How's this related to the original question ?

Regards,
Stefan
 
K

Kent Johnson

Axel said:
Hi,

I am doing matches with the re module, and I am experiencing a strange
problem. I match a string with
exp = re.compile(blah)
m = exp.match(string)
a,b,c,d = m.groups()
now a,b,c,d are all string variables, and they all come out as unicode
strings (u"xxx").

Apparently if the input strings are unicode then the groups will be as well:(u'ab',)

Are you sure that exp is not a unicode string?

Kent
 
A

Axel Bock

Scott said:
This shows a misunderstanding. Python does not have typed variables.
a,b,c,d = 'a', 'b', 'c', 'd'
a = 4.5
b = 3+5j
c = u'\N{LATIN CAPITAL LETTER D WITH STROKE}one'
d = repr
is perfectly legal.

sure, but that was not the question ... :)
The example is not concrete enough to reproduce. Please give particular
examples for blah and string.

if you like ...

** CODE **
string = "1. asdf asdf 327,88"
exp = re.compile("(\S+) (\S+) (\S+) (\S+).*")
m = exp.match(string)
print m.groups()
** /CODE **

the question now is (a little bit more precise): m.groups() delievers a list
(ok, tuple) of string types. in my app they are all unicode, but the same code
just typed in the shell produces "normal" strings.

now how does that unicode-string-triggering happen? I do definitely not see
why python feels inclinded to return a unicode thingy here ...

as said - it is of no great importance, but I like to know ... :))


ciao & thanks anyways,

axel.
 
S

Steven Bethard

Stefan Seefeld said:
huh ? It is perfectly valid to ask what type a variable has, i.e. python
is 'strongly typed'.

I think perhaps Axel Bock was getting a little pedantic here. Technically
Python does not have typed variables; any "variable" in Python can hold an
object of any type. Hence examples like:

Probably a better way of saying this is to say that *names* in Python can be
*bound* to objects of any type. Of course, Python is strongly typed, so the
objects 'a' or 4.5 will not change type on you, though you are free to bind any
name you like to these objects:
('a', 'a', 'a')

I think Axel Bock would have been happier if the sentence:

"now a,b,c,d are all string variables, and they all come out as unicode strings"

had been expressed as something like:

"now the names a,b,c,d are all bound to unicode objects"

Steve
 
S

Stefan Seefeld

Steven said:
I think perhaps Axel Bock was getting a little pedantic here. Technically
Python does not have typed variables; any "variable" in Python can hold an
object of any type. Hence examples like:

I don't see this as a contradiction (but then, may be we'd need to
define what 'variable' means, the reference or the referee). I only
meant to point out that there is a distinction between strong/weak
and static/dynamic typing, and thus, that Axel's question was perfectly
valid.

Regards,
Stefan
 
S

Steven Bethard

Sorry, I misread the quoting. s/Axel Bock/Scott David Daniels. Sorry!
I don't see this as a contradiction (but then, may be we'd need to
define what 'variable' means, the reference or the referee). I only
meant to point out that there is a distinction between strong/weak
and static/dynamic typing, and thus, that Axel's question was perfectly
valid.

Yeah, that's why I said the responder was being pedantic. The question is, of
course, a valid one.

I think the phrasing could have been a little confusing though -- if you thought
that the OP believed that "variables" in Python could be of type str (meaning
that they were statically typed), then you would interpret his question as "how
does a str-typed variable get converted into a unicode-typed variable?" instead
of the real question, "why does m.groups() return unicode objects?"

Again, the question was valid -- I just thought it would be helpful to explain
what Scott David Daniels meant when he said "Python does not have typed
variables", and perhaps give a more standard wording for these kind of
descriptions in Python.

Steve
 
A

Axel Bock

Kent said:
Apparently if the input strings are unicode then the groups will be as
well:
[...]
Are you sure that exp is not a unicode string?

hm. pretty much - i read the lines from a text file which contains only normal
text. a sample line looks like that:

6. call_noparam 1000 runs 149453,1 ms 149,4531 ms/call

no surprise here, i think ... . Actually I also wrote the program which
produces that file, and I really didn't use unicode then. opening the file
with a text editor also does not show unicode, and I can't believe that
windows does actually manage the unicode stuff transparently to text editors.
and also I have never heard of file-attached codepage information, those would
be the only things i could imagine as a reason.

but interesting though, thanks!


ciao,

axel.
 
P

Peter Otten

Axel said:
Kent said:
Apparently if the input strings are unicode then the groups will be as
well:
[...]
Are you sure that exp is not a unicode string?

hm. pretty much - i read the lines from a text file which contains only
normal text. a sample line looks like that:

6. call_noparam 1000 runs 149453,1 ms 149,4531 ms/call

no surprise here, i think ... . Actually I also wrote the program which
produces that file, and I really didn't use unicode then. opening the file
with a text editor also does not show unicode, and I can't believe that
windows does actually manage the unicode stuff transparently to text
editors. and also I have never heard of file-attached codepage
information, those would be the only things i could imagine as a reason.

Why do you keep speculating?

[Your code from another post]
** CODE **
string = "1. asdf asdf 327,88"
exp = re.compile("(\S+) (\S+) (\S+) (\S+).*")
m = exp.match(string)
print m.groups()
** /CODE **

You could modify that along the lines (untested)

string = "1. asdf asdf 327,88"
pattern = "(\S+) (\S+) (\S+) (\S+).*"
# make sure that there is no unicode input:
assert not isinstance(string, unicode)
assert not isinstance(pattern, unicode)
exp = re.compile(pattern)
m = exp.match(string)
# make sure at least one group is a unicode string
if m:
assert [g for g in m.groups() if isinstance(g, unicode)]

If this does not throw an assertion error we can look further, but I still
think this is unlikely.

Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,015
Latest member
AmbrosePal

Latest Threads

Top