re.compile.match() results in unicode strings - why?

Axel Bock · Nov 11, 2004

Hi,

I am doing matches with the re module, and I am experiencing a strange
problem. I match a string with
exp = re.compile(blah)
m = exp.match(string)
a,b,c,d = m.groups()
now a,b,c,d are all string variables, and they all come out as unicode strings
(u"xxx").

now my question is: is that "normal" behaviour, for there really is NO
character possibly triggering any unicode thing - at all. I mean, I don't
really care, for the program works fine, but I _am_ curious a bit

thanks in advance & greetings,

axel.

Scott David Daniels · Nov 11, 2004

Axel said:
.... I match a string with
exp = re.compile(blah)
m = exp.match(string)
a,b,c,d = m.groups()
now a,b,c,d are all string variables,

This shows a misunderstanding. Python does not have typed variables.
a,b,c,d = 'a', 'b', 'c', 'd'
a = 4.5
b = 3+5j
c = u'\N{LATIN CAPITAL LETTER D WITH STROKE}one'
d = repr
is perfectly legal.

... and [a,b,c,d] all come out as unicode > strings (u"xxx").

The example is not concrete enough to reproduce. Please give particular
examples for blah and string.

-Scott David Daniels
(e-mail address removed)

Stefan Seefeld · Nov 11, 2004

Scott said:
This shows a misunderstanding. Python does not have typed variables.

huh ? It is perfectly valid to ask what type a variable has, i.e. python
is 'strongly typed'.

a,b,c,d = 'a', 'b', 'c', 'd'
a = 4.5
b = 3+5j
c = u'\N{LATIN CAPITAL LETTER D WITH STROKE}one'
d = repr
is perfectly legal.

indeed. How's this related to the original question ?

Regards,
Stefan

Kent Johnson · Nov 11, 2004

Axel said:
Hi,

I am doing matches with the re module, and I am experiencing a strange
problem. I match a string with
exp = re.compile(blah)
m = exp.match(string)
a,b,c,d = m.groups()
now a,b,c,d are all string variables, and they all come out as unicode
strings (u"xxx").

Apparently if the input strings are unicode then the groups will be as well

u'ab',)

Are you sure that exp is not a unicode string?

Kent

Axel Bock · Nov 11, 2004

Scott said:
This shows a misunderstanding. Python does not have typed variables.
a,b,c,d = 'a', 'b', 'c', 'd'
a = 4.5
b = 3+5j
c = u'\N{LATIN CAPITAL LETTER D WITH STROKE}one'
d = repr
is perfectly legal.

sure, but that was not the question ...

The example is not concrete enough to reproduce. Please give particular
examples for blah and string.

if you like ...

** CODE **
string = "1. asdf asdf 327,88"
exp = re.compile("(\S+) (\S+) (\S+) (\S+).*")
m = exp.match(string)
print m.groups()
** /CODE **

the question now is (a little bit more precise): m.groups() delievers a list
(ok, tuple) of string types. in my app they are all unicode, but the same code
just typed in the shell produces "normal" strings.

now how does that unicode-string-triggering happen? I do definitely not see
why python feels inclinded to return a unicode thingy here ...

as said - it is of no great importance, but I like to know ...

)

ciao & thanks anyways,

axel.

Steven Bethard · Nov 11, 2004

Stefan Seefeld said:
huh ? It is perfectly valid to ask what type a variable has, i.e. python
is 'strongly typed'.

I think perhaps Axel Bock was getting a little pedantic here. Technically
Python does not have typed variables; any "variable" in Python can hold an
object of any type. Hence examples like:

Probably a better way of saying this is to say that *names* in Python can be
*bound* to objects of any type. Of course, Python is strongly typed, so the
objects 'a' or 4.5 will not change type on you, though you are free to bind any
name you like to these objects:
('a', 'a', 'a')

I think Axel Bock would have been happier if the sentence:

"now a,b,c,d are all string variables, and they all come out as unicode strings"

had been expressed as something like:

"now the names a,b,c,d are all bound to unicode objects"

Steve

Stefan Seefeld · Nov 11, 2004

Steven said:
I think perhaps Axel Bock was getting a little pedantic here. Technically
Python does not have typed variables; any "variable" in Python can hold an
object of any type. Hence examples like:

I don't see this as a contradiction (but then, may be we'd need to
define what 'variable' means, the reference or the referee). I only
meant to point out that there is a distinction between strong/weak
and static/dynamic typing, and thus, that Axel's question was perfectly
valid.

Regards,
Stefan

Steven Bethard · Nov 11, 2004

Sorry, I misread the quoting. s/Axel Bock/Scott David Daniels. Sorry!

I don't see this as a contradiction (but then, may be we'd need to
define what 'variable' means, the reference or the referee). I only
meant to point out that there is a distinction between strong/weak
and static/dynamic typing, and thus, that Axel's question was perfectly
valid.

Yeah, that's why I said the responder was being pedantic. The question is, of
course, a valid one.

I think the phrasing could have been a little confusing though -- if you thought
that the OP believed that "variables" in Python could be of type str (meaning
that they were statically typed), then you would interpret his question as "how
does a str-typed variable get converted into a unicode-typed variable?" instead
of the real question, "why does m.groups() return unicode objects?"

Again, the question was valid -- I just thought it would be helpful to explain
what Scott David Daniels meant when he said "Python does not have typed
variables", and perhaps give a more standard wording for these kind of
descriptions in Python.

Steve

Axel Bock · Nov 12, 2004

Kent said:
Apparently if the input strings are unicode then the groups will be as
well:
[...]
Are you sure that exp is not a unicode string?

hm. pretty much - i read the lines from a text file which contains only normal
text. a sample line looks like that:

6. call_noparam 1000 runs 149453,1 ms 149,4531 ms/call

no surprise here, i think ... . Actually I also wrote the program which
produces that file, and I really didn't use unicode then. opening the file
with a text editor also does not show unicode, and I can't believe that
windows does actually manage the unicode stuff transparently to text editors.
and also I have never heard of file-attached codepage information, those would
be the only things i could imagine as a reason.

but interesting though, thanks!

ciao,

axel.

Peter Otten · Nov 12, 2004

Axel said:
Kent said:

Apparently if the input strings are unicode then the groups will be as
well:
[...]
Are you sure that exp is not a unicode string?

Click to expand...

hm. pretty much - i read the lines from a text file which contains only
normal text. a sample line looks like that:

6. call_noparam 1000 runs 149453,1 ms 149,4531 ms/call

no surprise here, i think ... . Actually I also wrote the program which
produces that file, and I really didn't use unicode then. opening the file
with a text editor also does not show unicode, and I can't believe that
windows does actually manage the unicode stuff transparently to text
editors. and also I have never heard of file-attached codepage
information, those would be the only things i could imagine as a reason.

Why do you keep speculating?

[Your code from another post]

** CODE **
string = "1. asdf asdf 327,88"
exp = re.compile("(\S+) (\S+) (\S+) (\S+).*")
m = exp.match(string)
print m.groups()
** /CODE **

You could modify that along the lines (untested)

string = "1. asdf asdf 327,88"
pattern = "(\S+) (\S+) (\S+) (\S+).*"
# make sure that there is no unicode input:
assert not isinstance(string, unicode)
assert not isinstance(pattern, unicode)
exp = re.compile(pattern)
m = exp.match(string)
# make sure at least one group is a unicode string
if m:
assert [g for g in m.groups() if isinstance(g, unicode)]

If this does not throw an assertion error we can look further, but I still
think this is unlikely.

Peter

Unicode and Python - how often do you index strings?	33	Jun 4, 2014
sqlalchemy and Unicode strings: errormessage	12	May 30, 2011
can't get utf8 / unicode strings from embedded python	19	Aug 23, 2013
Python dict as unicode	1	Nov 24, 2010
unicode in multi-line strings	1	Sep 18, 2008
Strange unicode / no unicode phenomen with mysql	1	Dec 3, 2009
Unicode strings, struct, and files	2	Oct 9, 2006
Unicode (UTF-8) in C	13	Mar 16, 2014

re.compile.match() results in unicode strings - why?

Axel Bock

Scott David Daniels

Stefan Seefeld

Kent Johnson

Axel Bock

Steven Bethard

Stefan Seefeld

Steven Bethard

Axel Bock

Peter Otten

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads