Regex Matching on Readline()

J

jwwest

Anyone have any trouble pattern matching on lines returned by
readline? Here's an example:

string = "Accounting - General"
pat = ".+\s-"

Should match on "Accounting -". However, if I read that string in from
a file it will not match. In fact, I can't get anything to match
except ".*".

I'm almost certain that it has something to do with the characters
that python returns from readline(). If I have this in a file:

Accounting - General

And do a:

line = f.readline()
print line

I get:

A c c o u n t i n g - G e n e r a l

Not sure why, I'm a nub at Python so any help is appreciated. They
look like spaces to me, but aren't (I've tried matching on spacs too)


- james
 
J

John Machin

Anyone have any trouble pattern matching on lines returned by
readline? Here's an example:

string = "Accounting - General"
pat = ".+\s-"

Should match on "Accounting -". However, if I read that string in from
a file it will not match. In fact, I can't get anything to match
except ".*".

I'm almost certain that it has something to do with the characters
that python returns from readline(). If I have this in a file:

Accounting - General

And do a:

line = f.readline()
print line

I get:

A c c o u n t i n g - G e n e r a l

Not sure why, I'm a nub at Python so any help is appreciated. They
look like spaces to me, but aren't (I've tried matching on spacs too)

- james

To find out what the pseudo-spaces are, do this:

print repr(open("the_file", "rb").read()[:100])

and show us (copy/paste) what you get.

Also, tell us what platform you are running Python on, and how the
file was created (by what software, on what platform).
 
J

jwwest

Anyone have any trouble pattern matching on lines returned by
readline? Here's an example:
string = "Accounting - General"
pat = ".+\s-"
Should match on "Accounting -". However, if I read that string in from
a file it will not match. In fact, I can't get anything to match
except ".*".
I'm almost certain that it has something to do with the characters
that python returns from readline(). If I have this in a file:
Accounting - General
And do a:
line = f.readline()
print line
A c c o u n t i n g - G e n e r a l
Not sure why, I'm a nub at Python so any help is appreciated. They
look like spaces to me, but aren't (I've tried matching on spacs too)

To find out what the pseudo-spaces are, do this:

print repr(open("the_file", "rb").read()[:100])

and show us (copy/paste) what you get.

Also, tell us what platform you are running Python on, and how the
file was created (by what software, on what platform).

Here's my output:
'A\x00c\x00c\x00o\x00u\x00n\x00t\x00i\x00n\x00g\x00 \x00-\x00 \x00G
\x00e\x00n\x00e\x00r\x00a\x00l\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00
\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00
\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00'

I'm running Python on Windows. The file was initially created as
output from SQL Management Studio. I've re-saved it using TextPad
which tells me it's Unicode and PC formatted.
 
J

John Machin

To find out what the pseudo-spaces are, do this:
print repr(open("the_file", "rb").read()[:100])
and show us (copy/paste) what you get.
Also, tell us what platform you are running Python on, and how the
file was created (by what software, on what platform).

Here's my output:
'A\x00c\x00c\x00o\x00u\x00n\x00t\x00i\x00n\x00g\x00 \x00-\x00 \x00G
\x00e\x00n\x00e\x00r\x00a\x00l\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00
\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00
\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00'

I'm running Python on Windows. The file was initially created as
output from SQL Management Studio. I've re-saved it using TextPad
which tells me it's Unicode and PC formatted.

"Unicode" means "utf16".

Try this:

import codecs
f = codecs.open("the_file", "r", encoding="utf16le")
for uline in f:
line = uline.encode('cp1252') # or some other encoding if my guess
isn't correct
# proceed as usual

Cheers,
John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top