Strange re problem

T

TYR

OK, this ought to be simple. I'm parsing a large text file (originally
a database dump) in order to process the contents back into a SQLite3
database. The data looks like this:

'AAA','PF',-17.416666666667,-145.5,'Anaa, French Polynesia','Pacific/
Tahiti','Anaa';'AAB','AU',-26.75,141,'Arrabury, Queensland,
Australia','?','?';'AAC','EG',31.133333333333,33.8,'Al Arish,
Egypt','Africa/Cairo','El Arish International';'AAE','DZ',
36.833333333333,8,'Annaba','Africa/Algiers','Rabah Bitat';

which goes on for another 308 lines. As keen and agile minds will no
doubt spot, the rows are separated by a ; so it should be simple to
parse it using a regex. So, I establish a db connection and cursor,
create the table, and open the source file.

Then we do this:

f = file.readlines()
biglist = re.split(';', f)

and then iterate over the output from re.split(), inserting each set
of values into the db, and finally close the file and commit
transactions. But instead, I get this error:

Traceback (most recent call last):
File "converter.py", line 12, in <module>
biglist = re.split(';', f)
File "/usr/lib/python2.5/re.py", line 157, in split
return _compile(pattern, 0).split(string, maxsplit)
TypeError: expected string or buffer

Is this because the lat and long values are integers rather than
strings? (If so, any ideas?)
 
P

Peter Otten

TYR said:
OK, this ought to be simple. I'm parsing a large text file (originally
a database dump) in order to process the contents back into a SQLite3
database. The data looks like this:

'AAA','PF',-17.416666666667,-145.5,'Anaa, French Polynesia','Pacific/
Tahiti','Anaa';'AAB','AU',-26.75,141,'Arrabury, Queensland,
Australia','?','?';'AAC','EG',31.133333333333,33.8,'Al Arish,
Egypt','Africa/Cairo','El Arish International';'AAE','DZ',
36.833333333333,8,'Annaba','Africa/Algiers','Rabah Bitat';

which goes on for another 308 lines. As keen and agile minds will no
doubt spot, the rows are separated by a ; so it should be simple to
parse it using a regex. So, I establish a db connection and cursor,
create the table, and open the source file.

Then we do this:

f = file.readlines()
biglist = re.split(';', f)

and then iterate over the output from re.split(), inserting each set
of values into the db, and finally close the file and commit
transactions. But instead, I get this error:

Traceback (most recent call last):
File "converter.py", line 12, in <module>
biglist = re.split(';', f)
File "/usr/lib/python2.5/re.py", line 157, in split
return _compile(pattern, 0).split(string, maxsplit)
TypeError: expected string or buffer

Is this because the lat and long values are integers rather than
strings? (If so, any ideas?)

No, the result of f.readlines() is a list, but re.split() expects a string
as the second parameter.

f = file.read()
biglist = re.split(";", f)

should work if the file fits into memory, but you don't need regular
expressions here:

biglist = file.read().split(";")

is just as good -- or bad, if your data contains any ";" characters.

Peter
 
M

Mel

TYR said:
OK, this ought to be simple. I'm parsing a large text file (originally
a database dump) in order to process the contents back into a SQLite3
database. The data looks like this:

'AAA','PF',-17.416666666667,-145.5,'Anaa, French Polynesia','Pacific/
Tahiti','Anaa';'AAB','AU',-26.75,141,'Arrabury, Queensland,
Australia','?','?';'AAC','EG',31.133333333333,33.8,'Al Arish,
Egypt','Africa/Cairo','El Arish International';'AAE','DZ',
36.833333333333,8,'Annaba','Africa/Algiers','Rabah Bitat';

which goes on for another 308 lines. As keen and agile minds will no
doubt spot, the rows are separated by a ; so it should be simple to
parse it using a regex. So, I establish a db connection and cursor,
create the table, and open the source file.

Then we do this:

f = file.readlines()
biglist = re.split(';', f)

and then iterate over the output from re.split(), inserting each set
of values into the db, and finally close the file and commit
transactions. But instead, I get this error:

Traceback (most recent call last):
File "converter.py", line 12, in <module>
biglist = re.split(';', f)
File "/usr/lib/python2.5/re.py", line 157, in split
return _compile(pattern, 0).split(string, maxsplit)
TypeError: expected string or buffer

(untested) Try f=file.read()

readlines gives you a list of lines.

Mel.
 
J

John Machin

OK, this ought to be simple. I'm parsing a large text file (originally
a database dump) in order to process the contents back into a SQLite3
database. The data looks like this:

'AAA','PF',-17.416666666667,-145.5,'Anaa, French Polynesia','Pacific/
Tahiti','Anaa';'AAB','AU',-26.75,141,'Arrabury, Queensland,
Australia','?','?';'AAC','EG',31.133333333333,33.8,'Al Arish,
Egypt','Africa/Cairo','El Arish International';'AAE','DZ',
36.833333333333,8,'Annaba','Africa/Algiers','Rabah Bitat';

which goes on for another 308 lines.

308 lines or 308 rows? Another way of asking the same question: do you
have line terminators like \n or \r\n or \r in your file? If so, you
will need to do something like this:

rows = open('myfile', 'rb').read().replace('\r\n', '').split(';')
As keen and agile minds will no
doubt spot, the rows are separated by a ; so it should be simple to
parse it using a regex. So, I establish a db connection and cursor,
create the table, and open the source file.

Then we do this:

f = file.readlines()
biglist = re.split(';', f)

and then iterate over the output from re.split(), inserting each set
of values into the db,

Where we left off, you had a list of rows. Each row will be a string
like:
'AAB','AU',-26.75,141,'Arrabury, Queensland,
Australia','?','?'

How do you propose to parse that string into a "set of values"? Can
you rely there being data commas only in the 5th field, or do you need
a general solution? What if (as Peter remarked) there is a ';' in the
data? What if there's a "'" in the data (think O'Hare)?
and finally close the file and commit
transactions. But instead, I get this error:

Traceback (most recent call last):
File "converter.py", line 12, in <module>
biglist = re.split(';', f)
File "/usr/lib/python2.5/re.py", line 157, in split
return _compile(pattern, 0).split(string, maxsplit)
TypeError: expected string or buffer

Is this because the lat and long values are integers rather than
strings? (If so, any ideas?)

At the stage where it blew up, you didn't even have rows, let alone
fields, let alone worries about converting your lat and long fields
from string to float (not integer!).

HTH,
John
 
T

TYR

How do you propose to parse that string into a "set of values"? Can
you rely there being data commas only in the 5th field, or do you need
a general solution? What if (as Peter remarked) there is a ';' in the
data? What if there's a "'" in the data (think O'Hare)?

My plan was to be pointlessly sarcastic.
 
J

John Machin

My plan was to be pointlessly sarcastic.

You misunderstand. My questions covered several of the problems often
encountered by newbies. I am offering help. If my suspicion that you
would need further help was incorrect, please forgive me.
 
P

Paul McGuire

OK, this ought to be simple. I'm parsing a large text file (originally
a database dump) in order to process the contents back into a SQLite3
database. The data looks like this:

'AAA','PF',-17.416666666667,-145.5,'Anaa, French Polynesia','Pacific/
Tahiti','Anaa';'AAB','AU',-26.75,141,'Arrabury, Queensland,
Australia','?','?';'AAC','EG',31.133333333333,33.8,'Al Arish,
Egypt','Africa/Cairo','El Arish International';'AAE','DZ',
36.833333333333,8,'Annaba','Africa/Algiers','Rabah Bitat';

which goes on for another 308 lines. As keen and agile minds will no
doubt spot, the rows are separated by a ; so it should be simple to
parse it using a regex. So, I establish a db connection and cursor,
create the table, and open the source file.

Using pyparsing, you can skip all that "what happens if there is a
semicolon or comma inside a quoted string?" noise, and get the data in
a trice. If you add results names (as I've done in the example), then
loading each record into your db should be equally simple.

Here is a pyparsing extractor for you. The parse actions already do
the conversions to floats, and stripping off of quotation marks.

-- Paul

data = """
'AAA','PF',-17.416666666667,-145.5,'Anaa, French Polynesia','Pacific/
Tahiti','Anaa';'AAB','AU',-26.75,141,'Arrabury, Queensland,
Australia','?','?';'AAC','EG',31.133333333333,33.8,'Al Arish,
Egypt','Africa/Cairo','El Arish International';'AAE','DZ',
36.833333333333,8,'Annaba','Africa/Algiers','Rabah Bitat';
""".splitlines()
data = "".join(data)

from pyparsing import *

num = Regex(r'-?\d+(\.\d+)?')
num.setParseAction(lambda t: float(t[0]))
qs = sglQuotedString.setParseAction(removeQuotes)
CMA = Suppress(',')
SEMI = Suppress(';')
dataRow = qs("field1") + CMA + qs("field2") + CMA + \
num("long") + CMA + num("lat") + CMA + qs("city") + CMA + \
qs("tz") + CMA + qs("field7") + SEMI

for dr in dataRow.searchString(data):
print dr.dump()
print dr.city,dr.long,dr.lat

Prints:

['AAA', 'PF', -17.416666666666998, -145.5, 'Anaa, French Polynesia',
'Pacific/ Tahiti', 'Anaa']
- city: Anaa, French Polynesia
- field1: AAA
- field2: PF
- field7: Anaa
- lat: -145.5
- long: -17.4166666667
- tz: Pacific/ Tahiti
Anaa, French Polynesia -17.4166666667 -145.5
['AAB', 'AU', -26.75, 141.0, 'Arrabury, Queensland, Australia', '?',
'?']
- city: Arrabury, Queensland, Australia
- field1: AAB
- field2: AU
- field7: ?
- lat: 141.0
- long: -26.75
- tz: ?
Arrabury, Queensland, Australia -26.75 141.0
['AAC', 'EG', 31.133333333332999, 33.799999999999997, 'Al Arish,
Egypt', 'Africa/Cairo', 'El Arish International']
- city: Al Arish, Egypt
- field1: AAC
- field2: EG
- field7: El Arish International
- lat: 33.8
- long: 31.1333333333
- tz: Africa/Cairo
Al Arish, Egypt 31.1333333333 33.8
['AAE', 'DZ', 36.833333333333002, 8.0, 'Annaba', 'Africa/Algiers',
'Rabah Bitat']
- city: Annaba
- field1: AAE
- field2: DZ
- field7: Rabah Bitat
- lat: 8.0
- long: 36.8333333333
- tz: Africa/Algiers
Annaba 36.8333333333 8.0
 

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,065
Latest member
OrderGreenAcreCBD

Latest Threads

Top