split problem if the delimiter is inside the text limiter

R

rewonka

Hi,

I had a problem, i would like to process a file into a PSQL, but in
the file the delimiter char is ':'
and i have the same charater inside the text field also.
something like this:
text = 1:23:34:"sample: text":" something"
if I use text.split(':')
it will be ['1', '23', '34', '"sample', 'text"', 'something']
but it's not good for me, because the delimiter is in text limiter "
"

is it a solution, that is fast to process ? so i don't need to split
by "hand made" function

Rew
 
P

Peter Otten

rewonka said:
I had a problem, i would like to process a file into a PSQL, but in
the file the delimiter char is ':'
and i have the same charater inside the text field also.
something like this:
text = 1:23:34:"sample: text":" something"
if I use text.split(':')
it will be ['1', '23', '34', '"sample', 'text"', 'something']
but it's not good for me, because the delimiter is in text limiter "
"

is it a solution, that is fast to process ? so i don't need to split
by "hand made" function

Have a look at http://docs.python.org/library/csv.html.

with open(filename, "rb") as instream:
rows = csv.reader(instream, delimiter=":")
# decode, put into db

Peter
 
R

rewonka

rewonka said:
I had a problem, i would like to process a file into a PSQL, but in
the file the delimiter char is ':'
and i have the same charater inside the text field also.
something like this:
text = 1:23:34:"sample: text":" something"
if I use text.split(':')
it will be ['1', '23', '34', '"sample', 'text"', 'something']
but it's not good for me, because the delimiter is in text limiter  "
"
is it a solution, that is fast to process ? so i don't need to split
by "hand made" function

Have a look athttp://docs.python.org/library/csv.html.

with open(filename, "rb") as instream:
    rows = csv.reader(instream, delimiter=":")
    # decode, put into db

Peter

Thanks foe the answers.

I tried that module it's work fine.
Now i stucked when i tried to pu into db.
Because i have some cell that is in somekind of unicoded text, and i'm
looking a solution how to put this into db (my db in utf-8 format).
Im using the pgdb modul, something like this:

# connection is an object from pgdb class

sql = ''' INSERT INTO table (column1,column2, ...) VALUES ( %s,
%s, ....); '''
rows = csv.reader(open("sample.csv","rb"), delimiter=":",
quotechar='"')
for row in rows:
connection.cursor.execute(sql % (row[0],row[1],....))
connection.corsur.commit()

but something binary in a cell, the pgdb says it is not in utf-8
format, or something like this.
I know it's a newbie solution :))
better solution?

Rew
 
T

Tim Chase

sql = ''' INSERT INTO table (column1,column2, ...) VALUES ( %s,
%s, ....); '''
for row in rows:
connection.cursor.execute(sql % (row[0],row[1],....))
connection.corsur.commit()

but something binary in a cell, the pgdb says it is not in utf-8
format, or something like this.
I know it's a newbie solution :))
better solution?

The first step is to use the database's quoting to prevent
problems where miscreant characters (such as a single-quote)
appear in the data:

connection.cursor.exeute(sql, (row[0], row[1]))

instead of

connection.cursor.exeute(sql % (row[0], row[1]))

(if your columns in your CSV happen to match the order of your
INSERT statement, you can just use

execute(sql, tuple(row))

As for the UTF-8 exception, without knowing more about the data
and the table-definitions, it's hard to offer a tangible solution.

-tkc
 
B

Bruno Desthuilliers

Tim Chase a écrit :
sql = ''' INSERT INTO table (column1,column2, ...) VALUES ( %s,
%s, ....); '''
for row in rows:
connection.cursor.execute(sql % (row[0],row[1],....))
connection.corsur.commit()
(snip)

The first step is to use the database's quoting to prevent problems
where miscreant characters (such as a single-quote) appear in the data:

connection.cursor.exeute(sql, (row[0], row[1]))

instead of

connection.cursor.exeute(sql % (row[0], row[1]))

(if your columns in your CSV happen to match the order of your INSERT
statement, you can just use

execute(sql, tuple(row))

Or more simply:

cursor.execute(sql, row)
 
B

Bruno Desthuilliers

rewonka a écrit :
(snip)
Now i stucked when i tried to pu into db.
Because i have some cell that is in somekind of unicoded text,

You mean "encoded in something else than utf8" ?
and i'm
looking a solution how to put this into db (my db in utf-8 format).
(snip)

but something binary in a cell, the pgdb says it is not in utf-8
format, or something like this.

You have to know the original encoding (I mean, the one used for the csv
file), else there's nothing you can do. Then it's just a matter of
decoding (to unicode) then encoding (to utf8), ie (if your source is in
latin1):

utf_string = latin1_string.decode("latin1").encode("utf8")
 
T

Tim Chase

Bruno said:
Tim Chase a écrit :

Or more simply:

cursor.execute(sql, row)

that's always annoyed me with the mxODBC drivers I've
used....they require(d?) a tuple in this position, so I developed
this habit. I think the OP hinted they were using postgres
drivers which may be a bit kinder to the user. (and don't get me
started on mxODBC's failure to determine the data-type for
parameters in subqueries, raising exceptions on perfectly valid
queries</rant>)

-tkc
 
I

imageguy

You have to know the original encoding (I mean, the one used for the csv
file), else there's nothing you can do. Then it's just a matter of
decoding (to unicode) then encoding (to utf8), ie (if your source is in
latin1):

utf_string = latin1_string.decode("latin1").encode("utf8")

The OP mentioned using 'pgdb' which I assumed to mean he is using
PostgeSQL and the PygreSQL DB.
If that is the case, then PostgreSQL has an optional parameter call
'client_encoding'. If this is set in within postgres db or as part of
the db transaction, then the db will accept the incoming data 'as is'
and do the decoding internally saving this step and giving a bit of a
performance boost as well as the client (python) application doesn't
need to be concerned about it.

As you so correctly point out Bruno, you do need to know the original
encoding. My comments above just simplify the db update process.

This part of the manual might be helpful
http://www.postgresql.org/docs/8.1/static/multibyte.html

If 'pgdb' != PostgreSQL then please accept my apologies for this
intrusion in this thread.

g.
 
M

M.-A. Lemburg

that's always annoyed me with the mxODBC drivers I've used....they
require(d?) a tuple in this position, so I developed this habit. I
think the OP hinted they were using postgres drivers which may be a bit
kinder to the user.

This requirement in mxODBC results from still providing a DB-API 1.0
compatible interface... after all, it's been around since 1997.
Back in the early days, there was only cursor.execute() and the only way
to differentiate between a list of rows and a single row was by type.

DB-API 2.0 has cursor.executemany() to make this differentiation
at the API level. mxODBC will lift this requirement in the next
version, promised :)
(and don't get me started on mxODBC's failure to
determine the data-type for parameters in subqueries, raising exceptions
on perfectly valid queries</rant>)

I'm not sure what you are referring to here. mxODBC can only provide
an interface to the underlying ODBC driver.

If the ODBC driver does not provide the required information or
reports errors in situations where you don't expect them, mxODBC
can try to work around these things (and does for quite a few drivers),
but of course, we cannot cover all possible cases, since there are
far too many ODBC drivers out there, each with their own release
cycles.

That said, we are always open to reports about such issues so
we can work them into mxODBC updates where possible.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Mar 19 2009)________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/
 
T

Tim Chase

DB-API 2.0 has cursor.executemany() to make this differentiation
at the API level. mxODBC will lift this requirement in the next
version, promised :)

glad to hear...will executemany() take an arbitrary iterable? My
(albeit somewhat-antiquated) version balked at anything that
wasn't a list/tuple (don't remember off the top of my head which
it was). For a lot of my ETL work, it would be nice to pass a
generator so I don't have to keep huge datasets in memory.
I'm not sure what you are referring to here. mxODBC can only provide
an interface to the underlying ODBC driver.

Okay...feel free to deflect the minced oaths at MS's SQL Server
ODBC driver then. :) The main problem came with queries like

cursor.execute("""
SELECT a, b, c
FROM table_x
WHERE providerid = ?
AND a IN (
SELECT z
FROM table_y
WHERE accountid = ?
)""", (providerid, accountid)
)

The "accountid = ?" in the sub-query triggers some inability in
some layer to determine what datatype it should be converted to,
so it raises a FrustrateTim exception :)

-tkc
 
M

M.-A. Lemburg

glad to hear...will executemany() take an arbitrary iterable? My
(albeit somewhat-antiquated) version balked at anything that wasn't a
list/tuple (don't remember off the top of my head which it was). For a
lot of my ETL work, it would be nice to pass a generator so I don't have
to keep huge datasets in memory.

cursor.executemany() is designed to work on bulk data. It allows
passing a whole chunk of data to the database in one go. It will
currently accept any sequence, but not an iterator. However, we may
add support for general iterators to the next version.

If you want to use a generator with mxODBC 3.0, you are probably
better off using e.g.

cmd = 'INSERT ... VALUES (?,?,?)'
for row in iterable:
cursor.execute(cmd, row)

Since you always use the same command object, this will trigger
an optimization in mxODBC whereby the prepare step for the command
is only applied once and the already prepared command then
reused for all subsequent executes.
Okay...feel free to deflect the minced oaths at MS's SQL Server ODBC
driver then. :) The main problem came with queries like

cursor.execute("""
SELECT a, b, c
FROM table_x
WHERE providerid = ?
AND a IN (
SELECT z
FROM table_y
WHERE accountid = ?
)""", (providerid, accountid)
)

The "accountid = ?" in the sub-query triggers some inability in some
layer to determine what datatype it should be converted to, so it raises
a FrustrateTim exception :)

Parameter binding markers are not always supported in all contexts.

The error that I get from the SQL Server ODBC driver for the above is:

mx.ODBC.Error.ProgrammingError: ('42000', 0, '[Microsoft][SQL Server Native
Client 10.0]Syntax error, permission violation, or other nonspecific error', 7498)

In such a case, you may have more luck by using mxODBC's
cursor.executedirect(). This uses a different mechanism for preparing
and binding the data to the command which basically moves the
process to the database server side.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Mar 19 2009)________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top