split problem if the delimiter is inside the text limiter

rewonka · Mar 18, 2009

Hi,

I had a problem, i would like to process a file into a PSQL, but in
the file the delimiter char is ':'
and i have the same charater inside the text field also.
something like this:
text = 1:23:34:"sample: text":" something"
if I use text.split(':')
it will be ['1', '23', '34', '"sample', 'text"', 'something']
but it's not good for me, because the delimiter is in text limiter "
"

is it a solution, that is fast to process ? so i don't need to split
by "hand made" function

Rew

Peter Otten · Mar 18, 2009

rewonka said:
I had a problem, i would like to process a file into a PSQL, but in
the file the delimiter char is ':'
and i have the same charater inside the text field also.
something like this:
text = 1:23:34:"sample: text":" something"
if I use text.split(':')
it will be ['1', '23', '34', '"sample', 'text"', 'something']
but it's not good for me, because the delimiter is in text limiter "
"

is it a solution, that is fast to process ? so i don't need to split
by "hand made" function

Have a look at http://docs.python.org/library/csv.html.

with open(filename, "rb") as instream:
rows = csv.reader(instream, delimiter=":")
# decode, put into db

Peter

rewonka · Mar 18, 2009

rewonka said:
rewonka said:

I had a problem, i would like to process a file into a PSQL, but in
the file the delimiter char is ':'
and i have the same charater inside the text field also.
something like this:
text = 1:23:34:"sample: text":" something"
if I use text.split(':')
it will be ['1', '23', '34', '"sample', 'text"', 'something']
but it's not good for me, because the delimiter is in text limiter "
"

Click to expand...

is it a solution, that is fast to process ? so i don't need to split
by "hand made" function

Click to expand...

Have a look athttp://docs.python.org/library/csv.html.

with open(filename, "rb") as instream:
rows = csv.reader(instream, delimiter=":")
# decode, put into db

Peter

Thanks foe the answers.

I tried that module it's work fine.
Now i stucked when i tried to pu into db.
Because i have some cell that is in somekind of unicoded text, and i'm
looking a solution how to put this into db (my db in utf-8 format).
Im using the pgdb modul, something like this:

# connection is an object from pgdb class

sql = ''' INSERT INTO table (column1,column2, ...) VALUES ( %s,
%s, ....); '''
rows = csv.reader(open("sample.csv","rb"), delimiter=":",
quotechar='"')
for row in rows:
connection.cursor.execute(sql % (row[0],row[1],....))
connection.corsur.commit()

but something binary in a cell, the pgdb says it is not in utf-8
format, or something like this.
I know it's a newbie solution

)
better solution?

Rew

Tim Chase · Mar 18, 2009

sql = ''' INSERT INTO table (column1,column2, ...) VALUES ( %s,

%s, ....); '''
for row in rows:
connection.cursor.execute(sql % (row[0],row[1],....))
connection.corsur.commit()

but something binary in a cell, the pgdb says it is not in utf-8
format, or something like this.
I know it's a newbie solution )
better solution?

The first step is to use the database's quoting to prevent
problems where miscreant characters (such as a single-quote)
appear in the data:

connection.cursor.exeute(sql, (row[0], row[1]))

instead of

connection.cursor.exeute(sql % (row[0], row[1]))

(if your columns in your CSV happen to match the order of your
INSERT statement, you can just use

execute(sql, tuple(row))

As for the UTF-8 exception, without knowing more about the data
and the table-definitions, it's hard to offer a tangible solution.

-tkc

Bruno Desthuilliers · Mar 18, 2009

Tim Chase a écrit :

sql = ''' INSERT INTO table (column1,column2, ...) VALUES ( %s,
%s, ....); '''
for row in rows:
connection.cursor.execute(sql % (row[0],row[1],....))
connection.corsur.commit()

Click to expand...

(snip)

The first step is to use the database's quoting to prevent problems
where miscreant characters (such as a single-quote) appear in the data:

connection.cursor.exeute(sql, (row[0], row[1]))

instead of

connection.cursor.exeute(sql % (row[0], row[1]))

(if your columns in your CSV happen to match the order of your INSERT
statement, you can just use

execute(sql, tuple(row))

Or more simply:

cursor.execute(sql, row)

Bruno Desthuilliers · Mar 18, 2009

rewonka a écrit :
(snip)

Now i stucked when i tried to pu into db.
Because i have some cell that is in somekind of unicoded text,

You mean "encoded in something else than utf8" ?

and i'm
looking a solution how to put this into db (my db in utf-8 format).
(snip)

but something binary in a cell, the pgdb says it is not in utf-8
format, or something like this.

You have to know the original encoding (I mean, the one used for the csv
file), else there's nothing you can do. Then it's just a matter of
decoding (to unicode) then encoding (to utf8), ie (if your source is in
latin1):

utf_string = latin1_string.decode("latin1").encode("utf8")

Tim Chase · Mar 18, 2009

Bruno said:
Tim Chase a écrit :

Or more simply:

cursor.execute(sql, row)

that's always annoyed me with the mxODBC drivers I've
used....they require(d?) a tuple in this position, so I developed
this habit. I think the OP hinted they were using postgres
drivers which may be a bit kinder to the user. (and don't get me
started on mxODBC's failure to determine the data-type for
parameters in subqueries, raising exceptions on perfectly valid
queries</rant>)

-tkc

imageguy · Mar 19, 2009

You have to know the original encoding (I mean, the one used for the csv
file), else there's nothing you can do. Then it's just a matter of
decoding (to unicode) then encoding (to utf8), ie (if your source is in
latin1):

utf_string = latin1_string.decode("latin1").encode("utf8")

The OP mentioned using 'pgdb' which I assumed to mean he is using
PostgeSQL and the PygreSQL DB.
If that is the case, then PostgreSQL has an optional parameter call
'client_encoding'. If this is set in within postgres db or as part of
the db transaction, then the db will accept the incoming data 'as is'
and do the decoding internally saving this step and giving a bit of a
performance boost as well as the client (python) application doesn't
need to be concerned about it.

As you so correctly point out Bruno, you do need to know the original
encoding. My comments above just simplify the db update process.

This part of the manual might be helpful
http://www.postgresql.org/docs/8.1/static/multibyte.html

If 'pgdb' != PostgreSQL then please accept my apologies for this
intrusion in this thread.

g.

M.-A. Lemburg · Mar 19, 2009

that's always annoyed me with the mxODBC drivers I've used....they
require(d?) a tuple in this position, so I developed this habit. I
think the OP hinted they were using postgres drivers which may be a bit
kinder to the user.

This requirement in mxODBC results from still providing a DB-API 1.0
compatible interface... after all, it's been around since 1997.
Back in the early days, there was only cursor.execute() and the only way
to differentiate between a list of rows and a single row was by type.

DB-API 2.0 has cursor.executemany() to make this differentiation
at the API level. mxODBC will lift this requirement in the next
version, promised

(and don't get me started on mxODBC's failure to
determine the data-type for parameters in subqueries, raising exceptions
on perfectly valid queries</rant>)

I'm not sure what you are referring to here. mxODBC can only provide
an interface to the underlying ODBC driver.

If the ODBC driver does not provide the required information or
reports errors in situations where you don't expect them, mxODBC
can try to work around these things (and does for quite a few drivers),
but of course, we cannot cover all possible cases, since there are
far too many ODBC drivers out there, each with their own release
cycles.

That said, we are always open to reports about such issues so
we can work them into mxODBC updates where possible.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Mar 19 2009)________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/

Tim Chase · Mar 19, 2009

DB-API 2.0 has cursor.executemany() to make this differentiation

at the API level. mxODBC will lift this requirement in the next
version, promised

glad to hear...will executemany() take an arbitrary iterable? My
(albeit somewhat-antiquated) version balked at anything that
wasn't a list/tuple (don't remember off the top of my head which
it was). For a lot of my ETL work, it would be nice to pass a
generator so I don't have to keep huge datasets in memory.

I'm not sure what you are referring to here. mxODBC can only provide
an interface to the underlying ODBC driver.

Okay...feel free to deflect the minced oaths at MS's SQL Server
ODBC driver then.

The main problem came with queries like

cursor.execute("""
SELECT a, b, c
FROM table_x
WHERE providerid = ?
AND a IN (
SELECT z
FROM table_y
WHERE accountid = ?
)""", (providerid, accountid)
)

The "accountid = ?" in the sub-query triggers some inability in
some layer to determine what datatype it should be converted to,
so it raises a FrustrateTim exception

-tkc

M.-A. Lemburg · Mar 19, 2009

glad to hear...will executemany() take an arbitrary iterable? My
(albeit somewhat-antiquated) version balked at anything that wasn't a
list/tuple (don't remember off the top of my head which it was). For a
lot of my ETL work, it would be nice to pass a generator so I don't have
to keep huge datasets in memory.

cursor.executemany() is designed to work on bulk data. It allows
passing a whole chunk of data to the database in one go. It will
currently accept any sequence, but not an iterator. However, we may
add support for general iterators to the next version.

If you want to use a generator with mxODBC 3.0, you are probably
better off using e.g.

cmd = 'INSERT ... VALUES (?,?,?)'
for row in iterable:
cursor.execute(cmd, row)

Since you always use the same command object, this will trigger
an optimization in mxODBC whereby the prepare step for the command
is only applied once and the already prepared command then
reused for all subsequent executes.

Okay...feel free to deflect the minced oaths at MS's SQL Server ODBC
driver then. The main problem came with queries like

cursor.execute("""
SELECT a, b, c
FROM table_x
WHERE providerid = ?
AND a IN (
SELECT z
FROM table_y
WHERE accountid = ?
)""", (providerid, accountid)
)

The "accountid = ?" in the sub-query triggers some inability in some
layer to determine what datatype it should be converted to, so it raises
a FrustrateTim exception

Parameter binding markers are not always supported in all contexts.

The error that I get from the SQL Server ODBC driver for the above is:

mx.ODBC.Error.ProgrammingError: ('42000', 0, '[Microsoft][SQL Server Native
Client 10.0]Syntax error, permission violation, or other nonspecific error', 7498)

In such a case, you may have more luck by using mxODBC's
cursor.executedirect(). This uses a different mechanism for preparing
and binding the data to the command which basically moves the
process to the database server side.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Mar 19 2009)________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/

Genetic algoritm generating the text	0	Aug 18, 2023
Problem Splitting Text String	2	Dec 29, 2022
Select Eof extension files based on text list of filenames with if condition	0	May 4, 2022
Identifying if the program I have is python and then decompiling	0	May 29, 2022
csv.reader problem tab-delimiter - newbie	4	Oct 21, 2008
Php combine identical lines in text file	4	Oct 11, 2023
how to split the file into two sections....	1	Dec 4, 2012
Batch modifying text - content and context based	5	Jan 19, 2023

split problem if the delimiter is inside the text limiter

rewonka

Peter Otten

rewonka

Tim Chase

Bruno Desthuilliers

Bruno Desthuilliers

Tim Chase

imageguy

M.-A. Lemburg

Tim Chase

M.-A. Lemburg

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads