Troubles with CSV file

V

Vladimir Ignatov

Hello!

I have a big CSV file, which I must read and do some processing with it.
Unfortunately I can't figure out how to use standard *csv* module in my
situation. The problem is that some records look like:

""read this, man"", 1

which should be decoded back into the:

"read this, man"
1

.... which is look pretty "natural" for me. Instead I got a:

read this
man""
1

output. In other words, csv reader does not understand using of "" here.
Quick experiment show me that *csv* module (with default 'excel' dialect)
expects something like

"""read this, man""", 1

in my situation - quotes actually must be trippled. I don't understand this
and can't figure out how to proceed with my CSV file. Maybe some
*alternative* CSV parsers can help? Any suggestions are welcomed.

Vladimir Ignatov
 
P

Peter Hansen

Vladimir said:
I have a big CSV file, which I must read and do some processing with it.
Unfortunately I can't figure out how to use standard *csv* module in my
situation. The problem is that some records look like:

""read this, man"", 1

which should be decoded back into the:

"read this, man"
1

Do you have anything that already accepts this particular dialect?
It seems to me that the above could just as easily be interpreted
as three fields (using parentheses as delimiters) :

(""read this) ( man"") ( 1)

Is it possible that what you have is not really any standard CSV
format, but just something home-brewed? In that case, you may
well need to massage it before feeding it to the csv module.

Or, if you can define how your example works in terms of delimiters,
quoting and such, maybe there's a way to make the csv module handle
it without complaints.

As far as I can see, you want either the doubled quotation marks to
be treated as single quotation marks, or you want the outer quotation
marks to magically quote the whole string containing the comma even
though it contains the quotation marks already. I don't think CSV
can handle the latter (and it's probably an impossible goal), so you
must really want the former. In that case, unfortunately, you
are also screwed because the doubling of quotation marks must mean
that 'doublequote' is True, but then 'quotechar' must have been '"'
in the first place and that first field would now have triple quotes
around it, like the Excel dialect.

Can you just blindly substitute all double quotes with triple quotes
in the input string first? That might be the easiest approach.

-Peter
 
F

Fuzzyman

Vladimir Ignatov said:
Hello!

I have a big CSV file, which I must read and do some processing with it.
Unfortunately I can't figure out how to use standard *csv* module in my
situation. The problem is that some records look like:

""read this, man"", 1

which should be decoded back into the:

"read this, man"
1

... which is look pretty "natural" for me. Instead I got a:

read this
man""
1

output. In other words, csv reader does not understand using of "" here.
Quick experiment show me that *csv* module (with default 'excel' dialect)
expects something like

"""read this, man""", 1

in my situation - quotes actually must be trippled. I don't understand this
and can't figure out how to proceed with my CSV file. Maybe some
*alternative* CSV parsers can help? Any suggestions are welcomed.

Vladimir Ignatov


I have written a very simple CSV parser which uses a simple function
'unquote' to unquote quoted elements.
It would be *very* simple to amend unquote to handle double-quoted
elements.

http://www.voidspace.org.uk/atlantibots/pythonutils.html

Regards,

Fuzzy
 
P

Paul McGuire

Vladimir Ignatov said:
Hello!

I have a big CSV file, which I must read and do some processing with it.
Unfortunately I can't figure out how to use standard *csv* module in my
situation. The problem is that some records look like:

""read this, man"", 1

which should be decoded back into the:

"read this, man"
1

... which is look pretty "natural" for me. Instead I got a:

read this
man""
1

output. In other words, csv reader does not understand using of "" here.
Quick experiment show me that *csv* module (with default 'excel' dialect)
expects something like

"""read this, man""", 1

in my situation - quotes actually must be trippled. I don't understand this
and can't figure out how to proceed with my CSV file. Maybe some
*alternative* CSV parsers can help? Any suggestions are welcomed.

Vladimir Ignatov
Vladimir -

Here is the CSV example that is provided with pyparsing (with some slight
edits). I wrote this for exactly the situation you describe - just
splitting on commas doesn't always do the right thing.

You can download pyparsing at http://pyparsing.sourceforge.net .

-- Paul

==========================
# commasep.py
#
# comma-separated list example, to illustrate the advantages of using
# the pyparsing commaSeparatedList as opposed to string.split(","):
# - leading and trailing whitespace is implicitly trimmed from list elements
# - list elements can be quoted strings, which can safely contain commas
without breaking
# into separate elements

from pyparsing import commaSeparatedList
import string

testData = [
"a,b,c,100.2,,3",
"d, e, j k , m ",
"'Hello, World', f, g , , 5.1,x",
"John Doe, 123 Main St., Cleveland, Ohio",
"Jane Doe, 456 St. James St., Los Angeles , California ",
"",
]

for line in testData:
print "input:", repr(line)
print "split:", line.split(",")
print "parse:", commaSeparatedList.parseString(line)
print

==========================
Output:
input: 'a,b,c,100.2,,3'
split: ['a', 'b', 'c', '100.2', '', '3']
parse: ['a', 'b', 'c', '100.2', '', '3']

input: 'd, e, j k , m '
split: ['d', ' e', ' j k ', ' m ']
parse: ['d', 'e', 'j k', 'm']

input: "'Hello, World', f, g , , 5.1,x"
split: ["'Hello", " World'", ' f', ' g ', ' ', ' 5.1', 'x']
parse: ["'Hello, World'", 'f', 'g', '', '5.1', 'x']

input: 'John Doe, 123 Main St., Cleveland, Ohio'
split: ['John Doe', ' 123 Main St.', ' Cleveland', ' Ohio']
parse: ['John Doe', '123 Main St.', 'Cleveland', 'Ohio']

input: 'Jane Doe, 456 St. James St., Los Angeles , California '
split: ['Jane Doe', ' 456 St. James St.', ' Los Angeles ', ' California ']
parse: ['Jane Doe', '456 St. James St.', 'Los Angeles', 'California']

input: ''
split: ['']
parse: ['']
 
D

Dennis Lee Bieber

output. In other words, csv reader does not understand using of "" here.
Quick experiment show me that *csv* module (with default 'excel' dialect)
expects something like

"""read this, man""", 1

in my situation - quotes actually must be trippled. I don't understand this

Which is standard behavior in almost all programming languages.
The first " signals the beginning of a quoted string. Within a quoted
string, double "s flag an escape, being replaced with a single " in the
text. Then a final " ends the quoted string.

"This is a ""quoted"" string"
becomes
This is a "quoted" string
internally.

I don't know why you got the "" on the trailing segment of your
text -- maybe a bug in the CSV module, as I'd parse your (use fixed
font)

""read this, man"", 1
start---|
end------| ie, an empty quoted string
unquoted--^^^^^^^^^
comma-split--------|
unquoted------------^^^^
start-------------------|
end----------------------| another empty quoted string
comma-split---------------|
unquoted-------------------^^

whereas

"""read this, man""", 1
start---|
end?-----| could be empty string
NO-doubled| no, it's a " inside the string
quoted-----^^^^^^^^^^^^^^
end?---------------------| end of string?
NO-doubled----------------| no, another " inside the string
end------------------------| not doubled so end of string
comma-split-----------------|
unquoted---------------------^^


--
 
S

Skip Montanaro

Dennis> ""read this, man"", 1
Dennis> start---|
Dennis> end------| ie, an empty quoted string
Dennis> unquoted--^^^^^^^^^
Dennis> comma-split--------|
Dennis> unquoted------------^^^^
Dennis> start-------------------|
Dennis> end----------------------| another empty quoted string
Dennis> comma-split---------------|
Dennis> unquoted-------------------^^

I'm not sure what "correct" interpretation of this should be since no
separator was placed after the first '""' and before the second. Given that
the input is ill-defined, just about any output could be considered
"valid". ;-)

Skip
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,754
Messages
2,569,528
Members
45,000
Latest member
MurrayKeync

Latest Threads

Top