comma quoted lists question

C

cartercc

I create and deliver CSV files like this:
"12345","John","Q","Public","123 Main Street","Anywhere,
Alaska","800-555-1212"
I create this in my program as a seven element array. Users modify the
files, frequently using Excel, and return the files to me.

Sometimes I get the original format back.

Sometimes, Excel will mangle the file to this:
12345,John,Q,Public,123 Main Street,"Anywhere, Alaska",800-555-1212
My script will read this as an eight element array (splitting
"Anywhere, Alaska" on the comma).

Sometimes Excel will produce a tab delimited file like this:
12345 John Q Public 123 Main Street "Anywhere, Alaska" 800-555-1212
My script will read this as a two element array.

My users like to use Excel and I am not inclined to tell them they
can't, nor am I inclined to 'educate' them on making sure that Excel
doesn't change the file format. They honestly need the filtering and
sorting capabilities that Excel provides, as some of these files are
106 columns wide and thousands of rows deep. I have error checking in
place that rejects improperly formed files, which I define as
producing anything other than an X element array. However, this isn't
satisfactory, because I have the burden of reformatting the file.

Comments? Suggestions? Any module that automagically reformats files?
Will I have to write a reformatting script myself?

Thanks, CC.
 
A

A. Sinan Unur

I create and deliver CSV files like this:
"12345","John","Q","Public","123 Main Street","Anywhere,
Alaska","800-555-1212"
I create this in my program as a seven element array. Users modify the
files, frequently using Excel, and return the files to me.

Sometimes I get the original format back.

Sometimes, Excel will mangle the file to this:
12345,John,Q,Public,123 Main Street,"Anywhere, Alaska",800-555-1212
My script will read this as an eight element array (splitting
"Anywhere, Alaska" on the comma).

Use Text::CSV_XS instead of naively splitting on the comma.
Sometimes Excel will produce a tab delimited file like this:
12345 John Q Public 123 Main Street "Anywhere, Alaska" 800-555-1212
My script will read this as a two element array.

Again, use Text::CSV_XS bu specify the tab character as the separator
instead of naively splitting on the comma.
My users like to use Excel and I am not inclined to tell them they
can't,

Why is this relevant?
nor am I inclined to 'educate' them on making sure that Excel
doesn't change the file format.

Changing from a comma to tab separated file requires explicit user
action in Excel. So, you might want to tell them to save as CSV.

<snip>

Sinan


--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/
 
T

Ted Byers

I create and deliver CSV files like this:
"12345","John","Q","Public","123 Main Street","Anywhere,
Alaska","              800-555-1212       "
I create this in my program as a seven element array. Users modify the
files, frequently using Excel, and return the files to me.

Sometimes I get the original format back.

Sometimes, Excel will mangle the file to this:
12345,John,Q,Public,123 Main Street,"Anywhere, Alaska",              800-555-1212       
My script will read this as an eight element array (splitting
"Anywhere, Alaska" on the comma).

Sometimes Excel will produce a tab delimited file like this:
12345 John Q Public 123 Main Street "Anywhere, Alaska" 800-555-1212
My script will read this as a two element array.

My users like to use Excel and I am not inclined to tell them they
can't, nor am I inclined to 'educate' them on making sure that Excel
doesn't change the file format. They honestly need the filtering and
sorting capabilities that Excel provides, as some of these files are
106 columns wide and thousands of rows deep. I have error checking in
place that rejects improperly formed files, which I define as
producing anything other than an X element array. However, this isn't
satisfactory, because I have the burden of reformatting the file.

Comments? Suggestions? Any module that automagically reformats files?
Will I have to write a reformatting script myself?

Thanks, CC.

That you get multiple formats, such as optionally quoted fields or tab
delimited fields, suggests that user education in the use of Excel is
warranted. Excel will only do what the users tell it to do, so the
problem is what your users are doing, not what Excel actually does.
But if you don't want to provide that education, you have no option
but to test for the format used yourself, or load the files into Excel
yourself, when you get them, and then re-export them in the format you
require. To automagically handle this task in your script, regular
expressions are priceless. You know what delimiters are possible, so
count each in turn, to see what had been used in the file you're
processing. You can use a regular expression to escape delimiters,
such as commas, that appear within quotes. Your options are
limitless. You just have to pick one, and write code to do it.

You don't say what perl packages you use, but there are quite a number
(too many to list here) that make parsing delimited files easier.

HTH

Ted
 
X

xhoster

cartercc said:
I create and deliver CSV files like this:
"12345","John","Q","Public","123 Main Street","Anywhere,
Alaska","800-555-1212"
I create this in my program as a seven element array. Users modify the
files, frequently using Excel, and return the files to me.

Sometimes I get the original format back.

Sometimes, Excel will mangle the file to this:
12345,John,Q,Public,123 Main Street,"Anywhere, Alaska",800-555-1212
My script will read this as an eight element array (splitting
"Anywhere, Alaska" on the comma).

How do you avoid splitting that on that comma in the first case,
where all fields are quoted rather than just that one? If it works
in the first case, it should work in the second case. I'm guessing you
aren't using one of the standard modules for processing CSV?
Sometimes Excel will produce a tab delimited file like this:
12345 John Q Public 123 Main Street "Anywhere, Alaska" 800-555-1212
My script will read this as a two element array.

So then split on tabs instead when you see that the line has the right
number of tabs in it.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
J

Jürgen Exner

cartercc said:
I create and deliver CSV files like this:
"12345","John","Q","Public","123 Main Street","Anywhere,
Alaska","800-555-1212"
I create this in my program as a seven element array. Users modify the
files, frequently using Excel, and return the files to me.
Sometimes, Excel will mangle the file to this:

No, it doesn't.
12345,John,Q,Public,123 Main Street,"Anywhere, Alaska",800-555-1212
My script will read this as an eight element array (splitting
"Anywhere, Alaska" on the comma).

Which is correct and valid CSV format. It may be convenient to blame
Microsoft for everything but in this case it is your program that is at
fault.
Sometimes Excel will produce a tab delimited file like this:
12345 John Q Public 123 Main Street "Anywhere, Alaska" 800-555-1212

This on the other hand is a problem between chair and keyboard caused by
saving the file as "Text (tab delimited)". Maybe this user has this even
set as his preference for saving work sheets, maybe even without
realizing what he is doing.
My script will read this as a two element array.

And correctly so.
My users like to use Excel and I am not inclined to tell them they
can't, nor am I inclined to 'educate' them on making sure that Excel
doesn't change the file format. They honestly need the filtering and

Well, the education would be _NOT_ to select "Save As" -> "Text (tab
delimited)" when saving the file. You wouldn't want to accept any of the
other 24 possible file formats that Excel knows about, either, would
you?
satisfactory, because I have the burden of reformatting the file.

Then just reject files in "Text (tab delimited)" format. If users
deliberately save in a non-compatible format then it is their problem,
neither yours nor Excel's. Don't make it yours.
Comments? Suggestions? Any module that automagically reformats files?

As for fields not enclosed in quotes: that is completely valid CSV and I
am very surprised that Text::CSV doesn't handle it. Could you please
provide a short sample script that demonstrates this deficancy?

jue
 
C

cartercc

To answer the question about 'educating' users, Excel helpfully offers
to keep the Excel compatible features when saving, which users do. The
solution of course is to save the final version (after all
corrections) in the CSV format, but my users are generally not
technical people, and my rejecting their non-CSV input isn't a
politically correct solution.

How do you avoid splitting that on that comma in the first case,
where all fields are quoted rather than just that one? If it works
in the first case, it should work in the second case. I'm guessing you
aren't using one of the standard modules for processing CSV?

As part of error checking, I wrote a simple (simple-minded?) module
that checks for either (1) /\w,\w/ or (2) \/w","\w/ and adapts. And
yes, this sent me to CPAN where I will probably find something that
meets my needs.
So then split on tabs instead when you see that the line has the right
number of tabs in it.

Unfortunately, '123 Main Street' produces three array elements, and
'P. O. Box 123' produces four elements. Most of these files are large
and manually checking and altering them isn't doable.

CC
 
X

xhoster

cartercc said:
Unfortunately, '123 Main Street' produces three array elements, and
'P. O. Box 123' produces four elements.

Are you splitting on tabs, or on whitespace? If you split on tabs, it
should not break up those fields, unless your users are replacing the
spaces inside those fields with tabs. If they are doing that, then they
are not your users, they are enemy forces. Get a different job forthwith
before they graduate from informatics sabotage to lethal weapons.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
J

Jürgen Exner

cartercc said:
To answer the question about 'educating' users, Excel helpfully offers
to keep the Excel compatible features when saving, which users do. The
solution of course is to save the final version (after all
corrections) in the CSV format, but my users are generally not
technical people, and my rejecting their non-CSV input isn't a
politically correct solution.

Well, then good luck with those 26 different file formats that Excel
knows about.
As part of error checking, I wrote a simple (simple-minded?) module
that checks for either (1) /\w,\w/ or (2) \/w","\w/ and adapts. And
yes, this sent me to CPAN where I will probably find something that
meets my needs.


Unfortunately, '123 Main Street' produces three array elements, and
'P. O. Box 123' produces four elements.

That's odd to say the least. Excel doesn't replace space with tab in
data unless someone goes to great length to do so manually.

jue
 
C

cartercc

Which is correct and valid CSV format. It may be convenient to blame
Microsoft for everything but in this case it is your program that is at
fault.

I am aware of this.
This on the other hand is a problem between chair and keyboard caused by
saving the file as "Text (tab delimited)". Maybe this user has this even
set as his preference for saving work sheets, maybe even without
realizing what he is doing.

They don't do this deliberately, but just don't think about file
formats. Nor should they have to, IMO.
And correctly so.
Agreed.


Well, the education would be _NOT_ to select "Save As" -> "Text (tab
delimited)" when saving the file. You wouldn't want to accept any of the
other 24 possible file formats that Excel knows about, either, would
you?

No, the only things that kill be input file are the non-quote
described strings and the whitespace delimited. Never had a problem
with any other kind of delimiters.
Then just reject files in "Text (tab delimited)"  format. If users
deliberately save in a non-compatible format then it is their problem,
neither yours nor Excel's. Don't make it yours.

I can't do this. It's my problem, and I should handle these different
formats gracefully.
As for fields not enclosed in quotes: that is completely valid CSV and I
am very surprised that Text::CSV doesn't handle it. Could you please
provide a short sample script that demonstrates this deficancy?

I didn't use this module, but hand rolled everything. After reading
these comments and exploring CPAN, I am going to try different
options, and I expect to find one that works.

Thanks for your input, CC.
 
J

Jim Gibson

cartercc said:
I didn't use this module, but hand rolled everything. After reading
these comments and exploring CPAN, I am going to try different
options, and I expect to find one that works.

The best solution was already provided: don't ask your users to save
their spreadsheets as CSV, tab-delimited, or any other type of file.
Ask them to send you their Excel documents and extract the data using
the Spreadsheet::parseExcel module. To me, this is a "no-brainer".
 
C

cartercc

That's odd to say the least. Excel doesn't replace space with tab in
data unless someone goes to great length to do so manually.

No, but \s doesn't distinguish between spaces, tabs, carriage returns,
and new lines.

CC
 
C

cartercc

The best solution was already provided: don't ask your users to save
their spreadsheets as CSV, tab-delimited, or any other type of file.
Ask them to send you their Excel documents and extract the data using
the Spreadsheet::parseExcel module. To me, this is a "no-brainer".

And perhaps I should also send them an Excel file to begin with and
thus avoid all the headaches. I'll try that. Thanks.

CC
 
X

xhoster

cartercc said:
No, but \s doesn't distinguish between spaces, tabs, carriage returns,
and new lines.

Of course it doesn't. That is what it is there for. If you want tab,
you use \t (or a literal tab character, but I don't like them in
source-code)

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
J

Jürgen Exner

cartercc said:
No, but \s doesn't distinguish between spaces, tabs, carriage returns,
and new lines.

Well, ok, but, aehmmm, in this case, why are you insisting on shooting
yourself in the foot? If the data is tab delimited, then why on earth
are you using \s as separator?

jue
 
J

Jim Gibson

cartercc said:
And perhaps I should also send them an Excel file to begin with and
thus avoid all the headaches. I'll try that. Thanks.

You can generate the Excel spreadsheets you send them from Perl using
the Spreadsheet::WriteExcel module.

Good luck!
 
C

cartercc

The best solution was already provided: don't ask your users to save
their spreadsheets as CSV, tab-delimited, or any other type of file.
Ask them to send you their Excel documents and extract the data using
the Spreadsheet::parseExcel module. To me, this is a "no-brainer".

Okay, I got both WriteExcel and ParseExcel, and will play with them
next week. Just offhand, do you know whether ParseExcel will work with
Office 2007? Will WriteExcel work with Office 2007? If you know they
will not, it will save me some time.

Thanks, CC.
 
X

xhoster

cartercc said:
Okay, I got both WriteExcel and ParseExcel, and will play with them
next week. Just offhand, do you know whether ParseExcel will work with
Office 2007? Will WriteExcel work with Office 2007? If you know they
will not, it will save me some time.

I don't know, and don't have access to Office 2007 to test it on. If you
find out, please report back as I'm curious to know, too. Thanks,

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
J

Jim Gibson

[QUOTE= said:
Okay, I got both WriteExcel and ParseExcel, and will play with them
next week. Just offhand, do you know whether ParseExcel will work with
Office 2007? Will WriteExcel work with Office 2007? If you know they
will not, it will save me some time.

I don't know, and don't have access to Office 2007 to test it on. If you
find out, please report back as I'm curious to know, too. Thanks,[/QUOTE]

Apparently not:

From <http://www.cpanforum.com/posts/6027>

"Posted on Thu Sep 6 17:56:48 2007 by jmcnamara in response to 6026
(See the whole thread of 2)
Re: Spreadsheet-ParseExcel is unable to parse excel 2007 file
Spreadsheet::parseExcel only handles the Excel binary files format
called Biff that is used in Excel versions prior to 2007. Excel 2007
has a new file format that is comprised of zipped XML files. This is
not handled by Spreadsheet::WriteExcel.
If you unzip the 2007 xls file (change the extension to .zip if
necessary) you can probably parse the information out of the embedded
XML files using one of the Perl XML parsers."
 
C

cartercc

If you want eXcel files, why don't you create them instead of CSV's.

Because it is easy to open a file with a .csv extension and print my
row strings to the file - no special modules or magic required.
There is plenty of support for excel in perl. I produce disk usage
reports using Spreadsheet::WriteExcel and it could not be easier. I even
add a SUM column.

Yes, I am learning. I intend to add this weapon to my arsenal in the
next week or so. I have recently done the same for PDF and have
discovered anew the truth of Clark's law that 'Any sufficiently
advanced technology is indistinguishable from magic.'

CC
 
T

Ted Zlatanov

c> Yes, I am learning. I intend to add this weapon to my arsenal in the
c> next week or so. I have recently done the same for PDF and have
c> discovered anew the truth of Clark's law that 'Any sufficiently
c> advanced technology is indistinguishable from magic.'

If you're running on Windows, you can use the ActiveState distribution
and its OLE support (I don't know if that's available in any other
Windows distributions of Perl) to open any Excel-supported file
(including 2007), read or modify it, then save the results back.

Ted
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,066
Latest member
VytoKetoReviews

Latest Threads

Top