csv read clean up and write out to csv


S

Sacha Rook

Hi

I have a problem with a csv file from a supplier, so they export data to csv however the last column in the record is a description which is marked upwith html.

trying to automate the processing of this csv to upload elsewhere in a useable format. If i open the csv with csved it looks like all the records aren't escaped correctly as after a while i find html tags and text on the nextline/record.

If I 'openwith' excel the description stays on the correct line/record?

I want to use python to read these records in and output a valid csv with the descriptions intact preferably without the html tags so a string of textformatted with newline/CR where appropriate.

So far I have this but don't know where to go from here can someone help me?

import csv

infile = open('c:\data\input.csv', 'rb')
outfile = open('c:\data\output.csv', 'wb')

reader = csv.reader(infile)
writer = csv.writer(outfile)


for line in reader:
print line
writer.writerow(line)


The input.csv is set out as follows;
HEADER ROW 1st
"FileDate","ProductID","Name","StandardPrice","DropshipPrice","SRP","Brand","Xline","InStock","Stock","Barcode","Weight","CategoryID","Category","SmallImage","LargeImage","Description"

A COMPLETE RECORD LOOKS LIKE THIS WITH THE DESCRIPTION FIELD POPULATED SOMERECORDS DON'T HAVE THE DESCRIPTION FIELD POPULATED

"2012-11-01T18:28:45.25+00:00","10198","(Venom) PS2 DVD Remote Control (Black)","3.7800","4.3500","12.9800","Venom","true","In Stock","1","5031300025009","200","1339","PC/Games_Console / Playstation / PS2 / Remote Controls","http://www.atssitecentre.co.uk/images/products/10000/10198.gif","http://www..atssitecentre.co.uk/images/products/10000/10198f.jpg","Never have to unplug your joypad / DVD user friendly / Works up to 30 feet from PS/2 / IR wireless technology."

THIS IS AN EXAMPLE OF THE BAD RECORD CAUSING PROBLEMS, THE DESCRIPTION FIELD STARTS ""features:</p>
AS YOU CAN SEE CONTAINS HTML BUT BECAUSE OF THIS FORMAT SUBSEQUENT HTML TAGS ARE
ADDED AS NEW RECORDS ON SUBSEQUENT LINES.

"2012-11-01T18:28:45.25+00:00","11116","3.5 inch Disk Drive Lock","2.9500","2.9500","9.9500","None","true","In Stock","3","077511994166","131","1332","PC/Games_Console / PC Security / General","http://www.atssitecentre.co.uk/images/products/11000/11116.gif","http://www.atssitecentre.co.uk/images/products/11000/11116f.jpg","features:</p>
<ul>
<li>3 1/2&quot; FDD Lock.</li>
<li>Die casting housing and cylinder chrome plated.</li>
<li>Lock Cover : PBT + GF 15%. (PLASTIC)</li>
<li>2 Keys supplied per lock. <br /></li>
</ul>"

I know I am far from complete but don't know how to proceed :)
As I said I want to reconstruct a clean record either strip out the html tags or
at least escape the records appropriately..

Thanks all
 
Ad

Advertisements

N

Neil Cerutti

Hi

I have a problem with a csv file from a supplier, so they
export data to csv however the last column in the record is a
description which is marked up with html.

trying to automate the processing of this csv to upload
elsewhere in a useable format. If i open the csv with csved it
looks like all the records aren't escaped correctly as after a
while i find html tags and text on the next line/record.

Maybe compose a simple parter to disambiguate the lines from the
file.

Something like (you'll have to write is_html, and my Python 2 is
mighty rusty, you'll have to fix up. Note that infile doesn't
have to be in binary mode with this scheme, but it would fail on
bizarre newlines in the file):

def parse_records(iter):
for line in iter:
if is_html(line):
yield ('html', line)
else:
yield ('csv', csv.reader([line.strip()]).next())

infile = open('c:\data\input.csv')
outfile = open('c:\data\output.csv', 'wb')

writer = csv.writer(outfile)

for tag, rec in parse_record(infile):
if tag == 'html':
print rec
elif tag == 'csv':
writer.writerow(rec)
else:
raise ValueError("Unknown record type %s" % tag)
 
Ad

Advertisements

H

Hans Mulder

I have a problem with a csv file from a supplier, so they export data to csv
however the last column in the record is a description which is marked up
with html.

trying to automate the processing of this csv to upload elsewhere in a
useable format. If i open the csv with csved it looks like all the records
aren't escaped correctly as after a while i find html tags and text on the
next line/record.

The example line you gave was correctly escaped: the description starts
with a double quote, and ends several lines later with another double
quote. Double quotes in the HTML are represented by '&quot;'.

Maybe csved doesn't recognize this escape convention?
If I 'openwith' excel the description stays on the correct line/record?

Excel implements this convention
I want to use python to read these records in and output a valid csv with
the descriptions intact preferably without the html tags so a string of
text formatted with newline/CR where appropriate.

How about this:

import csv

infile = file("input.csv", "rb")
outfile = file("output.csv", "wb")

reader = csv.reader(infile)
writer = csv.writer(outfile)

for line in reader:
line[-1] = line[-1].replace("\n", " ")
print line
writer.writerow(line)

infile.close()
outfile.close()


That will replace the newlines inside the HTML, which your csved
doesn't seem to recognize, by spaces. When viewed as HTML code,
spaces have the same effect as newlines, so this replacement
shouldn't alter the meaning of the HTML text.

Hope this helps,

-- HansM
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top