S
Sacha Rook
Hi
I have a problem with a csv file from a supplier, so they export data to csv however the last column in the record is a description which is marked upwith html.
trying to automate the processing of this csv to upload elsewhere in a useable format. If i open the csv with csved it looks like all the records aren't escaped correctly as after a while i find html tags and text on the nextline/record.
If I 'openwith' excel the description stays on the correct line/record?
I want to use python to read these records in and output a valid csv with the descriptions intact preferably without the html tags so a string of textformatted with newline/CR where appropriate.
So far I have this but don't know where to go from here can someone help me?
import csv
infile = open('c:\data\input.csv', 'rb')
outfile = open('c:\data\output.csv', 'wb')
reader = csv.reader(infile)
writer = csv.writer(outfile)
for line in reader:
print line
writer.writerow(line)
The input.csv is set out as follows;
HEADER ROW 1st
"FileDate","ProductID","Name","StandardPrice","DropshipPrice","SRP","Brand","Xline","InStock","Stock","Barcode","Weight","CategoryID","Category","SmallImage","LargeImage","Description"
A COMPLETE RECORD LOOKS LIKE THIS WITH THE DESCRIPTION FIELD POPULATED SOMERECORDS DON'T HAVE THE DESCRIPTION FIELD POPULATED
"2012-11-01T18:28:45.25+00:00","10198","(Venom) PS2 DVD Remote Control (Black)","3.7800","4.3500","12.9800","Venom","true","In Stock","1","5031300025009","200","1339","PC/Games_Console / Playstation / PS2 / Remote Controls","http://www.atssitecentre.co.uk/images/products/10000/10198.gif","http://www..atssitecentre.co.uk/images/products/10000/10198f.jpg","Never have to unplug your joypad / DVD user friendly / Works up to 30 feet from PS/2 / IR wireless technology."
THIS IS AN EXAMPLE OF THE BAD RECORD CAUSING PROBLEMS, THE DESCRIPTION FIELD STARTS ""features:</p>
AS YOU CAN SEE CONTAINS HTML BUT BECAUSE OF THIS FORMAT SUBSEQUENT HTML TAGS ARE
ADDED AS NEW RECORDS ON SUBSEQUENT LINES.
"2012-11-01T18:28:45.25+00:00","11116","3.5 inch Disk Drive Lock","2.9500","2.9500","9.9500","None","true","In Stock","3","077511994166","131","1332","PC/Games_Console / PC Security / General","http://www.atssitecentre.co.uk/images/products/11000/11116.gif","http://www.atssitecentre.co.uk/images/products/11000/11116f.jpg","features:</p>
<ul>
<li>3 1/2" FDD Lock.</li>
<li>Die casting housing and cylinder chrome plated.</li>
<li>Lock Cover : PBT + GF 15%. (PLASTIC)</li>
<li>2 Keys supplied per lock. <br /></li>
</ul>"
I know I am far from complete but don't know how to proceed
As I said I want to reconstruct a clean record either strip out the html tags or
at least escape the records appropriately..
Thanks all
I have a problem with a csv file from a supplier, so they export data to csv however the last column in the record is a description which is marked upwith html.
trying to automate the processing of this csv to upload elsewhere in a useable format. If i open the csv with csved it looks like all the records aren't escaped correctly as after a while i find html tags and text on the nextline/record.
If I 'openwith' excel the description stays on the correct line/record?
I want to use python to read these records in and output a valid csv with the descriptions intact preferably without the html tags so a string of textformatted with newline/CR where appropriate.
So far I have this but don't know where to go from here can someone help me?
import csv
infile = open('c:\data\input.csv', 'rb')
outfile = open('c:\data\output.csv', 'wb')
reader = csv.reader(infile)
writer = csv.writer(outfile)
for line in reader:
print line
writer.writerow(line)
The input.csv is set out as follows;
HEADER ROW 1st
"FileDate","ProductID","Name","StandardPrice","DropshipPrice","SRP","Brand","Xline","InStock","Stock","Barcode","Weight","CategoryID","Category","SmallImage","LargeImage","Description"
A COMPLETE RECORD LOOKS LIKE THIS WITH THE DESCRIPTION FIELD POPULATED SOMERECORDS DON'T HAVE THE DESCRIPTION FIELD POPULATED
"2012-11-01T18:28:45.25+00:00","10198","(Venom) PS2 DVD Remote Control (Black)","3.7800","4.3500","12.9800","Venom","true","In Stock","1","5031300025009","200","1339","PC/Games_Console / Playstation / PS2 / Remote Controls","http://www.atssitecentre.co.uk/images/products/10000/10198.gif","http://www..atssitecentre.co.uk/images/products/10000/10198f.jpg","Never have to unplug your joypad / DVD user friendly / Works up to 30 feet from PS/2 / IR wireless technology."
THIS IS AN EXAMPLE OF THE BAD RECORD CAUSING PROBLEMS, THE DESCRIPTION FIELD STARTS ""features:</p>
AS YOU CAN SEE CONTAINS HTML BUT BECAUSE OF THIS FORMAT SUBSEQUENT HTML TAGS ARE
ADDED AS NEW RECORDS ON SUBSEQUENT LINES.
"2012-11-01T18:28:45.25+00:00","11116","3.5 inch Disk Drive Lock","2.9500","2.9500","9.9500","None","true","In Stock","3","077511994166","131","1332","PC/Games_Console / PC Security / General","http://www.atssitecentre.co.uk/images/products/11000/11116.gif","http://www.atssitecentre.co.uk/images/products/11000/11116f.jpg","features:</p>
<ul>
<li>3 1/2" FDD Lock.</li>
<li>Die casting housing and cylinder chrome plated.</li>
<li>Lock Cover : PBT + GF 15%. (PLASTIC)</li>
<li>2 Keys supplied per lock. <br /></li>
</ul>"
I know I am far from complete but don't know how to proceed
As I said I want to reconstruct a clean record either strip out the html tags or
at least escape the records appropriately..
Thanks all