csv read clean up and write out to csv

Discussion in 'Python' started by Sacha Rook, Nov 2, 2012.

  1. Sacha Rook

    Sacha Rook Guest

    Hi

    I have a problem with a csv file from a supplier, so they export data to csv however the last column in the record is a description which is marked upwith html.

    trying to automate the processing of this csv to upload elsewhere in a useable format. If i open the csv with csved it looks like all the records aren't escaped correctly as after a while i find html tags and text on the nextline/record.

    If I 'openwith' excel the description stays on the correct line/record?

    I want to use python to read these records in and output a valid csv with the descriptions intact preferably without the html tags so a string of textformatted with newline/CR where appropriate.

    So far I have this but don't know where to go from here can someone help me?

    import csv

    infile = open('c:\data\input.csv', 'rb')
    outfile = open('c:\data\output.csv', 'wb')

    reader = csv.reader(infile)
    writer = csv.writer(outfile)


    for line in reader:
    print line
    writer.writerow(line)


    The input.csv is set out as follows;
    HEADER ROW 1st
    "FileDate","ProductID","Name","StandardPrice","DropshipPrice","SRP","Brand","Xline","InStock","Stock","Barcode","Weight","CategoryID","Category","SmallImage","LargeImage","Description"

    A COMPLETE RECORD LOOKS LIKE THIS WITH THE DESCRIPTION FIELD POPULATED SOMERECORDS DON'T HAVE THE DESCRIPTION FIELD POPULATED

    "2012-11-01T18:28:45.25+00:00","10198","(Venom) PS2 DVD Remote Control (Black)","3.7800","4.3500","12.9800","Venom","true","In Stock","1","5031300025009","200","1339","PC/Games_Console / Playstation / PS2 / Remote Controls","http://www.atssitecentre.co.uk/images/products/10000/10198.gif","http://www..atssitecentre.co.uk/images/products/10000/10198f.jpg","Never have to unplug your joypad / DVD user friendly / Works up to 30 feet from PS/2 / IR wireless technology."

    THIS IS AN EXAMPLE OF THE BAD RECORD CAUSING PROBLEMS, THE DESCRIPTION FIELD STARTS ""features:</p>
    AS YOU CAN SEE CONTAINS HTML BUT BECAUSE OF THIS FORMAT SUBSEQUENT HTML TAGS ARE
    ADDED AS NEW RECORDS ON SUBSEQUENT LINES.

    "2012-11-01T18:28:45.25+00:00","11116","3.5 inch Disk Drive Lock","2.9500","2.9500","9.9500","None","true","In Stock","3","077511994166","131","1332","PC/Games_Console / PC Security / General","http://www.atssitecentre.co.uk/images/products/11000/11116.gif","http://www.atssitecentre.co.uk/images/products/11000/11116f.jpg","features:</p>
    <ul>
    <li>3 1/2&quot; FDD Lock.</li>
    <li>Die casting housing and cylinder chrome plated.</li>
    <li>Lock Cover : PBT + GF 15%. (PLASTIC)</li>
    <li>2 Keys supplied per lock. <br /></li>
    </ul>"

    I know I am far from complete but don't know how to proceed :)
    As I said I want to reconstruct a clean record either strip out the html tags or
    at least escape the records appropriately..

    Thanks all
    Sacha Rook, Nov 2, 2012
    #1
    1. Advertising

  2. Sacha Rook

    Neil Cerutti Guest

    On 2012-11-02, Sacha Rook <> wrote:
    > Hi
    >
    > I have a problem with a csv file from a supplier, so they
    > export data to csv however the last column in the record is a
    > description which is marked up with html.
    >
    > trying to automate the processing of this csv to upload
    > elsewhere in a useable format. If i open the csv with csved it
    > looks like all the records aren't escaped correctly as after a
    > while i find html tags and text on the next line/record.


    Maybe compose a simple parter to disambiguate the lines from the
    file.

    Something like (you'll have to write is_html, and my Python 2 is
    mighty rusty, you'll have to fix up. Note that infile doesn't
    have to be in binary mode with this scheme, but it would fail on
    bizarre newlines in the file):

    def parse_records(iter):
    for line in iter:
    if is_html(line):
    yield ('html', line)
    else:
    yield ('csv', csv.reader([line.strip()]).next())

    infile = open('c:\data\input.csv')
    outfile = open('c:\data\output.csv', 'wb')

    writer = csv.writer(outfile)

    for tag, rec in parse_record(infile):
    if tag == 'html':
    print rec
    elif tag == 'csv':
    writer.writerow(rec)
    else:
    raise ValueError("Unknown record type %s" % tag)

    --
    Neil Cerutti
    Neil Cerutti, Nov 2, 2012
    #2
    1. Advertising

  3. Sacha Rook

    Hans Mulder Guest

    On 2/11/12 18:25:09, Sacha Rook wrote:
    > I have a problem with a csv file from a supplier, so they export data to csv
    > however the last column in the record is a description which is marked up
    > with html.
    >
    > trying to automate the processing of this csv to upload elsewhere in a
    > useable format. If i open the csv with csved it looks like all the records
    > aren't escaped correctly as after a while i find html tags and text on the
    > next line/record.


    The example line you gave was correctly escaped: the description starts
    with a double quote, and ends several lines later with another double
    quote. Double quotes in the HTML are represented by '&quot;'.

    Maybe csved doesn't recognize this escape convention?

    > If I 'openwith' excel the description stays on the correct line/record?


    Excel implements this convention

    > I want to use python to read these records in and output a valid csv with
    > the descriptions intact preferably without the html tags so a string of
    > text formatted with newline/CR where appropriate.


    How about this:

    import csv

    infile = file("input.csv", "rb")
    outfile = file("output.csv", "wb")

    reader = csv.reader(infile)
    writer = csv.writer(outfile)

    for line in reader:
    line[-1] = line[-1].replace("\n", " ")
    print line
    writer.writerow(line)

    infile.close()
    outfile.close()


    That will replace the newlines inside the HTML, which your csved
    doesn't seem to recognize, by spaces. When viewed as HTML code,
    spaces have the same effect as newlines, so this replacement
    shouldn't alter the meaning of the HTML text.

    Hope this helps,

    -- HansM
    Hans Mulder, Nov 2, 2012
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Chris
    Replies:
    3
    Views:
    239
    Steven D'Aprano
    Sep 11, 2007
  2. jliu66
    Replies:
    0
    Views:
    492
    jliu66
    Oct 19, 2007
  3. Replies:
    8
    Views:
    499
  4. Li Chen
    Replies:
    18
    Views:
    643
    Azmi Farih
    Mar 23, 2010
  5. John Mcleod

    using FasterCSV to clean CSV file

    John Mcleod, Nov 17, 2009, in forum: Ruby
    Replies:
    7
    Views:
    157
    James Edward Gray II
    Nov 18, 2009
Loading...

Share This Page