Some help in refining this regex for CSV files

Discussion in 'Python' started by Oltmans, Dec 6, 2012.

  1. Oltmans

    Oltmans Guest

    Hi guys,

    I've to deal with CSVs that look like following

    CSV (with one header and 3 legit rows where each legit row has 3 columns)
    ----
    Some info
    Date: 12/6/2012
    Author: Some guy
    Total records: 100

    header1, header2, header3
    one, two, three
    one, "Python is great, so are other languages, isn't ?", three
    one, two, 'some languages, are realyl beautiful\r\n, I really cannot deny \n this \t\t\t fact. \t\t\t\tthis fact alone is amazing'
    ----

    So inside this CSV, there will always be bad lines like the top 4 (they could end up in the beginning, in the middle and even in the last). So above sample, csv has 3 legit lines and a header. I want to read those three linesand here is a regex that I came up with (which clearly isn't working)

    #print line
    pattern = r"([^\t]+\t|,+)"
    matches = re.match(pattern, line)

    Do you've any better ideas guys? I will really appreciate all help.
     
    Oltmans, Dec 6, 2012
    #1
    1. Advertising

  2. On 06/12/2012 07:21, Oltmans wrote:
    > Hi guys,
    >
    > I've to deal with CSVs that look like following
    >
    > CSV (with one header and 3 legit rows where each legit row has 3 columns)
    > ----
    > Some info
    > Date: 12/6/2012
    > Author: Some guy
    > Total records: 100
    >
    > header1, header2, header3
    > one, two, three
    > one, "Python is great, so are other languages, isn't ?", three
    > one, two, 'some languages, are realyl beautiful\r\n, I really cannot deny \n this \t\t\t fact. \t\t\t\tthis fact alone is amazing'
    > ----
    >
    > So inside this CSV, there will always be bad lines like the top 4 (they could end up in the beginning, in the middle and even in the last). So above sample, csv has 3 legit lines and a header. I want to read those three lines and here is a regex that I came up with (which clearly isn't working)
    >
    > #print line
    > pattern = r"([^\t]+\t|,+)"
    > matches = re.match(pattern, line)
    >
    > Do you've any better ideas guys? I will really appreciate all help.
    >


    I'd simply use the csv module from the standard library to read your
    files, discarding anything that you regard as bad. I'd certainly not
    use a regex for this.

    --
    Cheers.

    Mark Lawrence.
     
    Mark Lawrence, Dec 6, 2012
    #2
    1. Advertising

  3. Oltmans

    Tim Chase Guest

    On 12/06/12 01:21, Oltmans wrote:
    > Hi guys,
    >
    > I've to deal with CSVs that look like following
    >
    > CSV (with one header and 3 legit rows where each legit row has 3 columns)
    > ----
    > Some info
    > Date: 12/6/2012
    > Author: Some guy
    > Total records: 100
    >
    > header1, header2, header3
    > one, two, three
    > one, "Python is great, so are other languages, isn't ?", three
    > one, two, 'some languages, are realyl beautiful\r\n, I really cannot deny \n this \t\t\t fact. \t\t\t\tthis fact alone is amazing'
    > ----
    >
    > So inside this CSV, there will always be bad lines like the top 4 (they could end up in the beginning, in the middle and even in the last). So above sample, csv has 3 legit lines and a header. I want to read those three lines and here is a regex that I came up with (which clearly isn't working)
    >
    > #print line
    > pattern = r"([^\t]+\t|,+)"
    > matches = re.match(pattern, line)
    >
    > Do you've any better ideas guys? I will really appreciate all help.


    I agree with Mark that using the "csv" module will likely be your
    easiest way to go. Just consume the lines you don't want before
    passing it to the csv.reader(), or parse them and discard invalid
    items. The first could be done something like

    import csv
    f = file("data.csv", "rb")
    while True:
    line = f.next().rstrip("\r\n")
    if not line: break
    r = csv.reader(f)
    for row in r:
    print repr(row)

    The latter might be done something like

    f = file("data.csv", "rb")
    r = csv.reader(f)
    for row in r:
    if len(row) != 3: continue
    print repr(row)

    However, I also noticed that your example file doesn't seem to fit a
    true csv file definition, as you seem to switch quoting notations,
    sometimes using single, sometimes using double quotes.

    -tkc
     
    Tim Chase, Dec 6, 2012
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Michal Mikolajczyk
    Replies:
    0
    Views:
    694
    Michal Mikolajczyk
    Feb 13, 2004
  2. Skip Montanaro
    Replies:
    0
    Views:
    767
    Skip Montanaro
    Feb 13, 2004
  3. Replies:
    3
    Views:
    834
    Reedick, Andrew
    Jul 1, 2008
  4. Alpha Blue
    Replies:
    25
    Views:
    306
    Robert Klemme
    Feb 5, 2010
  5. George Hester

    Refining the popup?

    George Hester, Oct 15, 2004, in forum: Javascript
    Replies:
    4
    Views:
    142
    George Hester
    Oct 15, 2004
Loading...

Share This Page