Some help in refining this regex for CSV files

Oltmans · Dec 6, 2012

Hi guys,

I've to deal with CSVs that look like following

CSV (with one header and 3 legit rows where each legit row has 3 columns)
----
Some info
Date: 12/6/2012
Author: Some guy
Total records: 100

header1, header2, header3
one, two, three
one, "Python is great, so are other languages, isn't ?", three
one, two, 'some languages, are realyl beautiful\r\n, I really cannot deny \n this \t\t\t fact. \t\t\t\tthis fact alone is amazing'
----

So inside this CSV, there will always be bad lines like the top 4 (they could end up in the beginning, in the middle and even in the last). So above sample, csv has 3 legit lines and a header. I want to read those three linesand here is a regex that I came up with (which clearly isn't working)

#print line
pattern = r"([^\t]+\t|,+)"
matches = re.match(pattern, line)

Do you've any better ideas guys? I will really appreciate all help.

Mark Lawrence · Dec 6, 2012

Hi guys,

I've to deal with CSVs that look like following

CSV (with one header and 3 legit rows where each legit row has 3 columns)
----
Some info
Date: 12/6/2012
Author: Some guy
Total records: 100

header1, header2, header3
one, two, three
one, "Python is great, so are other languages, isn't ?", three
one, two, 'some languages, are realyl beautiful\r\n, I really cannot deny \n this \t\t\t fact. \t\t\t\tthis fact alone is amazing'
----

So inside this CSV, there will always be bad lines like the top 4 (they could end up in the beginning, in the middle and even in the last). So above sample, csv has 3 legit lines and a header. I want to read those three lines and here is a regex that I came up with (which clearly isn't working)

#print line
pattern = r"([^\t]+\t|,+)"
matches = re.match(pattern, line)

Do you've any better ideas guys? I will really appreciate all help.

I'd simply use the csv module from the standard library to read your
files, discarding anything that you regard as bad. I'd certainly not
use a regex for this.

Tim Chase · Dec 6, 2012

Hi guys,

I've to deal with CSVs that look like following

CSV (with one header and 3 legit rows where each legit row has 3 columns)
----
Some info
Date: 12/6/2012
Author: Some guy
Total records: 100

header1, header2, header3
one, two, three
one, "Python is great, so are other languages, isn't ?", three
one, two, 'some languages, are realyl beautiful\r\n, I really cannot deny \n this \t\t\t fact. \t\t\t\tthis fact alone is amazing'
----

So inside this CSV, there will always be bad lines like the top 4 (they could end up in the beginning, in the middle and even in the last). So above sample, csv has 3 legit lines and a header. I want to read those three lines and here is a regex that I came up with (which clearly isn't working)

#print line
pattern = r"([^\t]+\t|,+)"
matches = re.match(pattern, line)

Do you've any better ideas guys? I will really appreciate all help.

I agree with Mark that using the "csv" module will likely be your
easiest way to go. Just consume the lines you don't want before
passing it to the csv.reader(), or parse them and discard invalid
items. The first could be done something like

import csv
f = file("data.csv", "rb")
while True:
line = f.next().rstrip("\r\n")
if not line: break
r = csv.reader(f)
for row in r:
print repr(row)

The latter might be done something like

f = file("data.csv", "rb")
r = csv.reader(f)
for row in r:
if len(row) != 3: continue
print repr(row)

However, I also noticed that your example file doesn't seem to fit a
true csv file definition, as you seem to switch quoting notations,
sometimes using single, sometimes using double quotes.

-tkc

Non-greediness in a regex - need some help verifying syntax	5	Aug 3, 2006
CSV dB script help	9	Jun 2, 2004
Some errors in MIT's intro C++ course	109	Sep 8, 2010
YOU MUST KNOW THIS MAN	0	Sep 6, 2010
Need help with a Regex substitution	4	Mar 23, 2007
please help me correct the errors in this program.have to submit theproject tomorrow!!	4	Oct 19, 2008
anybody help me	1	Feb 10, 2006
i need help with this project please some one help meeee	4	Oct 19, 2006

Some help in refining this regex for CSV files

Oltmans

Mark Lawrence

Tim Chase

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads