Splitting a file from specific column content

Yigit Turgut · Jan 22, 2012

Hi all,

I have a text file approximately 20mb in size and contains about one
million lines. I was doing some processing on the data but then the
data rate increased and it takes very long time to process. I import
using numpy.loadtxt, here is a fragment of the data ;

0.000006 -0.0004
0.000071 0.0028
0.000079 0.0044
0.000086 0.0104
..
..
..

First column is the timestamp in seconds and second column is the
data. File contains 8seconds of measurement, and I would like to be
able to split the file into 3 parts seperated from specific time
locations. For example I want to divide the file into 3 parts, first
part containing 3 seconds of data, second containing 2 seconds of data
and third containing 3 seconds. Splitting based on file size doesn't
work that accurately for this specific data, some columns become
missing and etc. I need to split depending on the column content ;

1 - read file until first character of column1 is 3 (3 seconds)
2 - save this region to another file
3 - read the file where first characters of column1 are between 3 to
5 (2 seconds)
4 - save this region to another file
5 - read the file where first characters of column1 are between 5 to
5 (3 seconds)
6 - save this region to another file

I need to do this exactly because numpy.loadtxt or genfromtxt doesn't
get well with missing columns / rows. I even tried the invalidraise
parameter of genfromtxt but no luck.

I am sure it's a few lines of code for experienced users and I would
appreciate some guidance.

Roy Smith · Jan 22, 2012

Yigit Turgut said:
Hi all,

I have a text file approximately 20mb in size and contains about one
million lines. I was doing some processing on the data but then the
data rate increased and it takes very long time to process. I import
using numpy.loadtxt, here is a fragment of the data ;

0.000006 -0.0004
0.000071 0.0028
0.000079 0.0044
0.000086 0.0104
.
.
.

First column is the timestamp in seconds and second column is the
data. File contains 8seconds of measurement, and I would like to be
able to split the file into 3 parts seperated from specific time
locations. For example I want to divide the file into 3 parts, first
part containing 3 seconds of data, second containing 2 seconds of data
and third containing 3 seconds.

I would do this with standard unix tools:

grep '^[012]' input.txt > first-three-seconds.txt
grep '^[34]' input.txt > next-two-seconds.txt
grep '^[567]' input.txt > next-three-seconds.txt

Sure, it makes three passes over the data, but for 20 MB of data, you
could have the whole job done in less time than it took me to type this.

As a sanity check, I would run "wc -l" on each of the files and confirm
that they add up to the original line count.

MRAB · Jan 22, 2012

Hi all,

I have a text file approximately 20mb in size and contains about one
million lines. I was doing some processing on the data but then the
data rate increased and it takes very long time to process. I import
using numpy.loadtxt, here is a fragment of the data ;

0.000006 -0.0004
0.000071 0.0028
0.000079 0.0044
0.000086 0.0104
.
.
.

First column is the timestamp in seconds and second column is the
data. File contains 8seconds of measurement, and I would like to be
able to split the file into 3 parts seperated from specific time
locations. For example I want to divide the file into 3 parts, first
part containing 3 seconds of data, second containing 2 seconds of data
and third containing 3 seconds. Splitting based on file size doesn't
work that accurately for this specific data, some columns become
missing and etc. I need to split depending on the column content ;

1 - read file until first character of column1 is 3 (3 seconds)
2 - save this region to another file
3 - read the file where first characters of column1 are between 3 to
5 (2 seconds)
4 - save this region to another file
5 - read the file where first characters of column1 are between 5 to
5 (3 seconds)
6 - save this region to another file

I need to do this exactly because numpy.loadtxt or genfromtxt doesn't
get well with missing columns / rows. I even tried the invalidraise
parameter of genfromtxt but no luck.

I am sure it's a few lines of code for experienced users and I would
appreciate some guidance.

Here's a solution in Python 3:

input_path = "..."
section_1_path = "..."
section_2_path = "..."
section_3_path = "..."

with open(input_path) as input_file:
try:
line = next(input_file)

# Copy section 1.
with open(section_1_path, "w") as output_file:
while line[0] < "3":
output_file.write(line)
line = next(input_file)

# Copy section 2.
with open(section_2_path, "w") as output_file:
while line[5] < "5":
output_file.write(line)
line = next(input_file)

# Copy section 3.
with open(section_3_path, "w") as output_file:
while True:
output_file.write(line)
line = next(input_file)
except StopIteration:
pass

Arnaud Delobelle · Jan 22, 2012

Here's a solution in Python 3:

input_path = "..."
section_1_path = "..."
section_2_path = "..."
section_3_path = "..."

with open(input_path) as input_file:
Â Â try:
Â Â Â Â line = next(input_file)

Â Â Â Â # Copy section 1.
Â Â Â Â with open(section_1_path, "w") as output_file:
Â Â Â Â Â Â while line[0] < "3":
Â Â Â Â Â Â Â Â output_file.write(line)
Â Â Â Â Â Â Â Â line = next(input_file)

Â Â Â Â # Copy section 2.
Â Â Â Â with open(section_2_path, "w") as output_file:
Â Â Â Â Â Â while line[5] < "5":
Â Â Â Â Â Â Â Â output_file.write(line)
Â Â Â Â Â Â Â Â line = next(input_file)

Â Â Â Â # Copy section 3.
Â Â Â Â with open(section_3_path, "w") as output_file:
Â Â Â Â Â Â while True:
Â Â Â Â Â Â Â Â output_file.write(line)
Â Â Â Â Â Â Â Â line = next(input_file)
Â Â except StopIteration:
Â Â Â Â pass

Or more succintly (but not tested):

sections = [
("3", "section_1")
("5", "section_2")
("\xFF", "section_3")
]

with open(input_path) as input_file:
lines = iter(input_file)
for end, path in sections:
with open(path, "w") as output_file:
for line in lines:
if line >= end:
break
output_file.write(line)

MRAB · Jan 22, 2012

Here's a solution in Python 3:

input_path = "..."
section_1_path = "..."
section_2_path = "..."
section_3_path = "..."

with open(input_path) as input_file:
try:
line = next(input_file)

# Copy section 1.
with open(section_1_path, "w") as output_file:
while line[0]< "3":
output_file.write(line)
line = next(input_file)

# Copy section 2.
with open(section_2_path, "w") as output_file:
while line[5]< "5":
output_file.write(line)
line = next(input_file)

# Copy section 3.
with open(section_3_path, "w") as output_file:
while True:
output_file.write(line)
line = next(input_file)
except StopIteration:
pass

Click to expand...

Or more succintly (but not tested):

sections = [
("3", "section_1")
("5", "section_2")
("\xFF", "section_3")
]

with open(input_path) as input_file:
lines = iter(input_file)
for end, path in sections:
with open(path, "w") as output_file:
for line in lines:
if line>= end:
break
output_file.write(line)

Consider the condition "line >= end".

If it's true, then control will break out of the inner loop and start
the inner loop again, getting the next line.

But what of the line which caused it to break out? It'll be lost.

Yigit Turgut · Jan 22, 2012

Hi all,

Click to expand...

I have a text file approximately 20mb in size and contains about one
million lines. I was doing some processing on the data but then the
data rate increased and it takes very long time to process. I import
using numpy.loadtxt, here is a fragment of the data ;

Click to expand...

0.000006 -0.0004
0.000071 0.0028
0.000079 0.0044
0.000086 0.0104
.
.
.

Click to expand...

First column is the timestamp in seconds and second column is the
data. File contains 8seconds of measurement, and I would like to be
able to split the file into 3 parts seperated from specific time
locations. For example I want to divide the file into 3 parts, first
part containing 3 seconds of data, second containing 2 seconds of data
and third containing 3 seconds.

Click to expand...

I would do this with standard unix tools:

grep '^[012]' input.txt > first-three-seconds.txt
grep '^[34]' input.txt > next-two-seconds.txt
grep '^[567]' input.txt > next-three-seconds.txt

Sure, it makes three passes over the data, but for 20 MB of data, you
could have the whole job done in less time than it took me to type this.

As a sanity check, I would run "wc -l" on each of the files and confirm
that they add up to the original line count.

This works and is very fast but it missed a few hundred lines
unfortunately.

Hi all,

Click to expand...

I have a text file approximately 20mb in size and contains about one
million lines. I was doing some processing on the data but then the
data rate increased and it takes very long time to process. I import
using numpy.loadtxt, here is a fragment of the data ;

Click to expand...

0.000006 -0.0004
0.000071 0.0028
0.000079 0.0044
0.000086 0.0104
.
.
.

Click to expand...

First column is the timestamp in seconds and second column is the
data. File contains 8seconds of measurement, and I would like to be
able to split the file into 3 parts seperated from specific time
locations. For example I want to divide the file into 3 parts, first
part containing 3 seconds of data, second containing 2 seconds of data
and third containing 3 seconds. Splitting based on file size doesn't
work that accurately for this specific data, some columns become
missing and etc. I need to split depending on the column content ;

Click to expand...

1 - read file until first character of column1 is 3 (3 seconds)
2 - save this region to another file
3 - read the file where first characters of column1 are between 3 to
5 (2 seconds)
4 - save this region to another file
5 - read the file where first characters of column1 are between 5 to
5 (3 seconds)
6 - save this region to another file

Click to expand...

I need to do this exactly because numpy.loadtxt or genfromtxt doesn't
get well with missing columns / rows. I even tried the invalidraise
parameter of genfromtxt but no luck.

Click to expand...

I am sure it's a few lines of code for experienced users and I would
appreciate some guidance.

Click to expand...

Here's a solution in Python 3:

input_path = "..."
section_1_path = "..."
section_2_path = "..."
section_3_path = "..."

with open(input_path) as input_file:
try:
line = next(input_file)

# Copy section 1.
with open(section_1_path, "w") as output_file:
while line[0] < "3":
output_file.write(line)
line = next(input_file)

# Copy section 2.
with open(section_2_path, "w") as output_file:
while line[5] < "5":
output_file.write(line)
line = next(input_file)

# Copy section 3.
with open(section_3_path, "w") as output_file:
while True:
output_file.write(line)
line = next(input_file)
except StopIteration:
pass

With the following correction ;

while line[5] < "5":
should be
while line[0] < "5":

This works well.

Here's a solution in Python 3:

Click to expand...

input_path = "..."
section_1_path = "..."
section_2_path = "..."
section_3_path = "..."

Click to expand...

with open(input_path) as input_file:
try:
line = next(input_file)

Click to expand...

# Copy section 1.
with open(section_1_path, "w") as output_file:
while line[0] < "3":
output_file.write(line)
line = next(input_file)

Click to expand...

# Copy section 2.
with open(section_2_path, "w") as output_file:
while line[5] < "5":
output_file.write(line)
line = next(input_file)

Click to expand...

# Copy section 3.
with open(section_3_path, "w") as output_file:
while True:
output_file.write(line)
line = next(input_file)
except StopIteration:
pass

Click to expand...

Or more succintly (but not tested):

sections = [
("3", "section_1")
("5", "section_2")
("\xFF", "section_3")
]

with open(input_path) as input_file:
lines = iter(input_file)
for end, path in sections:
with open(path, "w") as output_file:
for line in lines:
if line >= end:
break
output_file.write(line)

Good idea. Especially when dealing with variable numbers of sections.
But somehow I got ;

("5", "section_2")
TypeError: 'tuple' object is not callable

MRAB · Jan 22, 2012

On 22/01/2012 16:17, Yigit Turgut wrote:
[snip]

Or more succintly (but not tested):

sections = [
("3", "section_1")
("5", "section_2")
("\xFF", "section_3")
]

with open(input_path) as input_file:
lines = iter(input_file)
for end, path in sections:
with open(path, "w") as output_file:
for line in lines:
if line>= end:
break
output_file.write(line)

Click to expand...

Good idea. Especially when dealing with variable numbers of sections.
But somehow I got ;

("5", "section_2")
TypeError: 'tuple' object is not callable

That's due to missing commas:

sections = [
("3", "section_1"),
("5", "section_2"),
("\xFF", "section_3")
]

Yigit Turgut · Jan 22, 2012

On 22/01/2012 16:17, Yigit Turgut wrote:
[snip]

Or more succintly (but not tested):
sections = [
("3", "section_1")
("5", "section_2")
("\xFF", "section_3")
]
with open(input_path) as input_file:
lines = iter(input_file)
for end, path in sections:
with open(path, "w") as output_file:
for line in lines:
if line>= end:
break
output_file.write(line)
--
Arnaud

Click to expand...

Click to expand...

Good idea. Especially when dealing with variable numbers of sections.
But somehow I got ;

Click to expand...

("5", "section_2")
TypeError: 'tuple' object is not callable

Click to expand...

That's due to missing commas:

sections = [
("3", "section_1"),
("5", "section_2"),
("\xFF", "section_3")
]

Thank you.

Roy Smith · Jan 22, 2012

I stand humbled.

I would do this with standard unix tools:

grep '^[012]' input.txt> first-three-seconds.txt
grep '^[34]' input.txt> next-two-seconds.txt
grep '^[567]' input.txt> next-three-seconds.txt

Sure, it makes three passes over the data, but for 20 MB of data, you
could have the whole job done in less time than it took me to type this.

Click to expand...

If you wanted to do it in one pass using standard unix tools, you can use:

sed -n -e'/^[0-2]/w first-three.txt' -e'/^[34]/w next-two.txt' -e'/^[5-7]/w next-three.txt'

-tkc

Tim Chase · Jan 22, 2012

If you wanted to do it in one pass using standard unix
tools, you can use:

sed -n -e'/^[0-2]/w first-three.txt' -e'/^[34]/w
next-two.txt' -e'/^[5-7]/w next-three.txt'

Click to expand...

I stand humbled.

In all likelyhood, you stand *younger*, not so much humbled ;-)

-tkc

Roy Smith · Jan 22, 2012

If you wanted to do it in one pass using standard unix
tools, you can use:

sed -n -e'/^[0-2]/w first-three.txt' -e'/^[34]/w
next-two.txt' -e'/^[5-7]/w next-three.txt'

Click to expand...

I stand humbled.

Click to expand...

In all likelyhood, you stand *younger*, not so much humbled ;-)

Oh, yeah? That must explain my grey hair and bifocals. I go back to Unix v6 in 1977. Humbled it is.

Arnaud Delobelle · Jan 22, 2012

On 22/01/2012 15:39, Arnaud Delobelle wrote: [...]

Or more succintly (but not tested):

sections = [
Â Â ("3", "section_1")
Â Â ("5", "section_2")
Â Â ("\xFF", "section_3")
]

with open(input_path) as input_file:
Â Â lines = iter(input_file)
Â Â for end, path in sections:
Â Â Â Â with open(path, "w") as output_file:
Â Â Â Â Â Â for line in lines:
Â Â Â Â Â Â Â Â if line>= end:
Â Â Â Â Â Â Â Â Â Â break
Â Â Â Â Â Â Â Â output_file.write(line)

Click to expand...

Consider the condition "line >= end".

If it's true, then control will break out of the inner loop and start
the inner loop again, getting the next line.

But what of the line which caused it to break out? It'll be lost.

Of course you're correct - my reply was too rushed. Here's a
hopefully working version (but still untested

.

sections = [
("3", "section_1")
("5", "section_2")
("\xFF", "section_3")
]

with open(input_path) as input_file:
line, lines = "", iter(input_file)
for end, path in sections:
with open(path, "w") as output_file:
output_file.write(line)
for line in lines:
if line >= end:
break
output_file.write(line)

Yigit Turgut · Jan 22, 2012

If you wanted to do it in one pass using standard unix
tools, you can use:
sed -n -e'/^[0-2]/w first-three.txt' -e'/^[34]/w
next-two.txt' -e'/^[5-7]/w next-three.txt'
I stand humbled.

Click to expand...

Click to expand...

In all likelyhood, you stand *younger*, not so much humbled ;-)

Click to expand...

Oh, yeah? That must explain my grey hair and bifocals. I go back to Unix v6 in 1977. Humbled it is.

Those times were much better IMHO

Eelco · Jan 22, 2012

The grep solution is not cross-platform, and not really an answer to a
question about python.

The by-line iteration examples are inefficient and bad practice from a
numpy/vectorization perspective.

I would advice to do it the numpythonic way (untested code):

breakpoints = [3, 5, 7]
data = np.loadtxt('data.txt')
time = data[:,0]
indices = np.searchsorted(time, breakpoints)
chunks = np.split(data, indices, axis=0)
for i, d in enumerate(chunks):
np.savetxt('data'+str(i)+'.txt', d)

Not sure how it compared to the grep solution in terms of performance,
but that should be quite a non-issue for 20mb of data, and its sure to
blow the by-line iteration out of the water. If you want to be more
efficient, you are going to have to cut the text-to-numeric parsing
out of the loop, which is the vast majority of the computational load
here; but if thats possible at all depends on how structured your
timestamps are; there must be a really compelling performance gain to
justify throwing the elegance of the np.split based solution out of
the window, in my opinion.

MRAB · Jan 22, 2012

On 22/01/2012 15:39, Arnaud Delobelle wrote: [...]

Or more succintly (but not tested):

sections = [
("3", "section_1")
("5", "section_2")
("\xFF", "section_3")
]

with open(input_path) as input_file:
lines = iter(input_file)
for end, path in sections:
with open(path, "w") as output_file:
for line in lines:
if line>= end:
break
output_file.write(line)

Click to expand...

Consider the condition "line>= end".

If it's true, then control will break out of the inner loop and start
the inner loop again, getting the next line.

But what of the line which caused it to break out? It'll be lost.

Click to expand...

Of course you're correct - my reply was too rushed. Here's a
hopefully working version (but still untested .

sections = [
("3", "section_1")
("5", "section_2")
("\xFF", "section_3")
]

[snip]
Missing commas!

Problem Splitting Text String	2	Dec 29, 2022
Read xml column inside csv file with Python	0	Jul 23, 2022
Search nested folders with specific names in python	0	Sep 23, 2022
splitting file/content into lines based on regex termination	0	Nov 7, 2013
How do i set specific code where in arduino	1	Mar 7, 2023
Remove unwanted characters from column	9	Jan 27, 2014
reading a specific column from file	8	Jan 11, 2008
How do I set the default content page) on a Classic ASP file?	0	Aug 24, 2021

Splitting a file from specific column content

Yigit Turgut

Roy Smith

MRAB

Arnaud Delobelle

MRAB

Yigit Turgut

MRAB

Yigit Turgut

Roy Smith

Tim Chase

Roy Smith

Arnaud Delobelle

Yigit Turgut

Eelco

MRAB

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads