Filtering content of a text file

Ira.Kovac · Jul 27, 2007

Hello All,

I'd greatly appreciate if you can take a look at the task I need help
with.

It'd be outstanding if someone can provide some sample Python code.

Thanks a lot,

Ira

-------------------------------------------------------------------------------
Problem
-------------------------------------------------------------------------------

I am working with 30K+ record datasets in flat file format (.txt) that
look like this:

//-+alibaba sinage
//-+amra damian//_9
//-+anix anire//_
//-+borom
//-+bokima sun drane
//-+ciren
//-+cop calestieon eded
//-+ciciban
//-+drago kimano sole

The records start with the same string (in the example //-+) wich is
followed by another string of characters taht's changing from record
to record.

I am working on one file at the time and for each file I need to be
able to do the following:

a) By looping thru the file the program should isolate all records
that have letter a following the //-+
b) The isolated dataset will contain only records that start with //-
+a
c) Save the isolated dataset as flat flat text file named a.txt
d) Repeat a), b) and c) for all letters of english alphabet (a thru z)
and numerical values (0 thru 9)

CP: An inch of time is an inch of gold but you can't buy that inch of
time with an inch of gold.

Random Link Generator

Amit Khemka · Jul 27, 2007

Hello All,

I'd greatly appreciate if you can take a look at the task I need help
with.

It'd be outstanding if someone can provide some sample Python code.

Thanks a lot,

Ira

-------------------------------------------------------------------------------
Problem
-------------------------------------------------------------------------------

I am working with 30K+ record datasets in flat file format (.txt) that
look like this:

//-+alibaba sinage
//-+amra damian//_9
//-+anix anire//_
//-+borom
//-+bokima sun drane
//-+ciren
//-+cop calestieon eded
//-+ciciban
//-+drago kimano sole

The records start with the same string (in the example //-+) wich is
followed by another string of characters taht's changing from record
to record.

I am working on one file at the time and for each file I need to be
able to do the following:

a) By looping thru the file the program should isolate all records
that have letter a following the //-+
b) The isolated dataset will contain only records that start with //-
+a
c) Save the isolated dataset as flat flat text file named a.txt
d) Repeat a), b) and c) for all letters of english alphabet (a thru z)
and numerical values (0 thru 9)

Well that should be easy if you take a look at methods in "string" module:
A rough sketch would be :

import string # import string module
alnums = list(string.lowercase+string.digits) # create a list of
alphabets and digits

for alnum in alnums:
outfile = open(alnum+'.txt', 'w')
for line in file("myrecords.txt"): # iterate over the records
if line.startswith("//-+"+alnum): # check your condition
# write the matches to a file
outfile.write(line)
outfile.close()

However rather than looping over the file for each alnum you may just
iterate over the file, and check the starting characters (if len(line)

4: ch=line[4]) , and if it is alnum then process it.

Cheers,
--
----
Amit Khemka
website: www.onyomo.com
wap-site: www.owap.in
Home Page: www.cse.iitd.ernet.in/~csd00377

Endless the world's turn, endless the sun's Spinning, Endless the quest;
I turn again, back to my own beginning, And here, find rest.

Bruno Desthuilliers · Jul 27, 2007

(e-mail address removed) a écrit :

Hello All,

I'd greatly appreciate if you can take a look at the task I need help
with.

It'd be outstanding if someone can provide some sample Python code.

No problem. It's 600 euro per day. Do I send you the contract ?

-------------------------------------------------------------------------------
Problem
-------------------------------------------------------------------------------

I am working with 30K+ record datasets in flat file format (.txt) that
look like this:

//-+alibaba sinage
//-+amra damian//_9
//-+anix anire//_
//-+borom
//-+bokima sun drane
//-+ciren
//-+cop calestieon eded
//-+ciciban
//-+drago kimano sole

The records start with the same string (in the example //-+) wich is
followed by another string of characters taht's changing from record
to record.

I am working on one file at the time and for each file I need to be
able to do the following:

a) By looping thru the file the program should isolate all records
that have letter a following the //-+
b) The isolated dataset will contain only records that start with //-
+a
c) Save the isolated dataset as flat flat text file named a.txt
d) Repeat a), b) and c) for all letters of english alphabet (a thru z)
and numerical values (0 thru 9)

This really looks like homework, and asking people to do your homework
for you is a pretty bad idea. On most newsgroup, the answer would stop
here, but c.l.py is a very friendly place, so I'll give you a couple
starting points:

1/ for char in "abc":
print "char is %s" % char
print "//-+%s" % char

2/ for line in open('somefile'):
print line

3/ print "//-+alibaba sinage"[4:]

4/ print "//-+alibaba sinage"[4:].startswith('a')

5/ data = []
data.append("//-+alibaba sinage\n")
data.append("//-+amra damian//_9\n")
print "".join(data)

6/ f = open('someotherfile.txt', 'w')
f.write("line1\nline2\nline3\n")
f.close()

This is all you need to know to complete your task.

Marc 'BlackJack' Rintsch · Jul 27, 2007

4/ print "//-+alibaba sinage"[4:].startswith('a')

print "//-+alibaba sinage".startswith('a', 4)

This does not create an extra string from the slicing.

Ciao,
Marc 'BlackJack' Rintsch

Bruno Desthuilliers · Jul 27, 2007

Marc 'BlackJack' Rintsch a Ã©crit :

4/ print "//-+alibaba sinage"[4:].startswith('a')

Click to expand...

print "//-+alibaba sinage".startswith('a', 4)

This does not create an extra string from the slicing.

One learns everyday...
Thanks Marc.

Marc 'BlackJack' Rintsch · Jul 27, 2007

I am working with 30K+ record datasets in flat file format (.txt) that
look like this:

//-+alibaba sinage
//-+amra damian//_9
//-+anix anire//_
//-+borom
//-+bokima sun drane
//-+ciren
//-+cop calestieon eded
//-+ciciban
//-+drago kimano sole

The example seems to be sorted, is this true for the real data too? And
are there records that don't start with a-z or 0-9?

a) By looping thru the file the program should isolate all records
that have letter a following the //-+
b) The isolated dataset will contain only records that start with //-
+a
c) Save the isolated dataset as flat flat text file named a.txt
d) Repeat a), b) and c) for all letters of english alphabet (a thru z)
and numerical values (0 thru 9)

This might be a little bit inefficient because the file gets read 36
times. If the data is already sorted you can use `itertools.groupby()` to
get the groups and write them to several files. Otherwise if the files
can be read into memory completely you can sort in memory and then use
`itertools.groupby()`.

Ciao,
Marc 'BlackJack' Rintsch

Bjoern Schliessmann · Jul 27, 2007

I'd greatly appreciate if you can take a look at the task I need
help with.

It'd be outstanding if someone can provide some sample Python
code.

Sure.

CP: An inch of time is an inch of gold but you can't buy that inch
of time with an inch of gold.

So, how much gold will I get for an "inch" of time?

Regards,

Björn

Ira.Kovac · Jul 27, 2007

Thanks all for the input. This is going to be a great basis for
starting. And, yeah - I wish it was a homework.

Best,

Ira

Problem Splitting Text String	2	Dec 29, 2022
Measuring a string of text	1	Sep 15, 2022
Add a text file that a user specified the name of in a program to a directory	0	Apr 28, 2022
How to sort a CSV file with merge sort JAVA	7	May 6, 2021
Splitting a file from specific column content	14	Jan 22, 2012
newbie: write content in a file (server-side)	4	Jul 29, 2012
Button click & filtering a GridView	2	Apr 2, 2010
How can I view / open / render / display a pdf file with c code?	0	Sep 23, 2023

Filtering content of a text file

Ira.Kovac

Amit Khemka

Bruno Desthuilliers

Marc 'BlackJack' Rintsch

Bruno Desthuilliers

Marc 'BlackJack' Rintsch

Bjoern Schliessmann

Ira.Kovac

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads