ignore specific data

P

pkilambi

Hi I need help. What I want to do is If I read a file with some text
content...
I would like to ignore a block of lines and consider the rest..
so if the block starts with

"start of block............."
fjesdgsdhfgdlgjklfjdgkd
jhcsdfskdlgjkljgkfdjkgj
"end of block"

I want to ignore this while processing the file .This block could
appear anywhere in the file.It could at the start or end or even middle
of file content.

Hope I'm clear...

somethin like

f = open("file")
clean_data = ignore_block(f)

here ignore_data should filter the block

def ignore_data(f):
.............................
return data # may be an array of remaining lines...
 
S

skip

pkilambi> I would like to ignore a block of lines and consider the
pkilambi> rest.. so if the block starts with

pkilambi> "start of block............."
pkilambi> fjesdgsdhfgdlgjklfjdgkd
pkilambi> jhcsdfskdlgjkljgkfdjkgj
pkilambi> "end of block"

pkilambi> I want to ignore this while processing the file .This block
pkilambi> could appear anywhere in the file.It could at the start or end
pkilambi> or even middle of file content.

How about (untested):

class FilterBlock:
def __init__(self, f, start, end):
self.f = f
self.start = start
self.end = end

def __iter__(self):
return self

def next(self):
line = self.f.next()
if line == self.start:
line = self.f.next()
while line != self.end:
line = self.f.next()
return line

Then use it like

filterfile = FilterBlock(open("somefile", "r"),
"start of block..........",
"end of block")

for line in filterfile:
process(line)

I'm not sure what you mean by all the dots in your start of block line. If
"start of block" can be followed by other text, just use

if line.startswith(self.start):

instead of an exact comparison.

Skip
 
M

Mike Meyer

Hi I need help. What I want to do is If I read a file with some text
content...
I would like to ignore a block of lines and consider the rest..
so if the block starts with

"start of block............."
fjesdgsdhfgdlgjklfjdgkd
jhcsdfskdlgjkljgkfdjkgj
"end of block"

I want to ignore this while processing the file .This block could
appear anywhere in the file.It could at the start or end or even middle
of file content.

The best way depends on how you're going to use the data. For
instance, if you're going to be processing line at a time, you might
consider writing an interator:

# Untested code:

def filter(rawfile):
for line in rawfile:
if line == "start of block......":
break
yield line
for line in rawfile:
if line == "end of block":
break
for line in rawfile:
yield line

Then you use it like:

myfile = open(...)
for line in filter(myfile):
process(line)

This is a straightforward translation of your description, and avoids
loading the entire file into memory at once. You might be able to cons
up something more efficient from itertools.

<mike
 
P

pkilambi

thanks for that. But this will check for the exact content of the
"start of block......" or "end of block". How about if the content is
anywhere in the line?
 
M

Mike Meyer

thanks for that. But this will check for the exact content of the
"start of block......" or "end of block". How about if the content is
anywhere in the line?

Then the test is '"start of block....." in line'. You could also use
the line.find or line.index methods, but those don't return booleans,
and so require some extra work to get what you want.

<mike
 
P

pkilambi

I tried the solutions you provided..these are not as robust as i
thought would be...
may be i should put the problem more clearly...

here it goes....

I have a bunch of documents and each document has a header which is
common to all files. I read each file process it and compute the
frequency of words in each file. now I want to ignore the header in
each file. It is easy if the header is always at the top. but
apparently its not. it could be at the bottom as well. So I want a
function which goes through the file content and ignores the common
header and return the remaining text to compute the frequencies..Also
the header is not just one line..it includes licences and all other
stuff and may be 50 to 60 lines as well..This "remove_header" has to be
much more efficient as the files may be huge. As this is a very small
part of the whole problem i dont want this to slow down my entire
code...
 
B

Bengt Richter

I tried the solutions you provided..these are not as robust as i
thought would be...
may be i should put the problem more clearly...

here it goes....

I have a bunch of documents and each document has a header which is
common to all files. I read each file process it and compute the
frequency of words in each file. now I want to ignore the header in
each file. It is easy if the header is always at the top. but
apparently its not. it could be at the bottom as well. So I want a
function which goes through the file content and ignores the common
header and return the remaining text to compute the frequencies..Also
the header is not just one line..it includes licences and all other
stuff and may be 50 to 60 lines as well..This "remove_header" has to be
much more efficient as the files may be huge. As this is a very small
part of the whole problem i dont want this to slow down my entire
code...
Does this "header" have fixed-constant-string beginning and similar
fixed end with possibly variably text between? I.e., and can there be
multiple headers (i.e., header+ instead of header)?

Assuming this is a grammar[1] of your file:

datafile: [leading_string] header+ [trailing_string]
header: header_start header_middle header_end

0) is this a text file of lines? or?
1) is header_start a fixed constant string?
2) does header_start begin with the first character of a line?
3) does it end with the end of the same or 3a) subsequent line?
4) does header_end begin at the beginning of a line?
4a) like 3
4b) like 3a
5) can we ignore header_middle as never containing header_end in any
form (e.g. in quotes or comments etc)?
6) Anything else you can think of ;-)


[1] using [x] to mean optional x and some_name to mean a string composed
by some rules given by some_name: ... (or described in prose as here ;-)
and some_name+ to mean one or more some_name. (BTW some_name would mean
exactly one, [some_name] zero or one, some_name* zero or morem and somename+
one or more). What's needed is the final resolution to actual constants
or patterns of primitives. Can you define

header_start: "The actual fixed constant character string defining the header"
header_end: "whatever?"

Regards,
Bengt Richter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,754
Messages
2,569,527
Members
45,000
Latest member
MurrayKeync

Latest Threads

Top