ignore specific data

pkilambi · Nov 21, 2005

Hi I need help. What I want to do is If I read a file with some text
content...
I would like to ignore a block of lines and consider the rest..
so if the block starts with

"start of block............."
fjesdgsdhfgdlgjklfjdgkd
jhcsdfskdlgjkljgkfdjkgj
"end of block"

I want to ignore this while processing the file .This block could
appear anywhere in the file.It could at the start or end or even middle
of file content.

Hope I'm clear...

somethin like

f = open("file")
clean_data = ignore_block(f)

here ignore_data should filter the block

def ignore_data(f):
.............................
return data # may be an array of remaining lines...

skip · Nov 21, 2005

pkilambi> I would like to ignore a block of lines and consider the
pkilambi> rest.. so if the block starts with

pkilambi> "start of block............."
pkilambi> fjesdgsdhfgdlgjklfjdgkd
pkilambi> jhcsdfskdlgjkljgkfdjkgj
pkilambi> "end of block"

pkilambi> I want to ignore this while processing the file .This block
pkilambi> could appear anywhere in the file.It could at the start or end
pkilambi> or even middle of file content.

How about (untested):

class FilterBlock:
def __init__(self, f, start, end):
self.f = f
self.start = start
self.end = end

def __iter__(self):
return self

def next(self):
line = self.f.next()
if line == self.start:
line = self.f.next()
while line != self.end:
line = self.f.next()
return line

Then use it like

filterfile = FilterBlock(open("somefile", "r"),
"start of block..........",
"end of block")

for line in filterfile:
process(line)

I'm not sure what you mean by all the dots in your start of block line. If
"start of block" can be followed by other text, just use

if line.startswith(self.start):

instead of an exact comparison.

Skip

Mike Meyer · Nov 21, 2005

Hi I need help. What I want to do is If I read a file with some text
content...
I would like to ignore a block of lines and consider the rest..
so if the block starts with

"start of block............."
fjesdgsdhfgdlgjklfjdgkd
jhcsdfskdlgjkljgkfdjkgj
"end of block"

I want to ignore this while processing the file .This block could
appear anywhere in the file.It could at the start or end or even middle
of file content.

The best way depends on how you're going to use the data. For
instance, if you're going to be processing line at a time, you might
consider writing an interator:

# Untested code:

def filter(rawfile):
for line in rawfile:
if line == "start of block......":
break
yield line
for line in rawfile:
if line == "end of block":
break
for line in rawfile:
yield line

Then you use it like:

myfile = open(...)
for line in filter(myfile):
process(line)

This is a straightforward translation of your description, and avoids
loading the entire file into memory at once. You might be able to cons
up something more efficient from itertools.

<mike

pkilambi · Nov 21, 2005

thanks for that. But this will check for the exact content of the
"start of block......" or "end of block". How about if the content is
anywhere in the line?

Mike Meyer · Nov 21, 2005

thanks for that. But this will check for the exact content of the
"start of block......" or "end of block". How about if the content is
anywhere in the line?

Then the test is '"start of block....." in line'. You could also use
the line.find or line.index methods, but those don't return booleans,
and so require some extra work to get what you want.

<mike

pkilambi · Nov 21, 2005

I tried the solutions you provided..these are not as robust as i
thought would be...
may be i should put the problem more clearly...

here it goes....

I have a bunch of documents and each document has a header which is
common to all files. I read each file process it and compute the
frequency of words in each file. now I want to ignore the header in
each file. It is easy if the header is always at the top. but
apparently its not. it could be at the bottom as well. So I want a
function which goes through the file content and ignores the common
header and return the remaining text to compute the frequencies..Also
the header is not just one line..it includes licences and all other
stuff and may be 50 to 60 lines as well..This "remove_header" has to be
much more efficient as the files may be huge. As this is a very small
part of the whole problem i dont want this to slow down my entire
code...

Bengt Richter · Nov 22, 2005

I tried the solutions you provided..these are not as robust as i
thought would be...
may be i should put the problem more clearly...

here it goes....

I have a bunch of documents and each document has a header which is
common to all files. I read each file process it and compute the
frequency of words in each file. now I want to ignore the header in
each file. It is easy if the header is always at the top. but
apparently its not. it could be at the bottom as well. So I want a
function which goes through the file content and ignores the common
header and return the remaining text to compute the frequencies..Also
the header is not just one line..it includes licences and all other
stuff and may be 50 to 60 lines as well..This "remove_header" has to be
much more efficient as the files may be huge. As this is a very small
part of the whole problem i dont want this to slow down my entire
code...

Does this "header" have fixed-constant-string beginning and similar
fixed end with possibly variably text between? I.e., and can there be
multiple headers (i.e., header+ instead of header)?

Assuming this is a grammar[1] of your file:

datafile: [leading_string] header+ [trailing_string]
header: header_start header_middle header_end

0) is this a text file of lines? or?
1) is header_start a fixed constant string?
2) does header_start begin with the first character of a line?
3) does it end with the end of the same or 3a) subsequent line?
4) does header_end begin at the beginning of a line?
4a) like 3
4b) like 3a
5) can we ignore header_middle as never containing header_end in any
form (e.g. in quotes or comments etc)?
6) Anything else you can think of ;-)

[1] using [x] to mean optional x and some_name to mean a string composed
by some rules given by some_name: ... (or described in prose as here ;-)
and some_name+ to mean one or more some_name. (BTW some_name would mean
exactly one, [some_name] zero or one, some_name* zero or morem and somename+
one or more). What's needed is the final resolution to actual constants
or patterns of primitives. Can you define

header_start: "The actual fixed constant character string defining the header"
header_end: "whatever?"

Regards,
Bengt Richter

EEG stream data with mne and brainfolw	0	Jul 26, 2023
I Need Help with making a function that draws in a canvas using location data.	1	Dec 17, 2021
New To Javascript - Accessing Data	3	Nov 26, 2023
Search nested folders with specific names in python	0	Sep 23, 2022
How to position the tooltip comment on these buttons?	9	Nov 4, 2023
shutil ignore fails on passing a tuple?	3	Jul 19, 2012
Collect Excel Data from Website	5	Apr 30, 2022
Clickable Div Block	1	Oct 13, 2023

ignore specific data

pkilambi

skip

Mike Meyer

pkilambi

Mike Meyer

pkilambi

Bengt Richter

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads