ignore specific data

Discussion in 'Python' started by pkilambi@gmail.com, Nov 21, 2005.

  1. Guest

    Hi I need help. What I want to do is If I read a file with some text
    content...
    I would like to ignore a block of lines and consider the rest..
    so if the block starts with

    "start of block............."
    fjesdgsdhfgdlgjklfjdgkd
    jhcsdfskdlgjkljgkfdjkgj
    "end of block"

    I want to ignore this while processing the file .This block could
    appear anywhere in the file.It could at the start or end or even middle
    of file content.

    Hope I'm clear...

    somethin like

    f = open("file")
    clean_data = ignore_block(f)

    here ignore_data should filter the block

    def ignore_data(f):
    .............................
    return data # may be an array of remaining lines...
     
    , Nov 21, 2005
    #1
    1. Advertising

  2. Guest

    pkilambi> I would like to ignore a block of lines and consider the
    pkilambi> rest.. so if the block starts with

    pkilambi> "start of block............."
    pkilambi> fjesdgsdhfgdlgjklfjdgkd
    pkilambi> jhcsdfskdlgjkljgkfdjkgj
    pkilambi> "end of block"

    pkilambi> I want to ignore this while processing the file .This block
    pkilambi> could appear anywhere in the file.It could at the start or end
    pkilambi> or even middle of file content.

    How about (untested):

    class FilterBlock:
    def __init__(self, f, start, end):
    self.f = f
    self.start = start
    self.end = end

    def __iter__(self):
    return self

    def next(self):
    line = self.f.next()
    if line == self.start:
    line = self.f.next()
    while line != self.end:
    line = self.f.next()
    return line

    Then use it like

    filterfile = FilterBlock(open("somefile", "r"),
    "start of block..........",
    "end of block")

    for line in filterfile:
    process(line)

    I'm not sure what you mean by all the dots in your start of block line. If
    "start of block" can be followed by other text, just use

    if line.startswith(self.start):

    instead of an exact comparison.

    Skip
     
    , Nov 21, 2005
    #2
    1. Advertising

  3. Mike Meyer Guest

    writes:
    > Hi I need help. What I want to do is If I read a file with some text
    > content...
    > I would like to ignore a block of lines and consider the rest..
    > so if the block starts with
    >
    > "start of block............."
    > fjesdgsdhfgdlgjklfjdgkd
    > jhcsdfskdlgjkljgkfdjkgj
    > "end of block"
    >
    > I want to ignore this while processing the file .This block could
    > appear anywhere in the file.It could at the start or end or even middle
    > of file content.


    The best way depends on how you're going to use the data. For
    instance, if you're going to be processing line at a time, you might
    consider writing an interator:

    # Untested code:

    def filter(rawfile):
    for line in rawfile:
    if line == "start of block......":
    break
    yield line
    for line in rawfile:
    if line == "end of block":
    break
    for line in rawfile:
    yield line

    Then you use it like:

    myfile = open(...)
    for line in filter(myfile):
    process(line)

    This is a straightforward translation of your description, and avoids
    loading the entire file into memory at once. You might be able to cons
    up something more efficient from itertools.

    <mike
    --
    Mike Meyer <> http://www.mired.org/home/mwm/
    Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
     
    Mike Meyer, Nov 21, 2005
    #3
  4. Guest

    thanks for that. But this will check for the exact content of the
    "start of block......" or "end of block". How about if the content is
    anywhere in the line?
     
    , Nov 21, 2005
    #4
  5. Mike Meyer Guest

    writes:

    > thanks for that. But this will check for the exact content of the
    > "start of block......" or "end of block". How about if the content is
    > anywhere in the line?


    Then the test is '"start of block....." in line'. You could also use
    the line.find or line.index methods, but those don't return booleans,
    and so require some extra work to get what you want.

    <mike
    --
    Mike Meyer <> http://www.mired.org/home/mwm/
    Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
     
    Mike Meyer, Nov 21, 2005
    #5
  6. Guest

    I tried the solutions you provided..these are not as robust as i
    thought would be...
    may be i should put the problem more clearly...

    here it goes....

    I have a bunch of documents and each document has a header which is
    common to all files. I read each file process it and compute the
    frequency of words in each file. now I want to ignore the header in
    each file. It is easy if the header is always at the top. but
    apparently its not. it could be at the bottom as well. So I want a
    function which goes through the file content and ignores the common
    header and return the remaining text to compute the frequencies..Also
    the header is not just one line..it includes licences and all other
    stuff and may be 50 to 60 lines as well..This "remove_header" has to be
    much more efficient as the files may be huge. As this is a very small
    part of the whole problem i dont want this to slow down my entire
    code...
     
    , Nov 21, 2005
    #6
  7. On 21 Nov 2005 13:59:12 -0800, wrote:

    >I tried the solutions you provided..these are not as robust as i
    >thought would be...
    >may be i should put the problem more clearly...
    >
    >here it goes....
    >
    >I have a bunch of documents and each document has a header which is
    >common to all files. I read each file process it and compute the
    >frequency of words in each file. now I want to ignore the header in
    >each file. It is easy if the header is always at the top. but
    >apparently its not. it could be at the bottom as well. So I want a
    >function which goes through the file content and ignores the common
    >header and return the remaining text to compute the frequencies..Also
    >the header is not just one line..it includes licences and all other
    >stuff and may be 50 to 60 lines as well..This "remove_header" has to be
    >much more efficient as the files may be huge. As this is a very small
    >part of the whole problem i dont want this to slow down my entire
    >code...
    >

    Does this "header" have fixed-constant-string beginning and similar
    fixed end with possibly variably text between? I.e., and can there be
    multiple headers (i.e., header+ instead of header)?

    Assuming this is a grammar[1] of your file:

    datafile: [leading_string] header+ [trailing_string]
    header: header_start header_middle header_end

    0) is this a text file of lines? or?
    1) is header_start a fixed constant string?
    2) does header_start begin with the first character of a line?
    3) does it end with the end of the same or 3a) subsequent line?
    4) does header_end begin at the beginning of a line?
    4a) like 3
    4b) like 3a
    5) can we ignore header_middle as never containing header_end in any
    form (e.g. in quotes or comments etc)?
    6) Anything else you can think of ;-)


    [1] using [x] to mean optional x and some_name to mean a string composed
    by some rules given by some_name: ... (or described in prose as here ;-)
    and some_name+ to mean one or more some_name. (BTW some_name would mean
    exactly one, [some_name] zero or one, some_name* zero or morem and somename+
    one or more). What's needed is the final resolution to actual constants
    or patterns of primitives. Can you define

    header_start: "The actual fixed constant character string defining the header"
    header_end: "whatever?"

    Regards,
    Bengt Richter
     
    Bengt Richter, Nov 22, 2005
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?SmF2?=

    Is ViwState Page-Specific or UserControl-Specific

    =?Utf-8?B?SmF2?=, Aug 16, 2006, in forum: ASP .Net
    Replies:
    2
    Views:
    575
    =?Utf-8?B?SmF2?=
    Aug 16, 2006
  2. mazdotnet
    Replies:
    2
    Views:
    425
    Alexey Smirnov
    Oct 2, 2009
  3. how to ignore postback data in particular situations

    , Nov 15, 2008, in forum: ASP .Net Building Controls
    Replies:
    0
    Views:
    838
  4. Rob Meade
    Replies:
    6
    Views:
    300
    Rob Meade
    Mar 1, 2004
  5. William FERRERES
    Replies:
    7
    Views:
    241
    William FERRERES
    Jul 9, 2007
Loading...

Share This Page