Discussion on some Code Issues

Discussion in 'Python' started by subhabangalore@gmail.com, Jul 5, 2012.

  1. Guest

    Dear Group,

    I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to discusssome coding issues. If any one of this learned room can shower some light I would be helpful enough.

    I got to code a bunch of documents which are combined together.
    Like,

    1)A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-airand forced the pilot to make an emergency landing.
    2) The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection.
    3) A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.

    The task is to separate the documents on the fly and to parse each of the documents with a definite set of rules.

    Now, the way I am processing is:
    I am clubbing all the documents together, as,

    A Mumbai-bound aircraft with 99 passengers on board was struck by lightningon Tuesday evening that led to complete communication failure in mid-air and forced the pilot to make an emergency landing.The discovery of a new sub-atomic particle that is key to understanding how the universe is built hasan intrinsic Indian connection. A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.

    But they are separated by a tag set, like,
    A Mumbai-bound aircraft with 99 passengers on board was struck by lightningon Tuesday evening that led to complete communication failure in mid-air and forced the pilot to make an emergency landing.$
    The discovery of a new sub-atomic particle that is key to understanding howthe universe is built has an intrinsic Indian connection.$
    A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.

    To detect the document boundaries, I am splitting them into a bag of words and using a simple for loop as,
    for i in range(len(bag_words)):
    if bag_words=="$":
    print (bag_words,i)

    There is no issue. I am segmenting it nicely. I am using annotated corpus so applying parse rules.

    The confusion comes next,

    As per my problem statement the size of the file (of documents combined together) won’t increase on the fly. So, just to support all kinds of combinations I am appending in a list the “I” values, taking its length, and using slice. Works perfect. Question is, is there a smarter way to achieve this, and a curious question if the documents are on the fly with no preprocessed tag set like “$” how may I do it? From a bunch without EOF isn’t it a classification problem?

    There is no question on parsing it seems I am achieving it independent of length of the document.

    If any one in the group can suggest how I am dealing with the problem and which portions should be improved and how?

    Thanking You in Advance,

    Best Regards,
    Subhabrata Banerjee.
     
    , Jul 5, 2012
    #1
    1. Advertising

  2. On Wed, 04 Jul 2012 16:21:46 -0700, subhabangalore wrote:

    [...]
    > I got to code a bunch of documents which are combined together.

    [...]
    > The task is to separate the documents on the fly and to parse each of
    > the documents with a definite set of rules.
    >
    > Now, the way I am processing is:
    > I am clubbing all the documents together, as,

    [...]
    > But they are separated by a tag set

    [...]
    > To detect the document boundaries,


    Let me see if I understand your problem.

    You have a bunch of documents. You stick them all together into one
    enormous lump. And then you try to detect the boundaries between one file
    and the next within the enormous lump.

    Why not just process each file separately? A simple for loop over the
    list of files, before consolidating them into one giant file, will avoid
    all the difficulty of trying to detect boundaries within files.

    Instead of:

    merge(output_filename, list_of_files)
    for word in parse(output_filename):
    if boundary_detected: do_something()
    process(word)

    Do this instead:

    for filename in list_of_files:
    do_something()
    for word in parse(filename):
    process(word)


    > I am splitting them into a bag of
    > words and using a simple for loop as,
    > for i in range(len(bag_words)):
    > if bag_words=="$":
    > print (bag_words,i)



    What happens if a file already has a $ in it?


    > There is no issue. I am segmenting it nicely. I am using annotated
    > corpus so applying parse rules.
    >
    > The confusion comes next,
    >
    > As per my problem statement the size of the file (of documents combined
    > together) won’t increase on the fly. So, just to support all kinds of
    > combinations I am appending in a list the “I†values, taking its length,
    > and using slice. Works perfect.


    I don't understand this. What sort of combinations do you think you need
    to support? What are "I" values, and why are they important?



    --
    Steven
     
    Steven D'Aprano, Jul 5, 2012
    #2
    1. Advertising

  3. Rick Johnson Guest

    On Jul 4, 6:21 pm, wrote:
    > [...]
    > To detect the document boundaries, I am splitting them into a bag
    > of words and using a simple for loop as,
    >
    > for i in range(len(bag_words)):
    >         if bag_words=="$":
    >             print (bag_words,i)


    Ignoring that you are attacking the problem incorrectly: that is very
    poor method of splitting a string since especially the Python gods
    have given you *power* over string objects. But you are going to have
    an even greater problem if the string contains a "$" char that you DID
    NOT insert :-O. You'd be wise to use a sep that is not likely to be in
    the file data. For example: "<SEP>" or "<SPLIT-HERE>". But even that
    approach is naive! Why not streamline the entire process and pass a
    list of file paths to a custom parser object instead?
     
    Rick Johnson, Jul 5, 2012
    #3
  4. Guest

    On Thursday, July 5, 2012 4:51:46 AM UTC+5:30, (unknown) wrote:
    > Dear Group,
    >
    > I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to discuss some coding issues. If any one of this learned room can shower some light I would be helpful enough.
    >
    > I got to code a bunch of documents which are combined together.
    > Like,
    >
    > 1)A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-air and forced the pilot to make an emergency landing.
    > 2) The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection.
    > 3) A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.
    >
    > The task is to separate the documents on the fly and to parse each of thedocuments with a definite set of rules.
    >
    > Now, the way I am processing is:
    > I am clubbing all the documents together, as,
    >
    > A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-airand forced the pilot to make an emergency landing.The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection. A bomb explosion outside a shopping mallhere on Tuesday left no one injured, but Nigerian authorities put securityagencies on high alert fearing more such attacks in the city.
    >
    > But they are separated by a tag set, like,
    > A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-airand forced the pilot to make an emergency landing.$
    > The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection.$
    > A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.
    >
    > To detect the document boundaries, I am splitting them into a bag of words and using a simple for loop as,
    > for i in range(len(bag_words)):
    > if bag_words=="$":
    > print (bag_words,i)
    >
    > There is no issue. I am segmenting it nicely. I am using annotated corpusso applying parse rules.
    >
    > The confusion comes next,
    >
    > As per my problem statement the size of the file (of documents combined together) won’t increase on the fly. So, just to support all kinds of combinations I am appending in a list the “I” values, taking its length, and using slice. Works perfect. Question is, is there a smarter way to achieve this, and a curious question if the documents are on the fly with no preprocessed tag set like “$” how may I do it? From a bunch without EOF isn’t it a classification problem?
    >
    > There is no question on parsing it seems I am achieving it independent oflength of the document.
    >
    > If any one in the group can suggest how I am dealing with the problem andwhich portions should be improved and how?
    >
    > Thanking You in Advance,
    >
    > Best Regards,
    > Subhabrata Banerjee.



    Hi Steven, It is nice to see your post. They are nice and I learnt so many things from you. "I" is for index of the loop.
    Now my clarification I thought to do "import os" and process files in a loop but that is not my problem statement. I have to make a big lump of text and detect one chunk. Looping over the line number of file I am not using because I may not be able to take the slices-this I need. I thought to give re.findall a try but that is not giving me the slices. Slice spreads here. The power issue of string! I would definitely give it a try. Happy Day AheadRegards, Subhabrata Banerjee.
     
    , Jul 5, 2012
    #4
  5. Peter Otten Guest

    wrote:

    > On Thursday, July 5, 2012 4:51:46 AM UTC+5:30, (unknown) wrote:
    >> Dear Group,
    >>
    >> I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to
    >> discuss some coding issues. If any one of this learned room can shower
    >> some light I would be helpful enough.
    >>
    >> I got to code a bunch of documents which are combined together.
    >> Like,
    >>
    >> 1)A Mumbai-bound aircraft with 99 passengers on board was struck by
    >> lightning on Tuesday evening that led to complete communication failure
    >> in mid-air and forced the pilot to make an emergency landing. 2) The
    >> discovery of a new sub-atomic particle that is key to understanding how
    >> the universe is built has an intrinsic Indian connection. 3) A bomb
    >> explosion outside a shopping mall here on Tuesday left no one injured,
    >> but Nigerian authorities put security agencies on high alert fearing more
    >> such attacks in the city.
    >>
    >> The task is to separate the documents on the fly and to parse each of the
    >> documents with a definite set of rules.
    >>
    >> Now, the way I am processing is:
    >> I am clubbing all the documents together, as,
    >>
    >> A Mumbai-bound aircraft with 99 passengers on board was struck by
    >> lightning on Tuesday evening that led to complete communication failure
    >> in mid-air and forced the pilot to make an emergency landing.The
    >> discovery of a new sub-atomic particle that is key to understanding how
    >> the universe is built has an intrinsic Indian connection. A bomb
    >> explosion outside a shopping mall here on Tuesday left no one injured,
    >> but Nigerian authorities put security agencies on high alert fearing more
    >> such attacks in the city.
    >>
    >> But they are separated by a tag set, like,
    >> A Mumbai-bound aircraft with 99 passengers on board was struck by
    >> lightning on Tuesday evening that led to complete communication failure
    >> in mid-air and forced the pilot to make an emergency landing.$ The
    >> discovery of a new sub-atomic particle that is key to understanding how
    >> the universe is built has an intrinsic Indian connection.$ A bomb
    >> explosion outside a shopping mall here on Tuesday left no one injured,
    >> but Nigerian authorities put security agencies on high alert fearing more
    >> such attacks in the city.
    >>
    >> To detect the document boundaries, I am splitting them into a bag of
    >> words and using a simple for loop as, for i in range(len(bag_words)):
    >> if bag_words=="$":
    >> print (bag_words,i)
    >>
    >> There is no issue. I am segmenting it nicely. I am using annotated corpus
    >> so applying parse rules.
    >>
    >> The confusion comes next,
    >>
    >> As per my problem statement the size of the file (of documents combined
    >> together) won’t increase on the fly. So, just to support all kinds of
    >> combinations I am appending in a list the “I†values, taking its length,
    >> and using slice. Works perfect. Question is, is there a smarter way to
    >> achieve this, and a curious question if the documents are on the fly with
    >> no preprocessed tag set like “$†how may I do it? From a bunch without
    >> EOF isn’t it a classification problem?
    >>
    >> There is no question on parsing it seems I am achieving it independent of
    >> length of the document.
    >>
    >> If any one in the group can suggest how I am dealing with the problem and
    >> which portions should be improved and how?
    >>
    >> Thanking You in Advance,
    >>
    >> Best Regards,
    >> Subhabrata Banerjee.

    >
    >
    > Hi Steven, It is nice to see your post. They are nice and I learnt so many
    > things from you. "I" is for index of the loop. Now my clarification I
    > thought to do "import os" and process files in a loop but that is not my
    > problem statement. I have to make a big lump of text and detect one chunk.
    > Looping over the line number of file I am not using because I may not be
    > able to take the slices-this I need. I thought to give re.findall a try
    > but that is not giving me the slices. Slice spreads here. The power issue
    > of string! I would definitely give it a try. Happy Day Ahead Regards,
    > Subhabrata Banerjee.


    Then use re.finditer():

    start = 0
    for match in re.finditer(r"\$", data):
    end = match.start()
    print(start, end)
    print(data[start:end])
    start = match.end()

    This will omit the last text. The simplest fix is to put another "$"
    separator at the end of your data.
     
    Peter Otten, Jul 5, 2012
    #5
  6. Guest

    Dear Peter,
    That is a nice one. I am thinking if I can write "for lines in f" sort of code that is easy but then how to find out the slices then, btw do you know in any case may I convert the index position of file to the list position provided I am writing the list for the same file we are reading.

    Best Regards,
    Subhabrata.

    On Thursday, July 5, 2012 1:00:12 PM UTC+5:30, Peter Otten wrote:
    > wrote:
    >
    > > On Thursday, July 5, 2012 4:51:46 AM UTC+5:30, (unknown) wrote:
    > >> Dear Group,
    > >>
    > >> I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to
    > >> discuss some coding issues. If any one of this learned room can shower
    > >> some light I would be helpful enough.
    > >>
    > >> I got to code a bunch of documents which are combined together.
    > >> Like,
    > >>
    > >> 1)A Mumbai-bound aircraft with 99 passengers on board was struck by
    > >> lightning on Tuesday evening that led to complete communication failure
    > >> in mid-air and forced the pilot to make an emergency landing. 2) The
    > >> discovery of a new sub-atomic particle that is key to understanding how
    > >> the universe is built has an intrinsic Indian connection. 3) A bomb
    > >> explosion outside a shopping mall here on Tuesday left no one injured,
    > >> but Nigerian authorities put security agencies on high alert fearing more
    > >> such attacks in the city.
    > >>
    > >> The task is to separate the documents on the fly and to parse each of the
    > >> documents with a definite set of rules.
    > >>
    > >> Now, the way I am processing is:
    > >> I am clubbing all the documents together, as,
    > >>
    > >> A Mumbai-bound aircraft with 99 passengers on board was struck by
    > >> lightning on Tuesday evening that led to complete communication failure
    > >> in mid-air and forced the pilot to make an emergency landing.The
    > >> discovery of a new sub-atomic particle that is key to understanding how
    > >> the universe is built has an intrinsic Indian connection. A bomb
    > >> explosion outside a shopping mall here on Tuesday left no one injured,
    > >> but Nigerian authorities put security agencies on high alert fearing more
    > >> such attacks in the city.
    > >>
    > >> But they are separated by a tag set, like,
    > >> A Mumbai-bound aircraft with 99 passengers on board was struck by
    > >> lightning on Tuesday evening that led to complete communication failure
    > >> in mid-air and forced the pilot to make an emergency landing.$ The
    > >> discovery of a new sub-atomic particle that is key to understanding how
    > >> the universe is built has an intrinsic Indian connection.$ A bomb
    > >> explosion outside a shopping mall here on Tuesday left no one injured,
    > >> but Nigerian authorities put security agencies on high alert fearing more
    > >> such attacks in the city.
    > >>
    > >> To detect the document boundaries, I am splitting them into a bag of
    > >> words and using a simple for loop as, for i in range(len(bag_words)):
    > >> if bag_words=="$":
    > >> print (bag_words,i)
    > >>
    > >> There is no issue. I am segmenting it nicely. I am using annotated corpus
    > >> so applying parse rules.
    > >>
    > >> The confusion comes next,
    > >>
    > >> As per my problem statement the size of the file (of documents combined
    > >> together) won’t increase on the fly. So, just to support all kinds of
    > >> combinations I am appending in a list the “I” values, taking its length,
    > >> and using slice. Works perfect. Question is, is there a smarter way to
    > >> achieve this, and a curious question if the documents are on the fly with
    > >> no preprocessed tag set like “$” how may I do it? From a bunch without
    > >> EOF isn’t it a classification problem?
    > >>
    > >> There is no question on parsing it seems I am achieving it independentof
    > >> length of the document.
    > >>
    > >> If any one in the group can suggest how I am dealing with the problem and
    > >> which portions should be improved and how?
    > >>
    > >> Thanking You in Advance,
    > >>
    > >> Best Regards,
    > >> Subhabrata Banerjee.

    > >
    > >
    > > Hi Steven, It is nice to see your post. They are nice and I learnt so many
    > > things from you. "I" is for index of the loop. Now my clarification I
    > > thought to do "import os" and process files in a loop but that is not my
    > > problem statement. I have to make a big lump of text and detect one chunk.
    > > Looping over the line number of file I am not using because I may not be
    > > able to take the slices-this I need. I thought to give re.findall a try
    > > but that is not giving me the slices. Slice spreads here. The power issue
    > > of string! I would definitely give it a try. Happy Day Ahead Regards,
    > > Subhabrata Banerjee.

    >
    > Then use re.finditer():
    >
    > start = 0
    > for match in re.finditer(r"\$", data):
    > end = match.start()
    > print(start, end)
    > print(data[start:end])
    > start = match.end()
    >
    > This will omit the last text. The simplest fix is to put another "$"
    > separator at the end of your data.
     
    , Jul 5, 2012
    #6
  7. Guest

    Dear Peter,
    That is a nice one. I am thinking if I can write "for lines in f" sort of code that is easy but then how to find out the slices then, btw do you know in any case may I convert the index position of file to the list position provided I am writing the list for the same file we are reading.

    Best Regards,
    Subhabrata.

    On Thursday, July 5, 2012 1:00:12 PM UTC+5:30, Peter Otten wrote:
    > wrote:
    >
    > > On Thursday, July 5, 2012 4:51:46 AM UTC+5:30, (unknown) wrote:
    > >> Dear Group,
    > >>
    > >> I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to
    > >> discuss some coding issues. If any one of this learned room can shower
    > >> some light I would be helpful enough.
    > >>
    > >> I got to code a bunch of documents which are combined together.
    > >> Like,
    > >>
    > >> 1)A Mumbai-bound aircraft with 99 passengers on board was struck by
    > >> lightning on Tuesday evening that led to complete communication failure
    > >> in mid-air and forced the pilot to make an emergency landing. 2) The
    > >> discovery of a new sub-atomic particle that is key to understanding how
    > >> the universe is built has an intrinsic Indian connection. 3) A bomb
    > >> explosion outside a shopping mall here on Tuesday left no one injured,
    > >> but Nigerian authorities put security agencies on high alert fearing more
    > >> such attacks in the city.
    > >>
    > >> The task is to separate the documents on the fly and to parse each of the
    > >> documents with a definite set of rules.
    > >>
    > >> Now, the way I am processing is:
    > >> I am clubbing all the documents together, as,
    > >>
    > >> A Mumbai-bound aircraft with 99 passengers on board was struck by
    > >> lightning on Tuesday evening that led to complete communication failure
    > >> in mid-air and forced the pilot to make an emergency landing.The
    > >> discovery of a new sub-atomic particle that is key to understanding how
    > >> the universe is built has an intrinsic Indian connection. A bomb
    > >> explosion outside a shopping mall here on Tuesday left no one injured,
    > >> but Nigerian authorities put security agencies on high alert fearing more
    > >> such attacks in the city.
    > >>
    > >> But they are separated by a tag set, like,
    > >> A Mumbai-bound aircraft with 99 passengers on board was struck by
    > >> lightning on Tuesday evening that led to complete communication failure
    > >> in mid-air and forced the pilot to make an emergency landing.$ The
    > >> discovery of a new sub-atomic particle that is key to understanding how
    > >> the universe is built has an intrinsic Indian connection.$ A bomb
    > >> explosion outside a shopping mall here on Tuesday left no one injured,
    > >> but Nigerian authorities put security agencies on high alert fearing more
    > >> such attacks in the city.
    > >>
    > >> To detect the document boundaries, I am splitting them into a bag of
    > >> words and using a simple for loop as, for i in range(len(bag_words)):
    > >> if bag_words=="$":
    > >> print (bag_words,i)
    > >>
    > >> There is no issue. I am segmenting it nicely. I am using annotated corpus
    > >> so applying parse rules.
    > >>
    > >> The confusion comes next,
    > >>
    > >> As per my problem statement the size of the file (of documents combined
    > >> together) won’t increase on the fly. So, just to support all kinds of
    > >> combinations I am appending in a list the “I” values, taking its length,
    > >> and using slice. Works perfect. Question is, is there a smarter way to
    > >> achieve this, and a curious question if the documents are on the fly with
    > >> no preprocessed tag set like “$” how may I do it? From a bunch without
    > >> EOF isn’t it a classification problem?
    > >>
    > >> There is no question on parsing it seems I am achieving it independentof
    > >> length of the document.
    > >>
    > >> If any one in the group can suggest how I am dealing with the problem and
    > >> which portions should be improved and how?
    > >>
    > >> Thanking You in Advance,
    > >>
    > >> Best Regards,
    > >> Subhabrata Banerjee.

    > >
    > >
    > > Hi Steven, It is nice to see your post. They are nice and I learnt so many
    > > things from you. "I" is for index of the loop. Now my clarification I
    > > thought to do "import os" and process files in a loop but that is not my
    > > problem statement. I have to make a big lump of text and detect one chunk.
    > > Looping over the line number of file I am not using because I may not be
    > > able to take the slices-this I need. I thought to give re.findall a try
    > > but that is not giving me the slices. Slice spreads here. The power issue
    > > of string! I would definitely give it a try. Happy Day Ahead Regards,
    > > Subhabrata Banerjee.

    >
    > Then use re.finditer():
    >
    > start = 0
    > for match in re.finditer(r"\$", data):
    > end = match.start()
    > print(start, end)
    > print(data[start:end])
    > start = match.end()
    >
    > This will omit the last text. The simplest fix is to put another "$"
    > separator at the end of your data.
     
    , Jul 5, 2012
    #7
  8. Peter Otten Guest

    wrote:

    [Please don't top-post]

    >> start = 0
    >> for match in re.finditer(r"\$", data):
    >> end = match.start()
    >> print(start, end)
    >> print(data[start:end])
    >> start = match.end()


    > That is a nice one. I am thinking if I can write "for lines in f" sort of
    > code that is easy but then how to find out the slices then,


    You have to keep track both of the offset of the line and the offset within
    the line:

    def offsets(lines, pos=0):
    for line in lines:
    yield pos, line
    pos += len(line)

    start = 0
    for line_start, line in offsets(lines):
    for pos, part in offsets(re.split(r"(\$)", line), line_start):
    if part == "$":
    print(start, pos)
    start = pos + 1

    (untested code, I'm assuming that the file ends with a $)

    > btw do you
    > know in any case may I convert the index position of file to the list
    > position provided I am writing the list for the same file we are reading.


    Use a lookup list with the end positions of the texts and then find the
    relevant text with bisect.

    >>> ends = [10, 20, 50]
    >>> filepos = 15
    >>> bisect.bisect(ends, filepos)

    1 # position 15 belongs to the second text
     
    Peter Otten, Jul 6, 2012
    #8
  9. Guest

    On Thursday, July 5, 2012 4:51:46 AM UTC+5:30, (unknown) wrote:
    > Dear Group,
    >
    > I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to discuss some coding issues. If any one of this learned room can shower some light I would be helpful enough.
    >
    > I got to code a bunch of documents which are combined together.
    > Like,
    >
    > 1)A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-air and forced the pilot to make an emergency landing.
    > 2) The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection.
    > 3) A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.
    >
    > The task is to separate the documents on the fly and to parse each of thedocuments with a definite set of rules.
    >
    > Now, the way I am processing is:
    > I am clubbing all the documents together, as,
    >
    > A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-airand forced the pilot to make an emergency landing.The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection. A bomb explosion outside a shopping mallhere on Tuesday left no one injured, but Nigerian authorities put securityagencies on high alert fearing more such attacks in the city.
    >
    > But they are separated by a tag set, like,
    > A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-airand forced the pilot to make an emergency landing.$
    > The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection.$
    > A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.
    >
    > To detect the document boundaries, I am splitting them into a bag of words and using a simple for loop as,
    > for i in range(len(bag_words)):
    > if bag_words=="$":
    > print (bag_words,i)
    >
    > There is no issue. I am segmenting it nicely. I am using annotated corpusso applying parse rules.
    >
    > The confusion comes next,
    >
    > As per my problem statement the size of the file (of documents combined together) won’t increase on the fly. So, just to support all kinds of combinations I am appending in a list the “I” values, taking its length, and using slice. Works perfect. Question is, is there a smarter way to achieve this, and a curious question if the documents are on the fly with no preprocessed tag set like “$” how may I do it? From a bunch without EOF isn’t it a classification problem?
    >
    > There is no question on parsing it seems I am achieving it independent oflength of the document.
    >
    > If any one in the group can suggest how I am dealing with the problem andwhich portions should be improved and how?
    >
    > Thanking You in Advance,
    >
    > Best Regards,
    > Subhabrata Banerjee.


    Thanks Peter but I feel your earlier one was better, I got an interesting one:
    [i - 1 for i in range(len(f1)) if f1.startswith('$', i - 1)]

    But I am bit intrigued with another question,

    suppose I say:
    file_open=open("/python32/doc1.txt","r")
    file=a1.read().lower()
    for line in file:
    line_word=line.split()

    This works fine. But if I print it would be printed continuously.
    I like to store in some variable,so that I may print line of my choice and manipulate them at my choice.
    Is there any way out to this problem?


    Regards,
    Subhabrata Banerjee
     
    , Jul 7, 2012
    #9
  10. On Sat, 7 Jul 2012 12:54:16 -0700 (PDT),
    declaimed the following in gmane.comp.python.general:

    > But I am bit intrigued with another question,
    >
    > suppose I say:
    > file_open=open("/python32/doc1.txt","r")
    > file=a1.read().lower()
    > for line in file:
    > line_word=line.split()
    >
    > This works fine. But if I print it would be printed continuously.


    "This works fine" -- Really?

    1) Why are you storing data files in the install directory of your
    Python interpreter?

    2) "a1" is undefined -- you should get an exception on that line which
    makes the following irrelevant; replacing "a1" with "file_open" leads
    to...

    3) "file" is a) a predefined function in Python, which you have just
    shadowed and b) a poor name for a string containing the contents of a
    file

    4) "for line in file", since "file" is a string, will iterate over EACH
    CHARACTER, meaning (since there is nothing to split) that "line_word" is
    also just a single character.

    for line in file.split("\n"):

    will split the STRING into logical lines (assuming a new-line character
    splits the lines) and permit the subsequent split to pull out wordS
    ("line_word" is misleading, as to will contain a LIST of words from the
    line).

    > I like to store in some variable,so that I may print line of my choice and manipulate them at my choice.
    > Is there any way out to this problem?
    >
    >
    > Regards,
    > Subhabrata Banerjee

    --
    Wulfraed Dennis Lee Bieber AF6VN
    HTTP://wlfraed.home.netcom.com/
     
    Dennis Lee Bieber, Jul 7, 2012
    #10
  11. Guest

    On Sunday, July 8, 2012 2:21:14 AM UTC+5:30, Dennis Lee Bieber wrote:
    > On Sat, 7 Jul 2012 12:54:16 -0700 (PDT),
    > declaimed the following in gmane.comp.python.general:
    >
    > > But I am bit intrigued with another question,
    > >
    > > suppose I say:
    > > file_open=open("/python32/doc1.txt","r")
    > > file=a1.read().lower()
    > > for line in file:
    > > line_word=line.split()
    > >
    > > This works fine. But if I print it would be printed continuously.

    >
    > "This works fine" -- Really?
    >
    > 1) Why are you storing data files in the install directory of your
    > Python interpreter?
    >
    > 2) "a1" is undefined -- you should get an exception on that line which
    > makes the following irrelevant; replacing "a1" with "file_open" leads
    > to...
    >
    > 3) "file" is a) a predefined function in Python, which you have just
    > shadowed and b) a poor name for a string containing the contents of a
    > file
    >
    > 4) "for line in file", since "file" is a string, will iterate over EACH
    > CHARACTER, meaning (since there is nothing to split) that "line_word" is
    > also just a single character.
    >
    > for line in file.split("\n"):
    >
    > will split the STRING into logical lines (assuming a new-line character
    > splits the lines) and permit the subsequent split to pull out wordS
    > ("line_word" is misleading, as to will contain a LIST of words from the
    > line).
    >
    > > I like to store in some variable,so that I may print line of my choice and manipulate them at my choice.
    > > Is there any way out to this problem?
    > >
    > >
    > > Regards,
    > > Subhabrata Banerjee

    > --
    > Wulfraed Dennis Lee Bieber AF6VN
    > HTTP://wlfraed.home.netcom.com/


    Thanks for pointing out the mistakes. Your points are right. So I am trying to revise it,

    file_open=open("/python32/doc1.txt","r")
    for line in file_open:
    line_word=line.split()
    print (line_word)

    To store them the best way is to assign a blank list and append but is there any alternate
    method for huge data it becomes tough as the list becomes huge if any way variables may be assigned.

    Regards,
    Subhabrata Banerjee.
     
    , Jul 8, 2012
    #11
  12. Guest

    On Sunday, July 8, 2012 2:21:14 AM UTC+5:30, Dennis Lee Bieber wrote:
    > On Sat, 7 Jul 2012 12:54:16 -0700 (PDT),
    > declaimed the following in gmane.comp.python.general:
    >
    > > But I am bit intrigued with another question,
    > >
    > > suppose I say:
    > > file_open=open("/python32/doc1.txt","r")
    > > file=a1.read().lower()
    > > for line in file:
    > > line_word=line.split()
    > >
    > > This works fine. But if I print it would be printed continuously.

    >
    > "This works fine" -- Really?
    >
    > 1) Why are you storing data files in the install directory of your
    > Python interpreter?
    >
    > 2) "a1" is undefined -- you should get an exception on that line which
    > makes the following irrelevant; replacing "a1" with "file_open" leads
    > to...
    >
    > 3) "file" is a) a predefined function in Python, which you have just
    > shadowed and b) a poor name for a string containing the contents of a
    > file
    >
    > 4) "for line in file", since "file" is a string, will iterate over EACH
    > CHARACTER, meaning (since there is nothing to split) that "line_word" is
    > also just a single character.
    >
    > for line in file.split("\n"):
    >
    > will split the STRING into logical lines (assuming a new-line character
    > splits the lines) and permit the subsequent split to pull out wordS
    > ("line_word" is misleading, as to will contain a LIST of words from the
    > line).
    >
    > > I like to store in some variable,so that I may print line of my choice and manipulate them at my choice.
    > > Is there any way out to this problem?
    > >
    > >
    > > Regards,
    > > Subhabrata Banerjee

    > --
    > Wulfraed Dennis Lee Bieber AF6VN
    > HTTP://wlfraed.home.netcom.com/


    Thanks for pointing out the mistakes. Your points are right. So I am trying to revise it,

    file_open=open("/python32/doc1.txt","r")
    for line in file_open:
    line_word=line.split()
    print (line_word)

    To store them the best way is to assign a blank list and append but is there any alternate
    method for huge data it becomes tough as the list becomes huge if any way variables may be assigned.

    Regards,
    Subhabrata Banerjee.
     
    , Jul 8, 2012
    #12
  13. On Sun, Jul 8, 2012 at 3:42 PM, <> wrote:
    > Thanks for pointing out the mistakes. Your points are right. So I am trying to revise it,
    >
    > file_open=open("/python32/doc1.txt","r")
    > for line in file_open:
    > line_word=line.split()
    > print (line_word)


    Yep. I'd be inclined to rename file_open to something that says what
    the file _is_, and you may want to look into the 'with' statement to
    guarantee timely closure of the file, but that's a way to do it.

    Also, as has already been mentioned: keeping your data files in the
    Python binaries directory isn't usually a good idea. More common to
    keep them in the same directory as your script, which would mean that
    you don't need a path on it at all.

    ChrisA
     
    Chris Angelico, Jul 8, 2012
    #13
  14. Guest

    On Sunday, July 8, 2012 1:33:25 PM UTC+5:30, Chris Angelico wrote:
    > On Sun, Jul 8, 2012 at 3:42 PM, <> wrote:
    > > Thanks for pointing out the mistakes. Your points are right. So I am trying to revise it,
    > >
    > > file_open=open("/python32/doc1.txt","r")
    > > for line in file_open:
    > > line_word=line.split()
    > > print (line_word)

    >
    > Yep. I'd be inclined to rename file_open to something that says what
    > the file _is_, and you may want to look into the 'with' statement to
    > guarantee timely closure of the file, but that's a way to do it.
    >
    > Also, as has already been mentioned: keeping your data files in the
    > Python binaries directory isn't usually a good idea. More common to
    > keep them in the same directory as your script, which would mean that
    > you don't need a path on it at all.
    >
    > ChrisA


    Dear Chirs,
    No file path! Amazing. I do not know I like to know one small example please.
    Btw, some earlier post said, line.split() to convert line into bag of words can be done with power(), but I did not find it, if any one can help. I do close files do not worry. New style I'd try.

    Regards,
    Subha
     
    , Jul 8, 2012
    #14
  15. Guest

    On Sunday, July 8, 2012 1:33:25 PM UTC+5:30, Chris Angelico wrote:
    > On Sun, Jul 8, 2012 at 3:42 PM, <> wrote:
    > > Thanks for pointing out the mistakes. Your points are right. So I am trying to revise it,
    > >
    > > file_open=open("/python32/doc1.txt","r")
    > > for line in file_open:
    > > line_word=line.split()
    > > print (line_word)

    >
    > Yep. I'd be inclined to rename file_open to something that says what
    > the file _is_, and you may want to look into the 'with' statement to
    > guarantee timely closure of the file, but that's a way to do it.
    >
    > Also, as has already been mentioned: keeping your data files in the
    > Python binaries directory isn't usually a good idea. More common to
    > keep them in the same directory as your script, which would mean that
    > you don't need a path on it at all.
    >
    > ChrisA


    Dear Chirs,
    No file path! Amazing. I do not know I like to know one small example please.
    Btw, some earlier post said, line.split() to convert line into bag of words can be done with power(), but I did not find it, if any one can help. I do close files do not worry. New style I'd try.

    Regards,
    Subha
     
    , Jul 8, 2012
    #15
  16. On Mon, Jul 9, 2012 at 3:05 AM, <> wrote:
    > On Sunday, July 8, 2012 1:33:25 PM UTC+5:30, Chris Angelico wrote:
    >> On Sun, Jul 8, 2012 at 3:42 PM, <> wrote:
    >> > file_open=open("/python32/doc1.txt","r")

    >> Also, as has already been mentioned: keeping your data files in the
    >> Python binaries directory isn't usually a good idea. More common to
    >> keep them in the same directory as your script, which would mean that
    >> you don't need a path on it at all.

    > No file path! Amazing. I do not know I like to know one small example please.


    open("doc1.txt","r")

    Python will look for a file called doc1.txt in the directory you run
    the script from (which is often going to be the same directory as your
    ..py program).

    > Btw, some earlier post said, line.split() to convert line into bag of words can be done with power(), but I did not find it, if any one can help. I do close files do not worry. New style I'd try.


    I don't know what power() function you're talking about, and can't
    find it in the previous posts; the nearest I can find is a post from
    Ranting Rick which says a lot of guff that you can ignore. (Rick is a
    professional troll. Occasionally he says something useful and
    courteous; more often it's one or the other, or neither.)

    As to the closing of files: There are a few narrow issues that make it
    worth using the 'with' statement, such as exceptions; mostly, it's
    just a good habit to get into. If you ignore it, your file will
    *usually* be closed fairly soon after you stop referencing it, but
    there's no guarantee. (Someone else will doubtless correct me if I'm
    wrong, but I'm pretty sure Python guarantees to properly flush and
    close on exit, but not necessarily before.)

    ChrisA
     
    Chris Angelico, Jul 8, 2012
    #16
  17. Roy Smith Guest

    In article <>,
    Chris Angelico <> wrote:

    > open("doc1.txt","r")
    >
    > Python will look for a file called doc1.txt in the directory you run
    > the script from (which is often going to be the same directory as your
    > .py program).


    Well, to pick a nit, the file will be looked for in the current working
    directory. This may or may not be the directory you ran your script
    from. Your script could have executed chdir() between the time you
    started it and you tried to open the file.

    To pick another nit, it's misleading to say, "Python will look for...".
    This implies that Python somehow gets involved in pathname resolution,
    when it doesn't. Python just passes paths to the operating system as
    opaque strings, and the OS does all the magic of figuring out what that
    string means.
     
    Roy Smith, Jul 8, 2012
    #17
  18. MRAB Guest

    On 08/07/2012 18:17, Chris Angelico wrote:
    > On Mon, Jul 9, 2012 at 3:05 AM, <> wrote:
    >> On Sunday, July 8, 2012 1:33:25 PM UTC+5:30, Chris Angelico wrote:
    >>> On Sun, Jul 8, 2012 at 3:42 PM, <> wrote:
    >>> > file_open=open("/python32/doc1.txt","r")
    >>> Also, as has already been mentioned: keeping your data files in the
    >>> Python binaries directory isn't usually a good idea. More common to
    >>> keep them in the same directory as your script, which would mean that
    >>> you don't need a path on it at all.

    >> No file path! Amazing. I do not know I like to know one small example please.

    >
    > open("doc1.txt","r")
    >
    > Python will look for a file called doc1.txt in the directory you run
    > the script from (which is often going to be the same directory as your
    > .py program).
    >
    >> Btw, some earlier post said, line.split() to convert line into bag of words can
    >> be done with power(), but I did not find it, if any one can help. I do close
    >> files do not worry. New style I'd try.

    >
    > I don't know what power() function you're talking about, and can't
    > find it in the previous posts; the nearest I can find is a post from
    > Ranting Rick which says a lot of guff that you can ignore. (Rick is a
    > professional troll. Occasionally he says something useful and
    > courteous; more often it's one or the other, or neither.)
    >

    I believe the relevant quote is """especially the Python gods have
    given you *power* over string objects""". If that's the case, he's not
    referring to a method or a function called "power".

    He did give the good warning about the problem there could be if the
    original string contains "$", the character being used as the separator.

    > As to the closing of files: There are a few narrow issues that make it
    > worth using the 'with' statement, such as exceptions; mostly, it's
    > just a good habit to get into. If you ignore it, your file will
    > *usually* be closed fairly soon after you stop referencing it, but
    > there's no guarantee. (Someone else will doubtless correct me if I'm
    > wrong, but I'm pretty sure Python guarantees to properly flush and
    > close on exit, but not necessarily before.)
    >
     
    MRAB, Jul 8, 2012
    #18
  19. On Sat, 7 Jul 2012 22:42:13 -0700 (PDT),
    declaimed the following in gmane.comp.python.general:

    >
    > Thanks for pointing out the mistakes. Your points are right. So I am trying to revise it,
    >
    > file_open=open("/python32/doc1.txt","r")
    > for line in file_open:
    > line_word=line.split()
    > print (line_word)
    >
    > To store them the best way is to assign a blank list and append but is there any alternate
    > method for huge data it becomes tough as the list becomes huge if any way variables may be assigned.
    >

    Well, first to copy from an earlier post (just so I can trim the
    unneeded)...

    > > > I like to store in some variable,so that I may print line of my choice and manipulate them at my choice.
    > > > Is there any way out to this problem?


    It is still not clear exactly what the task itself is supposed to
    be.

    After all, you are splitting the line into a LIST of words, and then
    here state the goal is to "print line of" choice... The line and not the
    list? There is no hint of what "manipulate them" involves.

    If the files are of any size, I would not even attempt to store them
    internally... I'd be more likely to run a preprocess phase which opens
    the file in binary mode, (maybe reads it in chunks), and builds a list
    of /offsets/ to the start of each line. To process any specific line
    later would use seek() operations to the start of the line, followed by
    a read operation of just the length to the next line.

    Doing an mmap() of the file may event speed up the later processing,
    as you wouldn't be using I/O seeks, but just asking for slices from the
    mmap'd file. The OS would be responsible for making sure the file
    contents were in memory.

    This won't work if the manipulation requires making a line longer or
    shorter. In that case, preprocessing would be writing the lines to a
    simple BSD-DB style "database", in which the "line number" is the key;
    an manipulation would work on records fetched by line number, and
    written back.

    If you also store a "process date" in the BSD-DB database, you could
    match it to the last modified time of the source file and skip
    reprocessing if the source has not changed.
    --
    Wulfraed Dennis Lee Bieber AF6VN
    HTTP://wlfraed.home.netcom.com/
     
    Dennis Lee Bieber, Jul 8, 2012
    #19
  20. On Mon, Jul 9, 2012 at 4:17 AM, Roy Smith <> wrote:
    > In article <>,
    > Chris Angelico <> wrote:
    >
    >> open("doc1.txt","r")
    >>
    >> Python will look for a file called doc1.txt in the directory you run
    >> the script from (which is often going to be the same directory as your
    >> .py program).

    >
    > Well, to pick a nit, the file will be looked for in the current working
    > directory. This may or may not be the directory you ran your script
    > from. Your script could have executed chdir() between the time you
    > started it and you tried to open the file.
    >
    > To pick another nit, it's misleading to say, "Python will look for...".
    > This implies that Python somehow gets involved in pathname resolution,
    > when it doesn't. Python just passes paths to the operating system as
    > opaque strings, and the OS does all the magic of figuring out what that
    > string means.


    Two perfectly accurate nitpicks. And of course, there's a million and
    one other things that could happen in between, too, including
    possibilities of the current directory not even existing and so on. I
    merely oversimplified in the hopes of giving a one-paragraph
    explanation of what it means to not put a path name in your open()
    call :) It's like the difference between reminder text on a Magic: The
    Gathering card and the actual entries in the Comprehensive Rules.
    Perfect example is the "Madness" ability - the reminder text explains
    the ability, but uses language that actually is quite incorrect. It's
    a better explanation, though.

    Am I overanalyzing this? Yeah, probably...

    ChrisA
     
    Chris Angelico, Jul 8, 2012
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Geno

    Discussion

    Geno, Jul 18, 2004, in forum: ASP .Net
    Replies:
    0
    Views:
    529
  2. mohammed rafi

    a discussion about verification

    mohammed rafi, Aug 14, 2004, in forum: VHDL
    Replies:
    12
    Views:
    858
    Andy Freeman
    Aug 21, 2004
  3. guessmyname

    code a discussion forum using java

    guessmyname, Jan 17, 2006, in forum: Java
    Replies:
    4
    Views:
    1,826
  4. Ken North
    Replies:
    0
    Views:
    344
    Ken North
    Aug 5, 2004
  5. Jeremy McAnally
    Replies:
    0
    Views:
    94
    Jeremy McAnally
    Apr 2, 2008
Loading...

Share This Page