Discussion on some Code Issues

subhabangalore · Jul 5, 2012

Dear Group,

I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to discusssome coding issues. If any one of this learned room can shower some light I would be helpful enough.

I got to code a bunch of documents which are combined together.
Like,

1)A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-airand forced the pilot to make an emergency landing.
2) The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection.
3) A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.

The task is to separate the documents on the fly and to parse each of the documents with a definite set of rules.

Now, the way I am processing is:
I am clubbing all the documents together, as,

A Mumbai-bound aircraft with 99 passengers on board was struck by lightningon Tuesday evening that led to complete communication failure in mid-air and forced the pilot to make an emergency landing.The discovery of a new sub-atomic particle that is key to understanding how the universe is built hasan intrinsic Indian connection. A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.

But they are separated by a tag set, like,
A Mumbai-bound aircraft with 99 passengers on board was struck by lightningon Tuesday evening that led to complete communication failure in mid-air and forced the pilot to make an emergency landing.$
The discovery of a new sub-atomic particle that is key to understanding howthe universe is built has an intrinsic Indian connection.$
A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.

To detect the document boundaries, I am splitting them into a bag of words and using a simple for loop as,
for i in range(len(bag_words)):
if bag_words=="$":
print (bag_words,i)

There is no issue. I am segmenting it nicely. I am using annotated corpus so applying parse rules.

The confusion comes next,

As per my problem statement the size of the file (of documents combined together) won’t increase on the fly. So, just to support all kinds of combinations I am appending in a list the “I” values, taking its length, and using slice. Works perfect. Question is, is there a smarter way to achieve this, and a curious question if the documents are on the fly with no preprocessed tag set like “$” how may I do it? From a bunch without EOF isn’t it a classification problem?

There is no question on parsing it seems I am achieving it independent of length of the document.

If any one in the group can suggest how I am dealing with the problem and which portions should be improved and how?

Thanking You in Advance,

Best Regards,
Subhabrata Banerjee.

Steven D'Aprano · Jul 5, 2012

On Wed, 04 Jul 2012 16:21:46 -0700, subhabangalore wrote:

[...]

I got to code a bunch of documents which are combined together. [...]
The task is to separate the documents on the fly and to parse each of
the documents with a definite set of rules.

Now, the way I am processing is:
I am clubbing all the documents together, as, [...]
But they are separated by a tag set [...]
To detect the document boundaries,

Let me see if I understand your problem.

You have a bunch of documents. You stick them all together into one
enormous lump. And then you try to detect the boundaries between one file
and the next within the enormous lump.

Why not just process each file separately? A simple for loop over the
list of files, before consolidating them into one giant file, will avoid
all the difficulty of trying to detect boundaries within files.

Instead of:

merge(output_filename, list_of_files)
for word in parse(output_filename):
if boundary_detected: do_something()
process(word)

Do this instead:

for filename in list_of_files:
do_something()
for word in parse(filename):
process(word)

I am splitting them into a bag of
words and using a simple for loop as,
for i in range(len(bag_words)):
if bag_words=="$":
print (bag_words,i)

What happens if a file already has a $ in it?

There is no issue. I am segmenting it nicely. I am using annotated
corpus so applying parse rules.

The confusion comes next,

As per my problem statement the size of the file (of documents combined
together) wonâ€™t increase on the fly. So, just to support all kinds of
combinations I am appending in a list the â€œIâ€ values, taking its length,
and using slice. Works perfect.

Click to expand...

I don't understand this. What sort of combinations do you think you need
to support? What are "I" values, and why are they important?

Rick Johnson · Jul 5, 2012

[...]
To detect the document boundaries, I am splitting them into a bag
of words and using a simple for loop as,

for i in range(len(bag_words)):
if bag_words=="$":
print (bag_words,i)

Ignoring that you are attacking the problem incorrectly: that is very
poor method of splitting a string since especially the Python gods
have given you *power* over string objects. But you are going to have
an even greater problem if the string contains a "$" char that you DID
NOT insert :-O. You'd be wise to use a sep that is not likely to be in
the file data. For example: "<SEP>" or "<SPLIT-HERE>". But even that
approach is naive! Why not streamline the entire process and pass a
list of file paths to a custom parser object instead?

subhabangalore · Jul 5, 2012

Dear Group,

I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to discuss some coding issues. If any one of this learned room can shower some light I would be helpful enough.

I got to code a bunch of documents which are combined together.
Like,

1)A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-air and forced the pilot to make an emergency landing.
2) The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection.
3) A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.

The task is to separate the documents on the fly and to parse each of thedocuments with a definite set of rules.

Now, the way I am processing is:
I am clubbing all the documents together, as,

A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-airand forced the pilot to make an emergency landing.The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection. A bomb explosion outside a shopping mallhere on Tuesday left no one injured, but Nigerian authorities put securityagencies on high alert fearing more such attacks in the city.

But they are separated by a tag set, like,
A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-airand forced the pilot to make an emergency landing.$
The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection.$
A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.

To detect the document boundaries, I am splitting them into a bag of words and using a simple for loop as,
for i in range(len(bag_words)):
if bag_words=="$":
print (bag_words,i)

There is no issue. I am segmenting it nicely. I am using annotated corpusso applying parse rules.

The confusion comes next,

As per my problem statement the size of the file (of documents combined together) won’t increase on the fly. So, just to support all kinds of combinations I am appending in a list the “I” values, taking its length, and using slice. Works perfect. Question is, is there a smarter way to achieve this, and a curious question if the documents are on the fly with no preprocessed tag set like “$” how may I do it? From a bunch without EOF isn’t it a classification problem?

There is no question on parsing it seems I am achieving it independent oflength of the document.

If any one in the group can suggest how I am dealing with the problem andwhich portions should be improved and how?

Thanking You in Advance,

Best Regards,
Subhabrata Banerjee.

Hi Steven, It is nice to see your post. They are nice and I learnt so many things from you. "I" is for index of the loop.
Now my clarification I thought to do "import os" and process files in a loop but that is not my problem statement. I have to make a big lump of text and detect one chunk. Looping over the line number of file I am not using because I may not be able to take the slices-this I need. I thought to give re.findall a try but that is not giving me the slices. Slice spreads here. The power issue of string! I would definitely give it a try. Happy Day AheadRegards, Subhabrata Banerjee.

Peter Otten · Jul 5, 2012

Dear Group,

I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to
discuss some coding issues. If any one of this learned room can shower
some light I would be helpful enough.

I got to code a bunch of documents which are combined together.
Like,

1)A Mumbai-bound aircraft with 99 passengers on board was struck by
lightning on Tuesday evening that led to complete communication failure
in mid-air and forced the pilot to make an emergency landing. 2) The
discovery of a new sub-atomic particle that is key to understanding how
the universe is built has an intrinsic Indian connection. 3) A bomb
explosion outside a shopping mall here on Tuesday left no one injured,
but Nigerian authorities put security agencies on high alert fearing more
such attacks in the city.

The task is to separate the documents on the fly and to parse each of the
documents with a definite set of rules.

Now, the way I am processing is:
I am clubbing all the documents together, as,

A Mumbai-bound aircraft with 99 passengers on board was struck by
lightning on Tuesday evening that led to complete communication failure
in mid-air and forced the pilot to make an emergency landing.The
discovery of a new sub-atomic particle that is key to understanding how
the universe is built has an intrinsic Indian connection. A bomb
explosion outside a shopping mall here on Tuesday left no one injured,
but Nigerian authorities put security agencies on high alert fearing more
such attacks in the city.

But they are separated by a tag set, like,
A Mumbai-bound aircraft with 99 passengers on board was struck by
lightning on Tuesday evening that led to complete communication failure
in mid-air and forced the pilot to make an emergency landing.$ The
discovery of a new sub-atomic particle that is key to understanding how
the universe is built has an intrinsic Indian connection.$ A bomb
explosion outside a shopping mall here on Tuesday left no one injured,
but Nigerian authorities put security agencies on high alert fearing more
such attacks in the city.

To detect the document boundaries, I am splitting them into a bag of
words and using a simple for loop as, for i in range(len(bag_words)):
if bag_words=="$":
print (bag_words,i)

There is no issue. I am segmenting it nicely. I am using annotated corpus
so applying parse rules.

The confusion comes next,

As per my problem statement the size of the file (of documents combined
together) wonâ€™t increase on the fly. So, just to support all kinds of
combinations I am appending in a list the â€œIâ€ values, taking its length,
and using slice. Works perfect. Question is, is there a smarter way to
achieve this, and a curious question if the documents are on the fly with
no preprocessed tag set like â€œ$â€ how may I do it? From a bunch without
EOF isnâ€™t it a classification problem?

There is no question on parsing it seems I am achieving it independent of
length of the document.

If any one in the group can suggest how I am dealing with the problem and
which portions should be improved and how?

Thanking You in Advance,

Best Regards,
Subhabrata Banerjee.

Click to expand...

Hi Steven, It is nice to see your post. They are nice and I learnt so many
things from you. "I" is for index of the loop. Now my clarification I
thought to do "import os" and process files in a loop but that is not my
problem statement. I have to make a big lump of text and detect one chunk.
Looping over the line number of file I am not using because I may not be
able to take the slices-this I need. I thought to give re.findall a try
but that is not giving me the slices. Slice spreads here. The power issue
of string! I would definitely give it a try. Happy Day Ahead Regards,
Subhabrata Banerjee.

Then use re.finditer():

start = 0
for match in re.finditer(r"\$", data):
end = match.start()
print(start, end)
print(data[start:end])
start = match.end()

This will omit the last text. The simplest fix is to put another "$"
separator at the end of your data.

subhabangalore · Jul 5, 2012

Dear Peter,
That is a nice one. I am thinking if I can write "for lines in f" sort of code that is easy but then how to find out the slices then, btw do you know in any case may I convert the index position of file to the list position provided I am writing the list for the same file we are reading.

Best Regards,
Subhabrata.

Dear Group,

I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to
discuss some coding issues. If any one of this learned room can shower
some light I would be helpful enough.

I got to code a bunch of documents which are combined together.
Like,

1)A Mumbai-bound aircraft with 99 passengers on board was struck by
lightning on Tuesday evening that led to complete communication failure
in mid-air and forced the pilot to make an emergency landing. 2) The
discovery of a new sub-atomic particle that is key to understanding how
the universe is built has an intrinsic Indian connection. 3) A bomb
explosion outside a shopping mall here on Tuesday left no one injured,
but Nigerian authorities put security agencies on high alert fearing more
such attacks in the city.

The task is to separate the documents on the fly and to parse each of the
documents with a definite set of rules.

Now, the way I am processing is:
I am clubbing all the documents together, as,

A Mumbai-bound aircraft with 99 passengers on board was struck by
lightning on Tuesday evening that led to complete communication failure
in mid-air and forced the pilot to make an emergency landing.The
discovery of a new sub-atomic particle that is key to understanding how
the universe is built has an intrinsic Indian connection. A bomb
explosion outside a shopping mall here on Tuesday left no one injured,
but Nigerian authorities put security agencies on high alert fearing more
such attacks in the city.

But they are separated by a tag set, like,
A Mumbai-bound aircraft with 99 passengers on board was struck by
lightning on Tuesday evening that led to complete communication failure
in mid-air and forced the pilot to make an emergency landing.$ The
discovery of a new sub-atomic particle that is key to understanding how
the universe is built has an intrinsic Indian connection.$ A bomb
explosion outside a shopping mall here on Tuesday left no one injured,
but Nigerian authorities put security agencies on high alert fearing more
such attacks in the city.

To detect the document boundaries, I am splitting them into a bag of
words and using a simple for loop as, for i in range(len(bag_words)):
if bag_words=="$":
print (bag_words,i)

There is no issue. I am segmenting it nicely. I am using annotated corpus
so applying parse rules.

The confusion comes next,

As per my problem statement the size of the file (of documents combined
together) won’t increase on the fly. So, just to support all kinds of
combinations I am appending in a list the “I” values, taking its length,
and using slice. Works perfect. Question is, is there a smarter way to
achieve this, and a curious question if the documents are on the fly with
no preprocessed tag set like “$” how may I do it? From a bunch without
EOF isn’t it a classification problem?

There is no question on parsing it seems I am achieving it independentof
length of the document.

If any one in the group can suggest how I am dealing with the problem and
which portions should be improved and how?

Thanking You in Advance,

Best Regards,
Subhabrata Banerjee.

Click to expand...

Hi Steven, It is nice to see your post. They are nice and I learnt so many
things from you. "I" is for index of the loop. Now my clarification I
thought to do "import os" and process files in a loop but that is not my
problem statement. I have to make a big lump of text and detect one chunk.
Looping over the line number of file I am not using because I may not be
able to take the slices-this I need. I thought to give re.findall a try
but that is not giving me the slices. Slice spreads here. The power issue
of string! I would definitely give it a try. Happy Day Ahead Regards,
Subhabrata Banerjee.

Click to expand...

Then use re.finditer():

start = 0
for match in re.finditer(r"\$", data):
end = match.start()
print(start, end)
print(data[start:end])
start = match.end()

This will omit the last text. The simplest fix is to put another "$"
separator at the end of your data.

subhabangalore · Jul 5, 2012

Dear Peter,
That is a nice one. I am thinking if I can write "for lines in f" sort of code that is easy but then how to find out the slices then, btw do you know in any case may I convert the index position of file to the list position provided I am writing the list for the same file we are reading.

Best Regards,
Subhabrata.

Dear Group,

I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to
discuss some coding issues. If any one of this learned room can shower
some light I would be helpful enough.

I got to code a bunch of documents which are combined together.
Like,

1)A Mumbai-bound aircraft with 99 passengers on board was struck by
lightning on Tuesday evening that led to complete communication failure
in mid-air and forced the pilot to make an emergency landing. 2) The
discovery of a new sub-atomic particle that is key to understanding how
the universe is built has an intrinsic Indian connection. 3) A bomb
explosion outside a shopping mall here on Tuesday left no one injured,
but Nigerian authorities put security agencies on high alert fearing more
such attacks in the city.

The task is to separate the documents on the fly and to parse each of the
documents with a definite set of rules.

Now, the way I am processing is:
I am clubbing all the documents together, as,

A Mumbai-bound aircraft with 99 passengers on board was struck by
lightning on Tuesday evening that led to complete communication failure
in mid-air and forced the pilot to make an emergency landing.The
discovery of a new sub-atomic particle that is key to understanding how
the universe is built has an intrinsic Indian connection. A bomb
explosion outside a shopping mall here on Tuesday left no one injured,
but Nigerian authorities put security agencies on high alert fearing more
such attacks in the city.

But they are separated by a tag set, like,
A Mumbai-bound aircraft with 99 passengers on board was struck by
lightning on Tuesday evening that led to complete communication failure
in mid-air and forced the pilot to make an emergency landing.$ The
discovery of a new sub-atomic particle that is key to understanding how
the universe is built has an intrinsic Indian connection.$ A bomb
explosion outside a shopping mall here on Tuesday left no one injured,
but Nigerian authorities put security agencies on high alert fearing more
such attacks in the city.

To detect the document boundaries, I am splitting them into a bag of
words and using a simple for loop as, for i in range(len(bag_words)):
if bag_words=="$":
print (bag_words,i)

There is no issue. I am segmenting it nicely. I am using annotated corpus
so applying parse rules.

The confusion comes next,

As per my problem statement the size of the file (of documents combined
together) won’t increase on the fly. So, just to support all kinds of
combinations I am appending in a list the “I” values, taking its length,
and using slice. Works perfect. Question is, is there a smarter way to
achieve this, and a curious question if the documents are on the fly with
no preprocessed tag set like “$” how may I do it? From a bunch without
EOF isn’t it a classification problem?

There is no question on parsing it seems I am achieving it independentof
length of the document.

If any one in the group can suggest how I am dealing with the problem and
which portions should be improved and how?

Thanking You in Advance,

Best Regards,
Subhabrata Banerjee.

Click to expand...

Hi Steven, It is nice to see your post. They are nice and I learnt so many
things from you. "I" is for index of the loop. Now my clarification I
thought to do "import os" and process files in a loop but that is not my
problem statement. I have to make a big lump of text and detect one chunk.
Looping over the line number of file I am not using because I may not be
able to take the slices-this I need. I thought to give re.findall a try
but that is not giving me the slices. Slice spreads here. The power issue
of string! I would definitely give it a try. Happy Day Ahead Regards,
Subhabrata Banerjee.

Click to expand...

Then use re.finditer():

start = 0
for match in re.finditer(r"\$", data):
end = match.start()
print(start, end)
print(data[start:end])
start = match.end()

This will omit the last text. The simplest fix is to put another "$"
separator at the end of your data.

Peter Otten · Jul 6, 2012

(e-mail address removed) wrote:

[Please don't top-post]

start = 0
for match in re.finditer(r"\$", data):
end = match.start()
print(start, end)
print(data[start:end])
start = match.end()

Click to expand...

That is a nice one. I am thinking if I can write "for lines in f" sort of
code that is easy but then how to find out the slices then,

You have to keep track both of the offset of the line and the offset within
the line:

def offsets(lines, pos=0):
for line in lines:
yield pos, line
pos += len(line)

start = 0
for line_start, line in offsets(lines):
for pos, part in offsets(re.split(r"(\$)", line), line_start):
if part == "$":
print(start, pos)
start = pos + 1

(untested code, I'm assuming that the file ends with a $)

btw do you
know in any case may I convert the index position of file to the list
position provided I am writing the list for the same file we are reading.

Use a lookup list with the end positions of the texts and then find the
relevant text with bisect.

ends = [10, 20, 50]
filepos = 15
bisect.bisect(ends, filepos)

Click to expand...

Click to expand...

1 # position 15 belongs to the second text

subhabangalore · Jul 7, 2012

Dear Group,

I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to discuss some coding issues. If any one of this learned room can shower some light I would be helpful enough.

I got to code a bunch of documents which are combined together.
Like,

1)A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-air and forced the pilot to make an emergency landing.
2) The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection.
3) A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.

The task is to separate the documents on the fly and to parse each of thedocuments with a definite set of rules.

Now, the way I am processing is:
I am clubbing all the documents together, as,

A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-airand forced the pilot to make an emergency landing.The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection. A bomb explosion outside a shopping mallhere on Tuesday left no one injured, but Nigerian authorities put securityagencies on high alert fearing more such attacks in the city.

But they are separated by a tag set, like,
A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-airand forced the pilot to make an emergency landing.$
The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection.$
A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.

To detect the document boundaries, I am splitting them into a bag of words and using a simple for loop as,
for i in range(len(bag_words)):
if bag_words=="$":
print (bag_words,i)

There is no issue. I am segmenting it nicely. I am using annotated corpusso applying parse rules.

The confusion comes next,

As per my problem statement the size of the file (of documents combined together) won’t increase on the fly. So, just to support all kinds of combinations I am appending in a list the “I” values, taking its length, and using slice. Works perfect. Question is, is there a smarter way to achieve this, and a curious question if the documents are on the fly with no preprocessed tag set like “$” how may I do it? From a bunch without EOF isn’t it a classification problem?

There is no question on parsing it seems I am achieving it independent oflength of the document.

If any one in the group can suggest how I am dealing with the problem andwhich portions should be improved and how?

Thanking You in Advance,

Best Regards,
Subhabrata Banerjee.

Thanks Peter but I feel your earlier one was better, I got an interesting one:
[i - 1 for i in range(len(f1)) if f1.startswith('$', i - 1)]

But I am bit intrigued with another question,

suppose I say:
file_open=open("/python32/doc1.txt","r")
file=a1.read().lower()
for line in file:
line_word=line.split()

This works fine. But if I print it would be printed continuously.
I like to store in some variable,so that I may print line of my choice and manipulate them at my choice.
Is there any way out to this problem?

Regards,
Subhabrata Banerjee

Dennis Lee Bieber · Jul 7, 2012

But I am bit intrigued with another question,

suppose I say:
file_open=open("/python32/doc1.txt","r")
file=a1.read().lower()
for line in file:
line_word=line.split()

This works fine. But if I print it would be printed continuously.

"This works fine" -- Really?

1) Why are you storing data files in the install directory of your
Python interpreter?

2) "a1" is undefined -- you should get an exception on that line which
makes the following irrelevant; replacing "a1" with "file_open" leads
to...

3) "file" is a) a predefined function in Python, which you have just
shadowed and b) a poor name for a string containing the contents of a
file

4) "for line in file", since "file" is a string, will iterate over EACH
CHARACTER, meaning (since there is nothing to split) that "line_word" is
also just a single character.

for line in file.split("\n"):

will split the STRING into logical lines (assuming a new-line character
splits the lines) and permit the subsequent split to pull out wordS
("line_word" is misleading, as to will contain a LIST of words from the
line).

subhabangalore · Jul 8, 2012

"This works fine" -- Really?

1) Why are you storing data files in the install directory of your
Python interpreter?

2) "a1" is undefined -- you should get an exception on that line which
makes the following irrelevant; replacing "a1" with "file_open" leads
to...

3) "file" is a) a predefined function in Python, which you have just
shadowed and b) a poor name for a string containing the contents of a
file

4) "for line in file", since "file" is a string, will iterate over EACH
CHARACTER, meaning (since there is nothing to split) that "line_word" is
also just a single character.

for line in file.split("\n"):

will split the STRING into logical lines (assuming a new-line character
splits the lines) and permit the subsequent split to pull out wordS
("line_word" is misleading, as to will contain a LIST of words from the
line).

Thanks for pointing out the mistakes. Your points are right. So I am trying to revise it,

file_open=open("/python32/doc1.txt","r")
for line in file_open:
line_word=line.split()
print (line_word)

To store them the best way is to assign a blank list and append but is there any alternate
method for huge data it becomes tough as the list becomes huge if any way variables may be assigned.

Regards,
Subhabrata Banerjee.

subhabangalore · Jul 8, 2012

"This works fine" -- Really?

1) Why are you storing data files in the install directory of your
Python interpreter?

2) "a1" is undefined -- you should get an exception on that line which
makes the following irrelevant; replacing "a1" with "file_open" leads
to...

3) "file" is a) a predefined function in Python, which you have just
shadowed and b) a poor name for a string containing the contents of a
file

4) "for line in file", since "file" is a string, will iterate over EACH
CHARACTER, meaning (since there is nothing to split) that "line_word" is
also just a single character.

for line in file.split("\n"):

will split the STRING into logical lines (assuming a new-line character
splits the lines) and permit the subsequent split to pull out wordS
("line_word" is misleading, as to will contain a LIST of words from the
line).

Thanks for pointing out the mistakes. Your points are right. So I am trying to revise it,

file_open=open("/python32/doc1.txt","r")
for line in file_open:
line_word=line.split()
print (line_word)

To store them the best way is to assign a blank list and append but is there any alternate
method for huge data it becomes tough as the list becomes huge if any way variables may be assigned.

Regards,
Subhabrata Banerjee.

Chris Angelico · Jul 8, 2012

Thanks for pointing out the mistakes. Your points are right. So I am trying to revise it,

file_open=open("/python32/doc1.txt","r")
for line in file_open:
line_word=line.split()
print (line_word)

Yep. I'd be inclined to rename file_open to something that says what
the file _is_, and you may want to look into the 'with' statement to
guarantee timely closure of the file, but that's a way to do it.

Also, as has already been mentioned: keeping your data files in the
Python binaries directory isn't usually a good idea. More common to
keep them in the same directory as your script, which would mean that
you don't need a path on it at all.

ChrisA

subhabangalore · Jul 8, 2012

Yep. I'd be inclined to rename file_open to something that says what
the file _is_, and you may want to look into the 'with' statement to
guarantee timely closure of the file, but that's a way to do it.

Also, as has already been mentioned: keeping your data files in the
Python binaries directory isn't usually a good idea. More common to
keep them in the same directory as your script, which would mean that
you don't need a path on it at all.

ChrisA

Dear Chirs,
No file path! Amazing. I do not know I like to know one small example please.
Btw, some earlier post said, line.split() to convert line into bag of words can be done with power(), but I did not find it, if any one can help. I do close files do not worry. New style I'd try.

Regards,
Subha

subhabangalore · Jul 8, 2012

Yep. I'd be inclined to rename file_open to something that says what
the file _is_, and you may want to look into the 'with' statement to
guarantee timely closure of the file, but that's a way to do it.

Also, as has already been mentioned: keeping your data files in the
Python binaries directory isn't usually a good idea. More common to
keep them in the same directory as your script, which would mean that
you don't need a path on it at all.

ChrisA

Dear Chirs,
No file path! Amazing. I do not know I like to know one small example please.
Btw, some earlier post said, line.split() to convert line into bag of words can be done with power(), but I did not find it, if any one can help. I do close files do not worry. New style I'd try.

Regards,
Subha

Chris Angelico · Jul 8, 2012

No file path! Amazing. I do not know I like to know one small example please.

open("doc1.txt","r")

Python will look for a file called doc1.txt in the directory you run
the script from (which is often going to be the same directory as your
..py program).

Btw, some earlier post said, line.split() to convert line into bag of words can be done with power(), but I did not find it, if any one can help. I do close files do not worry. New style I'd try.

I don't know what power() function you're talking about, and can't
find it in the previous posts; the nearest I can find is a post from
Ranting Rick which says a lot of guff that you can ignore. (Rick is a
professional troll. Occasionally he says something useful and
courteous; more often it's one or the other, or neither.)

As to the closing of files: There are a few narrow issues that make it
worth using the 'with' statement, such as exceptions; mostly, it's
just a good habit to get into. If you ignore it, your file will
*usually* be closed fairly soon after you stop referencing it, but
there's no guarantee. (Someone else will doubtless correct me if I'm
wrong, but I'm pretty sure Python guarantees to properly flush and
close on exit, but not necessarily before.)

ChrisA

Roy Smith · Jul 8, 2012

Chris Angelico said:
open("doc1.txt","r")

Python will look for a file called doc1.txt in the directory you run
the script from (which is often going to be the same directory as your
.py program).

Well, to pick a nit, the file will be looked for in the current working
directory. This may or may not be the directory you ran your script
from. Your script could have executed chdir() between the time you
started it and you tried to open the file.

To pick another nit, it's misleading to say, "Python will look for...".
This implies that Python somehow gets involved in pathname resolution,
when it doesn't. Python just passes paths to the operating system as
opaque strings, and the OS does all the magic of figuring out what that
string means.

MRAB · Jul 8, 2012

open("doc1.txt","r")

Python will look for a file called doc1.txt in the directory you run
the script from (which is often going to be the same directory as your
.py program).

I don't know what power() function you're talking about, and can't
find it in the previous posts; the nearest I can find is a post from
Ranting Rick which says a lot of guff that you can ignore. (Rick is a
professional troll. Occasionally he says something useful and
courteous; more often it's one or the other, or neither.)

I believe the relevant quote is """especially the Python gods have
given you *power* over string objects""". If that's the case, he's not
referring to a method or a function called "power".

He did give the good warning about the problem there could be if the
original string contains "$", the character being used as the separator.

Dennis Lee Bieber · Jul 8, 2012

Thanks for pointing out the mistakes. Your points are right. So I am trying to revise it,

file_open=open("/python32/doc1.txt","r")
for line in file_open:
line_word=line.split()
print (line_word)

To store them the best way is to assign a blank list and append but is there any alternate
method for huge data it becomes tough as the list becomes huge if any way variables may be assigned.

Well, first to copy from an earlier post (just so I can trim the
unneeded)...

It is still not clear exactly what the task itself is supposed to
be.

After all, you are splitting the line into a LIST of words, and then
here state the goal is to "print line of" choice... The line and not the
list? There is no hint of what "manipulate them" involves.

If the files are of any size, I would not even attempt to store them
internally... I'd be more likely to run a preprocess phase which opens
the file in binary mode, (maybe reads it in chunks), and builds a list
of /offsets/ to the start of each line. To process any specific line
later would use seek() operations to the start of the line, followed by
a read operation of just the length to the next line.

Doing an mmap() of the file may event speed up the later processing,
as you wouldn't be using I/O seeks, but just asking for slices from the
mmap'd file. The OS would be responsible for making sure the file
contents were in memory.

This won't work if the manipulation requires making a line longer or
shorter. In that case, preprocessing would be writing the lines to a
simple BSD-DB style "database", in which the "line number" is the key;
an manipulation would work on records fetched by line number, and
written back.

If you also store a "process date" in the BSD-DB database, you could
match it to the last modified time of the source file and skip
reprocessing if the source has not changed.

Chris Angelico · Jul 8, 2012

Well, to pick a nit, the file will be looked for in the current working
directory. This may or may not be the directory you ran your script
from. Your script could have executed chdir() between the time you
started it and you tried to open the file.

To pick another nit, it's misleading to say, "Python will look for...".
This implies that Python somehow gets involved in pathname resolution,
when it doesn't. Python just passes paths to the operating system as
opaque strings, and the OS does all the magic of figuring out what that
string means.

Two perfectly accurate nitpicks. And of course, there's a million and
one other things that could happen in between, too, including
possibilities of the current directory not even existing and so on. I
merely oversimplified in the hopes of giving a one-paragraph
explanation of what it means to not put a path name in your open()
call

It's like the difference between reminder text on a Magic: The
Gathering card and the actual entries in the Comprehensive Rules.
Perfect example is the "Madness" ability - the reminder text explains
the ability, but uses language that actually is quite incorrect. It's
a better explanation, though.

Am I overanalyzing this? Yeah, probably...

ChrisA

Discussion on some Code Issues

subhabangalore

Steven D'Aprano

Rick Johnson

subhabangalore

Peter Otten

subhabangalore

subhabangalore

Peter Otten

subhabangalore

Dennis Lee Bieber

subhabangalore

subhabangalore

Chris Angelico

subhabangalore

subhabangalore

Chris Angelico

Roy Smith

MRAB

Dennis Lee Bieber

Chris Angelico

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads