Scanning a file character by character

Spacebar265 · Feb 5, 2009

Hi. Does anyone know how to scan a file character by character and
have each character so I can put it into a variable. I am attempting
to make a chatbot and need this to read the saved input to look for
spelling mistakes and further analysis of user input.
Thanks
Spacebar265

Bard Aase · Feb 5, 2009

Hi. Does anyone know how to scan a file character by character and
have each character so I can put it into a variable. I am attempting
to make a chatbot and need this to read the saved input to look for
spelling mistakes and further analysis of user input.
Thanks
Spacebar265

You can read one byte at the time using the read() method on the file-
object.
http://docs.python.org/library/stdtypes.html#file.read

e.g.:
f=open("myfile.txt")
byte=f.read(1)

Gabriel Genellina · Feb 5, 2009

Hi. Does anyone know how to scan a file character by character and
have each character so I can put it into a variable. I am attempting
to make a chatbot and need this to read the saved input to look for
spelling mistakes and further analysis of user input.

Read the file one line at a time, and process each line one character at a
time:

with open(filename, "r") as f:
for line in f:
for c in line:
process(c)

But probably you want to process one *word* at a time; the easiest way
(perhaps inaccurate) is to just split on whitespace:

...
for word in line.split():
process(word)

Jorgen Grahn · Feb 6, 2009

Hi. Does anyone know how to scan a file character by character and
have each character so I can put it into a variable. I am attempting
to make a chatbot and need this to read the saved input to look for
spelling mistakes and further analysis of user input.

That does not follow. To analyze a text, the worst possible starting
point is one variable for each character (what would you call them --
character_1, character_2, ... character_65802 ?)

/Jorgen

Spacebar265 · Feb 9, 2009

That does not follow. To analyze a text, the worst possible starting
point is one variable for eachcharacter(what would you call them --
character_1, character_2, ... character_65802 ?)

/Jorgen

How else would you check for spelling mistakes? Because input would be
very unlikely to be lengthy paragraphs I wouldn't even need very many
variables. If anyone could suggest an alternative method this would be
much appreciated.

Steve Holden · Feb 9, 2009

I believe most people would read the input a line at a time and split
the lines into words. It does depend whether you are attempting
real-time spelling correction, though. That would be a different case.

regards
Steve

Spacebar265 · Feb 10, 2009

I believe most people would read the input a line at a time and split
the lines into words. It does depend whether you are attempting
real-time spelling correction, though. That would be a different case.

regards
Steve

Thanks. How would I do separate lines into words without scanning one
character at a time?

Steven D'Aprano · Feb 10, 2009

How would I do separate lines into words without scanning one character
at a time?

Scan a line at a time, then split each line into words.

for line in open('myfile.txt'):
words = line.split()

should work for a particularly simple-minded idea of words.

Hendrik van Rooyen · Feb 10, 2009

Spacebar265 said:
Thanks. How would I do separate lines into words without scanning one
character at a time?

Type the following at the interactive prompt and see what happens:

s = "This is a string composed of a few words and a newline\n"
help(s.split)
help(s.rstrip)
help(s.strip)
dir(s)

- Hendrik

Steven D'Aprano · Feb 10, 2009

Steven D'Aprano said:
Steven D'Aprano said:

Scan a line at a time, then split each line into words.

for line in open('myfile.txt'):
words = line.split()

should work for a particularly simple-minded idea of words.

Click to expand...

Or for a slightly less simple minded splitting you could try re.split:

re.split("(\w+)", "The quick brown fox jumps, and falls over.")[1::2]

Click to expand...

Click to expand...

['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']

Perhaps I'm missing something, but the above regex does the exact same
thing as line.split() except it is significantly slower and harder to
read.

Neither deal with quoted text, apostrophes, hyphens, punctuation or any
other details of real-world text. That's what I mean by "simple-minded".

Tim Chase · Feb 10, 2009

Or for a slightly less simple minded splitting you could try re.split:

re.split("(\w+)", "The quick brown fox jumps, and falls over.")[1::2]

Click to expand...

['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']

Click to expand...

Perhaps I'm missing something, but the above regex does the exact same
thing as line.split() except it is significantly slower and harder to
read.

Neither deal with quoted text, apostrophes, hyphens, punctuation or any
other details of real-world text. That's what I mean by "simple-minded".

>>> s = "The quick brown fox jumps, and falls over."
>>> import re
>>> re.split(r"(\w+)", s)[1::2] ['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']
>>> s.split()

Click to expand...

Click to expand...

['The', 'quick', 'brown', 'fox', 'jumps,', 'and', 'falls',
'over.']

Note the difference in "jumps" vs. "jumps," (extra comma in the
string.split() version) and likewise the period after "over".
Thus not quite "the exact same thing as line.split()".

I think an easier-to-read variant would be
['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']

which just finds words. One could also just limit it to letters with

re.findall("[a-zA-Z]", s)

as "\w" is a little more encompassing (letters and underscores)
if that's a problem.

-tkc

Rhodri James · Feb 10, 2009

Steven D'Aprano said:
Steven D'Aprano said:

On Mon, 09 Feb 2009 19:10:28 -0800, Spacebar265 wrote:

How would I do separate lines into words without scanning one
character at a time?

Scan a line at a time, then split each line into words.

for line in open('myfile.txt'):
words = line.split()

should work for a particularly simple-minded idea of words.

Click to expand...

Or for a slightly less simple minded splitting you could try re.split:

re.split("(\w+)", "The quick brown fox jumps, and falls over.")[1::2]

Click to expand...

['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']

Click to expand...

Perhaps I'm missing something, but the above regex does the exact same
thing as line.split() except it is significantly slower and harder to
read.

Neither deal with quoted text, apostrophes, hyphens, punctuation or any
other details of real-world text. That's what I mean by "simple-minded".

You're missing something

Specifically, the punctuation gets swept
up with the whitespace, and the extended slice skips it. Apostrophes
(and possibly hyphenation) are still a bit moot, though.

Steven D'Aprano · Feb 10, 2009

Or for a slightly less simple minded splitting you could try re.split:

re.split("(\w+)", "The quick brown fox jumps, and falls
over.")[1::2]
['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']

Click to expand...

Perhaps I'm missing something, but the above regex does the exact same
thing as line.split() except it is significantly slower and harder to
read.

Click to expand...

....

Note the difference in "jumps" vs. "jumps," (extra comma in the
string.split() version) and likewise the period after "over". Thus not
quite "the exact same thing as line.split()".

Um... yes. I'll just slink away quietly now... nothing to see here...

MRAB · Feb 10, 2009

Steven said:
Or for a slightly less simple minded splitting you could try re.split:

re.split("(\w+)", "The quick brown fox jumps, and falls
over.")[1::2]
['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']

Perhaps I'm missing something, but the above regex does the exact same
thing as line.split() except it is significantly slower and harder to
read.

Click to expand...

...

Note the difference in "jumps" vs. "jumps," (extra comma in the
string.split() version) and likewise the period after "over". Thus not
quite "the exact same thing as line.split()".

Click to expand...

Um... yes. I'll just slink away quietly now... nothing to see here...

You could've used str.translate to strip out the unwanted characters.

Spacebar265 · Feb 13, 2009

Scan a line at a time, then split each line into words.

Click to expand...

for line in open('myfile.txt'):
words = line.split()

Click to expand...

should work for a particularly simple-minded idea of words.

Click to expand...

Or for a slightly less simple minded splitting you could try re.split:

re.split("(\w+)", "The quick brown fox jumps, and falls over.")[1::2]

Click to expand...

Click to expand...

['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']

Using this code how would it load each word into a temporary variable.

Rhodri James · Feb 13, 2009

How would I do separate lines into words without scanning one character
at a time?

Click to expand...

Scan a line at a time, then split each line into words.

Click to expand...

for line in open('myfile.txt'):
Â Â words = line.split()

Click to expand...

should work for a particularly simple-minded idea of words.

Click to expand...

Or for a slightly less simple minded splitting you could try re.split:

re.split("(\w+)", "The quick brown fox jumps, and falls

Click to expand...

over.")[1::2]

['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']

Click to expand...

Using this code how would it load each word into a temporary variable.

Why on earth would you want to? Just index through the list.

Josh Dukes · Feb 17, 2009

In [401]: import shlex

In [402]: shlex.split("""Joe went to 'the store' where he bought a "box of chocolates" and stuff.""")
Out[402]:
['Joe',
'went',
'to',
'the store',
'where',
'he',
'bought',
'a',
'box of chocolates',
'and',
'stuff.']

how's that work for ya?

http://docs.python.org/library/shlex.html

Or for a slightly less simple minded splitting you could try
re.split:

re.split("(\w+)", "The quick brown fox jumps, and falls
over.")[1::2]
['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']

Click to expand...

Perhaps I'm missing something, but the above regex does the exact
same thing as line.split() except it is significantly slower and
harder to read.

Neither deal with quoted text, apostrophes, hyphens, punctuation or
any other details of real-world text. That's what I mean by
"simple-minded".

s = "The quick brown fox jumps, and falls over."
import re
re.split(r"(\w+)", s)[1::2] ['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']
s.split()

Click to expand...

Click to expand...

['The', 'quick', 'brown', 'fox', 'jumps,', 'and', 'falls',
'over.']

Note the difference in "jumps" vs. "jumps," (extra comma in the
string.split() version) and likewise the period after "over".
Thus not quite "the exact same thing as line.split()".

I think an easier-to-read variant would be
['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']

which just finds words. One could also just limit it to letters with

re.findall("[a-zA-Z]", s)

as "\w" is a little more encompassing (letters and underscores)
if that's a problem.

-tkc

Tim Chase · Feb 17, 2009

Josh said:
In [401]: import shlex

In [402]: shlex.split("""Joe went to 'the store' where he bought a "box of chocolates" and stuff.""")

how's that work for ya?

It works great if that's the desired behavior. However, the OP
wrote about splitting the lines into separate words, not
"treating quoted items as a single word". (OP: "How would I do
separate lines into words without scanning one character at a time?")

But for pulling out quoted strings as units, the shlex is a great
module.

-tkc

rzed · Feb 22, 2009

om:

On Feb 11, 1:06 am, Duncan Booth <[email protected]>
wrote: [...]

re.split("(\w+)", "The quick brown fox jumps, and falls
over.")[1::2]

Click to expand...

['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls',
'over']

Click to expand...

Using this code how would it load each word into a temporary
variable.

import re
list_name = re.split("(\w+)", "The quick brown fox jumps, and falls over.")[1::2]
list_name[2]

Click to expand...

Click to expand...

'brown'

You see, temporary variables are set. Their names are spelled
'list_name[x]', where x is an index into the list. If your plan was
instead to have predefined names of variables, what would they be
called? How many would you have? With list variables, you will have
enough, and you will know their names.

Outputting signal values to terminal Within Character Array	0	Dec 10, 2021
newbie EOL while scanning string literal	7	Jun 25, 2013
Problem with a login script, SESSION user rights and put this together so it works with the other pages and MySQL. Code examples.	2	May 5, 2023
character array to string	0	Mar 31, 2011
Can't solve problems! please Help	0	Sep 26, 2022
Reversing output of user input by using while loop...	2	Sep 1, 2022
2to3 chokes on bad character	7	Feb 23, 2011
wide character file to wstring - unexpected results	1	Dec 14, 2011

Scanning a file character by character

Spacebar265

Bard Aase

Gabriel Genellina

Jorgen Grahn

Spacebar265

Steve Holden

Spacebar265

Steven D'Aprano

Hendrik van Rooyen

Steven D'Aprano

Tim Chase

Rhodri James

Steven D'Aprano

MRAB

Spacebar265

Rhodri James

Josh Dukes

Tim Chase

rzed

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads