Split text file into words

Q

qwweeeit

The standard split() can use only one delimiter. To split a text file
into words you need multiple delimiters like blank, punctuation, math
signs (+-*/), parenteses and so on.

I didn't succeeded in using re.split()...
 
H

Heiko Wundram

The standard split() can use only one delimiter. To split a text file
into words you need multiple delimiters like blank, punctuation, math
signs (+-*/), parenteses and so on.

I didn't succeeded in using re.split()...

Then try again... ;) No, seriously, re.split() can do what you want. Just
think about what are word delimiters.

Say, you want to split on all whitespace, and ",", ".", and "?", then you'd
use something like:

heiko@heiko ~ $ python
Python 2.3.5 (#1, Feb 27 2005, 22:40:59)
[GCC 3.4.3 20050110 (Gentoo Linux 3.4.3.20050110, ssp-3.4.3.20050110-0,
pie-8.7 on linux2
Type "help", "copyright", "credits" or "license" for more information.
import re
teststr = "Hello qwweeeit, how are you? I am fine, today, actually."
re.split(r"[\s\.,\?]+",teststr)
['Hello', 'qwweeeit', 'how', 'are', 'you', 'I', 'am', 'fine', 'today',
'actually', '']

Extending with other word separators shouldn't be hard... Just have a look at

http://docs.python.org/lib/re-syntax.html

HTH!

--
--- Heiko.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)

iD8DBQBCLa5Yf0bpgh6uVAMRAh7RAJ9LY1P1lLJmMz6v8EPlGU46KGsPDwCcDxFb
jPZAoMBmLTkMliiFBP6s8bg=
=7kGS
-----END PGP SIGNATURE-----
 
D

Duncan Booth

qwweeeit said:
The standard split() can use only one delimiter. To split a text file
into words you need multiple delimiters like blank, punctuation, math
signs (+-*/), parenteses and so on.

I didn't succeeded in using re.split()...

Would you care to elaborate on how you tried to use re.split and failed? We
aren't mind readers here. An example of your non-working code along with
the expected result and the actual result would be useful.

This is the first example given in the documentation for re.split:
['Words', 'words', 'words', '']

Does it do what you want? If not what do you want?
 
Q

qwweeeit

I thank you for your help.
I already used re.split successfully but in this case...
I didn't explain more deeply because I don't want someone else do my
homework.

I want to implement a variable & commands cross reference tool.
For this goal I must clean the python source from any comment and
manifest string.
On the cleaned source file I must isolate all the words (keeping the
words connected by '.')

My wrong code (don't consider the line ref. in traceback ... it's an
extract!):

import re

# input text file w/o strings & comments

f=open('file.txt')
lInput=f.readlines()
f.close()

fOut=open('words.txt','w')

for i in lInput:
.. ll=re.split(r"[\s,{}[]()+=-/*]",i)
.. fOut.write(' '.join(ll)+'\n')

fOut.close()

Traceback (most recent call last):
File "./GetWords.py", line 70, in ?
ll=re.split(r"[\s,{}[]()+=-/*]",i)
File "/usr/lib/python2.3/sre.py", line 156, in split
return _compile(pattern, 0).split(string, maxsplit)
RuntimeError: maximum recursion limit exceeded


.... and if I use:
ll=re.split(r"\s,{}[]()+=-/*",i)

Traceback (most recent call last):
File "./GetWords.py", line 70, in ?
ll=re.split(r"\s,{}[]()+=-/*",i)
File "/usr/lib/python2.3/sre.py", line 156, in split
return _compile(pattern, 0).split(string, maxsplit)
File "/usr/lib/python2.3/sre.py", line 230, in _compile
raise error, v # invalid expression
sre_constants.error: bad character range

I taught it was my mistake in the use of re.split...

I am using:
Python 2.3.4 (#2, Aug 19 2004, 15:49:40)
[GCC 3.4.1 (Mandrakelinux (Alpha 3.4.1-3mdk)] on linux2
 
D

Duncan Booth

qwweeeit said:
ll=re.split(r"[\s,{}[]()+=-/*]",i)

The stack overflow comes because the ()+ tried to match an empty string as
many times as possible.

This regular expression contains a character set '\s,{}[' followed by the
expression '()+=-/*]'. You can see that the parentheses aren't part of a
character set if you reverse their order which gives you an error when the
expression is compiled instead of failing when trying to match:
ll=re.split(r"[\s,{}[])(+=-/*]",i)

Traceback (most recent call last):
File "<pyshell#10>", line 1, in -toplevel-
ll=re.split(r"[\s,{}[])(+=-/*]",i)
File "C:\Python24\Lib\sre.py", line 157, in split
return _compile(pattern, 0).split(string, maxsplit)
File "C:\Python24\Lib\sre.py", line 227, in _compile
raise error, v # invalid expression
error: unbalanced parenthesis
I suspect you actually meant the character set to include the other
punctuation characters in which case you need to escape the closing square
bracket or make it the first character:

Try:

ll=re.split(r"[\s,{}[\]()+=-/*]",i)

or:

ll=re.split(r"[]\s,{}[()+=-/*]",i)

instead.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,565
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top