Split text file into words

Discussion in 'Python' started by qwweeeit, Mar 8, 2005.

  1. qwweeeit

    qwweeeit Guest

    The standard split() can use only one delimiter. To split a text file
    into words you need multiple delimiters like blank, punctuation, math
    signs (+-*/), parenteses and so on.

    I didn't succeeded in using re.split()...
     
    qwweeeit, Mar 8, 2005
    #1
    1. Advertisements

  2. Then try again... ;) No, seriously, re.split() can do what you want. Just
    think about what are word delimiters.

    Say, you want to split on all whitespace, and ",", ".", and "?", then you'd
    use something like:

    [email protected] ~ $ python
    Python 2.3.5 (#1, Feb 27 2005, 22:40:59)
    [GCC 3.4.3 20050110 (Gentoo Linux 3.4.3.20050110, ssp-3.4.3.20050110-0,
    pie-8.7 on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    ['Hello', 'qwweeeit', 'how', 'are', 'you', 'I', 'am', 'fine', 'today',
    'actually', '']

    Extending with other word separators shouldn't be hard... Just have a look at

    http://docs.python.org/lib/re-syntax.html

    HTH!

    --
    --- Heiko.

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.4.0 (GNU/Linux)

    iD8DBQBCLa5Yf0bpgh6uVAMRAh7RAJ9LY1P1lLJmMz6v8EPlGU46KGsPDwCcDxFb
    jPZAoMBmLTkMliiFBP6s8bg=
    =7kGS
    -----END PGP SIGNATURE-----
     
    Heiko Wundram, Mar 8, 2005
    #2
    1. Advertisements

  3. qwweeeit

    Duncan Booth Guest

    Would you care to elaborate on how you tried to use re.split and failed? We
    aren't mind readers here. An example of your non-working code along with
    the expected result and the actual result would be useful.

    This is the first example given in the documentation for re.split:
    ['Words', 'words', 'words', '']

    Does it do what you want? If not what do you want?
     
    Duncan Booth, Mar 8, 2005
    #3
  4. qwweeeit

    qwweeeit Guest

    I thank you for your help.
    I already used re.split successfully but in this case...
    I didn't explain more deeply because I don't want someone else do my
    homework.

    I want to implement a variable & commands cross reference tool.
    For this goal I must clean the python source from any comment and
    manifest string.
    On the cleaned source file I must isolate all the words (keeping the
    words connected by '.')

    My wrong code (don't consider the line ref. in traceback ... it's an
    extract!):

    import re

    # input text file w/o strings & comments

    f=open('file.txt')
    lInput=f.readlines()
    f.close()

    fOut=open('words.txt','w')

    for i in lInput:
    .. ll=re.split(r"[\s,{}[]()+=-/*]",i)
    .. fOut.write(' '.join(ll)+'\n')

    fOut.close()

    Traceback (most recent call last):
    File "./GetWords.py", line 70, in ?
    ll=re.split(r"[\s,{}[]()+=-/*]",i)
    File "/usr/lib/python2.3/sre.py", line 156, in split
    return _compile(pattern, 0).split(string, maxsplit)
    RuntimeError: maximum recursion limit exceeded


    .... and if I use:
    ll=re.split(r"\s,{}[]()+=-/*",i)

    Traceback (most recent call last):
    File "./GetWords.py", line 70, in ?
    ll=re.split(r"\s,{}[]()+=-/*",i)
    File "/usr/lib/python2.3/sre.py", line 156, in split
    return _compile(pattern, 0).split(string, maxsplit)
    File "/usr/lib/python2.3/sre.py", line 230, in _compile
    raise error, v # invalid expression
    sre_constants.error: bad character range

    I taught it was my mistake in the use of re.split...

    I am using:
    Python 2.3.4 (#2, Aug 19 2004, 15:49:40)
    [GCC 3.4.1 (Mandrakelinux (Alpha 3.4.1-3mdk)] on linux2
     
    qwweeeit, Mar 9, 2005
    #4
  5. qwweeeit

    Duncan Booth Guest

    The stack overflow comes because the ()+ tried to match an empty string as
    many times as possible.

    This regular expression contains a character set '\s,{}[' followed by the
    expression '()+=-/*]'. You can see that the parentheses aren't part of a
    character set if you reverse their order which gives you an error when the
    expression is compiled instead of failing when trying to match:
    Traceback (most recent call last):
    File "<pyshell#10>", line 1, in -toplevel-
    ll=re.split(r"[\s,{}[])(+=-/*]",i)
    File "C:\Python24\Lib\sre.py", line 157, in split
    return _compile(pattern, 0).split(string, maxsplit)
    File "C:\Python24\Lib\sre.py", line 227, in _compile
    raise error, v # invalid expression
    error: unbalanced parenthesis
    I suspect you actually meant the character set to include the other
    punctuation characters in which case you need to escape the closing square
    bracket or make it the first character:

    Try:

    ll=re.split(r"[\s,{}[\]()+=-/*]",i)

    or:

    ll=re.split(r"[]\s,{}[()+=-/*]",i)

    instead.
     
    Duncan Booth, Mar 9, 2005
    #5
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.