Split text file into words

Discussion in 'Python' started by qwweeeit, Mar 8, 2005.

  1. qwweeeit

    qwweeeit Guest

    The standard split() can use only one delimiter. To split a text file
    into words you need multiple delimiters like blank, punctuation, math
    signs (+-*/), parenteses and so on.

    I didn't succeeded in using re.split()...
    qwweeeit, Mar 8, 2005
    1. Advertisements

  2. Then try again... ;) No, seriously, re.split() can do what you want. Just
    think about what are word delimiters.

    Say, you want to split on all whitespace, and ",", ".", and "?", then you'd
    use something like:

    [email protected] ~ $ python
    Python 2.3.5 (#1, Feb 27 2005, 22:40:59)
    [GCC 3.4.3 20050110 (Gentoo Linux, ssp-,
    pie-8.7 on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    ['Hello', 'qwweeeit', 'how', 'are', 'you', 'I', 'am', 'fine', 'today',
    'actually', '']

    Extending with other word separators shouldn't be hard... Just have a look at



    --- Heiko.

    Version: GnuPG v1.4.0 (GNU/Linux)

    -----END PGP SIGNATURE-----
    Heiko Wundram, Mar 8, 2005
    1. Advertisements

  3. qwweeeit

    Duncan Booth Guest

    Would you care to elaborate on how you tried to use re.split and failed? We
    aren't mind readers here. An example of your non-working code along with
    the expected result and the actual result would be useful.

    This is the first example given in the documentation for re.split:
    ['Words', 'words', 'words', '']

    Does it do what you want? If not what do you want?
    Duncan Booth, Mar 8, 2005
  4. qwweeeit

    qwweeeit Guest

    I thank you for your help.
    I already used re.split successfully but in this case...
    I didn't explain more deeply because I don't want someone else do my

    I want to implement a variable & commands cross reference tool.
    For this goal I must clean the python source from any comment and
    manifest string.
    On the cleaned source file I must isolate all the words (keeping the
    words connected by '.')

    My wrong code (don't consider the line ref. in traceback ... it's an

    import re

    # input text file w/o strings & comments



    for i in lInput:
    .. ll=re.split(r"[\s,{}[]()+=-/*]",i)
    .. fOut.write(' '.join(ll)+'\n')


    Traceback (most recent call last):
    File "./GetWords.py", line 70, in ?
    File "/usr/lib/python2.3/sre.py", line 156, in split
    return _compile(pattern, 0).split(string, maxsplit)
    RuntimeError: maximum recursion limit exceeded

    .... and if I use:

    Traceback (most recent call last):
    File "./GetWords.py", line 70, in ?
    File "/usr/lib/python2.3/sre.py", line 156, in split
    return _compile(pattern, 0).split(string, maxsplit)
    File "/usr/lib/python2.3/sre.py", line 230, in _compile
    raise error, v # invalid expression
    sre_constants.error: bad character range

    I taught it was my mistake in the use of re.split...

    I am using:
    Python 2.3.4 (#2, Aug 19 2004, 15:49:40)
    [GCC 3.4.1 (Mandrakelinux (Alpha 3.4.1-3mdk)] on linux2
    qwweeeit, Mar 9, 2005
  5. qwweeeit

    Duncan Booth Guest

    The stack overflow comes because the ()+ tried to match an empty string as
    many times as possible.

    This regular expression contains a character set '\s,{}[' followed by the
    expression '()+=-/*]'. You can see that the parentheses aren't part of a
    character set if you reverse their order which gives you an error when the
    expression is compiled instead of failing when trying to match:
    Traceback (most recent call last):
    File "<pyshell#10>", line 1, in -toplevel-
    File "C:\Python24\Lib\sre.py", line 157, in split
    return _compile(pattern, 0).split(string, maxsplit)
    File "C:\Python24\Lib\sre.py", line 227, in _compile
    raise error, v # invalid expression
    error: unbalanced parenthesis
    I suspect you actually meant the character set to include the other
    punctuation characters in which case you need to escape the closing square
    bracket or make it the first character:





    Duncan Booth, Mar 9, 2005
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.