RE Help splitting CVS data

Discussion in 'Python' started by Garry, Jan 20, 2013.

  1. Garry

    Garry Guest

    I'm trying to manipulate family tree data using Python.
    I'm using linux and Python 2.7.3 and have data files saved as Linux formatted cvs files
    The data appears in this format:

    Marriage,Husband,Wife,Date,Place,Source,Note0x0a
    Note: the Source field or the Note field can contain quoted data (same as the Place field)

    Actual data:
    [F0244],[I0690],[I0354],1916-06-08,"Neely's Landing, Cape Gir. Co, MO",,0x0a
    [F0245],[I0692],[I0355],1919-09-04,"Cape Girardeau Co, MO",,0x0a

    code snippet follows:

    import os
    import re
    #I'm using the following regex in an attempt to decode the data:
    RegExp2 = "^(\[[A-Z]\d{1,}\])\,(\[[A-Z]\d{1,}\])\,(\[[A-Z]\d{1,}\])\,(\d{,4}\-\d{,2}\-\d{,2})\,(.*|\".*\")\,(.*|\".*\")\,(.*|\".*\")"
    #
    line = "[F0244],[I0690],[I0354],1916-06-08,\"Neely's Landing, Cape Gir. Co, MO\",,"
    #
    (Marriage,Husband,Wife,Date,Place,Source,Note) = re.split(RegExp2,line)
    #
    #However, this does not decode the 7 fields.
    # The following error is displayed:
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    ValueError: too many values to unpack
    #
    # When I use xx the fields apparently get unpacked.
    xx = re.split(RegExp2,line)
    #
    >>> print xx[0]


    >>> print xx[1]

    [F0244]
    >>> print xx[5]

    "Neely's Landing, Cape Gir. Co, MO"
    >>> print xx[6]


    >>> print xx[7]


    >>> print xx[8]


    Why is there an extra NULL field before and after my record contents?
    I'm stuck, comments and solutions greatly appreciated.

    Garry
     
    Garry, Jan 20, 2013
    #1
    1. Advertising

  2. On 01/20/2013 05:04 PM, Garry wrote:
    > I'm trying to manipulate family tree data using Python.
    > I'm using linux and Python 2.7.3 and have data files saved as Linux formatted cvs files
    > The data appears in this format:
    >
    > Marriage,Husband,Wife,Date,Place,Source,Note0x0a
    > Note: the Source field or the Note field can contain quoted data (same as the Place field)
    >
    > Actual data:
    > [F0244],[I0690],[I0354],1916-06-08,"Neely's Landing, Cape Gir. Co, MO",,0x0a
    > [F0245],[I0692],[I0355],1919-09-04,"Cape Girardeau Co, MO",,0x0a
    >
    > code snippet follows:
    >
    > import os
    > import re
    > #I'm using the following regex in an attempt to decode the data:
    > RegExp2 = "^(\[[A-Z]\d{1,}\])\,(\[[A-Z]\d{1,}\])\,(\[[A-Z]\d{1,}\])\,(\d{,4}\-\d{,2}\-\d{,2})\,(.*|\".*\")\,(.*|\".*\")\,(.*|\".*\")"
    > #
    > line = "[F0244],[I0690],[I0354],1916-06-08,\"Neely's Landing, Cape Gir. Co, MO\",,"
    > #
    > (Marriage,Husband,Wife,Date,Place,Source,Note) = re.split(RegExp2,line)
    > #
    > #However, this does not decode the 7 fields.
    > # The following error is displayed:
    > Traceback (most recent call last):
    > File "<stdin>", line 1, in <module>
    > ValueError: too many values to unpack
    > #
    > # When I use xx the fields apparently get unpacked.
    > xx = re.split(RegExp2,line)
    > #
    >>>> print xx[0]
    >>>> print xx[1]

    > [F0244]
    >>>> print xx[5]

    > "Neely's Landing, Cape Gir. Co, MO"
    >>>> print xx[6]
    >>>> print xx[7]
    >>>> print xx[8]

    > Why is there an extra NULL field before and after my record contents?
    > I'm stuck, comments and solutions greatly appreciated.
    >
    > Garry
    >



    Gosh, you really don't want to use regex to split csv lines like that....

    Use csv module:

    >>> s

    '[F0244],[I0690],[I0354],1916-06-08,"Neely\'s Landing, Cape Gir. Co,
    MO",,0x0a'
    >>> import csv
    >>> r = csv.reader()
    >>> for l in r: print(l)

    ....
    ['[F0244]', '[I0690]', '[I0354]', '1916-06-08', "Neely's Landing, Cape
    Gir. Co, MO", '', '0x0a']


    the arg to csv.reader can be the file object (or a list of lines).

    - mitya


    --
    Lark's Tongue Guide to Python: http://lightbird.net/larks/
     
    Mitya Sirenef, Jan 20, 2013
    #2
    1. Advertising

  3. Garry

    Terry Reedy Guest

    On 1/20/2013 5:04 PM, Garry wrote:
    > I'm trying to manipulate family tree data using Python.
    > I'm using linux and Python 2.7.3 and have data files saved as Linux formatted cvs files

    ....
    > I'm stuck, comments and solutions greatly appreciated.


    Why are you not using the cvs module?

    --
    Terry Jan Reedy
     
    Terry Reedy, Jan 20, 2013
    #3
  4. Garry

    Roy Smith Guest

    In article <>,
    Garry <> wrote:

    > Actual data:
    > [F0244],[I0690],[I0354],1916-06-08,"Neely's Landing, Cape Gir. Co, MO",,0x0a
    > [F0245],[I0692],[I0355],1919-09-04,"Cape Girardeau Co, MO",,0x0a
    >
    > code snippet follows:
    >
    > import os
    > import re
    > #I'm using the following regex in an attempt to decode the data:


    First suggestion, don't try to parse CSV data with regex. I'm a huge
    regex fan, but it's just the wrong tool for this job. Use the built-in
    csv module (http://docs.python.org/2/library/csv.html). Or, if you want
    something fancier, read_csv() from pandas (http://tinyurl.com/ajxdxjm).

    Second, when you use regexes, *always* use raw strings around the
    pattern:

    RegExp2 = r'....'

    Lastly, take a look at the re.VERBOSE flag. It lets you write monster
    regexes split up into several lines. Between re.VERBOSE and raw
    strings, it can make the difference between line noise like this:

    > RegExp2 =
    > "^(\[[A-Z]\d{1,}\])\,(\[[A-Z]\d{1,}\])\,(\[[A-Z]\d{1,}\])\,(\d{,4}\-\d{,2}\-\d
    > {,2})\,(.*|\".*\")\,(.*|\".*\")\,(.*|\".*\")"


    and something that mere mortals can understand.
     
    Roy Smith, Jan 21, 2013
    #4
  5. Garry

    Tim Chase Guest

    On 01/20/13 16:16, Terry Reedy wrote:
    > On 1/20/2013 5:04 PM, Garry wrote:
    >> I'm trying to manipulate family tree data using Python.
    >> I'm using linux and Python 2.7.3 and have data files saved as Linux formatted cvs files

    > ...
    >> I'm stuck, comments and solutions greatly appreciated.

    >
    > Why are you not using the cvs module?


    that's an easy answer:

    >>> import cvs

    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    ImportError: No module named cvs


    Now the *csv* module... ;-)

    -tkc
     
    Tim Chase, Jan 21, 2013
    #5
  6. Garry

    Garry Guest

    On Sunday, January 20, 2013 3:04:39 PM UTC-7, Garry wrote:
    > I'm trying to manipulate family tree data using Python.
    >
    > I'm using linux and Python 2.7.3 and have data files saved as Linux formatted cvs files
    >
    > The data appears in this format:
    >
    >
    >
    > Marriage,Husband,Wife,Date,Place,Source,Note0x0a
    >
    > Note: the Source field or the Note field can contain quoted data (same as the Place field)
    >
    >
    >
    > Actual data:
    >
    > [F0244],[I0690],[I0354],1916-06-08,"Neely's Landing, Cape Gir. Co, MO",,0x0a
    >
    > [F0245],[I0692],[I0355],1919-09-04,"Cape Girardeau Co, MO",,0x0a
    >
    >
    >
    > code snippet follows:
    >
    >
    >
    > import os
    >
    > import re
    >
    > #I'm using the following regex in an attempt to decode the data:
    >
    > RegExp2 = "^(\[[A-Z]\d{1,}\])\,(\[[A-Z]\d{1,}\])\,(\[[A-Z]\d{1,}\])\,(\d{,4}\-\d{,2}\-\d{,2})\,(.*|\".*\")\,(.*|\".*\")\,(.*|\".*\")"
    >
    > #
    >
    > line = "[F0244],[I0690],[I0354],1916-06-08,\"Neely's Landing, Cape Gir. Co, MO\",,"
    >
    > #
    >
    > (Marriage,Husband,Wife,Date,Place,Source,Note) = re.split(RegExp2,line)
    >
    > #
    >
    > #However, this does not decode the 7 fields.
    >
    > # The following error is displayed:
    >
    > Traceback (most recent call last):
    >
    > File "<stdin>", line 1, in <module>
    >
    > ValueError: too many values to unpack
    >
    > #
    >
    > # When I use xx the fields apparently get unpacked.
    >
    > xx = re.split(RegExp2,line)
    >
    > #
    >
    > >>> print xx[0]

    >
    >
    >
    > >>> print xx[1]

    >
    > [F0244]
    >
    > >>> print xx[5]

    >
    > "Neely's Landing, Cape Gir. Co, MO"
    >
    > >>> print xx[6]

    >
    >
    >
    > >>> print xx[7]

    >
    >
    >
    > >>> print xx[8]

    >
    >
    >
    > Why is there an extra NULL field before and after my record contents?
    >
    > I'm stuck, comments and solutions greatly appreciated.
    >
    >
    >
    > Garry


    Thanks everyone for your comments. I'm new to Python, but can get around in Perl and regular expressions. I sure was taking the long way trying to get the cvs data parsed.

    Sure hope to teach myself python. Maybe I need to look into courses offered at the local Jr College!

    Garry
     
    Garry, Jan 21, 2013
    #6
  7. On Mon, Jan 21, 2013 at 11:41 AM, Garry <> wrote:
    > Thanks everyone for your comments. I'm new to Python, but can get around in Perl and regular expressions. I sure was taking the long way trying to get the cvs data parsed.


    As has been hinted by Tim, you're actually talking about csv data -
    Comma Separated Values. Not to be confused with cvs, an old vcs. (See?
    The v can go anywhere...) Not a big deal, but it's much easier to find
    stuff on PyPI or similar when you have the right keyword to search
    for!

    ChrisA
     
    Chris Angelico, Jan 21, 2013
    #7
  8. Garry

    Neil Cerutti Guest

    On 2013-01-21, Garry <> wrote:
    > Thanks everyone for your comments. I'm new to Python, but can
    > get around in Perl and regular expressions. I sure was taking
    > the long way trying to get the cvs data parsed.
    >
    > Sure hope to teach myself python. Maybe I need to look into
    > courses offered at the local Jr College!


    There's more than enough free resources online for the
    resourceful Perl programmer to get going. It sounds like you
    might be interested in Text Processing in Python.

    http://gnosis.cx/TPiP/

    Also good for your purposes is Dive Into Python.

    http://www.diveintopython.net/

    --
    Neil Cerutti
     
    Neil Cerutti, Jan 21, 2013
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. John Ericson
    Replies:
    0
    Views:
    437
    John Ericson
    Jul 19, 2003
  2. ddog
    Replies:
    3
    Views:
    606
    Jason Whaley
    Aug 4, 2007
  3. Replies:
    1
    Views:
    614
    GArlington
    Aug 31, 2007
  4. David Ross
    Replies:
    5
    Views:
    166
    Nicholas Van Weerdenburg
    Dec 5, 2004
  5. Xaver Biton
    Replies:
    16
    Views:
    210
    A. Sinan Unur
    Jun 1, 2004
Loading...

Share This Page