Using re to get data from text file

Discussion in 'Python' started by Jocknerd, Sep 10, 2004.

  1. Jocknerd

    Jocknerd Guest

    I'm a Python newbie and I'm having trouble with Regular Expressions when
    reading in a text file. Here is a sample layout of the input file:

    09/04/2004 Virginia 44 Temple 14
    09/04/2004 LSU 22 Oregon State 21
    09/09/2004 Troy State 24 Missouri 14

    As you can see, the text file contains a list of games. Each game has a
    date, a winning team, the winning team's score, the losing team, and the
    losing team's score. If I set up my program to import the data with fixed
    length format's its no problem. But some of my text files have different
    layouts. For instance, some only have one space between a team name and
    their score.

    Here's how I read in the file using fixed length fields:

    filename = sys.argv[1]
    file = open (filename, 'r')

    schedule = [] # make a list called schedule

    while True:
    line = file.readline()
    if not line: break
    game = {} # make a dictionary called game
    game['date'] = line[0:10] # fixed length field
    game['team1'] = string.strip (line[12:40])
    game['score1'] = line[40:42]
    game['team2'] = string.strip (line[44:72])
    game['score2'] = line[72:74]
    schedule.append(game)

    file.close()

    Note: I'm stripping whitespace from the team names because I don't want
    the team name to actually be a fixed length.

    How would I set this up to read in the data using Regular expressions?

    I've tried this:

    while True:
    line = file.readline ()
    if not line: break
    game = {}
    datePattern = re.compile('^(\d{2})\D+(\d{2})\D+(\d{4})')

    Here's where I get stuck. What do I do from here? I just don't know how
    to import the text and assign it to the proper fields using the re module.
     
    Jocknerd, Sep 10, 2004
    #1
    1. Advertising

  2. Jocknerd

    William Park Guest

    Jocknerd <> wrote:
    > I'm a Python newbie and I'm having trouble with Regular Expressions when
    > reading in a text file. Here is a sample layout of the input file:
    >
    > 09/04/2004 Virginia 44 Temple 14
    > 09/04/2004 LSU 22 Oregon State 21
    > 09/09/2004 Troy State 24 Missouri 14
    >
    > As you can see, the text file contains a list of games. Each game has a
    > date, a winning team, the winning team's score, the losing team, and the
    > losing team's score. If I set up my program to import the data with fixed
    > length format's its no problem. But some of my text files have different
    > layouts. For instance, some only have one space between a team name and
    > their score.
    >
    > Here's how I read in the file using fixed length fields:
    >
    > filename = sys.argv[1]
    > file = open (filename, 'r')
    >
    > schedule = [] # make a list called schedule
    >
    > while True:
    > line = file.readline()
    > if not line: break
    > game = {} # make a dictionary called game
    > game['date'] = line[0:10] # fixed length field
    > game['team1'] = string.strip (line[12:40])
    > game['score1'] = line[40:42]
    > game['team2'] = string.strip (line[44:72])
    > game['score2'] = line[72:74]
    > schedule.append(game)
    >
    > file.close()
    >
    > Note: I'm stripping whitespace from the team names because I don't want
    > the team name to actually be a fixed length.
    >
    > How would I set this up to read in the data using Regular expressions?
    >
    > I've tried this:
    >
    > while True:
    > line = file.readline ()
    > if not line: break
    > game = {}
    > datePattern = re.compile('^(\d{2})\D+(\d{2})\D+(\d{4})')
    >
    > Here's where I get stuck. What do I do from here? I just don't know how
    > to import the text and assign it to the proper fields using the re module.



    Your format is a bit complicated since team's name can be variable
    words. But, I'm assuming that they don't have any digit as part of
    their name. So, use '\d+' to separate the fields. Eg.
    re.split ('\d+', line)
    re.split ('(\d+)', line)
    re.split ('(\d+)', line[10:])

    --
    William Park <>
    Open Geometry Consulting, Toronto, Canada
     
    William Park, Sep 10, 2004
    #2
    1. Advertising

  3. Jocknerd

    Andrew Dalke Guest

    Jocknerd wrote:
    > How would I set this up to read in the data using Regular expressions?
    >
    > I've tried this:
    >
    > while True:
    > line = file.readline ()
    > if not line: break
    > game = {}
    > datePattern = re.compile('^(\d{2})\D+(\d{2})\D+(\d{4})')


    Regular expressions are tricky. Luckily, there are plenty
    of resources available to learn. Here's a suggestion for how
    to read your data.

    The subtle parts are:
    - I'm using re.X so I can document each of the fields in the re
    - The team name must only contain letters
    [a-zA-Z]+ means "set of letters" (that is, a word)
    [a-zA-Z]+(\s[a-zA-Z]+)* means "one or more words separated
    by spaces

    I also use the ^ and $ symbols to make sure the match is
    complete across the whole line.

    If you have teams with digits in the name (eg, "49ers") then
    you'll have to change the definition of 'word' appropriately.
    I made it a strict test to ensure sure there wasn't an accidental
    confusion with a score.


    import re

    pat = re.compile("""
    ^\s* # allow spaces at the start
    (\d\d)/(\d\d)/(\d\d\d\d) # the month, day, and year

    \s+ # spaces to the first team name
    ([a-zA-Z]+(\s+[a-zA-Z]+)*) # one or more words separated by spaces
    \s+ # spaces to the first score
    (\d+) # the score

    \s+ # spaces to the second team name
    ([a-zA-Z]+(\s+[a-zA-Z]+)*) # one or more words separated by spaces
    \s+ # spaces to the second score
    (\d+) # the score

    \s*$ # allow spaces, up to the end
    """, re.X)
    tests = [
    "09/04/2004 Virginia 44 Temple 14",
    "09/04/2004 LSU 22 Oregon State 21",
    "09/09/2004 Troy State 24 Missouri 14",
    "01/02/2003 Florida State 103 University of Miami 2",
    ]

    for test in tests:
    m = pat.match(test)
    if not m:
    raise AssertionError("test failure")
    print "Match results:"
    print " month", m.group(1), "day", m.group(2), "year", m.group(3)
    print " team #1", m.group(4), "score", m.group(6)
    print " team #2", m.group(7), "score", m.group(9)


    Here's the output

    Match results:
    month 09 day 04 year 2004
    team #1 Virginia score 44
    team #2 Temple score 14
    Match results:
    month 09 day 04 year 2004
    team #1 LSU score 22
    team #2 Oregon State score 21
    Match results:
    month 09 day 09 year 2004
    team #1 Troy State score 24
    team #2 Missouri score 14
    Match results:
    month 01 day 02 year 2003
    team #1 Florida State score 103
    team #2 University of Miami score 2


    Andrew
     
    Andrew Dalke, Sep 10, 2004
    #3
  4. Jocknerd

    John Lenton Guest

    On Fri, Sep 10, 2004 at 10:29:27AM -0400, Jocknerd wrote:
    > I'm a Python newbie and I'm having trouble with Regular Expressions when
    > reading in a text file. Here is a sample layout of the input file:
    >
    > 09/04/2004 Virginia 44 Temple 14
    > 09/04/2004 LSU 22 Oregon State 21
    > 09/09/2004 Troy State 24 Missouri 14
    >
    > As you can see, the text file contains a list of games. Each game has a
    > date, a winning team, the winning team's score, the losing team, and the
    > losing team's score. If I set up my program to import the data with fixed
    > length format's its no problem. But some of my text files have different
    > layouts. For instance, some only have one space between a team name and
    > their score.
    >
    > [...]
    >
    > I've tried this:
    >
    > while True:
    > line = file.readline ()
    > if not line: break
    > game = {}
    > datePattern = re.compile('^(\d{2})\D+(\d{2})\D+(\d{4})')
    >
    > Here's where I get stuck. What do I do from here? I just don't know how
    > to import the text and assign it to the proper fields using the re module.


    how about this:

    import re, time, datetime

    class Game(object):
    def __init__(self, d, t1, t2, s1, s2):
    self.date = d
    self.team1 = t1
    self.team2 = t2
    self.score1 = s1
    self.score2 = s2

    def __str__(self):
    return 'On %s, %s beat %s %s-%s' % (self.date,
    self.team1,
    self.team2,
    self.score1,
    self.score2)

    class Games(Game):
    _re = re.compile(r'([\d/]+)'
    + r'\s+(\w[\w\s]+\w)\s+(\d+)' * 2
    + r'\s*$')
    def __init__(self, filename):
    self.games = []
    for line in file('games.csv'):
    match = re.search(self._re, line)
    if match:
    d, t1, s1, t2, s2 = match.groups()
    d = time.strptime(d, '%m/%d/%Y') # m/d/Y! yecch!
    d = datetime.date(*d[:3])
    self.games.append(Game(d, t1, t2, s1, s2))

    if __name__ == '__main__':
    import sys
    for i in Games(sys.argv[1]).games:
    print i

    woops! looks like I got carried away.

    --
    John Lenton () -- Random fortune:
    Of course you have a purpose -- to find a purpose.

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.2.4 (GNU/Linux)

    iD8DBQFBQeivgPqu395ykGsRAg0uAKDAP3fRvwQk/kKcD4c4rlFqYI3OiwCfe80m
    JQE5RMAU/oIQvIF6Zd/PshA=
    =58wF
    -----END PGP SIGNATURE-----
     
    John Lenton, Sep 10, 2004
    #4
  5. Jocknerd

    Jocknerd Guest

    Re: Using re to get data from text file: SOLVED

    On Fri, 10 Sep 2004 14:53:32 +0000, William Park wrote:

    > Jocknerd <> wrote:
    >> I'm a Python newbie and I'm having trouble with Regular Expressions when
    >> reading in a text file. Here is a sample layout of the input file:
    >>
    >> 09/04/2004 Virginia 44 Temple 14
    >> 09/04/2004 LSU 22 Oregon State 21
    >> 09/09/2004 Troy State 24 Missouri 14
    >>
    >> As you can see, the text file contains a list of games. Each game has a
    >> date, a winning team, the winning team's score, the losing team, and the
    >> losing team's score. If I set up my program to import the data with fixed
    >> length format's its no problem. But some of my text files have different
    >> layouts. For instance, some only have one space between a team name and
    >> their score.
    >>
    >> Here's how I read in the file using fixed length fields:
    >>
    >> filename = sys.argv[1]
    >> file = open (filename, 'r')
    >>
    >> schedule = [] # make a list called schedule
    >>
    >> while True:
    >> line = file.readline()
    >> if not line: break
    >> game = {} # make a dictionary called game
    >> game['date'] = line[0:10] # fixed length field
    >> game['team1'] = string.strip (line[12:40])
    >> game['score1'] = line[40:42]
    >> game['team2'] = string.strip (line[44:72])
    >> game['score2'] = line[72:74]
    >> schedule.append(game)
    >>
    >> file.close()
    >>
    >> Note: I'm stripping whitespace from the team names because I don't want
    >> the team name to actually be a fixed length.
    >>
    >> How would I set this up to read in the data using Regular expressions?
    >>
    >> I've tried this:
    >>
    >> while True:
    >> line = file.readline ()
    >> if not line: break
    >> game = {}
    >> datePattern = re.compile('^(\d{2})\D+(\d{2})\D+(\d{4})')
    >>
    >> Here's where I get stuck. What do I do from here? I just don't know how
    >> to import the text and assign it to the proper fields using the re module.

    >
    >
    > Your format is a bit complicated since team's name can be variable
    > words. But, I'm assuming that they don't have any digit as part of
    > their name. So, use '\d+' to separate the fields. Eg.
    > re.split ('\d+', line)
    > re.split ('(\d+)', line)
    > re.split ('(\d+)', line[10:])


    Couldn't figure out re.split. Didn't seem to do what I wanted. Here's
    what did work:

    #!/usr/bin/python

    import re
    filename = sys.argv[1]
    file = open (filename, 'r')

    schedule = []

    pattern = re.compile(r'^(.*\D\d+\D\d+)\D(.*)\D(.*\d+)\D(.*)\D(.*\d+)(.*)$')
    while True:
    line = file.readline()
    if not line: break
    g = {}
    g['date'], g['team1'], g['score1'], g['team2'],
    g['score2'],g['location'] = pattern.search(line).groups()
    schedule.append(g)
    file.close()

    for game in schedule:
    print game['date'], game['team1'], game['score1'], game['team2'],
    game['score2']
     
    Jocknerd, Sep 10, 2004
    #5
  6. Jocknerd

    William Park Guest

    Re: Using re to get data from text file: SOLVED

    Jocknerd <> wrote:
    > >> 09/04/2004 Virginia 44 Temple 14
    > >> 09/04/2004 LSU 22 Oregon State 21
    > >> 09/09/2004 Troy State 24 Missouri 14


    > > Your format is a bit complicated since team's name can be variable
    > > words. But, I'm assuming that they don't have any digit as part of
    > > their name. So, use '\d+' to separate the fields. Eg.
    > > re.split ('\d+', line)
    > > re.split ('(\d+)', line)
    > > re.split ('(\d+)', line[10:])

    >
    > Couldn't figure out re.split. Didn't seem to do what I wanted. Here's
    > what did work:
    >
    > #!/usr/bin/python
    >
    > import re
    > filename = sys.argv[1]
    > file = open (filename, 'r')
    >
    > schedule = []
    >
    > pattern = re.compile(r'^(.*\D\d+\D\d+)\D(.*)\D(.*\d+)\D(.*)\D(.*\d+)(.*)$')
    > while True:
    > line = file.readline()
    > if not line: break
    > g = {}
    > g['date'], g['team1'], g['score1'], g['team2'],
    > g['score2'],g['location'] = pattern.search(line).groups()
    > schedule.append(g)
    > file.close()
    >
    > for game in schedule:
    > print game['date'], game['team1'], game['score1'], game['team2'],
    > game['score2']



    In Bash shell, this kind of cut/slicing is a bit easier.

    1. line='09/09/2004 Troy State 24 Missouri 14'
    sscanf "$line" '%s %[^0-9] %[0-9] %[^0-9] %[0-9]' date team1 score1 team2 score2
    declare -p date team1 score1 team2 score2

    2. line='09/09/2004 Troy State 24 Missouri 14'
    match "$line" '([0-9/]*) ([^0-9]*) ([0-9]*) ([^0-9]*) ([0-9]*)' a
    date=a[1]
    team1=a[2] score1=a[3]
    team2=a[4] score2=a[5]
    declare -p date team1 score1 team2 score2

    Ref:
    http://freshmeat.net/projects/bashdiff/
    http://home.eol.ca/~parkw/index.html#bash
    help sscanf
    help match

    --
    William Park <>
    Open Geometry Consulting, Toronto, Canada
     
    William Park, Sep 10, 2004
    #6
  7. Jocknerd

    Andrew Dalke Guest

    Re: Using re to get data from text file: SOLVED

    Jocknerd wrote:
    > pattern = re.compile(r'^(.*\D\d+\D\d+)\D(.*)\D(.*\d+)\D(.*)\D(.*\d+)(.*)$')


    Though I think that's esthetically poor. The .* groups
    will cause a lot of backtracking.

    Andrew
     
    Andrew Dalke, Sep 10, 2004
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Bhavesh
    Replies:
    0
    Views:
    428
    Bhavesh
    Jul 16, 2007
  2. Bhavesh
    Replies:
    5
    Views:
    636
    Bhavesh
    Jul 18, 2007
  3. Bhavesh
    Replies:
    1
    Views:
    452
    Bhavesh
    Jul 17, 2007
  4. Bruno Desthuilliers
    Replies:
    4
    Views:
    540
    Paul McGuire
    Feb 9, 2009
  5. Domenico Discepola

    Assistance parsing text file using Text::CSV_XS

    Domenico Discepola, Sep 1, 2004, in forum: Perl Misc
    Replies:
    6
    Views:
    454
    Domenico Discepola
    Sep 2, 2004
Loading...

Share This Page