Reading by positions plain text files

Discussion in 'Python' started by javivd, Nov 30, 2010.

  1. javivd

    javivd Guest

    Hi all,

    Sorry, newbie question:

    I have database in a plain text file (could be .txt or .dat, it's the
    same) that I need to read in python in order to do some data
    validation. In other files I read this kind of files with the split()
    method, reading line by line. But split() relies on a separator
    character (I think... all I know is that it's work OK).

    I have a case now in wich another file has been provided (besides the
    database) that tells me in wich column of the file is every variable,
    because there isn't any blank or tab character that separates the
    variables, they are stick together. This second file specify the
    variable name and his position:


    VARIABLE NAME POSITION (COLUMN) IN FILE
    var_name_1 123-123
    var_name_2 124-125
    var_name_3 126-126
    ...
    ...
    var_name_N 512-513 (last positions)

    How can I read this so each position in the file it's associated with
    each variable name?

    Thanks a lot!!

    Javier
     
    javivd, Nov 30, 2010
    #1
    1. Advertising

  2. javivd

    Tim Harig Guest

    On 2010-11-30, javivd <> wrote:
    > I have a case now in wich another file has been provided (besides the
    > database) that tells me in wich column of the file is every variable,
    > because there isn't any blank or tab character that separates the
    > variables, they are stick together. This second file specify the
    > variable name and his position:
    >
    > VARIABLE NAME POSITION (COLUMN) IN FILE
    > var_name_1 123-123
    > var_name_2 124-125
    > var_name_3 126-126
    > ..
    > ..
    > var_name_N 512-513 (last positions)


    I am unclear on the format of these positions. They do not look like
    what I would expect from absolute references in the data. For instance,
    123-123 may only contain one byte??? which could change for different
    encodings and how you mark line endings. Frankly, the use of the
    world columns in the header suggests that the data *is* separated by
    line endings rather then absolute position and the position refers to
    the line number. In which case, you can use splitlines() to break up
    the data and then address the proper line by index. Nevertheless,
    you can use file.seek() to move to an absolute offset in the file,
    if that really is what you are looking for.
     
    Tim Harig, Nov 30, 2010
    #2
    1. Advertising

  3. javivd

    MRAB Guest

    On 30/11/2010 21:31, javivd wrote:
    > Hi all,
    >
    > Sorry, newbie question:
    >
    > I have database in a plain text file (could be .txt or .dat, it's the
    > same) that I need to read in python in order to do some data
    > validation. In other files I read this kind of files with the split()
    > method, reading line by line. But split() relies on a separator
    > character (I think... all I know is that it's work OK).
    >
    > I have a case now in wich another file has been provided (besides the
    > database) that tells me in wich column of the file is every variable,
    > because there isn't any blank or tab character that separates the
    > variables, they are stick together. This second file specify the
    > variable name and his position:
    >
    >
    > VARIABLE NAME POSITION (COLUMN) IN FILE
    > var_name_1 123-123
    > var_name_2 124-125
    > var_name_3 126-126
    > ..
    > ..
    > var_name_N 512-513 (last positions)
    >
    > How can I read this so each position in the file it's associated with
    > each variable name?
    >

    It sounds like a similar problem to this:

    http://groups.google.com/group/comp.../123422d510187dc3?show_docid=123422d510187dc3
     
    MRAB, Nov 30, 2010
    #3
  4. javivd

    javivd Guest

    On Nov 30, 11:43 pm, Tim Harig <> wrote:
    > On 2010-11-30, javivd <> wrote:
    >
    > > I have a case now in wich another file has been provided (besides the
    > > database) that tells me in wich column of the file is every variable,
    > > because there isn't any blank or tab character that separates the
    > > variables, they are stick together. This second file specify the
    > > variable name and his position:

    >
    > > VARIABLE NAME      POSITION (COLUMN) IN FILE
    > > var_name_1                 123-123
    > > var_name_2                 124-125
    > > var_name_3                 126-126
    > > ..
    > > ..
    > > var_name_N                 512-513 (last positions)

    >
    > I am unclear on the format of these positions.  They do not look like
    > what I would expect from absolute references in the data.  For instance,
    > 123-123 may only contain one byte??? which could change for different
    > encodings and how you mark line endings.  Frankly, the use of the
    > world columns in the header suggests that the data *is* separated by
    > line endings rather then absolute position and the position refers to
    > the line number. In which case, you can use splitlines() to break up
    > the data and then address the proper line by index.  Nevertheless,
    > you can use file.seek() to move to an absolute offset in the file,
    > if that really is what you are looking for.


    I work in a survey research firm. the data im talking about has a lot
    of 0-1 variables, meaning yes or no of a lot of questions. so only one
    position of a character is needed (not byte), explaining the 123-123
    kind of positions of a lot of variables.

    and no, MRAB, it's not the similar problem (at least what i understood
    of it). I have to associate the position this file give me with the
    variable name this file give me for those positions.

    thank you both and sorry for my english!

    J
     
    javivd, Dec 1, 2010
    #4
  5. javivd

    MRAB Guest

    On 01/12/2010 02:03, javivd wrote:
    > On Nov 30, 11:43 pm, Tim Harig<> wrote:
    >> On 2010-11-30, javivd<> wrote:
    >>
    >>> I have a case now in wich another file has been provided (besides the
    >>> database) that tells me in wich column of the file is every variable,
    >>> because there isn't any blank or tab character that separates the
    >>> variables, they are stick together. This second file specify the
    >>> variable name and his position:

    >>
    >>> VARIABLE NAME POSITION (COLUMN) IN FILE
    >>> var_name_1 123-123
    >>> var_name_2 124-125
    >>> var_name_3 126-126
    >>> ..
    >>> ..
    >>> var_name_N 512-513 (last positions)

    >>
    >> I am unclear on the format of these positions. They do not look like
    >> what I would expect from absolute references in the data. For instance,
    >> 123-123 may only contain one byte??? which could change for different
    >> encodings and how you mark line endings. Frankly, the use of the
    >> world columns in the header suggests that the data *is* separated by
    >> line endings rather then absolute position and the position refers to
    >> the line number. In which case, you can use splitlines() to break up
    >> the data and then address the proper line by index. Nevertheless,
    >> you can use file.seek() to move to an absolute offset in the file,
    >> if that really is what you are looking for.

    >
    > I work in a survey research firm. the data im talking about has a lot
    > of 0-1 variables, meaning yes or no of a lot of questions. so only one
    > position of a character is needed (not byte), explaining the 123-123
    > kind of positions of a lot of variables.
    >
    > and no, MRAB, it's not the similar problem (at least what i understood
    > of it). I have to associate the position this file give me with the
    > variable name this file give me for those positions.
    >
    > thank you both and sorry for my english!
    >

    You just have to parse the second file to build a list (or dict)
    containing the name, start position and end position of each variable:

    variables = [("var_name_1", 123, 123), ...]

    and then work through that list, extracting the data between those
    positions in the first file and putting the values in another list (or
    dict).

    You also need to check whether the positions are 1-based or 0-based
    (Python uses 0-based).
     
    MRAB, Dec 1, 2010
    #5
  6. javivd

    Tim Chase Guest

    On 11/30/2010 08:03 PM, javivd wrote:
    > On Nov 30, 11:43 pm, Tim Harig<> wrote:
    >>> VARIABLE NAME POSITION (COLUMN) IN FILE
    >>> var_name_1 123-123
    >>> var_name_2 124-125
    >>> var_name_3 126-126
    >>> ..
    >>> ..
    >>> var_name_N 512-513 (last positions)

    >>

    > and no, MRAB, it's not the similar problem (at least what i understood
    > of it). I have to associate the position this file give me with the
    > variable name this file give me for those positions.


    MRAB may be referring to my reply in that thread where you can do
    something like

    OFFSETS = 'offsets.txt'
    offsets = {}
    f = file(OFFSETS)
    f.next() # throw away the headers
    for row in f:
    varname, rest = row.split()[:2]
    # sanity check
    if varname in offsets:
    print "[%s] in %s twice?!" % (varname, OFFSETS)
    if '-' not in rest: continue
    start, stop = map(int, rest.split('-'))
    offsets[varname] = slice(start, stop+1) # 0-based offsets
    #offsets[varname] = slice(start+1, stop+2) # 1-based offsets
    f.close()

    def do_something_with(data):
    # your real code goes here
    print data['var_name_2']

    for row in file('data.txt'):
    data = dict((name, row[offsets[name]]) for name in offsets)
    do_something_with(data)

    There's additional robustness-checks I'd include if your
    offsets-file isn't controlled by you (people send me daft data).

    -tkc
     
    Tim Chase, Dec 1, 2010
    #6
  7. javivd

    Tim Harig Guest

    On 2010-12-01, javivd <> wrote:
    > On Nov 30, 11:43 pm, Tim Harig <> wrote:
    >> On 2010-11-30, javivd <> wrote:
    >>
    >> > I have a case now in wich another file has been provided (besides the
    >> > database) that tells me in wich column of the file is every variable,
    >> > because there isn't any blank or tab character that separates the
    >> > variables, they are stick together. This second file specify the
    >> > variable name and his position:

    >>
    >> > VARIABLE NAME      POSITION (COLUMN) IN FILE
    >> > var_name_1                 123-123
    >> > var_name_2                 124-125
    >> > var_name_3                 126-126
    >> > ..
    >> > ..
    >> > var_name_N                 512-513 (last positions)

    >>
    >> I am unclear on the format of these positions.  They do not look like
    >> what I would expect from absolute references in the data.  For instance,
    >> 123-123 may only contain one byte??? which could change for different
    >> encodings and how you mark line endings.  Frankly, the use of the
    >> world columns in the header suggests that the data *is* separated by
    >> line endings rather then absolute position and the position refers to
    >> the line number. In which case, you can use splitlines() to break up
    >> the data and then address the proper line by index.  Nevertheless,
    >> you can use file.seek() to move to an absolute offset in the file,
    >> if that really is what you are looking for.

    >
    > I work in a survey research firm. the data im talking about has a lot
    > of 0-1 variables, meaning yes or no of a lot of questions. so only one
    > position of a character is needed (not byte), explaining the 123-123
    > kind of positions of a lot of variables.


    Then file.seek() is what you are looking for; but, you need to be aware of
    line endings and encodings as indicated. Make sure that you open the file
    using whatever encoding was used when it was generated or you could have
    problems with multibyte characters affecting the offsets.
     
    Tim Harig, Dec 1, 2010
    #7
  8. javivd

    javivd Guest

    On Dec 1, 3:15 am, Tim Harig <> wrote:
    > On 2010-12-01, javivd <> wrote:
    >
    >
    >
    > > On Nov 30, 11:43 pm, Tim Harig <> wrote:
    > >> On 2010-11-30, javivd <> wrote:

    >
    > >> > I have a case now in wich another file has been provided (besides the
    > >> > database) that tells me in wich column of the file is every variable,
    > >> > because there isn't any blank or tab character that separates the
    > >> > variables, they are stick together. This second file specify the
    > >> > variable name and his position:

    >
    > >> > VARIABLE NAME      POSITION (COLUMN) IN FILE
    > >> > var_name_1                 123-123
    > >> > var_name_2                 124-125
    > >> > var_name_3                 126-126
    > >> > ..
    > >> > ..
    > >> > var_name_N                 512-513 (last positions)

    >
    > >> I am unclear on the format of these positions.  They do not look like
    > >> what I would expect from absolute references in the data.  For instance,
    > >> 123-123 may only contain one byte??? which could change for different
    > >> encodings and how you mark line endings.  Frankly, the use of the
    > >> world columns in the header suggests that the data *is* separated by
    > >> line endings rather then absolute position and the position refers to
    > >> the line number. In which case, you can use splitlines() to break up
    > >> the data and then address the proper line by index.  Nevertheless,
    > >> you can use file.seek() to move to an absolute offset in the file,
    > >> if that really is what you are looking for.

    >
    > > I work in a survey research firm. the data im talking about has a lot
    > > of 0-1 variables, meaning yes or no of a lot of questions. so only one
    > > position of a character is needed (not byte), explaining the 123-123
    > > kind of positions of a lot of variables.

    >
    > Then file.seek() is what you are looking for; but, you need to be aware of
    > line endings and encodings as indicated.  Make sure that you open the file
    > using whatever encoding was used when it was generated or you could have
    > problems with multibyte characters affecting the offsets.


    Ok, I will try it and let you know. Thanks all!!
     
    javivd, Dec 3, 2010
    #8
  9. javivd

    javivd Guest

    On Dec 1, 7:15 am, Tim Harig <> wrote:
    > On 2010-12-01, javivd <> wrote:
    >
    >
    >
    >
    >
    >
    >
    >
    >
    > > On Nov 30, 11:43 pm, Tim Harig <> wrote:
    > >> On 2010-11-30, javivd <> wrote:

    >
    > >> > I have a case now in wich anotherfilehas been provided (besides the
    > >> > database) that tells me in wich column of thefileis every variable,
    > >> > because there isn't any blank or tab character that separates the
    > >> > variables, they are stick together. This secondfilespecify the
    > >> > variable name and his position:

    >
    > >> > VARIABLE NAME      POSITION (COLUMN) INFILE
    > >> > var_name_1                 123-123
    > >> > var_name_2                 124-125
    > >> > var_name_3                 126-126
    > >> > ..
    > >> > ..
    > >> > var_name_N                 512-513 (last positions)

    >
    > >> I am unclear on the format of these positions.  They do not look like
    > >> what I would expect from absolute references in the data.  For instance,
    > >> 123-123 may only contain one byte??? which could change for different
    > >> encodings and how you mark line endings.  Frankly, the use of the
    > >> world columns in the header suggests that the data *is* separated by
    > >> line endings rather then absolute position and the position refers to
    > >> the line number. In which case, you can use splitlines() to break up
    > >> the data and then address the proper line by index.  Nevertheless,
    > >> you can usefile.seek() to move to an absolute offset in thefile,
    > >> if that really is what you are looking for.

    >
    > > I work in a survey research firm. the data im talking about has a lot
    > > of 0-1 variables, meaning yes or no of a lot of questions. so only one
    > > position of a character is needed (not byte), explaining the 123-123
    > > kind of positions of a lot of variables.

    >
    > Thenfile.seek() is what you are looking for; but, you need to be aware of
    > line endings and encodings as indicated.  Make sure that you open thefile
    > using whatever encoding was used when it was generated or you could have
    > problems with multibyte characters affecting the offsets.


    I've tried your advice and something is wrong. Here is my code,



    f = open(r'c:c:\somefile.txt', 'w')

    f.write('0123456789\n0123456789\n0123456789')

    f.close()

    f = open(r'c:\somefile.txt', 'r')


    for line in f:
    f.seek(3,0)
    print f.read(1) #just to know if its printing the rigth column

    I used .seek() in this manner, but is not working.

    Let me put the problem in another way. I have .txt file with NO
    headers, and NO blanks between any columns. But i know that from
    columns, say 13 to 15, is variable VARNAME_1 (of course, a three digit
    var). How can extract that column in a list call VARNAME_1??

    Obviously, this should extend to all the positions and variables i
    have to extract from the file.

    Thanks!

    J
     
    javivd, Dec 12, 2010
    #9
  10. javivd

    Tim Harig Guest

    On 2010-12-12, javivd <> wrote:
    > On Dec 1, 7:15 am, Tim Harig <> wrote:
    >> On 2010-12-01, javivd <> wrote:
    >> > On Nov 30, 11:43 pm, Tim Harig <> wrote:
    >> >> encodings and how you mark line endings.  Frankly, the use of the
    >> >> world columns in the header suggests that the data *is* separated by
    >> >> line endings rather then absolute position and the position refers to
    >> >> the line number. In which case, you can use splitlines() to break up
    >> >> the data and then address the proper line by index.  Nevertheless,


    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    Note that I specifically questioned the use of absolute file position vs.
    postion within a column. These are two different things. You use
    different methods to extract each.

    >> > I work in a survey research firm. the data im talking about has a lot
    >> > of 0-1 variables, meaning yes or no of a lot of questions. so only one
    >> > position of a character is needed (not byte), explaining the 123-123
    >> > kind of positions of a lot of variables.

    >>
    >> Thenfile.seek() is what you are looking for; but, you need to be aware of
    >> line endings and encodings as indicated.  Make sure that you open thefile
    >> using whatever encoding was used when it was generated or you could have
    >> problems with multibyte characters affecting the offsets.

    >
    > f = open(r'c:c:\somefile.txt', 'w')


    I suspect you don't need to use the c: twice.

    > f.write('0123456789\n0123456789\n0123456789')


    Note that the file you a writing contains three lines. Is the data that
    you are looking for located at an absolute position in the file or on a
    position within a individual line? If the latter, not that line endings
    may be composed of more then a single character.

    > f.write('0123456789\n0123456789\n0123456789')

    ^ postion 3 using fseek()

    > for line in f:


    Perhaps you meant:
    for character in f.read():
    or
    for line in f.read().splitlines()

    > f.seek(3,0)


    This will always take you back to the exact fourth position in the file
    (indicated above).

    > I used .seek() in this manner, but is not working.


    It is working the way it is supposed to.

    If you want the absolution position 3 in a file then:

    f = open('somefile.txt', 'r')
    f.seek(3)
    variable = f.read(1)

    If you want the absolute position in a column:
    f = open('somefile.txt', 'r').read().splitlines()
    for column in f:
    variable = column[3]
     
    Tim Harig, Dec 12, 2010
    #10
  11. javivd

    Tim Harig Guest

    On 2010-12-12, Tim Harig <> wrote:
    >> I used .seek() in this manner, but is not working.

    >
    > It is working the way it is supposed to.
    > If you want the absolute position in a column:
    >
    > f = open('somefile.txt', 'r').read().splitlines()
    > for column in f:
    > variable = column[3]


    or:
    f = open('somefile.txt', 'r')
    for column in f.readlines():
    variable = column[3]
     
    Tim Harig, Dec 12, 2010
    #11
  12. On Sun, 12 Dec 2010 07:02:13 -0800 (PST), javivd
    <> declaimed the following in
    gmane.comp.python.general:

    >
    > f = open(r'c:c:\somefile.txt', 'w')
    >
    > f.write('0123456789\n0123456789\n0123456789')
    >

    Not the most explanatory sample data... It would be better if the
    records had different contents.

    > f.close()
    >
    > f = open(r'c:\somefile.txt', 'r')
    >
    >
    > for line in f:


    Here you extract one "line" from the file

    > f.seek(3,0)
    > print f.read(1) #just to know if its printing the rigth column
    >

    And here you ignored the entire line you read, seeking to the fourth
    byte from the beginning of the file, and reading just one byte from it.

    I have no idea of how seek()/read() behaves relative to line
    iteration in the for loop... Given the small size of the test data set
    it is quite likely that the first "for line in f" resulted in the entire
    file being read into a buffer, and that buffer scanned to find the line
    ending and return the data preceding it; then the buffer position is set
    to after that line ending so the next "for line" continues from that
    point.

    But in a situation with a large data set, or an unbuffered I/O
    system, the seek()/read() could easily result in resetting the file
    position used by the "for line", so that the second call returns
    "456789\n"... And all subsequent calls too, resulting in an infinite
    loop.


    Presuming the assignment requires pulling multiple selected fields
    from individual records, where each record is of the same
    format/spacing, AND that the field selection can not be preprogrammed...

    Sample data file (use fixed width font to view):
    -=-=-=-=-=-
    Wulfraed 09Ranger 1915
    Bask Euren 13Cleric 1511
    Aethelwulf 07Mage 0908
    Cwiculf 08Mage 1008
    -=-=-=-=-=-

    Sample format definition file:
    -=-=-=-=-=-
    Name 0-14
    Level 15-16
    Class 17-24
    THAC0 25-26
    Armor 27-28
    -=-=-=-=-=-

    Code to process (Python 2.5, with minimal error handling):
    -=-=-=-=-=-

    class Extractor(object):
    def __init__(self, formatFile):
    ff = open(formatFile, "r")
    self._format = {}
    self._length = 0
    for line in ff:
    form = line.split("\t") #file must be tab separated
    if len(form) != 2:
    print "Invalid file format definition: %s" % line
    continue
    name = form[0]
    columns = form[1].split("-")
    if len(columns) == 1: #single column definition
    start = int(columns[0])
    end = start
    elif len(columns) == 2:
    start = int(columns[0])
    end = int(columns[1])
    else:
    print "Invalid column definition: %s" % form[1]
    continue
    self._format[name] = (start, end)
    self._length = max(self._length, end)
    ff.close()

    def __call__(self, line):
    data = {}
    if len(line) < self._length:
    print "Data line is too short for required format: ignored"
    else:
    for (name, (start, end)) in self._format.items():
    data[name] = line[start:end+1]
    return data


    if __name__ == "__main__":
    FORMATFILE = "SampleFormat.tsv"
    DATAFILE = "SampleData.txt"

    characterExtractor = Extractor(FORMATFILE)

    df = open(DATAFILE, "r")
    for line in df:
    fields = characterExtractor(line)
    for (name, value) in fields.items():
    print "Field name: '%s'\t\tvalue: '%s'" % (name, value)
    print

    df.close()
    -=-=-=-=-=-

    Output from running above code:
    -=-=-=-=-=-
    Field name: 'Armor' value: '15'
    Field name: 'THAC0' value: '19'
    Field name: 'Level' value: '09'
    Field name: 'Class' value: 'Ranger '
    Field name: 'Name' value: 'Wulfraed '

    Field name: 'Armor' value: '11'
    Field name: 'THAC0' value: '15'
    Field name: 'Level' value: '13'
    Field name: 'Class' value: 'Cleric '
    Field name: 'Name' value: 'Bask Euren '

    Field name: 'Armor' value: '08'
    Field name: 'THAC0' value: '09'
    Field name: 'Level' value: '07'
    Field name: 'Class' value: 'Mage '
    Field name: 'Name' value: 'Aethelwulf '

    Field name: 'Armor' value: '08'
    Field name: 'THAC0' value: '10'
    Field name: 'Level' value: '08'
    Field name: 'Class' value: 'Mage '
    Field name: 'Name' value: 'Cwiculf '
    -=-=-=-=-=-

    Note that string fields have not been trimmed, also numeric fields
    are still in text format... The format definition file would need to be
    expanded to include a "string", "integer", "float" (and "Boolean"?) code
    in order for the extractor to do proper type conversions.



    --
    Wulfraed Dennis Lee Bieber AF6VN
    HTTP://wlfraed.home.netcom.com/
     
    Dennis Lee Bieber, Dec 12, 2010
    #12
  13. On Sun, 12 Dec 2010 14:21:18 -0800, Dennis Lee Bieber
    <> declaimed the following in
    gmane.comp.python.general:


    > Sample data file (use fixed width font to view):
    > -=-=-=-=-=-
    > Wulfraed 09Ranger 1915
    > Bask Euren 13Cleric 1511
    > Aethelwulf 07Mage 0908
    > Cwiculf 08Mage 1008
    > -=-=-=-=-=-
    >
    > Sample format definition file:
    > -=-=-=-=-=-
    > Name 0-14
    > Level 15-16
    > Class 17-24
    > THAC0 25-26
    > Armor 27-28
    > -=-=-=-=-=-
    >

    If it isn't clear from the code -- the DATA file is SPACE FILLED,
    but the DEFINITION file uses a TAB to separate the columns, not spaces.
    --
    Wulfraed Dennis Lee Bieber AF6VN
    HTTP://wlfraed.home.netcom.com/
     
    Dennis Lee Bieber, Dec 12, 2010
    #13
  14. javivd

    javivd Guest

    On Dec 12, 11:21 pm, Dennis Lee Bieber <> wrote:
    > On Sun, 12 Dec 2010 07:02:13 -0800 (PST), javivd
    > <> declaimed the following in
    > gmane.comp.python.general:
    >
    >
    >
    > > f = open(r'c:c:\somefile.txt', 'w')

    >
    > > f.write('0123456789\n0123456789\n0123456789')

    >
    >         Not the most explanatory sample data... It would be better if the
    > records had different contents.
    >
    > > f.close()

    >
    > > f = open(r'c:\somefile.txt', 'r')

    >
    > > for line in f:

    >
    >         Here you extract one "line" from the file
    >
    > >     f.seek(3,0)
    > >     print f.read(1) #just to know if its printing the rigth column

    >
    >         And here you ignored the entire line you read, seeking to the fourth
    > byte from the beginning of the file, andreadingjust one byte from it.
    >
    >         I have no idea of how seek()/read() behaves relative to line
    > iteration in the for loop... Given the small size of the test data set
    > it is quite likely that the first "for line in f" resulted in the entire
    > file being read into a buffer, and that buffer scanned to find the line
    > ending and return the data preceding it; then the buffer position is set
    > to after that line ending so the next "for line" continues from that
    > point.
    >
    >         But in a situation with a large data set, or an unbuffered I/O
    > system, the seek()/read() could easily result in resetting the file
    > position used by the "for line", so that the second call returns
    > "456789\n"... And all subsequent calls too, resulting in an infinite
    > loop.
    >
    >         Presuming the assignment requires pulling multiple selected fields
    > from individual records, where each record is of the same
    > format/spacing, AND that the field selection can not be preprogrammed...
    >
    > Sample data file (use fixed width font to view):
    > -=-=-=-=-=-
    > Wulfraed       09Ranger  1915
    > Bask Euren     13Cleric  1511
    > Aethelwulf     07Mage    0908
    > Cwiculf        08Mage    1008
    > -=-=-=-=-=-
    >
    > Sample format definition file:
    > -=-=-=-=-=-
    > Name    0-14
    > Level   15-16
    > Class   17-24
    > THAC0   25-26
    > Armor   27-28
    > -=-=-=-=-=-
    >
    > Code to process (Python 2.5, with minimal error handling):
    > -=-=-=-=-=-
    >
    > class Extractor(object):
    >     def __init__(self, formatFile):
    >         ff = open(formatFile, "r")
    >         self._format = {}
    >         self._length = 0
    >         for line in ff:
    >             form = line.split("\t") #file must be tab separated
    >             if len(form) != 2:
    >                 print "Invalid file format definition: %s" % line
    >                 continue
    >             name = form[0]
    >             columns = form[1].split("-")
    >             if len(columns) == 1:   #single column definition
    >                 start = int(columns[0])
    >                 end = start
    >             elif len(columns) == 2:
    >                 start = int(columns[0])
    >                 end = int(columns[1])
    >             else:
    >                 print "Invalid column definition: %s" % form[1]
    >                 continue
    >             self._format[name] = (start, end)
    >             self._length = max(self._length, end)
    >         ff.close()
    >
    >     def __call__(self, line):
    >         data = {}
    >         if len(line) < self._length:
    >             print "Data line is too short for required format: ignored"
    >         else:
    >             for (name, (start, end)) in self._format.items():
    >                 data[name] = line[start:end+1]
    >         return data
    >
    > if __name__ == "__main__":
    >     FORMATFILE = "SampleFormat.tsv"
    >     DATAFILE = "SampleData.txt"
    >
    >     characterExtractor = Extractor(FORMATFILE)
    >
    >     df = open(DATAFILE, "r")
    >     for line in df:
    >         fields = characterExtractor(line)
    >         for (name, value) in fields.items():
    >             print "Field name: '%s'\t\tvalue: '%s'" % (name, value)
    >         print
    >
    >     df.close()
    > -=-=-=-=-=-
    >
    > Output from running above code:
    > -=-=-=-=-=-
    > Field name: 'Armor'             value: '15'
    > Field name: 'THAC0'             value: '19'
    > Field name: 'Level'             value: '09'
    > Field name: 'Class'             value: 'Ranger  '
    > Field name: 'Name'              value: 'Wulfraed       '
    >
    > Field name: 'Armor'             value: '11'
    > Field name: 'THAC0'             value: '15'
    > Field name: 'Level'             value: '13'
    > Field name: 'Class'             value: 'Cleric  '
    > Field name: 'Name'              value: 'Bask Euren     '
    >
    > Field name: 'Armor'             value: '08'
    > Field name: 'THAC0'             value: '09'
    > Field name: 'Level'             value: '07'
    > Field name: 'Class'             value: 'Mage    '
    > Field name: 'Name'              value: 'Aethelwulf     '
    >
    > Field name: 'Armor'             value: '08'
    > Field name: 'THAC0'             value: '10'
    > Field name: 'Level'             value: '08'
    > Field name: 'Class'             value: 'Mage    '
    > Field name: 'Name'              value: 'Cwiculf        '
    > -=-=-=-=-=-
    >
    >         Note that string fields have not been trimmed, also numeric fields
    > are still intextformat... The format definition file would need to be
    > expanded to include a "string", "integer", "float" (and "Boolean"?) code
    > in order for the extractor to do proper type conversions.
    >
    > --
    >         Wulfraed                 Dennis Lee Bieber         AF6VN
    >            HTTP://wlfraed.home.netcom.com/


    Clearly it's working. Altough, this code is beyond my python knowledge
    (i don't get along with classes, maybe it's a good moment to learn
    about them...) but i'll dig into it.

    Thanks a lot! It really helps...

    J
     
    javivd, Dec 13, 2010
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. championsleeper

    validating plain text input files .....

    championsleeper, Nov 22, 2004, in forum: XML
    Replies:
    2
    Views:
    443
    Manuel Collado
    Nov 23, 2004
  2. Replies:
    1
    Views:
    383
    Chris Smith
    Aug 31, 2006
  3. Knut Krueger
    Replies:
    2
    Views:
    451
    Knut Krueger
    May 21, 2007
  4. knipknap
    Replies:
    0
    Views:
    1,268
    knipknap
    Jan 19, 2010
  5. Shiny Hydra
    Replies:
    11
    Views:
    202
    Robert Klemme
    Mar 19, 2010
Loading...

Share This Page