Simple Text Processing Help

Discussion in 'Python' started by patrick.waldo@gmail.com, Oct 14, 2007.

  1. Guest

    Hi all,

    I started Python just a little while ago and I am stuck on something
    that is really simple, but I just can't figure out.

    Essentially I need to take a text document with some chemical
    information in Czech and organize it into another text file. The
    information is always EINECS number, CAS, chemical name, and formula
    in tables. I need to organize them into lines with | in between. So
    it goes from:

    200-763-1 71-73-8
    nátrium-tiopentál C11H18N2O2S.Na to:

    200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na

    but if I have a chemical like: kyselina moÄová

    I get:
    200-720-7|69-93-2|kyselina|moÄová
    |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál

    and then it is all off.

    How can I get Python to realize that a chemical name may have a space
    in it?

    Thank you,
    Patrick

    So far I have:

    #take tables in one text file and organize them into lines in another

    import codecs

    path = "c:\\text_samples\\chem_1_utf8.txt"
    path2 = "c:\\text_samples\\chem_2.txt"
    input = codecs.open(path, 'r','utf8')
    output = codecs.open(path2, 'w', 'utf8')

    #read and enter into a list
    chem_file = []
    chem_file.append(input.read())

    #split words and store them in a list
    for word in chem_file:
    words = word.split()

    #starting values in list
    e=0 #EINECS
    c=1 #CAS
    ch=2 #chemical name
    f=3 #formula

    n=0
    loop=1
    x=len(words) #counts how many words there are in the file

    print '-'*100
    while loop==1:
    if n<x and f<=x:
    print words[e], '|', words[c], '|', words[ch], '|', words[f],
    '\n'
    output.write(words[e])
    output.write('|')
    output.write(words[c])
    output.write('|')
    output.write(words[ch])
    output.write('|')
    output.write(words[f])
    output.write('\r\n')
    #increase variables by 4 to get next set
    e = e + 4
    c = c + 4
    ch = ch + 4
    f = f + 4
    # increase by 1 to repeat
    n=n+1
    else:
    loop=0

    input.close()
    output.close()
     
    , Oct 14, 2007
    #1
    1. Advertising

  2. On Sun, 14 Oct 2007 13:48:51 +0000, patrick.waldo wrote:

    > Essentially I need to take a text document with some chemical
    > information in Czech and organize it into another text file. The
    > information is always EINECS number, CAS, chemical name, and formula
    > in tables. I need to organize them into lines with | in between. So
    > it goes from:
    >
    > 200-763-1 71-73-8
    > nátrium-tiopentál C11H18N2O2S.Na to:


    Is that in *one* line in the input file or two lines like shown here?

    > 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na
    >
    > but if I have a chemical like: kyselina moÄová
    >
    > I get:
    > 200-720-7|69-93-2|kyselina|moÄová
    > |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál
    >
    > and then it is all off.
    >
    > How can I get Python to realize that a chemical name may have a space
    > in it?


    If the two elements before and the one element after the name can't
    contain spaces it is easy: take the first two and the last as it is and
    for the name take from the third to the next to last element = the name
    and join them with a space.

    In [202]: parts = '123 456 a name with spaces 789'.split()

    In [203]: parts[0]
    Out[203]: '123'

    In [204]: parts[1]
    Out[204]: '456'

    In [205]: ' '.join(parts[2:-1])
    Out[205]: 'a name with spaces'

    In [206]: parts[-1]
    Out[206]: '789'

    This works too if the name doesn't have a space in it:

    In [207]: parts = '123 456 name 789'.split()

    In [208]: parts[0]
    Out[208]: '123'

    In [209]: parts[1]
    Out[209]: '456'

    In [210]: ' '.join(parts[2:-1])
    Out[210]: 'name'

    In [211]: parts[-1]
    Out[211]: '789'

    > #read and enter into a list
    > chem_file = []
    > chem_file.append(input.read())


    This reads the whole file and puts it into a list. This list will
    *always* just contain *one* element. So why a list at all!?

    > #split words and store them in a list
    > for word in chem_file:
    > words = word.split()


    *If* the list would contain more than one element all would be processed
    but only the last is bound to `words`. You could leave out `chem_file` and
    the loop and simply do:

    words = input.read().split()

    Same effect but less chatty. ;-)

    The rest of the source seems to indicate that you don't really want to read
    in the whole input file at once but process it line by line, i.e. chemical
    element by chemical element.

    Ciao,
    Marc 'BlackJack' Rintsch
     
    Marc 'BlackJack' Rintsch, Oct 14, 2007
    #2
    1. Advertising

  3. Paul Hankin Guest

    On Oct 14, 2:48 pm, wrote:
    > Hi all,
    >
    > I started Python just a little while ago and I am stuck on something
    > that is really simple, but I just can't figure out.
    >
    > Essentially I need to take a text document with some chemical
    > information in Czech and organize it into another text file. The
    > information is always EINECS number, CAS, chemical name, and formula
    > in tables. I need to organize them into lines with | in between. So
    > it goes from:
    >
    > 200-763-1 71-73-8
    > nátrium-tiopentál C11H18N2O2S.Na to:
    >
    > 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na
    >
    > but if I have a chemical like: kyselina moÄová
    >
    > I get:
    > 200-720-7|69-93-2|kyselina|moÄová
    > |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál
    >
    > and then it is all off.
    >
    > How can I get Python to realize that a chemical name may have a space
    > in it?


    In the original file, is every chemical on a line of its own? I assume
    it is here.

    You might use a regexp (look at the re module), or I think here you
    can use the fact that only chemicals have spaces in them. Then, you
    can split each line on whitespace (like you're doing), and join back
    together all the words between the 3rd (ie index 2) and the last (ie
    index -1) using tokens[2:-1] = [u' '.join(tokens[2:-1])]. This uses
    the somewhat unusual python syntax for replacing a section of a list
    with another list.

    The approach you took involves reading the whole file, and building a
    list of all the chemicals which you don't seem to use: I've changed it
    to a per-line version and removed the big lists.

    path = "c:\\text_samples\\chem_1_utf8.txt"
    path2 = "c:\\text_samples\\chem_2.txt"
    input = codecs.open(path, 'r','utf8')
    output = codecs.open(path2, 'w', 'utf8')

    for line in input:
    tokens = line.strip().split()
    tokens[2:-1] = [u' '.join(tokens[2:-1])]
    chemical = u'|'.join(tokens)
    print chemical + u'\n'
    output.write(chemical + u'\r\n')

    input.close()
    output.close()

    Obviously, this isn't tested because I don't have your chem_1_utf8.txt
    file.

    --
    Paul Hankin
     
    Paul Hankin, Oct 14, 2007
    #3
  4. Guest

    Thank you both for helping me out. I am still rather new to Python
    and so I'm probably trying to reinvent the wheel here.

    When I try to do Paul's response, I get
    >>>tokens = line.strip().split()

    []

    So I am not quite sure how to read line by line.

    tokens = input.read().split() gets me all the information from the
    file. tokens[2:-1] = [u' '.join(tokens[2:-1])] works just fine, like
    in the example; however, how can I loop this for the entire document?
    Also, when I try output.write(tokens), I get "TypeError: coercing to
    Unicode: need string or buffer, list found".

    Any ideas?

















    On Oct 14, 4:25 pm, Paul Hankin <> wrote:
    > On Oct 14, 2:48 pm, wrote:
    >
    >
    >
    > > Hi all,

    >
    > > I started Python just a little while ago and I am stuck on something
    > > that is really simple, but I just can't figure out.

    >
    > > Essentially I need to take a text document with some chemical
    > > information in Czech and organize it into another text file. The
    > > information is always EINECS number, CAS, chemical name, and formula
    > > in tables. I need to organize them into lines with | in between. So
    > > it goes from:

    >
    > > 200-763-1 71-73-8
    > > nátrium-tiopentál C11H18N2O2S.Na to:

    >
    > > 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na

    >
    > > but if I have a chemical like: kyselina moÄová

    >
    > > I get:
    > > 200-720-7|69-93-2|kyselina|moÄová
    > > |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál

    >
    > > and then it is all off.

    >
    > > How can I get Python to realize that a chemical name may have a space
    > > in it?

    >
    > In the original file, is every chemical on a line of its own? I assume
    > it is here.
    >
    > You might use a regexp (look at the re module), or I think here you
    > can use the fact that only chemicals have spaces in them. Then, you
    > can split each line on whitespace (like you're doing), and join back
    > together all the words between the 3rd (ie index 2) and the last (ie
    > index -1) using tokens[2:-1] = [u' '.join(tokens[2:-1])]. This uses
    > the somewhat unusual python syntax for replacing a section of a list
    > with another list.
    >
    > The approach you took involves reading the whole file, and building a
    > list of all the chemicals which you don't seem to use: I've changed it
    > to a per-line version and removed the big lists.
    >
    > path = "c:\\text_samples\\chem_1_utf8.txt"
    > path2 = "c:\\text_samples\\chem_2.txt"
    > input = codecs.open(path, 'r','utf8')
    > output = codecs.open(path2, 'w', 'utf8')
    >
    > for line in input:
    > tokens = line.strip().split()
    > tokens[2:-1] = [u' '.join(tokens[2:-1])]
    > chemical = u'|'.join(tokens)
    > print chemical + u'\n'
    > output.write(chemical + u'\r\n')
    >
    > input.close()
    > output.close()
    >
    > Obviously, this isn't tested because I don't have your chem_1_utf8.txt
    > file.
    >
    > --
    > Paul Hankin
     
    , Oct 14, 2007
    #4
  5. On Sun, 14 Oct 2007 16:57:06 +0000, patrick.waldo wrote:

    > Thank you both for helping me out. I am still rather new to Python
    > and so I'm probably trying to reinvent the wheel here.
    >
    > When I try to do Paul's response, I get
    >>>>tokens = line.strip().split()

    > []


    What is in `line`? Paul wrote this in the body of the ``for`` loop over
    all the lines in the file.

    > So I am not quite sure how to read line by line.


    That's what the ``for`` loop over a file or file-like object is doing.
    Maybe you should develop your script in smaller steps and do some printing
    to see what you get at each step. For example after opening the input
    file:

    for line in input:
    print line # prints the whole line.
    tokens = line.split()
    print tokens # prints a list with the split line.

    > tokens = input.read().split() gets me all the information from the
    > file.


    Right it reads *all* of the file, not just one line.

    > tokens[2:-1] = [u' '.join(tokens[2:-1])] works just fine, like
    > in the example; however, how can I loop this for the entire document?


    Don't read the whole file but line by line, just like Paul showed you.

    > Also, when I try output.write(tokens), I get "TypeError: coercing to
    > Unicode: need string or buffer, list found".


    `tokens` is a list but you need to write a unicode string. So you have to
    reassemble the parts with '|' characters in between. Also shown by Paul.

    Ciao,
    Marc 'BlackJack' Rintsch
     
    Marc 'BlackJack' Rintsch, Oct 14, 2007
    #5
  6. John Machin Guest

    On Oct 14, 11:48 pm, wrote:
    > Hi all,
    >
    > I started Python just a little while ago and I am stuck on something
    > that is really simple, but I just can't figure out.
    >
    > Essentially I need to take a text document with some chemical
    > information in Czech and organize it into another text file. The
    > information is always EINECS number, CAS, chemical name, and formula
    > in tables. I need to organize them into lines with | in between. So
    > it goes from:
    >
    > 200-763-1 71-73-8
    > nátrium-tiopentál C11H18N2O2S.Na to:
    >
    > 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na
    >
    > but if I have a chemical like: kyselina moÄová
    >
    > I get:
    > 200-720-7|69-93-2|kyselina|moÄová
    > |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál
    >
    > and then it is all off.
    >
    > How can I get Python to realize that a chemical name may have a space
    > in it?
    >


    Your input file could be in one of THREE formats:
    (1) fields are separated by TAB characters (represented in Python by
    the escape sequence '\t', and equivalent to '\x09')
    (2) fields are fixed width and padded with spaces
    (3) fields are separated by a random number of whitespace characters
    (and can contain spaces).

    What makes you sure that you have format 3? You might like to try
    something like
    lines = open('your_file.txt').readlines()[:4]
    print lines
    print map(len, lines)
    This will print a *precise* representation of what is in the first
    four lines, plus their lengths. Please show us the output.
     
    John Machin, Oct 14, 2007
    #6
  7. Guest

    > lines = open('your_file.txt').readlines()[:4]
    > print lines
    > print map(len, lines)


    gave me:
    ['\xef\xbb\xbf200-720-7 69-93-2\n', 'kyselina mo\xc4\x8dov
    \xc3\xa1 C5H4N4O3\n', '\n', '200-001-8\t50-00-0\n']
    [28, 32, 1, 18]

    I think it means that I'm still at option 3. I got the line by line
    part. My code is a lot cleaner now:

    import codecs

    path = "c:\\text_samples\\chem_1_utf8.txt"
    path2 = "c:\\text_samples\\chem_2.txt"
    input = codecs.open(path, 'r','utf8')
    output = codecs.open(path2, 'w', 'utf8')

    for line in input:
    tokens = line.strip().split()
    tokens[2:-1] = [u' '.join(tokens[2:-1])] #this doesn't seem to
    combine the files correctly
    file = u'|'.join(tokens) #this does put '|' in
    between
    print file + u'\n'
    output.write(file + u'\r\n')

    input.close()
    output.close()

    my sample input file looks like this( not organized,as you see it):
    200-720-7 69-93-2
    kyselina mocová C5H4N4O3

    200-001-8 50-00-0
    formaldehyd CH2O

    200-002-3
    50-01-1
    guanidínium-chlorid CH5N3.ClH

    etc...

    and after the program I get:

    200-720-7|69-93-2|
    kyselina|mocová||C5H4N4O3

    200-001-8|50-00-0|
    formaldehyd|CH2O|

    200-002-3|
    50-01-1|
    guanidínium-chlorid|CH5N3.ClH|

    etc...
    So, I am sort of back at the start again.

    If I add:

    tokens = line.strip().split()
    for token in tokens:
    print token

    I get all the single tokens, which I thought I could then put
    together, except when I did:

    for token in tokens:
    s = u'|'.join(token)
    print s

    I got ?|2|0|0|-|7|2|0|-|7, etc...

    How can I join these together into nice neat little lines? When I try
    to store the tokens in a list, the tokens double and I don't know
    why. I can work on getting the chemical names together after...baby
    steps, or maybe I am just missing something obvious. The first two
    numbers will always be the same three digits-three digits-one digit
    and then two digits-two digits-one digit.

    My intuition tells me that I need to add an if statement that says, if
    the first two numbers follow the pattern, then continue, if they don't
    (ie a chemical name was accidently split apart) then the third entry
    needs to be put together. Something like
    if tokens.startswith('pattern') == true


    Again, thanks so much. I've gone to http://gnosis.cx/TPiP/ and I have
    a couple O'Reilly books, but they don't seem to have a straightforward
    example for this kind of text manipulation.

    Patrick


    On Oct 14, 11:17 pm, John Machin <> wrote:
    > On Oct 14, 11:48 pm, wrote:
    >
    >
    >
    > > Hi all,

    >
    > > I started Python just a little while ago and I am stuck on something
    > > that is really simple, but I just can't figure out.

    >
    > > Essentially I need to take a text document with some chemical
    > > information in Czech and organize it into another text file. The
    > > information is always EINECS number, CAS, chemical name, and formula
    > > in tables. I need to organize them into lines with | in between. So
    > > it goes from:

    >
    > > 200-763-1 71-73-8
    > > nátrium-tiopentál C11H18N2O2S.Na to:

    >
    > > 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na

    >
    > > but if I have a chemical like: kyselina moÄová

    >
    > > I get:
    > > 200-720-7|69-93-2|kyselina|moÄová
    > > |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál

    >
    > > and then it is all off.

    >
    > > How can I get Python to realize that a chemical name may have a space
    > > in it?

    >
    > Your input file could be in one of THREE formats:
    > (1) fields are separated by TAB characters (represented in Python by
    > the escape sequence '\t', and equivalent to '\x09')
    > (2) fields are fixed width and padded with spaces
    > (3) fields are separated by a random number of whitespace characters
    > (and can contain spaces).
    >
    > What makes you sure that you have format 3? You might like to try
    > something like
    > lines = open('your_file.txt').readlines()[:4]
    > print lines
    > print map(len, lines)
    > This will print a *precise* representation of what is in the first
    > four lines, plus their lengths. Please show us the output.
     
    , Oct 15, 2007
    #7
  8. Guest

    > lines = open('your_file.txt').readlines()[:4]
    > print lines
    > print map(len, lines)


    gave me:
    ['\xef\xbb\xbf200-720-7 69-93-2\n', 'kyselina mo\xc4\x8dov
    \xc3\xa1 C5H4N4O3\n', '\n', '200-001-8\t50-00-0\n']
    [28, 32, 1, 18]

    I think it means that I'm still at option 3. I got the line by line
    part. My code is a lot cleaner now:

    import codecs

    path = "c:\\text_samples\\chem_1_utf8.txt"
    path2 = "c:\\text_samples\\chem_2.txt"
    input = codecs.open(path, 'r','utf8')
    output = codecs.open(path2, 'w', 'utf8')

    for line in input:
    tokens = line.strip().split()
    tokens[2:-1] = [u' '.join(tokens[2:-1])] #this doesn't seem to
    combine the files correctly
    file = u'|'.join(tokens) #this does put '|' in
    between
    print file + u'\n'
    output.write(file + u'\r\n')

    input.close()
    output.close()

    my sample input file looks like this( not organized,as you see it):
    200-720-7 69-93-2
    kyselina mocová C5H4N4O3

    200-001-8 50-00-0
    formaldehyd CH2O

    200-002-3
    50-01-1
    guanidínium-chlorid CH5N3.ClH

    etc...

    and after the program I get:

    200-720-7|69-93-2|
    kyselina|mocová||C5H4N4O3

    200-001-8|50-00-0|
    formaldehyd|CH2O|

    200-002-3|
    50-01-1|
    guanidínium-chlorid|CH5N3.ClH|

    etc...
    So, I am sort of back at the start again.

    If I add:

    tokens = line.strip().split()
    for token in tokens:
    print token

    I get all the single tokens, which I thought I could then put
    together, except when I did:

    for token in tokens:
    s = u'|'.join(token)
    print s

    I got ?|2|0|0|-|7|2|0|-|7, etc...

    How can I join these together into nice neat little lines? When I try
    to store the tokens in a list, the tokens double and I don't know
    why. I can work on getting the chemical names together after...baby
    steps, or maybe I am just missing something obvious. The first two
    numbers will always be the same three digits-three digits-one digit
    and then two digits-two digits-one digit. This seems to be on the
    only pattern.

    My intuition tells me that I need to add an if statement that says, if
    the first two numbers follow the pattern, then continue, if they don't
    (ie a chemical name was accidently split apart) then the third entry
    needs to be put together. Something like

    if tokens[1] and tokens[2] startswith('pattern') == true
    tokens[2] = join(tokens[2]:tokens[3])
    token[3] = token[4]
    del token[4]

    but the code isn't right...any ideas?

    Again, thanks so much. I've gone to http://gnosis.cx/TPiP/ and I have
    a couple O'Reilly books, but they don't seem to have a straightforward
    example for this kind of text manipulation.

    Patrick

    On Oct 14, 11:17 pm, John Machin <> wrote:
    > On Oct 14, 11:48 pm, wrote:
    >
    >
    >
    > > Hi all,

    >
    > > I started Python just a little while ago and I am stuck on something
    > > that is really simple, but I just can't figure out.

    >
    > > Essentially I need to take a text document with some chemical
    > > information in Czech and organize it into another text file. The
    > > information is always EINECS number, CAS, chemical name, and formula
    > > in tables. I need to organize them into lines with | in between. So
    > > it goes from:

    >
    > > 200-763-1 71-73-8
    > > nátrium-tiopentál C11H18N2O2S.Na to:

    >
    > > 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na

    >
    > > but if I have a chemical like: kyselina moÄová

    >
    > > I get:
    > > 200-720-7|69-93-2|kyselina|moÄová
    > > |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál

    >
    > > and then it is all off.

    >
    > > How can I get Python to realize that a chemical name may have a space
    > > in it?

    >
    > Your input file could be in one of THREE formats:
    > (1) fields are separated by TAB characters (represented in Python by
    > the escape sequence '\t', and equivalent to '\x09')
    > (2) fields are fixed width and padded with spaces
    > (3) fields are separated by a random number of whitespace characters
    > (and can contain spaces).
    >
    > What makes you sure that you have format 3? You might like to try
    > something like
    > lines = open('your_file.txt').readlines()[:4]
    > print lines
    > print map(len, lines)
    > This will print a *precise* representation of what is in the first
    > four lines, plus their lengths. Please show us the output.
     
    , Oct 15, 2007
    #8
  9. On Mon, 15 Oct 2007 10:47:16 +0000, patrick.waldo wrote:

    > my sample input file looks like this( not organized,as you see it):
    > 200-720-7 69-93-2
    > kyselina mocová C5H4N4O3
    >
    > 200-001-8 50-00-0
    > formaldehyd CH2O
    >
    > 200-002-3
    > 50-01-1
    > guanidínium-chlorid CH5N3.ClH
    >
    > etc...


    That's quite irregular so it is not that straightforward. One way is to
    split everything into words, start a record by taking the first two
    elements and then look for the start of the next record that looks like
    three numbers concatenated by '-' characters. Quick and dirty hack:

    import codecs
    import re

    NR_RE = re.compile(r'^\d+-\d+-\d+$')

    def iter_elements(tokens):
    tokens = iter(tokens)
    try:
    nr_a = tokens.next()
    while True:
    nr_b = tokens.next()
    items = list()
    for item in tokens:
    if NR_RE.match(item):
    yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1])
    nr_a = item
    break
    else:
    items.append(item)
    except StopIteration:
    yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1])



    def main():
    in_file = codecs.open('test.txt', 'r', 'utf-8')
    tokens = in_file.read().split()
    in_file.close()
    for element in iter_elements(tokens):
    print '|'.join(element)

    Ciao,
    Marc 'BlackJack' Rintsch
     
    Marc 'BlackJack' Rintsch, Oct 15, 2007
    #9
  10. Paul Hankin Guest

    On Oct 15, 12:20 pm, Marc 'BlackJack' Rintsch <> wrote:
    > On Mon, 15 Oct 2007 10:47:16 +0000, patrick.waldo wrote:
    > > my sample input file looks like this( not organized,as you see it):
    > > 200-720-7 69-93-2
    > > kyselina mocová C5H4N4O3

    >
    > > 200-001-8 50-00-0
    > > formaldehyd CH2O

    >
    > > 200-002-3
    > > 50-01-1
    > > guanidínium-chlorid CH5N3.ClH

    >
    > > etc...

    >
    > That's quite irregular so it is not that straightforward. One way is to
    > split everything into words, start a record by taking the first two
    > elements and then look for the start of the next record that looks like
    > three numbers concatenated by '-' characters. Quick and dirty hack:
    >
    > import codecs
    > import re
    >
    > NR_RE = re.compile(r'^\d+-\d+-\d+$')
    >
    > def iter_elements(tokens):
    > tokens = iter(tokens)
    > try:
    > nr_a = tokens.next()
    > while True:
    > nr_b = tokens.next()
    > items = list()
    > for item in tokens:
    > if NR_RE.match(item):
    > yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1])
    > nr_a = item
    > break
    > else:
    > items.append(item)
    > except StopIteration:
    > yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1])


    Maybe this is a bit more readable?

    def iter_elements(tokens):
    chem = []
    for tok in tokens:
    if NR_RE.match(tok) and len(chem) >= 4:
    chem[2:-1] = [' '.join(chem[2:-1])]
    yield chem
    chem = []
    chem.append(tok)
    yield chem

    --
    Paul Hankin
     
    Paul Hankin, Oct 15, 2007
    #10
  11. Peter Otten Guest

    patrick.waldo wrote:

    > my sample input file looks like this( not organized,as you see it):
    > 200-720-7 69-93-2
    > kyselina mocová C5H4N4O3
    >
    > 200-001-8 50-00-0
    > formaldehyd CH2O
    >
    > 200-002-3
    > 50-01-1
    > guanidínium-chlorid CH5N3.ClH


    Assuming that the records are always separated by blank lines and only the
    third field in a record may contain spaces the following might work:

    import codecs
    from itertools import groupby

    path = "c:\\text_samples\\chem_1_utf8.txt"
    path2 = "c:\\text_samples\\chem_2.txt"

    def fields(s):
    parts = s.split()
    return parts[0], parts[1], " ".join(parts[2:-1]), parts[-1]

    def records(instream):
    for key, group in groupby(instream, unicode.isspace):
    if not key:
    yield "".join(group)

    if __name__ == "__main__":
    outstream = codecs.open(path2, 'w', 'utf8')
    for record in records(codecs.open(path, "r", "utf8")):
    outstream.write("|".join(fields(record)))
    outstream.write("\n")

    Peter
     
    Peter Otten, Oct 15, 2007
    #11
  12. Guest

    Wow, thank you all. All three work. To output correctly I needed to
    add:

    output.write("\r\n")

    This is really a great help!!

    Because of my limited Python knowledge, I will need to try to figure
    out exactly how they work for future text manipulation and for my own
    knowledge. Could you recommend some resources for this kind of text
    manipulation? Also, I conceptually get it, but would you mind walking
    me through

    > for tok in tokens:
    > if NR_RE.match(tok) and len(chem) >= 4:
    > chem[2:-1] = [' '.join(chem[2:-1])]
    > yield chem
    > chem = []
    > chem.append(tok)


    and

    > for key, group in groupby(instream, unicode.isspace):
    > if not key:
    > yield "".join(group)



    Thanks again,
    Patrick



    On Oct 15, 2:16 pm, Peter Otten <> wrote:
    > patrick.waldo wrote:
    > > my sample input file looks like this( not organized,as you see it):
    > > 200-720-7 69-93-2
    > > kyselina mocová C5H4N4O3

    >
    > > 200-001-8 50-00-0
    > > formaldehyd CH2O

    >
    > > 200-002-3
    > > 50-01-1
    > > guanidínium-chlorid CH5N3.ClH

    >
    > Assuming that the records are always separated by blank lines and only the
    > third field in a record may contain spaces the following might work:
    >
    > import codecs
    > from itertools import groupby
    >
    > path = "c:\\text_samples\\chem_1_utf8.txt"
    > path2 = "c:\\text_samples\\chem_2.txt"
    >
    > def fields(s):
    > parts = s.split()
    > return parts[0], parts[1], " ".join(parts[2:-1]), parts[-1]
    >
    > def records(instream):
    > for key, group in groupby(instream, unicode.isspace):
    > if not key:
    > yield "".join(group)
    >
    > if __name__ == "__main__":
    > outstream = codecs.open(path2, 'w', 'utf8')
    > for record in records(codecs.open(path, "r", "utf8")):
    > outstream.write("|".join(fields(record)))
    > outstream.write("\n")
    >
    > Peter
     
    , Oct 15, 2007
    #12
  13. Paul Hankin Guest

    On Oct 15, 10:08 pm, wrote:
    > Because of my limited Python knowledge, I will need to try to figure
    > out exactly how they work for future text manipulation and for my own
    > knowledge. Could you recommend some resources for this kind of text
    > manipulation? Also, I conceptually get it, but would you mind walking
    > me through
    >
    > > for tok in tokens:
    > > if NR_RE.match(tok) and len(chem) >= 4:
    > > chem[2:-1] = [' '.join(chem[2:-1])]
    > > yield chem
    > > chem = []
    > > chem.append(tok)


    Sure: 'chem' is a list of all the data associated with one chemical.
    When a token (tok) arrives that is matched by NR_RE (ie 3 lots of
    digits separated by dots), it's assumed that this is the start of a
    new chemical if we've already got 4 pieces of data. Then, we join the
    name back up (as was explained in earlier posts), and 'yield chem'
    yields up the chemical so far; and a new chemical is started (by
    emptying the list). Whatever tok is, it's added to the end of the
    current chemical data. Add some print statements in to watch it work
    if you can't get it.

    This code uses exactly the same algorithm as Marc's code - it's just a
    bit clearer (or at least, I thought so). Oh, and it returns a list
    rather than a tuple, but that makes no difference.

    --
    Paul Hankin
     
    Paul Hankin, Oct 15, 2007
    #13
  14. Paul McGuire Guest

    On Oct 14, 8:48 am, wrote:
    > Hi all,
    >
    > I started Python just a little while ago and I am stuck on something
    > that is really simple, but I just can't figure out.
    >
    > Essentially I need to take a text document with some chemical
    > information in Czech and organize it into another text file.  The
    > information is always EINECS number, CAS, chemical name, and formula
    > in tables.  I need to organize them into lines with | in between.  So
    > it goes from:
    >
    > 200-763-1                     71-73-8
    > nátrium-tiopentál           C11H18N2O2S.Na           to:
    >
    > 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na
    >
    > but if I have a chemical like: kyselina moÄová
    >
    > I get:
    > 200-720-7|69-93-2|kyselina|moÄová
    > |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál
    >
    > and then it is all off.


    Pyparsing might be overkill for this example, but it is a good sample
    for a demo. If you end up doing lots of data extraction like this,
    pyparsing is a useful tool. In pyparsing, you define expressions
    using pyparsing classes and built-in strings, then use the constructed
    pyparsing expression to parse the data (using parseString, scanString,
    searchString, or transformString). In this example, searchString is
    the easiest to use. After the parsing is done, the parsed fields are
    returned in a ParseResults object, which has some list and some dict
    style behavior. I've given each field a name based on your post, so
    that you can read the tokens right out of the results as if they were
    attributes of an object. This example emits your '|' delimited data,
    but the commented lines show how you could access the individually
    parsed fields, too.

    Learn more about pyparsing at http://pyparsing.wikispaces.com/ .

    -- Paul


    # -*- coding: iso-8859-15 -*-

    data = """200-720-7 69-93-2
    kyselina mocová C5H4N4O3


    200-001-8 50-00-0
    formaldehyd CH2O


    200-002-3
    50-01-1
    guanidínium-chlorid CH5N3.ClH

    """

    from pyparsing import Word, nums,OneOrMore,alphas,alphas8bit

    # define expressions for each part in the input data

    # a numeric id starts with a number, and is followed by
    # any number of numbers or '-'s
    numericId = Word(nums, nums+"-")

    # a chemical name is one or more words, each made up of
    # alphas (including 8-bit alphas) or '-'s
    chemName = OneOrMore(Word(alphas.lower()+alphas8bit.lower()+"-"))

    # when returning the chemical name, rejoin the separate
    # words into a single string, with spaces
    chemName.setParseAction(lambda t:" ".join(t))

    # a chemical formula is a 'word' starting with an uppercase
    # alpha, followed by uppercase alphas or numbers
    chemFormula = Word(alphas.upper(), alphas.upper()+nums)

    # put all expressions into overall form, and attach field names
    entry = numericId("EINECS") + \
    numericId("CAS") + \
    chemName("name") + \
    chemFormula("formula")

    # search through input data, and print out retrieved data
    for chemData in entry.searchString(data):
    print "%(EINECS)s|%(CAS)s|%(name)s|%(formula)s" % chemData
    # or print each field by itself
    # print chemData.EINECS
    # print chemData.CAS
    # print chemData.name
    # print chemData.formula
    # print


    prints:
    200-720-7|69-93-2|kyselina mocová|C5H4N4O3
    200-001-8|50-00-0|formaldehyd|CH2O
    200-002-3|50-01-1|guanidínium-chlorid|CH5N3
     
    Paul McGuire, Oct 16, 2007
    #14
  15. Peter Otten Guest

    patrick.waldo wrote:

    > manipulation? Also, I conceptually get it, but would you mind walking
    > me through


    >> for key, group in groupby(instream, unicode.isspace):
    >> if not key:
    >> yield "".join(group)


    itertools.groupby() splits a sequence into groups with the same key; e. g.
    to group names by their first letter you'd do the following:

    >>> def first_letter(s): return s[:1]

    ....
    >>> for key, group in groupby(["Anne", "Andrew", "Bill", "Brett", "Alex"], first_letter):

    .... print "--- %s ---" % key
    .... for item in group:
    .... print item
    ....
    --- A ---
    Anne
    Andrew
    --- B ---
    Bill
    Brett
    --- A ---
    Alex

    Note that there are two groups with the same initial; groupby() considers
    only consecutive items in the sequence for the same group.

    In your case the sequence are the lines in the file, converted to unicode
    strings -- the key is a boolean indicating whether the line consists
    entirely of whitespace or not,

    >>> u"\n".isspace()

    True
    >>> u"alpha\n".isspace()

    False

    but I call it slightly differently, as an unbound method:

    >>> unicode.isspace(u"alpha\n")

    False

    This is only possible because all items in the sequence are known to be
    unicode instances. So far we have, using a list instead of a file:

    >>> instream = [u"alpha\n", u"beta\n", u"\n", u"gamma\n", u"\n", u"\n", u"delta\n"]
    >>> for key, group in groupby(instream, unicode.isspace):

    .... print "--- %s ---" % key
    .... for item in group:
    .... print repr(item)
    ....
    --- False ---
    u'alpha\n'
    u'beta\n'
    --- True ---
    u'\n'
    --- False ---
    u'gamma\n'
    --- True ---
    u'\n'
    u'\n'
    --- False ---
    u'delta\n'

    As you see, groups with real data alternate with groups that contain only
    blank lines, and the key for the latter is True, so we can skip them with

    if not key: # it's not a separator group
    yield group

    As the final refinement we join all lines of the group into a single
    string

    >>> "".join(group)

    u'alpha\nbeta\n'

    and that's it.

    Peter
     
    Peter Otten, Oct 16, 2007
    #15
  16. Guest

    And now for something completely different...

    I see a lot of COM stuff with Python for excel...and I quickly made
    the same program output to excel. What if the input file were a Word
    document? Where is there information about manipulating word
    documents, or what could I add to make the same program work for word?

    Again thanks a lot. I'll start hitting some books about this sort of
    text manipulation.

    The Excel add on:

    import codecs
    import re
    from win32com.client import Dispatch

    path = "c:\\text_samples\\chem_1_utf8.txt"
    path2 = "c:\\text_samples\\chem_2.txt"
    input = codecs.open(path, 'r','utf8')
    output = codecs.open(path2, 'w', 'utf8')

    NR_RE = re.compile(r'^\d+-\d+-\d+$') #pattern for EINECS
    number

    tokens = input.read().split()
    def iter_elements(tokens):
    product = []
    for tok in tokens:
    if NR_RE.match(tok) and len(product) >= 4:
    product[2:-1] = [' '.join(product[2:-1])]
    yield product
    product = []
    product.append(tok)
    yield product

    xlApp = Dispatch("Excel.Application")
    xlApp.Visible = 1
    xlApp.Workbooks.Add()
    c = 1

    for element in iter_elements(tokens):
    xlApp.ActiveSheet.Cells(c,1).Value = element[0]
    xlApp.ActiveSheet.Cells(c,2).Value = element[1]
    xlApp.ActiveSheet.Cells(c,3).Value = element[2]
    xlApp.ActiveSheet.Cells(c,4).Value = element[3]
    c = c + 1

    xlApp.ActiveWorkbook.Close(SaveChanges=1)
    xlApp.Quit()
    xlApp.Visible = 0
    del xlApp

    input.close()
    output.close()
     
    , Oct 16, 2007
    #16
  17. Guest

    And now for something completely different...

    I've been reading up a bit about Python and Excel and I quickly told
    the program to output to Excel quite easily. However, what if the
    input file were a Word document? I can't seem to find much
    information about parsing Word files. What could I add to make the
    same program work for a Word file?

    Again thanks a lot.

    And the Excel Add on...

    import codecs
    import re
    from win32com.client import Dispatch

    path = "c:\\text_samples\\chem_1_utf8.txt"
    path2 = "c:\\text_samples\\chem_2.txt"
    input = codecs.open(path, 'r','utf8')
    output = codecs.open(path2, 'w', 'utf8')

    NR_RE = re.compile(r'^\d+-\d+-\d+$') #pattern for EINECS
    number

    tokens = input.read().split()
    def iter_elements(tokens):
    product = []
    for tok in tokens:
    if NR_RE.match(tok) and len(product) >= 4:
    product[2:-1] = [' '.join(product[2:-1])]
    yield product
    product = []
    product.append(tok)
    yield product

    xlApp = Dispatch("Excel.Application")
    xlApp.Visible = 1
    xlApp.Workbooks.Add()
    c = 1

    for element in iter_elements(tokens):
    xlApp.ActiveSheet.Cells(c,1).Value = element[0]
    xlApp.ActiveSheet.Cells(c,2).Value = element[1]
    xlApp.ActiveSheet.Cells(c,3).Value = element[2]
    xlApp.ActiveSheet.Cells(c,4).Value = element[3]
    c = c + 1

    xlApp.ActiveWorkbook.Close(SaveChanges=1)
    xlApp.Quit()
    xlApp.Visible = 0
    del xlApp

    input.close()
    output.close()
     
    , Oct 16, 2007
    #17
  18. Tim Roberts Guest

    wrote:
    >
    >And now for something completely different...
    >
    >I've been reading up a bit about Python and Excel and I quickly told
    >the program to output to Excel quite easily. However, what if the
    >input file were a Word document? I can't seem to find much
    >information about parsing Word files. What could I add to make the
    >same program work for a Word file?


    Word files are not human-readable. You parse them using
    Dispatch("Word.Application"), just the way you wrote the Excel file.

    I believe there are some third-party modules that will read a Word file a
    little more directly.
    --
    Tim Roberts,
    Providenza & Boekelheide, Inc.
     
    Tim Roberts, Oct 18, 2007
    #18
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Maxwell2006

    Simple image processing component

    Maxwell2006, Jul 21, 2006, in forum: ASP .Net
    Replies:
    2
    Views:
    335
    Walter Wang [MSFT]
    Jul 21, 2006
  2. Hubert Hung-Hsien Chang
    Replies:
    2
    Views:
    517
    Michael Foord
    Sep 17, 2004
  3. mathieu
    Replies:
    3
    Views:
    526
    mathieu
    Jan 6, 2007
  4. AJAskey

    Simple Text Processing

    AJAskey, Sep 10, 2009, in forum: Python
    Replies:
    2
    Views:
    242
    Steven D'Aprano
    Sep 12, 2009
  5. Zhenhuan Du

    help:text processing needed when porting

    Zhenhuan Du, Dec 16, 2006, in forum: Perl Misc
    Replies:
    1
    Views:
    100
    John Bokma
    Dec 18, 2006
Loading...

Share This Page