# Simple Text Processing Help

Discussion in 'Python' started by patrick.waldo@gmail.com, Oct 14, 2007.

1. ### Guest

Hi all,

I started Python just a little while ago and I am stuck on something
that is really simple, but I just can't figure out.

Essentially I need to take a text document with some chemical
information in Czech and organize it into another text file. The
information is always EINECS number, CAS, chemical name, and formula
in tables. I need to organize them into lines with | in between. So
it goes from:

200-763-1 71-73-8
nÃ¡trium-tiopentÃ¡l C11H18N2O2S.Na to:

200-763-1|71-73-8|nÃ¡trium-tiopentÃ¡l|C11H18N2O2S.Na

but if I have a chemical like: kyselina moÄovÃ¡

I get:
200-720-7|69-93-2|kyselina|moÄovÃ¡
|C5H4N4O3|200-763-1|71-73-8|nÃ¡trium-tiopentÃ¡l

and then it is all off.

How can I get Python to realize that a chemical name may have a space
in it?

Thank you,
Patrick

So far I have:

#take tables in one text file and organize them into lines in another

import codecs

path = "c:\\text_samples\\chem_1_utf8.txt"
path2 = "c:\\text_samples\\chem_2.txt"
input = codecs.open(path, 'r','utf8')
output = codecs.open(path2, 'w', 'utf8')

#read and enter into a list
chem_file = []

#split words and store them in a list
for word in chem_file:
words = word.split()

#starting values in list
e=0 #EINECS
c=1 #CAS
ch=2 #chemical name
f=3 #formula

n=0
loop=1
x=len(words) #counts how many words there are in the file

print '-'*100
while loop==1:
if n<x and f<=x:
print words[e], '|', words[c], '|', words[ch], '|', words[f],
'\n'
output.write(words[e])
output.write('|')
output.write(words[c])
output.write('|')
output.write(words[ch])
output.write('|')
output.write(words[f])
output.write('\r\n')
#increase variables by 4 to get next set
e = e + 4
c = c + 4
ch = ch + 4
f = f + 4
# increase by 1 to repeat
n=n+1
else:
loop=0

input.close()
output.close()
, Oct 14, 2007

2. ### Marc 'BlackJack' RintschGuest

On Sun, 14 Oct 2007 13:48:51 +0000, patrick.waldo wrote:

> Essentially I need to take a text document with some chemical
> information in Czech and organize it into another text file. The
> information is always EINECS number, CAS, chemical name, and formula
> in tables. I need to organize them into lines with | in between. So
> it goes from:
>
> 200-763-1 71-73-8
> nÃ¡trium-tiopentÃ¡l C11H18N2O2S.Na to:

Is that in *one* line in the input file or two lines like shown here?

> 200-763-1|71-73-8|nÃ¡trium-tiopentÃ¡l|C11H18N2O2S.Na
>
> but if I have a chemical like: kyselina moÄovÃ¡
>
> I get:
> 200-720-7|69-93-2|kyselina|moÄovÃ¡
> |C5H4N4O3|200-763-1|71-73-8|nÃ¡trium-tiopentÃ¡l
>
> and then it is all off.
>
> How can I get Python to realize that a chemical name may have a space
> in it?

If the two elements before and the one element after the name can't
contain spaces it is easy: take the first two and the last as it is and
for the name take from the third to the next to last element = the name
and join them with a space.

In [202]: parts = '123 456 a name with spaces 789'.split()

In [203]: parts[0]
Out[203]: '123'

In [204]: parts[1]
Out[204]: '456'

In [205]: ' '.join(parts[2:-1])
Out[205]: 'a name with spaces'

In [206]: parts[-1]
Out[206]: '789'

This works too if the name doesn't have a space in it:

In [207]: parts = '123 456 name 789'.split()

In [208]: parts[0]
Out[208]: '123'

In [209]: parts[1]
Out[209]: '456'

In [210]: ' '.join(parts[2:-1])
Out[210]: 'name'

In [211]: parts[-1]
Out[211]: '789'

> #read and enter into a list
> chem_file = []

This reads the whole file and puts it into a list. This list will
*always* just contain *one* element. So why a list at all!?

> #split words and store them in a list
> for word in chem_file:
> words = word.split()

*If* the list would contain more than one element all would be processed
but only the last is bound to words. You could leave out chem_file and
the loop and simply do:

Same effect but less chatty. ;-)

The rest of the source seems to indicate that you don't really want to read
in the whole input file at once but process it line by line, i.e. chemical
element by chemical element.

Ciao,
Marc 'BlackJack' Rintsch
Marc 'BlackJack' Rintsch, Oct 14, 2007

3. ### Paul HankinGuest

On Oct 14, 2:48 pm, wrote:
> Hi all,
>
> I started Python just a little while ago and I am stuck on something
> that is really simple, but I just can't figure out.
>
> Essentially I need to take a text document with some chemical
> information in Czech and organize it into another text file. The
> information is always EINECS number, CAS, chemical name, and formula
> in tables. I need to organize them into lines with | in between. So
> it goes from:
>
> 200-763-1 71-73-8
> nÃ¡trium-tiopentÃ¡l C11H18N2O2S.Na to:
>
> 200-763-1|71-73-8|nÃ¡trium-tiopentÃ¡l|C11H18N2O2S.Na
>
> but if I have a chemical like: kyselina moÄovÃ¡
>
> I get:
> 200-720-7|69-93-2|kyselina|moÄovÃ¡
> |C5H4N4O3|200-763-1|71-73-8|nÃ¡trium-tiopentÃ¡l
>
> and then it is all off.
>
> How can I get Python to realize that a chemical name may have a space
> in it?

In the original file, is every chemical on a line of its own? I assume
it is here.

You might use a regexp (look at the re module), or I think here you
can use the fact that only chemicals have spaces in them. Then, you
can split each line on whitespace (like you're doing), and join back
together all the words between the 3rd (ie index 2) and the last (ie
index -1) using tokens[2:-1] = [u' '.join(tokens[2:-1])]. This uses
the somewhat unusual python syntax for replacing a section of a list
with another list.

The approach you took involves reading the whole file, and building a
list of all the chemicals which you don't seem to use: I've changed it
to a per-line version and removed the big lists.

path = "c:\\text_samples\\chem_1_utf8.txt"
path2 = "c:\\text_samples\\chem_2.txt"
input = codecs.open(path, 'r','utf8')
output = codecs.open(path2, 'w', 'utf8')

for line in input:
tokens = line.strip().split()
tokens[2:-1] = [u' '.join(tokens[2:-1])]
chemical = u'|'.join(tokens)
print chemical + u'\n'
output.write(chemical + u'\r\n')

input.close()
output.close()

Obviously, this isn't tested because I don't have your chem_1_utf8.txt
file.

--
Paul Hankin
Paul Hankin, Oct 14, 2007
4. ### Guest

Thank you both for helping me out. I am still rather new to Python
and so I'm probably trying to reinvent the wheel here.

When I try to do Paul's response, I get
>>>tokens = line.strip().split()

[]

So I am not quite sure how to read line by line.

tokens = input.read().split() gets me all the information from the
file. tokens[2:-1] = [u' '.join(tokens[2:-1])] works just fine, like
in the example; however, how can I loop this for the entire document?
Also, when I try output.write(tokens), I get "TypeError: coercing to
Unicode: need string or buffer, list found".

Any ideas?

On Oct 14, 4:25 pm, Paul Hankin <> wrote:
> On Oct 14, 2:48 pm, wrote:
>
>
>
> > Hi all,

>
> > I started Python just a little while ago and I am stuck on something
> > that is really simple, but I just can't figure out.

>
> > Essentially I need to take a text document with some chemical
> > information in Czech and organize it into another text file. The
> > information is always EINECS number, CAS, chemical name, and formula
> > in tables. I need to organize them into lines with | in between. So
> > it goes from:

>
> > 200-763-1 71-73-8
> > nÃ¡trium-tiopentÃ¡l C11H18N2O2S.Na to:

>
> > 200-763-1|71-73-8|nÃ¡trium-tiopentÃ¡l|C11H18N2O2S.Na

>
> > but if I have a chemical like: kyselina moÄovÃ¡

>
> > I get:
> > 200-720-7|69-93-2|kyselina|moÄovÃ¡
> > |C5H4N4O3|200-763-1|71-73-8|nÃ¡trium-tiopentÃ¡l

>
> > and then it is all off.

>
> > How can I get Python to realize that a chemical name may have a space
> > in it?

>
> In the original file, is every chemical on a line of its own? I assume
> it is here.
>
> You might use a regexp (look at the re module), or I think here you
> can use the fact that only chemicals have spaces in them. Then, you
> can split each line on whitespace (like you're doing), and join back
> together all the words between the 3rd (ie index 2) and the last (ie
> index -1) using tokens[2:-1] = [u' '.join(tokens[2:-1])]. This uses
> the somewhat unusual python syntax for replacing a section of a list
> with another list.
>
> The approach you took involves reading the whole file, and building a
> list of all the chemicals which you don't seem to use: I've changed it
> to a per-line version and removed the big lists.
>
> path = "c:\\text_samples\\chem_1_utf8.txt"
> path2 = "c:\\text_samples\\chem_2.txt"
> input = codecs.open(path, 'r','utf8')
> output = codecs.open(path2, 'w', 'utf8')
>
> for line in input:
> tokens = line.strip().split()
> tokens[2:-1] = [u' '.join(tokens[2:-1])]
> chemical = u'|'.join(tokens)
> print chemical + u'\n'
> output.write(chemical + u'\r\n')
>
> input.close()
> output.close()
>
> Obviously, this isn't tested because I don't have your chem_1_utf8.txt
> file.
>
> --
> Paul Hankin
, Oct 14, 2007
5. ### Marc 'BlackJack' RintschGuest

On Sun, 14 Oct 2007 16:57:06 +0000, patrick.waldo wrote:

> Thank you both for helping me out. I am still rather new to Python
> and so I'm probably trying to reinvent the wheel here.
>
> When I try to do Paul's response, I get
>>>>tokens = line.strip().split()

> []

What is in line? Paul wrote this in the body of the for loop over
all the lines in the file.

> So I am not quite sure how to read line by line.

That's what the for loop over a file or file-like object is doing.
Maybe you should develop your script in smaller steps and do some printing
to see what you get at each step. For example after opening the input
file:

for line in input:
print line # prints the whole line.
tokens = line.split()
print tokens # prints a list with the split line.

> tokens = input.read().split() gets me all the information from the
> file.

Right it reads *all* of the file, not just one line.

> tokens[2:-1] = [u' '.join(tokens[2:-1])] works just fine, like
> in the example; however, how can I loop this for the entire document?

Don't read the whole file but line by line, just like Paul showed you.

> Also, when I try output.write(tokens), I get "TypeError: coercing to
> Unicode: need string or buffer, list found".

tokens is a list but you need to write a unicode string. So you have to
reassemble the parts with '|' characters in between. Also shown by Paul.

Ciao,
Marc 'BlackJack' Rintsch
Marc 'BlackJack' Rintsch, Oct 14, 2007
6. ### John MachinGuest

On Oct 14, 11:48 pm, wrote:
> Hi all,
>
> I started Python just a little while ago and I am stuck on something
> that is really simple, but I just can't figure out.
>
> Essentially I need to take a text document with some chemical
> information in Czech and organize it into another text file. The
> information is always EINECS number, CAS, chemical name, and formula
> in tables. I need to organize them into lines with | in between. So
> it goes from:
>
> 200-763-1 71-73-8
> nÃ¡trium-tiopentÃ¡l C11H18N2O2S.Na to:
>
> 200-763-1|71-73-8|nÃ¡trium-tiopentÃ¡l|C11H18N2O2S.Na
>
> but if I have a chemical like: kyselina moÄovÃ¡
>
> I get:
> 200-720-7|69-93-2|kyselina|moÄovÃ¡
> |C5H4N4O3|200-763-1|71-73-8|nÃ¡trium-tiopentÃ¡l
>
> and then it is all off.
>
> How can I get Python to realize that a chemical name may have a space
> in it?
>

Your input file could be in one of THREE formats:
(1) fields are separated by TAB characters (represented in Python by
the escape sequence '\t', and equivalent to '\x09')
(2) fields are fixed width and padded with spaces
(3) fields are separated by a random number of whitespace characters
(and can contain spaces).

What makes you sure that you have format 3? You might like to try
something like
print lines
print map(len, lines)
This will print a *precise* representation of what is in the first
four lines, plus their lengths. Please show us the output.
John Machin, Oct 14, 2007
7. ### Guest

> print lines
> print map(len, lines)

gave me:
['\xef\xbb\xbf200-720-7 69-93-2\n', 'kyselina mo\xc4\x8dov
\xc3\xa1 C5H4N4O3\n', '\n', '200-001-8\t50-00-0\n']
[28, 32, 1, 18]

I think it means that I'm still at option 3. I got the line by line
part. My code is a lot cleaner now:

import codecs

path = "c:\\text_samples\\chem_1_utf8.txt"
path2 = "c:\\text_samples\\chem_2.txt"
input = codecs.open(path, 'r','utf8')
output = codecs.open(path2, 'w', 'utf8')

for line in input:
tokens = line.strip().split()
tokens[2:-1] = [u' '.join(tokens[2:-1])] #this doesn't seem to
combine the files correctly
file = u'|'.join(tokens) #this does put '|' in
between
print file + u'\n'
output.write(file + u'\r\n')

input.close()
output.close()

my sample input file looks like this( not organized,as you see it):
200-720-7 69-93-2
kyselina mocovÃ¡ C5H4N4O3

200-001-8 50-00-0
formaldehyd CH2O

200-002-3
50-01-1
guanidÃ­nium-chlorid CH5N3.ClH

etc...

and after the program I get:

200-720-7|69-93-2|
kyselina|mocovÃ¡||C5H4N4O3

200-001-8|50-00-0|
formaldehyd|CH2O|

200-002-3|
50-01-1|
guanidÃ­nium-chlorid|CH5N3.ClH|

etc...
So, I am sort of back at the start again.

tokens = line.strip().split()
for token in tokens:
print token

I get all the single tokens, which I thought I could then put
together, except when I did:

for token in tokens:
s = u'|'.join(token)
print s

I got ?|2|0|0|-|7|2|0|-|7, etc...

How can I join these together into nice neat little lines? When I try
to store the tokens in a list, the tokens double and I don't know
why. I can work on getting the chemical names together after...baby
steps, or maybe I am just missing something obvious. The first two
numbers will always be the same three digits-three digits-one digit
and then two digits-two digits-one digit.

My intuition tells me that I need to add an if statement that says, if
the first two numbers follow the pattern, then continue, if they don't
(ie a chemical name was accidently split apart) then the third entry
needs to be put together. Something like
if tokens.startswith('pattern') == true

Again, thanks so much. I've gone to http://gnosis.cx/TPiP/ and I have
a couple O'Reilly books, but they don't seem to have a straightforward
example for this kind of text manipulation.

Patrick

On Oct 14, 11:17 pm, John Machin <> wrote:
> On Oct 14, 11:48 pm, wrote:
>
>
>
> > Hi all,

>
> > I started Python just a little while ago and I am stuck on something
> > that is really simple, but I just can't figure out.

>
> > Essentially I need to take a text document with some chemical
> > information in Czech and organize it into another text file. The
> > information is always EINECS number, CAS, chemical name, and formula
> > in tables. I need to organize them into lines with | in between. So
> > it goes from:

>
> > 200-763-1 71-73-8
> > nÃ¡trium-tiopentÃ¡l C11H18N2O2S.Na to:

>
> > 200-763-1|71-73-8|nÃ¡trium-tiopentÃ¡l|C11H18N2O2S.Na

>
> > but if I have a chemical like: kyselina moÄovÃ¡

>
> > I get:
> > 200-720-7|69-93-2|kyselina|moÄovÃ¡
> > |C5H4N4O3|200-763-1|71-73-8|nÃ¡trium-tiopentÃ¡l

>
> > and then it is all off.

>
> > How can I get Python to realize that a chemical name may have a space
> > in it?

>
> Your input file could be in one of THREE formats:
> (1) fields are separated by TAB characters (represented in Python by
> the escape sequence '\t', and equivalent to '\x09')
> (2) fields are fixed width and padded with spaces
> (3) fields are separated by a random number of whitespace characters
> (and can contain spaces).
>
> What makes you sure that you have format 3? You might like to try
> something like
> print lines
> print map(len, lines)
> This will print a *precise* representation of what is in the first
> four lines, plus their lengths. Please show us the output.
, Oct 15, 2007
8. ### Guest

> print lines
> print map(len, lines)

gave me:
['\xef\xbb\xbf200-720-7 69-93-2\n', 'kyselina mo\xc4\x8dov
\xc3\xa1 C5H4N4O3\n', '\n', '200-001-8\t50-00-0\n']
[28, 32, 1, 18]

I think it means that I'm still at option 3. I got the line by line
part. My code is a lot cleaner now:

import codecs

path = "c:\\text_samples\\chem_1_utf8.txt"
path2 = "c:\\text_samples\\chem_2.txt"
input = codecs.open(path, 'r','utf8')
output = codecs.open(path2, 'w', 'utf8')

for line in input:
tokens = line.strip().split()
tokens[2:-1] = [u' '.join(tokens[2:-1])] #this doesn't seem to
combine the files correctly
file = u'|'.join(tokens) #this does put '|' in
between
print file + u'\n'
output.write(file + u'\r\n')

input.close()
output.close()

my sample input file looks like this( not organized,as you see it):
200-720-7 69-93-2
kyselina mocovÃ¡ C5H4N4O3

200-001-8 50-00-0
formaldehyd CH2O

200-002-3
50-01-1
guanidÃ­nium-chlorid CH5N3.ClH

etc...

and after the program I get:

200-720-7|69-93-2|
kyselina|mocovÃ¡||C5H4N4O3

200-001-8|50-00-0|
formaldehyd|CH2O|

200-002-3|
50-01-1|
guanidÃ­nium-chlorid|CH5N3.ClH|

etc...
So, I am sort of back at the start again.

tokens = line.strip().split()
for token in tokens:
print token

I get all the single tokens, which I thought I could then put
together, except when I did:

for token in tokens:
s = u'|'.join(token)
print s

I got ?|2|0|0|-|7|2|0|-|7, etc...

How can I join these together into nice neat little lines? When I try
to store the tokens in a list, the tokens double and I don't know
why. I can work on getting the chemical names together after...baby
steps, or maybe I am just missing something obvious. The first two
numbers will always be the same three digits-three digits-one digit
and then two digits-two digits-one digit. This seems to be on the
only pattern.

My intuition tells me that I need to add an if statement that says, if
the first two numbers follow the pattern, then continue, if they don't
(ie a chemical name was accidently split apart) then the third entry
needs to be put together. Something like

if tokens[1] and tokens[2] startswith('pattern') == true
tokens[2] = join(tokens[2]:tokens[3])
token[3] = token[4]
del token[4]

but the code isn't right...any ideas?

Again, thanks so much. I've gone to http://gnosis.cx/TPiP/ and I have
a couple O'Reilly books, but they don't seem to have a straightforward
example for this kind of text manipulation.

Patrick

On Oct 14, 11:17 pm, John Machin <> wrote:
> On Oct 14, 11:48 pm, wrote:
>
>
>
> > Hi all,

>
> > I started Python just a little while ago and I am stuck on something
> > that is really simple, but I just can't figure out.

>
> > Essentially I need to take a text document with some chemical
> > information in Czech and organize it into another text file. The
> > information is always EINECS number, CAS, chemical name, and formula
> > in tables. I need to organize them into lines with | in between. So
> > it goes from:

>
> > 200-763-1 71-73-8
> > nÃ¡trium-tiopentÃ¡l C11H18N2O2S.Na to:

>
> > 200-763-1|71-73-8|nÃ¡trium-tiopentÃ¡l|C11H18N2O2S.Na

>
> > but if I have a chemical like: kyselina moÄovÃ¡

>
> > I get:
> > 200-720-7|69-93-2|kyselina|moÄovÃ¡
> > |C5H4N4O3|200-763-1|71-73-8|nÃ¡trium-tiopentÃ¡l

>
> > and then it is all off.

>
> > How can I get Python to realize that a chemical name may have a space
> > in it?

>
> Your input file could be in one of THREE formats:
> (1) fields are separated by TAB characters (represented in Python by
> the escape sequence '\t', and equivalent to '\x09')
> (2) fields are fixed width and padded with spaces
> (3) fields are separated by a random number of whitespace characters
> (and can contain spaces).
>
> What makes you sure that you have format 3? You might like to try
> something like
> print lines
> print map(len, lines)
> This will print a *precise* representation of what is in the first
> four lines, plus their lengths. Please show us the output.
, Oct 15, 2007
9. ### Marc 'BlackJack' RintschGuest

On Mon, 15 Oct 2007 10:47:16 +0000, patrick.waldo wrote:

> my sample input file looks like this( not organized,as you see it):
> 200-720-7 69-93-2
> kyselina mocovÃ¡ C5H4N4O3
>
> 200-001-8 50-00-0
> formaldehyd CH2O
>
> 200-002-3
> 50-01-1
> guanidÃ­nium-chlorid CH5N3.ClH
>
> etc...

That's quite irregular so it is not that straightforward. One way is to
split everything into words, start a record by taking the first two
elements and then look for the start of the next record that looks like
three numbers concatenated by '-' characters. Quick and dirty hack:

import codecs
import re

NR_RE = re.compile(r'^\d+-\d+-\d+$') def iter_elements(tokens): tokens = iter(tokens) try: nr_a = tokens.next() while True: nr_b = tokens.next() items = list() for item in tokens: if NR_RE.match(item): yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1]) nr_a = item break else: items.append(item) except StopIteration: yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1]) def main(): in_file = codecs.open('test.txt', 'r', 'utf-8') tokens = in_file.read().split() in_file.close() for element in iter_elements(tokens): print '|'.join(element) Ciao, Marc 'BlackJack' Rintsch Marc 'BlackJack' Rintsch, Oct 15, 2007 10. ### Paul HankinGuest On Oct 15, 12:20 pm, Marc 'BlackJack' Rintsch <> wrote: > On Mon, 15 Oct 2007 10:47:16 +0000, patrick.waldo wrote: > > my sample input file looks like this( not organized,as you see it): > > 200-720-7 69-93-2 > > kyselina mocová C5H4N4O3 > > > 200-001-8 50-00-0 > > formaldehyd CH2O > > > 200-002-3 > > 50-01-1 > > guanidínium-chlorid CH5N3.ClH > > > etc... > > That's quite irregular so it is not that straightforward. One way is to > split everything into words, start a record by taking the first two > elements and then look for the start of the next record that looks like > three numbers concatenated by '-' characters. Quick and dirty hack: > > import codecs > import re > > NR_RE = re.compile(r'^\d+-\d+-\d+$')
>
> def iter_elements(tokens):
> tokens = iter(tokens)
> try:
> nr_a = tokens.next()
> while True:
> nr_b = tokens.next()
> items = list()
> for item in tokens:
> if NR_RE.match(item):
> yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1])
> nr_a = item
> break
> else:
> items.append(item)
> except StopIteration:
> yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1])

Maybe this is a bit more readable?

def iter_elements(tokens):
chem = []
for tok in tokens:
if NR_RE.match(tok) and len(chem) >= 4:
chem[2:-1] = [' '.join(chem[2:-1])]
yield chem
chem = []
chem.append(tok)
yield chem

--
Paul Hankin
Paul Hankin, Oct 15, 2007
11. ### Peter OttenGuest

patrick.waldo wrote:

> my sample input file looks like this( not organized,as you see it):
> 200-720-7 69-93-2
> kyselina mocovÃ¡ C5H4N4O3
>
> 200-001-8 50-00-0
> formaldehyd CH2O
>
> 200-002-3
> 50-01-1
> guanidÃ­nium-chlorid CH5N3.ClH

Assuming that the records are always separated by blank lines and only the
third field in a record may contain spaces the following might work:

import codecs
from itertools import groupby

path = "c:\\text_samples\\chem_1_utf8.txt"
path2 = "c:\\text_samples\\chem_2.txt"

def fields(s):
parts = s.split()
return parts[0], parts[1], " ".join(parts[2:-1]), parts[-1]

def records(instream):
for key, group in groupby(instream, unicode.isspace):
if not key:
yield "".join(group)

if __name__ == "__main__":
outstream = codecs.open(path2, 'w', 'utf8')
for record in records(codecs.open(path, "r", "utf8")):
outstream.write("|".join(fields(record)))
outstream.write("\n")

Peter
Peter Otten, Oct 15, 2007
12. ### Guest

Wow, thank you all. All three work. To output correctly I needed to

output.write("\r\n")

This is really a great help!!

Because of my limited Python knowledge, I will need to try to figure
out exactly how they work for future text manipulation and for my own
knowledge. Could you recommend some resources for this kind of text
manipulation? Also, I conceptually get it, but would you mind walking
me through

> for tok in tokens:
> if NR_RE.match(tok) and len(chem) >= 4:
> chem[2:-1] = [' '.join(chem[2:-1])]
> yield chem
> chem = []
> chem.append(tok)

and

> for key, group in groupby(instream, unicode.isspace):
> if not key:
> yield "".join(group)

Thanks again,
Patrick

On Oct 15, 2:16 pm, Peter Otten <> wrote:
> patrick.waldo wrote:
> > my sample input file looks like this( not organized,as you see it):
> > 200-720-7 69-93-2
> > kyselina mocová C5H4N4O3

>
> > 200-001-8 50-00-0
> > formaldehyd CH2O

>
> > 200-002-3
> > 50-01-1
> > guanidínium-chlorid CH5N3.ClH

>
> Assuming that the records are always separated by blank lines and only the
> third field in a record may contain spaces the following might work:
>
> import codecs
> from itertools import groupby
>
> path = "c:\\text_samples\\chem_1_utf8.txt"
> path2 = "c:\\text_samples\\chem_2.txt"
>
> def fields(s):
> parts = s.split()
> return parts[0], parts[1], " ".join(parts[2:-1]), parts[-1]
>
> def records(instream):
> for key, group in groupby(instream, unicode.isspace):
> if not key:
> yield "".join(group)
>
> if __name__ == "__main__":
> outstream = codecs.open(path2, 'w', 'utf8')
> for record in records(codecs.open(path, "r", "utf8")):
> outstream.write("|".join(fields(record)))
> outstream.write("\n")
>
> Peter
, Oct 15, 2007
13. ### Paul HankinGuest

On Oct 15, 10:08 pm, wrote:
> Because of my limited Python knowledge, I will need to try to figure
> out exactly how they work for future text manipulation and for my own
> knowledge. Could you recommend some resources for this kind of text
> manipulation? Also, I conceptually get it, but would you mind walking
> me through
>
> > for tok in tokens:
> > if NR_RE.match(tok) and len(chem) >= 4:
> > chem[2:-1] = [' '.join(chem[2:-1])]
> > yield chem
> > chem = []
> > chem.append(tok)

Sure: 'chem' is a list of all the data associated with one chemical.
When a token (tok) arrives that is matched by NR_RE (ie 3 lots of
digits separated by dots), it's assumed that this is the start of a
new chemical if we've already got 4 pieces of data. Then, we join the
name back up (as was explained in earlier posts), and 'yield chem'
yields up the chemical so far; and a new chemical is started (by
emptying the list). Whatever tok is, it's added to the end of the
current chemical data. Add some print statements in to watch it work
if you can't get it.

This code uses exactly the same algorithm as Marc's code - it's just a
bit clearer (or at least, I thought so). Oh, and it returns a list
rather than a tuple, but that makes no difference.

--
Paul Hankin
Paul Hankin, Oct 15, 2007
14. ### Paul McGuireGuest

On Oct 14, 8:48Â am, wrote:
> Hi all,
>
> I started Python just a little while ago and I am stuck on something
> that is really simple, but I just can't figure out.
>
> Essentially I need to take a text document with some chemical
> information in Czech and organize it into another text file. Â The
> information is always EINECS number, CAS, chemical name, and formula
> in tables. Â I need to organize them into lines with | in between. Â So
> it goes from:
>
> 200-763-1 Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  71-73-8
> nÃ¡trium-tiopentÃ¡l Â  Â  Â  Â  Â  C11H18N2O2S.Na Â  Â  Â  Â  Â  to:
>
> 200-763-1|71-73-8|nÃ¡trium-tiopentÃ¡l|C11H18N2O2S.Na
>
> but if I have a chemical like: kyselina moÄovÃ¡
>
> I get:
> 200-720-7|69-93-2|kyselina|moÄovÃ¡
> |C5H4N4O3|200-763-1|71-73-8|nÃ¡trium-tiopentÃ¡l
>
> and then it is all off.

Pyparsing might be overkill for this example, but it is a good sample
for a demo. If you end up doing lots of data extraction like this,
pyparsing is a useful tool. In pyparsing, you define expressions
using pyparsing classes and built-in strings, then use the constructed
pyparsing expression to parse the data (using parseString, scanString,
searchString, or transformString). In this example, searchString is
the easiest to use. After the parsing is done, the parsed fields are
returned in a ParseResults object, which has some list and some dict
style behavior. I've given each field a name based on your post, so
that you can read the tokens right out of the results as if they were
attributes of an object. This example emits your '|' delimited data,
but the commented lines show how you could access the individually
parsed fields, too.

-- Paul

# -*- coding: iso-8859-15 -*-

data = """200-720-7 69-93-2
kyselina mocovÃ¡ C5H4N4O3

200-001-8 50-00-0
formaldehyd CH2O

200-002-3
50-01-1
guanidÃ­nium-chlorid CH5N3.ClH

"""

from pyparsing import Word, nums,OneOrMore,alphas,alphas8bit

# define expressions for each part in the input data

# a numeric id starts with a number, and is followed by
# any number of numbers or '-'s
numericId = Word(nums, nums+"-")

# a chemical name is one or more words, each made up of
# alphas (including 8-bit alphas) or '-'s
chemName = OneOrMore(Word(alphas.lower()+alphas8bit.lower()+"-"))

# when returning the chemical name, rejoin the separate
# words into a single string, with spaces
chemName.setParseAction(lambda t:" ".join(t))

# a chemical formula is a 'word' starting with an uppercase
# alpha, followed by uppercase alphas or numbers
chemFormula = Word(alphas.upper(), alphas.upper()+nums)

# put all expressions into overall form, and attach field names
entry = numericId("EINECS") + \
numericId("CAS") + \
chemName("name") + \
chemFormula("formula")

# search through input data, and print out retrieved data
for chemData in entry.searchString(data):
print "%(EINECS)s|%(CAS)s|%(name)s|%(formula)s" % chemData
# or print each field by itself
# print chemData.EINECS
# print chemData.CAS
# print chemData.name
# print chemData.formula
# print

prints:
200-720-7|69-93-2|kyselina mocovÃ¡|C5H4N4O3
200-001-8|50-00-0|formaldehyd|CH2O
200-002-3|50-01-1|guanidÃ­nium-chlorid|CH5N3
Paul McGuire, Oct 16, 2007
15. ### Peter OttenGuest

patrick.waldo wrote:

> manipulation? Also, I conceptually get it, but would you mind walking
> me through

>> for key, group in groupby(instream, unicode.isspace):
>> if not key:
>> yield "".join(group)

itertools.groupby() splits a sequence into groups with the same key; e. g.
to group names by their first letter you'd do the following:

>>> def first_letter(s): return s[:1]

....
>>> for key, group in groupby(["Anne", "Andrew", "Bill", "Brett", "Alex"], first_letter):

.... print "--- %s ---" % key
.... for item in group:
.... print item
....
--- A ---
Anne
Andrew
--- B ---
Bill
Brett
--- A ---
Alex

Note that there are two groups with the same initial; groupby() considers
only consecutive items in the sequence for the same group.

In your case the sequence are the lines in the file, converted to unicode
strings -- the key is a boolean indicating whether the line consists
entirely of whitespace or not,

>>> u"\n".isspace()

True
>>> u"alpha\n".isspace()

False

but I call it slightly differently, as an unbound method:

>>> unicode.isspace(u"alpha\n")

False

This is only possible because all items in the sequence are known to be
unicode instances. So far we have, using a list instead of a file:

>>> instream = [u"alpha\n", u"beta\n", u"\n", u"gamma\n", u"\n", u"\n", u"delta\n"]
>>> for key, group in groupby(instream, unicode.isspace):

.... print "--- %s ---" % key
.... for item in group:
.... print repr(item)
....
--- False ---
u'alpha\n'
u'beta\n'
--- True ---
u'\n'
--- False ---
u'gamma\n'
--- True ---
u'\n'
u'\n'
--- False ---
u'delta\n'

As you see, groups with real data alternate with groups that contain only
blank lines, and the key for the latter is True, so we can skip them with

if not key: # it's not a separator group
yield group

As the final refinement we join all lines of the group into a single
string

>>> "".join(group)

u'alpha\nbeta\n'

and that's it.

Peter
Peter Otten, Oct 16, 2007
16. ### Guest

And now for something completely different...

I see a lot of COM stuff with Python for excel...and I quickly made
the same program output to excel. What if the input file were a Word
document? Where is there information about manipulating word
documents, or what could I add to make the same program work for word?

text manipulation.

import codecs
import re
from win32com.client import Dispatch

path = "c:\\text_samples\\chem_1_utf8.txt"
path2 = "c:\\text_samples\\chem_2.txt"
input = codecs.open(path, 'r','utf8')
output = codecs.open(path2, 'w', 'utf8')

NR_RE = re.compile(r'^\d+-\d+-\d+$') #pattern for EINECS number tokens = input.read().split() def iter_elements(tokens): product = [] for tok in tokens: if NR_RE.match(tok) and len(product) >= 4: product[2:-1] = [' '.join(product[2:-1])] yield product product = [] product.append(tok) yield product xlApp = Dispatch("Excel.Application") xlApp.Visible = 1 xlApp.Workbooks.Add() c = 1 for element in iter_elements(tokens): xlApp.ActiveSheet.Cells(c,1).Value = element[0] xlApp.ActiveSheet.Cells(c,2).Value = element[1] xlApp.ActiveSheet.Cells(c,3).Value = element[2] xlApp.ActiveSheet.Cells(c,4).Value = element[3] c = c + 1 xlApp.ActiveWorkbook.Close(SaveChanges=1) xlApp.Quit() xlApp.Visible = 0 del xlApp input.close() output.close() , Oct 16, 2007 17. ### Guest And now for something completely different... I've been reading up a bit about Python and Excel and I quickly told the program to output to Excel quite easily. However, what if the input file were a Word document? I can't seem to find much information about parsing Word files. What could I add to make the same program work for a Word file? Again thanks a lot. And the Excel Add on... import codecs import re from win32com.client import Dispatch path = "c:\\text_samples\\chem_1_utf8.txt" path2 = "c:\\text_samples\\chem_2.txt" input = codecs.open(path, 'r','utf8') output = codecs.open(path2, 'w', 'utf8') NR_RE = re.compile(r'^\d+-\d+-\d+$') #pattern for EINECS
number

def iter_elements(tokens):
product = []
for tok in tokens:
if NR_RE.match(tok) and len(product) >= 4:
product[2:-1] = [' '.join(product[2:-1])]
yield product
product = []
product.append(tok)
yield product

xlApp = Dispatch("Excel.Application")
xlApp.Visible = 1
c = 1

for element in iter_elements(tokens):
xlApp.ActiveSheet.Cells(c,1).Value = element[0]
xlApp.ActiveSheet.Cells(c,2).Value = element[1]
xlApp.ActiveSheet.Cells(c,3).Value = element[2]
xlApp.ActiveSheet.Cells(c,4).Value = element[3]
c = c + 1

xlApp.ActiveWorkbook.Close(SaveChanges=1)
xlApp.Quit()
xlApp.Visible = 0
del xlApp

input.close()
output.close()
, Oct 16, 2007
18. ### Tim RobertsGuest

wrote:
>
>And now for something completely different...
>
>I've been reading up a bit about Python and Excel and I quickly told
>the program to output to Excel quite easily. However, what if the
>input file were a Word document? I can't seem to find much
>information about parsing Word files. What could I add to make the
>same program work for a Word file?

Word files are not human-readable. You parse them using
Dispatch("Word.Application"), just the way you wrote the Excel file.

I believe there are some third-party modules that will read a Word file a
little more directly.
--
Tim Roberts,
Providenza & Boekelheide, Inc.
Tim Roberts, Oct 18, 2007