String manipulation

marco.minerva · Apr 4, 2007

Hi all!

I have a file in which there are some expressions such as "kindest
regard" and "yours sincerely". I must create a phyton script that
checks if a text contains one or more of these expressions and, in
this case, replaces the spaces in the expression with the character
"_". For example, the text

Yours sincerely, Marco.

Must be transformated in:

Yours_sincerely, Marco.

Now I have written this code:

filemw = codecs.open(sys.argv[1], "r", "iso-8859-1").readlines()
filein = codecs.open(sys.argv[2], "r", "iso-8859-1").readlines()

mw = ""
for line in filemw:
mw = mw + line.strip() + "|"

mwfind_re = re.compile(r"^(" + mw + ")",re.IGNORECASE|re.VERBOSE)
mwfind_subst = r"_"

for line in filein:
line = line.strip()
if (line != ""):
line = mwfind_re.sub(mwfind_subst, line)
print line

It correctly identifies the expressions, but doesn't replace the
character in the right way. How can I do what I want?

Thanks in advance.

Alexander Schmolck · Apr 4, 2007

All the code is untested, but should give you the idea.

Hi all!

I have a file in which there are some expressions such as "kindest
regard" and "yours sincerely". I must create a phyton script that
checks if a text contains one or more of these expressions and, in
this case, replaces the spaces in the expression with the character
"_". For example, the text

Yours sincerely, Marco.

Must be transformated in:

Yours_sincerely, Marco.

Now I have written this code:

filemw = codecs.open(sys.argv[1], "r", "iso-8859-1").readlines()
filein = codecs.open(sys.argv[2], "r", "iso-8859-1").readlines()

mw = ""
for line in filemw:
mw = mw + line.strip() + "|"

One "|" too many. Generally, use join instead of many individual string +s.

mwfind_re_string = "(%s)" % "|".join(line.strip() for line in filemw)

mwfind_re = re.compile(r"^(" + mw + ")",re.IGNORECASE|re.VERBOSE)

mwfind_re = re.compile(mwfind_re_string),re.IGNORECASE)

mwfind_subst = r"_"

for line in filein:

That doesn't work. What about "kindest\nregard"? I think you're best of
reading the whole file in (don't forget to close the files, BTW).

line = line.strip()
if (line != ""):
line = mwfind_re.sub(mwfind_subst, line)
print line

It correctly identifies the expressions, but doesn't replace the
character in the right way. How can I do what I want?

Use the fact that you can also use a function as a substitution.

print mwfind_re.sub(lambda match: match.group().replace(' ','_'),
"".join(line.strip() for line in filein))

'as

Alexander Schmolck · Apr 4, 2007

Alexander Schmolck said:
That doesn't work. What about "kindest\nregard"? I think you're best of
reading the whole file in (don't forget to close the files, BTW).

I should have written "that may not always work, depending of whether the set
phrases you're interested in can also span lines". If in doubt, it's better
to assume they can.

'as

marco.minerva · Apr 4, 2007

[email protected] said:
All the code is untested, but should give you the idea.

[email protected] said:

Hi all!

Click to expand...

I have a file in which there are some expressions such as "kindest
regard" and "yours sincerely". I must create a phyton script that
checks if a text contains one or more of these expressions and, in
this case, replaces the spaces in the expression with the character
"_". For example, the text

Click to expand...

Yours sincerely, Marco.

Click to expand...

Must be transformated in:

Click to expand...

Yours_sincerely, Marco.

Click to expand...

Now I have written this code:

Click to expand...

filemw = codecs.open(sys.argv[1], "r", "iso-8859-1").readlines()
filein = codecs.open(sys.argv[2], "r", "iso-8859-1").readlines()

Click to expand...

mw = ""
for line in filemw:
mw = mw + line.strip() + "|"

Click to expand...

One "|" too many. Generally, use join instead of many individual string +s.

mwfind_re_string = "(%s)" % "|".join(line.strip() for line in filemw)

mwfind_re = re.compile(r"^(" + mw + ")",re.IGNORECASE|re.VERBOSE)

Click to expand...

mwfind_re = re.compile(mwfind_re_string),re.IGNORECASE)

mwfind_subst = r"_"

Click to expand...

for line in filein:

Click to expand...

That doesn't work. What about "kindest\nregard"? I think you're best of
reading the whole file in (don't forget to close the files, BTW).

line = line.strip()
if (line != ""):
line = mwfind_re.sub(mwfind_subst, line)
print line

Click to expand...

It correctly identifies the expressions, but doesn't replace the
character in the right way. How can I do what I want?

Click to expand...

Use the fact that you can also use a function as a substitution.

print mwfind_re.sub(lambda match: match.group().replace(' ','_'),
"".join(line.strip() for line in filein))

'as- Nascondi testo tra virgolette -

- Mostra testo tra virgolette -

Hi Alexander!

Thank you very much, your code works perfectly!

Alexander Schmolck · Apr 4, 2007

Thank you very much, your code works perfectly!

One thing I forgot: you might want to make the whitespace handling a bit more
robust/general e.g. by using something along the lines of

set_phrase.replace(' ', r'\w+')

'as

marco.minerva · Apr 5, 2007

One thing I forgot: you might want to make the whitespace handling a bit more
robust/general e.g. by using something along the lines of

set_phrase.replace(' ', r'\w+')

'as

Hi!

Thanks again... But where must I insert this instruction?

Alexander Schmolck · Apr 5, 2007

Oops, sorry I meant r'\s+'.

Hi!
Thanks again... But where must I insert this instruction?

If you're sure the code already does what you want you can forget about my
remark; I was thinking of transforming individual patterns like so: 'kindest
regard' -> r'kindest\w+regard', but it really depends on the details of your
spec, which I'm not familiar with.

For example you clearly want to do some amount of whitespace normalization
(because you use ``.strip()``), but how much? The most extreme you could go is

input = " ".join(file.read().split()) # all newlines, tabs, multiple spaces -> " "

In which case you don't need to worry about modifying the patterns to take
care of possible whitespace variations. Another possibility is that you
specify the patterns you want to replace as regexps in the file e.g.

\bkind(?:est)?\b\s+regard(?:s)?\b
\byours,\b
...

In any case I'd suggest the following: think about what possible edge cases
your input can contain and how you'd like to handle then; then write them up
as unittests (use doctest or unittest and StringIO) and finally modify your
code until it passes all the tests. Here are some examples of possible test
patterns:

- """kindest regard,"""
- """kindest regard"""
- """kindest\tregard"""
- """kind regards"
- """mankind regards other species as inferior"""
- """... and please send your wife my kindest
regards,"""

Finally, if you're looking for a programming excercise you could try the
following: rather than working on strings and using regexps, work on a
"stream" of words (i.e. ["kindest", "regards", ...]) and write your own code
to match sequences of words.

'as

p.s. BTW, I overlooked the ``.readlines()`` before, but you don't need it --
files are iterable and you also want to hang on to the openend file object so
that you can close it when you're done.

Regex not matching a string	2	Jan 9, 2013
error when printing a UTF-8 string (python 2.6.2)	9	Apr 21, 2010
Regular Expression for Finding and Deleting comments	1	Jan 4, 2011
email with a non-ascii charset in Python3 ?	3	Aug 15, 2012
KirbyBase : replacing string exceptions	2	Nov 23, 2009
Improving the web page download code.	5	Aug 27, 2013
nested dictionaries and functions in data structures.	0	Jan 7, 2014
Unicode problem	5	Apr 7, 2007

String manipulation

marco.minerva

Alexander Schmolck

Alexander Schmolck

marco.minerva

Alexander Schmolck

marco.minerva

Alexander Schmolck

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads