Replace and inserting strings within .txt files with the use of regex

ÎÎ¯ÎºÎ¿Ï‚ · Aug 9, 2010

Two problems here:

str.replace doesn't use regular expressions. You'll have to use the re
module to use regexps. (the re.sub function to be precise)

'.' Â matches a single character. Any character, but only one.
'.*' matches as many characters as possible. This is not what you want,
since it will match everything between the *first* <? and the *last* ?>.
You want non-greedy matching.

'.*?' is the same thing, without the greed.

Thanks you,

So i guess this needs to be written as:

src_data = re.sub( '<?(.*?)?>', '', src_data )

Tha 'r' special char doesn't need to be inserter before the regex here
due to regex ain't containing backslashes.

You will have to find the </body> tag before inserting the string.
str.find should help -- or you could use str.replace and replace the
</body> tag with you counter line, plus a new </body>.

Ah yes! Damn why din't i think of it.... str.replace should do the
trick. I was stuck trying to figure regexes.

So, i guess that should work:

src_data = src_data.replace('</body>', '<br><br><h4><font

color=green> Î‘ÏÎ¹Î¸Î¼ÏŒÏ‚ Î•Ï€Î¹ÏƒÎºÎµÏ€Ï„ÏŽÎ½: %(counter)d said:
No it's not. You're just giving up too soon.

Yes youa re right, your hints keep me going and thank you for that.

ÎÎ¯ÎºÎ¿Ï‚ · Aug 9, 2010

Now the code looks as follows:

=============================
#!/usr/bin/python

import re, os, sys

id = 0 # unique page_id

for currdir, files, dirs in os.walk('test'):

for f in files:

if f.endswith('php'):

# get abs path to filename
src_f = join(currdir, f)

# open php src file
print ( 'reading from %s' % src_f )
f = open(src_f, 'r')
src_data = f.read() # read contents of PHP file
f.close()

# replace tags
print ( 'replacing php tags and contents within' )
src_data = re.sub( '<?(.*?)?>', '', src_data )

# add ID
print ( 'adding unique page_id' )
src_data = ( '' % id ) + src_data
id += 1

# add template variables
print ( 'adding counter template variable' )
src_data = src_data.replace('</body>', '<br><br><center><h4><font
color=green> Î‘ÏÎ¹Î¸Î¼ÏŒÏ‚ Î•Ï€Î¹ÏƒÎºÎµÏ€Ï„ÏŽÎ½: %(counter)d </body>' )

# rename old php file to new with .html extension
src_file = src_file.replace('.php', '.html')

# open newly created html file for inserting data
print ( 'writing to %s' % dest_f )
dest_f = open(src_f, 'w')
dest_f.write(src_data) # write contents
dest_f.close()

I just tried to test it. I created a folder names 'test' in me 'd:\'
drive.
Then i have put to .php files inside form the original to test if it
would work ok for those too files before acting in the whole copy and
after in the original project.

so i opened a 'cli' form my Win7 and tried

D:\>convert.py

D:\>

Itsjust printed an empty line and nothign else. Why didn't even try to
open the folder and fiels within?
Syntactically it doesnt ghive me an error!
Somehting with os.walk() methos perhaps?

Peter Otten · Aug 9, 2010

ÎÎ¯ÎºÎ¿Ï‚ said:
Now the code looks as follows:

for currdir, files, dirs in os.walk('test'):

for f in files:

if f.endswith('php'):

# get abs path to filename
src_f = join(currdir, f)

I just tried to test it. I created a folder names 'test' in me 'd:\'
drive.
Then i have put to .php files inside form the original to test if it
would work ok for those too files before acting in the whole copy and
after in the original project.

so i opened a 'cli' form my Win7 and tried

D:\>convert.py

D:\>

Itsjust printed an empty line and nothign else. Why didn't even try to
open the folder and fiels within?
Syntactically it doesnt ghive me an error!
Somehting with os.walk() methos perhaps?

If there is a folder D:\test and it does contain some PHP files (double-
check!) the extension could be upper-case. Try

if f.lower().endswith("php"): ...

or

php_files = fnmatch.filter(files, "*.php")
for f in php_files: ...

Peter

ÎÎ¯ÎºÎ¿Ï‚ · Aug 9, 2010

If there is a folder D:\test and it does contain some PHP files (double-
check!) the extension could be upper-case. Try

if f.lower().endswith("php"): ...

or

php_files = fnmatch.filter(files, "*.php")
for f in php_files: ...

Peter

The extension is in in lower case. folder is there, php files is
there, i dont know why it doesnt't want to go into the d:\test to find
them.

Thast one problem.

The other one is:

i made the code simpler by specifying the filename my self.

=========================
# get abs path to filename
src_f = 'd:\\test\\index.php'

# open php src file
print ( 'reading from %s' % src_f )
f = open(src_f, 'r')
src_data = f.read() # read contents of PHP file
f.close()
=========================

but although ti nwo finds the fiel i egt this error in 'cli':

D:\>aconvert.py
reading from d:\test\index.php
Traceback (most recent call last):
File "D:\aconvert.py", line 16, in <module>
src_data = f.read() # read contents of PHP file
File "C:\Python32\lib\encodings\cp1253.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9f in position
321: char
acter maps to <undefined>

Somethign with the damn encodings again!!

Peter Otten · Aug 9, 2010

ÎÎ¯ÎºÎ¿Ï‚ said:
If there is a folder D:\test and it does contain some PHP files (double-
check!) the extension could be upper-case. Try

if f.lower().endswith("php"): ...

or

php_files = fnmatch.filter(files, "*.php")
for f in php_files: ...

Peter

Click to expand...

The extension is in in lower case. folder is there, php files is
there, i dont know why it doesnt't want to go into the d:\test to find
them.

Thast one problem.

The other one is:

i made the code simpler by specifying the filename my self.

=========================
# get abs path to filename
src_f = 'd:\\test\\index.php'

# open php src file
print ( 'reading from %s' % src_f )
f = open(src_f, 'r')
src_data = f.read() # read contents of PHP file
f.close()
=========================

but although ti nwo finds the fiel i egt this error in 'cli':

D:\>aconvert.py
reading from d:\test\index.php
Traceback (most recent call last):
File "D:\aconvert.py", line 16, in <module>
src_data = f.read() # read contents of PHP file
File "C:\Python32\lib\encodings\cp1253.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9f in position
321: char
acter maps to <undefined>

Somethign with the damn encodings again!!

Hmm, at one point in this thread you switched from Python 2.x to Python 3.2.
There are a lot of subtle and not so subtle differences between 2.x and 3.x,
and I recommend that you stick to one while you are still in newbie mode.

If you want to continue to use 3.x I recommend that you at least use the
stable 3.1 version.

Now one change from Python 2 to 3 is that open(filename, "r") gives you a
beast that is unicode-aware and assumes that the file is encoded in utf-8
unless you tell it otherwise with open(..., encoding=whatever). So what is
the charset used for your index.php?

Peter

ÎÎ¯ÎºÎ¿Ï‚ · Aug 9, 2010

The extension is in in lower case. folder is there, php files is
there, i dont know why it doesnt't want to go into the d:\test to find
them.

Click to expand...

Thast one problem.

Click to expand...

The other one is:

Click to expand...

i made the code simpler by specifying the filename my self.

Click to expand...

=========================
# get abs path to filename
src_f = 'd:\\test\\index.php'

Click to expand...

# open php src file
print ( 'reading from %s' % src_f )
f = open(src_f, 'r')
src_data = f.read() Â Â Â Â Â Â Â Â # read contents of PHP file
f.close()
=========================

Click to expand...

but Â although ti nwo finds the fiel i egt this error in 'cli':

Click to expand...

D:\>aconvert.py
reading from d:\test\index.php
Traceback (most recent call last):
Â File "D:\aconvert.py", line 16, in <module>
Â Â src_data = f.read() Â Â Â Â # read contents of PHP file
Â File "C:\Python32\lib\encodings\cp1253.py", line 23, in decode
Â Â return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9f in position
321: char
acter maps to <undefined>

Click to expand...

Somethign with the damn encodings again!!

Click to expand...

Hmm, at one point in this thread you switched from Python 2.x to Python 3..2.
There are a lot of subtle and not so subtle differences between 2.x and 3..x,
and I recommend that you stick to one while you are still in newbie mode.

If you want to continue to use 3.x I recommend that you at least use the
stable 3.1 version.

Now one change from Python 2 to 3 is that open(filename, "r") gives you a
beast that is unicode-aware and assumes that the file is encoded in utf-8
unless you tell it otherwise with open(..., encoding=whatever). So what is
the charset used for your index.php?

Peter

Yes yesterday i switched to Python 3.2 Peter.

When i open index.php within Notapad++ it says its in utf-8 without
BOM and it contains inside exepect form english chars , greek cjhars
as well fro printing.

The file was made by my client in dreamweaver.

So since its utf-8 what the problem of opening it?

Peter Otten · Aug 9, 2010

ÎÎ¯ÎºÎ¿Ï‚ said:
ÎÎ¯ÎºÎ¿Ï‚ said:

ÎÎ¯ÎºÎ¿Ï‚ wrote:
Now the code looks as follows:
for currdir, files, dirs in os.walk('test'):

Click to expand...

for f in files:

Click to expand...

if f.endswith('php'):

Click to expand...

# get abs path to filename
src_f = join(currdir, f)
I just tried to test it. I created a folder names 'test' in me 'd:\'
drive.
Then i have put to .php files inside form the original to test if it
would work ok for those too files before acting in the whole copy
and after in the original project.

Click to expand...

so i opened a 'cli' form my Win7 and tried

Itsjust printed an empty line and nothign else. Why didn't even try
to open the folder and fiels within?
Syntactically it doesnt ghive me an error!
Somehting with os.walk() methos perhaps?

Click to expand...

If there is a folder D:\test and it does contain some PHP files
(double- check!) the extension could be upper-case. Try

Click to expand...

if f.lower().endswith("php"): ...

php_files = fnmatch.filter(files, "*.php")
for f in php_files: ...

The extension is in in lower case. folder is there, php files is
there, i dont know why it doesnt't want to go into the d:\test to find
them.

Click to expand...

Thast one problem.

Click to expand...

The other one is:

Click to expand...

i made the code simpler by specifying the filename my self.

Click to expand...

=========================
# get abs path to filename
src_f = 'd:\\test\\index.php'

Click to expand...

# open php src file
print ( 'reading from %s' % src_f )
f = open(src_f, 'r')
src_data = f.read() # read contents of PHP file
f.close()
=========================

Click to expand...

but although ti nwo finds the fiel i egt this error in 'cli':

Click to expand...

D:\>aconvert.py
reading from d:\test\index.php
Traceback (most recent call last):
File "D:\aconvert.py", line 16, in <module>
src_data = f.read() # read contents of PHP file
File "C:\Python32\lib\encodings\cp1253.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9f in position
321: char
acter maps to <undefined>

Click to expand...

Somethign with the damn encodings again!!

Click to expand...

Hmm, at one point in this thread you switched from Python 2.x to Python
3.2. There are a lot of subtle and not so subtle differences between 2.x
and 3.x, and I recommend that you stick to one while you are still in
newbie mode.

If you want to continue to use 3.x I recommend that you at least use the
stable 3.1 version.

Now one change from Python 2 to 3 is that open(filename, "r") gives you a
beast that is unicode-aware and assumes that the file is encoded in utf-8
unless you tell it otherwise with open(..., encoding=whatever). So what
is the charset used for your index.php?

Peter

Click to expand...

Yes yesterday i switched to Python 3.2 Peter.

When i open index.php within Notapad++ it says its in utf-8 without
BOM and it contains inside exepect form english chars , greek cjhars
as well fro printing.

The file was made by my client in dreamweaver.

So since its utf-8 what the problem of opening it?

Python says it's not, and I tend to believe it. You can open the file with

open(..., errors="replace")

but you will lose data (which is already garbled, anyway).

Again: in the unlikely case that Python is causing your problem -- you do
understand what an alpha version is?

Peter

ÎÎ¯ÎºÎ¿Ï‚ · Aug 9, 2010

Python says it's not, and I tend to believe it.

You are right!

I tried to do the same exact openign via IDLE enviroment and i goth
the encoding of the file from there!
<_io.TextIOWrapper name='d:\\test\\index.php' encoding='cp1253'>

Thats why in the error in my previous post it said
File "C:\Python32\lib\encodings\cp1253.py", line 23, in decode
it tried to use the cp1253 encoding.

But now sicne Python as we see can undestand the nature of the
encoding what causing it not to open the file?

Peter Otten · Aug 9, 2010

ÎÎ¯ÎºÎ¿Ï‚ said:
You are right!

I tried to do the same exact openign via IDLE enviroment and i goth
the encoding of the file from there!

<_io.TextIOWrapper name='d:\\test\\index.php' encoding='cp1253'>

Thats why in the error in my previous post it said
File "C:\Python32\lib\encodings\cp1253.py", line 23, in decode
it tried to use the cp1253 encoding.

But now sicne Python as we see can undestand the nature of the
encoding what causing it not to open the file?

It doesn't. You have to tell. *If* the file uses cp1253 you can open it with

open(..., encoding="cp1253")

Note that if the file is not in cp1253 python will still happily open it as
long as it doesn't contain the following bytes:

.... try: chr(i).decode("cp1253") and None
.... except: print i
....
129
136
138
140
141
142
143
144
152
154
156
157
158
159
170
210
255

Peter

MRAB · Aug 9, 2010

ÎÎ¯ÎºÎ¿Ï‚ said:
Thanks you,

So i guess this needs to be written as:

src_data = re.sub( '<?(.*?)?>', '', src_data )

In a regex '?' is a special character, so if you want a literal '?' you
need to escape it. Therefore:

ÎÎ¯ÎºÎ¿Ï‚ · Aug 9, 2010

It doesn't. You have to tell.

Why it doesn't? The idle response designates that it knows that file
encoding is in "cp1253" which means it can identify it.

*If* the file uses cp1253 you can open it with

open(..., encoding="cp1253")

Note that if the file is not in cp1253 python will still happily open it as
long as it doesn't contain the following bytes:

... Â Â try: chr(i).decode("cp1253") and None
... Â Â except: print i
...
129
136
138
140
141
142
143
144
152
154
156
157
158
159
170
210
255

Peter

I'm afraid it does because whn i tried:

f = open(src_f, 'r', encoding="cp1253" )

i got the same error again.....what are those characters?Dont they
belong too tot he same weird 'cp1253' encoding? Why compiler cant open
them?

ÎÎ¯ÎºÎ¿Ï‚ · Aug 9, 2010

In a regex '?' is a special character, so if you want a literal '?' you
need to escape it. Therefore:

Â Â Â src_data = re.sub(r'<\?(.*?)\?>', '', src_data)

i see, or perhaps even this:

Â Â src_data = re.sub(r'<?(.*?)?>', '', src_data)

maybe it works here as well.

MRAB · Aug 9, 2010

ÎÎ¯ÎºÎ¿Ï‚ said:
i see, or perhaps even this:

src_data = re.sub(r'<?(.*?)?>', '', src_data)

maybe it works here as well.

No. That regex means that it should match:

<? # optional '<'
(.*?)? # optional group of any number of any characters

Íßêïò · Aug 9, 2010

Please tell me that no matter what weird charhs has inside ic an still
open thosie fiels and make the neccessary replacements.

Peter Otten · Aug 9, 2010

ÎÎ¯ÎºÎ¿Ï‚ said:
Please tell me that no matter what weird charhs has inside ic an still
open thosie fiels and make the neccessary replacements.

Go back to 2.6 for the moment and defer learning about unicode until you're
done with the conversion job.

ÎÎ¯ÎºÎ¿Ï‚ · Aug 9, 2010

Go back to 2.6 for the moment and defer learning about unicode until you're
done with the conversion job.

You are correct again! 3.2 caused the problem, i switched to 2.7 and
now i donyt have that problem anymore. File is openign okey!

it ALMOST convert correctly!

# replace tags
print ( 'replacing php tags and contents within' )
src_data = re.sub( '<\?(.*?)\?>', '', src_data )

it only convert the first instance of php tages and not the rest?
But why?

Thomas Jollans · Aug 9, 2010

You are correct again! 3.2 caused the problem, i switched to 2.7 and
now i donyt have that problem anymore. File is openign okey!

it ALMOST convert correctly!

# replace tags
print ( 'replacing php tags and contents within' )
src_data = re.sub( '<\?(.*?)\?>', '', src_data )

it only convert the first instance of php tages and not the rest?
But why?

http://docs.python.org/library/re.html#re.S

You probably need to pass the re.DOTALL flag.

ÎÎ¯ÎºÎ¿Ï‚ · Aug 9, 2010

When replacing text in an HTML document with re.sub, you want to use
the re.S (singleline) option; otherwise your pattern won't match when
the opening tag is on one line and the closing is on another.

Thats exactly the problem iam facing now with this statement.

src_data = re.sub( '<\?(.*?)\?>', '', src_data )

you mean i have to switch it like this?

src_data = re.S ( '<\?(.*?)\?>', '', src_data ) ?

ÎÎ¯ÎºÎ¿Ï‚ · Aug 9, 2010

http://docs.python.org/library/re.html#re.S

You probably need to pass the re.DOTALL flag.

src_data = re.sub( '<\?(.*?)\?>', '', src_data, re.DOTALL )

like this?

Íßêïò · Aug 9, 2010

Now the code looks as follows:

=============================
#!/usr/bin/python

import re, os, sys

id = 0 # unique page_id

for currdir, files, dirs in os.walk('test'):

for f in files:

if f.endswith('php'):

# get abs path to filename
src_f = join(currdir, f)

# open php src file
print ( 'reading from %s' % src_f )
f = open(src_f, 'r')
src_data = f.read() # read contents of PHP file
f.close()

# replace tags
print ( 'replacing php tags and contents within' )
src_data = re.sub( '<?(..*?)?>', '', src_data )

# add ID
print ( 'adding unique page_id' )
src_data = ( '' % id ) + src_data
id += 1

# add template variables
print ( 'adding counter template variable' )
src_data = src_data.replace('</body>', '<br><br><center><h4><font
color=green> Áñéèìüò Åðéóêåðôþí: %(counter)d </body>' )

# rename old php file to new with .html extension
src_file = src_file.replace('.php', '.html')

# open newly created html file for inserting data
print ( 'writing to %s' % dest_f )
dest_f = open(src_f, 'w')
dest_f.write(src_data) # write contents
dest_f.close()

I just tried to test it. I created a folder names 'test' in me 'd:\'
drive.
Then i have put to .php files inside form the original to test if it
would work ok for those too files before acting in the whole copy and
after in the original project.

so i opened a 'cli' form my Win7 and tried

D:\>convert.py

D:\>

Itsjust printed an empty line and nothign else. Why didn't even try to
open the folder and fiels within?
Syntactically it doesnt ghive me an error!
Somehting with os.walk() methos perhaps?

Can you help in this too please?

Now iam able to just convrt a single file 'd:\test\index.php'

But these needs to be done for ALL the php files in every subfolder.

for currdir, files, dirs in os.walk('test'):

for f in files:

if f.endswith('php'):

Should the above lines enter folders and find php files in each folder
so to be edited?

Problem with a login script, SESSION user rights and put this together so it works with the other pages and MySQL. Code examples.	2	May 5, 2023
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
Can I get a little help with my program? (string searching and regex)	0	Jan 8, 2009
Using a RegEx as a "variable" WITHIN an array?	4	May 15, 2005
Use of CSS and Master Pages	4	May 2, 2007
Archos 70 Android tablet with HTML pages for control and data display	3	Sep 14, 2011
CFP with extended deadline of Mar. 31, 2011: The 2011 InternationalConference on Modeling, Simulati	0	Mar 20, 2011

Replace and inserting strings within .txt files with the use of regex

ÎÎ¯ÎºÎ¿Ï‚

ÎÎ¯ÎºÎ¿Ï‚

Peter Otten

ÎÎ¯ÎºÎ¿Ï‚

Peter Otten

ÎÎ¯ÎºÎ¿Ï‚

Peter Otten

ÎÎ¯ÎºÎ¿Ï‚

Peter Otten

MRAB

ÎÎ¯ÎºÎ¿Ï‚

ÎÎ¯ÎºÎ¿Ï‚

MRAB

Íßêïò

Peter Otten

ÎÎ¯ÎºÎ¿Ï‚

Thomas Jollans

ÎÎ¯ÎºÎ¿Ï‚

ÎÎ¯ÎºÎ¿Ï‚

Íßêïò

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads