Replace and inserting strings within .txt files with the use of regex

Íßêïò · Aug 7, 2010

Hello dear Pythoneers,

I have over 500 .php web pages in various subfolders under 'data'
folder that i have to rename to .html and and ditch the '<?' and '?>'
tages from within and also insert a very first line of 
where id must be an identification unique number of every page for
counter tracking purposes. ONly pure html code must be left.

I before find otu Python used php and now iam switching to templates +
python solution so i ahve to change each and every page.

I don't know how to handle such a big data replacing problem and
cannot play with fire because those 500 pages are my cleints pages and
data of those filesjust cannot be messes up.

Can you provide to me a script please that is able of performing an
automatic way of such a page content replacing?

Thanks a million!

rantingrick · Aug 7, 2010

Hello dear Pythoneers,

I prefer Pythonista, but anywho..

I have over 500 .php web pages in various subfolders under 'data'
folder that i have to rename to .html

import os
os.rename(old, new)

and and ditch the '<?' and '?>' tages from within

path = 'some/valid/path'
f = open(path, 'r')
data = f.read()
f.close()
data.replace('<?', '')
data.replace('?>', '')

and also insert a very first line of 
where id must be an identification unique number of every page for
counter tracking purposes.

ONly pure html code must be left.

Well then don't F up! However judging from the amount of typos in this
post i would suggest you do some major testing!

I don't know how to handle such a big data replacing problem and
cannot play with fire because those 500 pages are my cleints pages and
data of those files just cannot be messes up.

Better do some serous testing first, or (if you have enough disc
space ) create copies instead!

Can you provide to me a script please that is able of performing an
automatic way of such a page content replacing?

This is very basic stuff and the fine manual is free you know. But how
much are you willing to pay?

MRAB · Aug 7, 2010

rantingrick said:
I prefer Pythonista, but anywho..

import os
os.rename(old, new)

path = 'some/valid/path'
f = open(path, 'r')
data = f.read()
f.close()
data.replace('<?', '')
data.replace('?>', '')

That should be:

data = data.replace('<?', '')
data = data.replace('?>', '')

comment = ""%(idnum)
data.insert(idx, comment)

Strings don't have an 'insert' method!

Íßêïò · Aug 7, 2010

# rename ALL php files to html in every subfolder of the folder 'data'
os.rename('*.php', '*.html') # how to tell python to
rename ALL php files to html to ALL subfolder under 'data' ?

# current path of the file to be processed
path = './data' # this must be somehow in a loop i feel
that read every file of every subfolder

# open an html file for reading
f = open(path, 'rw')
# read the contents of the whole file
data = f.read()

# replace all php tags with empty string
data = data.replace('<?', '')
data = data.replace('?>', '')

# write replaced data to file
data = f.write()

# insert an increasing unique integer number at the very first line
of every html file processing
comment = ""%(idnum) # how will the number
change here an increased by one file after file?
f = f.close()

Please help i'm new to python an apart from syntx its a logic problem
as well and needs experience.

John S · Aug 7, 2010

Hello dear Pythoneers,

I have over 500 .php web pages in various subfolders under 'data'
folder that i have to rename to .html and and ditch the '<?' and '?>'
tages from within and also insert a very first line of 
where id must be an identification unique number of every page for
counter tracking purposes. ONly pure html code must be left.

I before find otu Python used php and now iam switching to templates +
python solution so i ahve to change each and every page.

I don't know how to handle such a big data replacing problem and
cannot play with fire because those 500 pages are my cleints pages and
data of those filesjust cannot be messes up.

Can you provide to me a script please that is able of performing an
automatic way of such a page content replacing?

Thanks a million!

If the 500 web pages are PHP only in the sense that there is only one
pair of <? ?> tags in each file, surrounding the entire content, then
what you ask for is doable.

from os.path import join
import os

id = 1 # id number
for currdir,files,dirs in os.walk('data'):
for f in files:
if f.endswith('php'):
source_file_name = join(currdir,f) # get abs path to
filename
source_file = open(source_file_name)
source_contents = source_file.read() # read contents of
PHP file
source_file.close()

# replace tags
source_contents = source_contents.replace('<%','')
source_contents = source_contents.replace('%>','')

# add ID
source_contents = ( '' % id ) + source_contents
id += 1

# create new file with .html extension
source_file_name =
source_file_name.replace('.php','.html')
dest_file = open(source_file_name,'w')
dest_file.write(source_contents) # write contents
dest_file.close()

Note: error checking left out for clarity.

On the other hand, if your 500 web pages contain embedded PHP
variables or logic, you have a big job ahead. Django templates and PHP
are two different languages for embedding data and logic in web pages.
Converting a project from PHP to Django involves more than renaming
the template files and deleting "<?" and friends.

For example, here is a snippet of PHP which checks which browser is
viewing the page:

<?php
if (strpos($_SERVER['HTTP_USER_AGENT'], 'MSIE') !== FALSE) {
echo 'You are using Internet Explorer.<br />';
}
?>

In Django, you would typically put this logic in a Django *view*
(which btw is not what is called a 'view' in MVC term), which is the
code that prepares data for the template. The logic would not live
with the HTML. The template uses "template variables" that the view
has associated with a Python variable or function. You might create a
template variable (created via a Context object) named 'browser' that
contains a value that identifies the browser.

Thus, your Python template (HTML file) might look like this:

{% if browser == 'IE' %}You are using Internet Explorer{% endif %}

PHP tends to combine the presentation with the business logic, or in
MVC terms, combines the view with the controller. Django separates
them out, which many people find to be a better way. The person who
writes the HTML doesn't have to speak Python, but only know the names
of template variables and a little bit of template logic. In PHP, the
HTML code and all the business logic lives in the same files. Even
here, it would probably make sense to calculate the browser ID in the
header of the HTML file, then access it via a variable in the body.

If you have 500 static web pages that are part of the same
application, but that do not contain any logic, your application might
need to be redesigned.

Also, you are doing your changes on a COPY of the application on a non-
public server, aren't you? If not, then you really are playing with
fire.

HTH,
John

rantingrick · Aug 7, 2010

That should be:

data = data.replace('<?', '')
data = data.replace('?>', '')

Yes, Thanks MRAB. I did forget that important detail.

Strings don't have an 'insert' method!

*facepalm*! I really must stop Usenet-ing whilst consuming large
volumes of alcoholic beverages.

John S · Aug 7, 2010

Even though I just replied above, in reading over the OP's message, I
think the OP might be asking:

"How can I use RE string replacement to find PHP tags and convert them
to Django template tags?"

Instead of saying

source_contents = source_contents.replace(...)

say this instead:

import re

def replace_php_tags(m):
''' PHP tag replacer
This function is called for each PHP tag. It gets a Match object as
its parameter, so you can get the contents of the old tag, and
should
return the new (Django) tag.
'''

# m is the match object from the current match
php_guts = m.group(1) # the contents of the PHP tag

# now put the replacement logic here

# and return whatever should go in place of the PHP tag,
# which could be '{{ python_template_var }}'
# or '{% template logic ... %}
# or some combination

source_contents = re.sub('<?\s*(.*?)\s*?

Íßêïò · Aug 7, 2010

If the 500 web pages are PHP only in the sense that there is only one
pair of <? ?> tags in each file, surrounding the entire content, then
what you ask for is doable.

First of all, thank you very much John for your BIG effort to help
me(i'm still readign your posts)!

I have to tell you here that those php files contain several instances
of php opening and closing tags(like 3 each php file). The rest is
pure html data. That happened because those files were in the
beginning html only files that later needed conversion to php due to
some dynamic code that had to be used to address some issues.

Please tell me that the code you provided can be adjusted to several
instances as well!

Íßêïò · Aug 7, 2010

"How can I use RE string replacement to find PHP tags and convert them
to Django template tags?"

No, not at all John, at least not yet!

I have only 1 week that i'm learnign python(changing from php & perl)
so i'm very fresh at this beautifull and straighforwrd language.

When i have a good understnading of Python then i will proceed to
Django templates.
Until then my Python templates would be only 'simple html files' that
the only thign they contain apart form the html data would be the
special string formatting identifies '%s'

Steven D'Aprano · Aug 8, 2010

I don't know how to handle such a big data replacing problem and cannot
play with fire because those 500 pages are my cleints pages and data of
those filesjust cannot be messes up.

Take a backup copy of the files, and only edit the copies. Don't replace
the originals until you know they're correct.

ÎÎ¯ÎºÎ¿Ï‚ · Aug 8, 2010

Take a backup copy of the files, and only edit the copies. Don't replace
the originals until you know they're correct.

Yes of course, but the code that John S provided need soem
modification in order to be able to change various instances of php
tags and not only one set.

ÎÎ¯ÎºÎ¿Ï‚ · Aug 8, 2010

Script so far:

#!/usr/bin/python

import cgitb; cgitb.enable()
import cgi, re, os

print ( "Content-type: text/html; charset=UTF-8 \n" )

id = 0 # unique page_id

for currdir, files, dirs in os.walk('data'):

for f in files:

if f.endswith('php'):

# get abs path to filename
src_f = join(currdir,f)

# open php src file
f = open(src_f, 'r')
src_data = f.read() # read contents of PHP file
f.close()
print 'reading from %s' % src_f

# replace tags
src_data = src_data.replace('<%', '')
src_data = src_data.replace('%>', '')
print 'replacing php tags'

# add ID
src_data = ( '' % id ) + src_data
id += 1
print 'adding unique page_id'

# create new file with .html extension
src_file = src_file.replace('.php', '.html')

# open newly created html file for insertid data
dest_f = open(src_f, 'w')
dest_f.write(src_data) # write contents
dest_f.close()
print 'writing to %s' % dest_f

Please help me adjust it, if need extra modification for more php tags
replacing.

Thomas Jollans · Aug 8, 2010

*facepalm*! I really must stop Usenet-ing whilst consuming large
volumes of alcoholic beverages.

THAT explains a lot.

Cheers

Thomas Jollans · Aug 8, 2010

Please help me adjust it, if need extra modification for more php tags
replacing.

Have you tried it ? I haven't, but I see no immediate reason why it
wouldn't work with multiple PHP blocks.

#!/usr/bin/python

import cgitb; cgitb.enable()
import cgi, re, os

print ( "Content-type: text/html; charset=UTF-8 \n" )

id = 0 # unique page_id

for currdir, files, dirs in os.walk('data'):

for f in files:

if f.endswith('php'):

# get abs path to filename
src_f = join(currdir,f)

# open php src file
f = open(src_f, 'r')
src_data = f.read() # read contents of PHP file
f.close()
print 'reading from %s' % src_f

# replace tags
src_data = src_data.replace('<%', '')
src_data = src_data.replace('%>', '')

Did you read the script before posting? ;-)
Here, you remove ASP-style tags. Which is fine, PHP supports them if you
configure it that way, but you probably didn't. Change this to the start
and end tags you actually use, and, if you use multiple forms (such as
<?php vs <?), then add another line or two.

ÎÎ¯ÎºÎ¿Ï‚ · Aug 8, 2010

Have you tried it ? I haven't, but I see no immediate reason why it
wouldn't work with multiple PHP blocks.

Did you read the script before posting? ;-)
Here, you remove ASP-style tags. Which is fine, PHP supports them if you
configure it that way, but you probably didn't. Change this to the start
and end tags you actually use, and, if you use multiple forms (such as
<?php vs <?), then add another line or two.

Yes i have read the code very well and by mistake i wrote '<%>'
instead of '<?'

I was so dizzy and confused yesterday that i forgot to metnion that
not only i need removal of php openign and closing tags but whaevers
data lurks inside those tags as well ebcause now with the 'counter.py'
script i wrote the html fiels would open ftm there and substitute the
tempalte variabels like %(counter)d

Also before the

</body>
</html>

of every html file afetr removing the tags this line must be
inserted(this holds the template variable) that 'counter.py' uses to
produce data

<br><br><center><h4><font color=green> Î‘ÏÎ¹Î¸Î¼ÏŒÏ‚ Î•Ï€Î¹ÏƒÎºÎµÏ€Ï„ÏŽÎ½: %(counter)d
</h4>

After making this modifications then i can trst the script to a COPY
of the original data in my pc.

*In my pc i run Windows 7 while remote web hosting setup uses Linux
Servers.
*That wont be a problem right?

Thomas Jollans · Aug 8, 2010

I was so dizzy and confused yesterday that i forgot to metnion that
not only i need removal of php openign and closing tags but whaevers
data lurks inside those tags as well ebcause now with the 'counter.py'
script i wrote the html fiels would open ftm there and substitute the
tempalte variabels like %(counter)d

I could just hand you a solution, but I'll be a bit of a bastard and
just give you some hints.

You could use regular expressions. If you know regular expressions, it's
relatively trivial - but I doubt you know regexp.

You could also repeatedly find the next occurrence of first a start tag,
then an end tag, using either str.find or str.split, and build up a
version of the file without PHP yourself.

Also before the

</body>
</html>

of every html file afetr removing the tags this line must be
inserted(this holds the template variable) that 'counter.py' uses to
produce data

<br><br><center><h4><font color=green> Î‘ÏÎ¹Î¸Î¼ÏŒÏ‚ Î•Ï€Î¹ÏƒÎºÎµÏ€Ï„ÏŽÎ½: %(counter)d
</h4>

This problem is truly trivial. I know you can do it yourself, or at
least give it a good shot, and ask again when you hit a serious roadblock.

If I may comment on your HTML: you forgot to close your <center> and
<font> tags. Close them! Also, both (CENTER and FONT) have been
deprecated since HTML 4.0 -- you should consider using CSS for these
tasks instead. Also, this line does not look like a heading, so H4 is
hardly fitting.

After making this modifications then i can trst the script to a COPY
of the original data in my pc.

It would be nice if you re-read your posts before sending and tried to
iron out some of more careless spelling mistakes. Maybe you are doing
your best to post in good English -- it isn't bad and I realize this is
neither your native language nor alphabet, in which case I apologize.
The fact of the matter is: I originally interpreter "trst" as "trust",
which made no sense whatsoever.

*In my pc i run Windows 7 while remote web hosting setup uses Linux
Servers.
*That wont be a problem right?

Nah.

ÎÎ¯ÎºÎ¿Ï‚ · Aug 8, 2010

I could just hand you a solution, but I'll be a bit of a bastard and
just give you some hints.

You could use regular expressions. If you know regular expressions, it's
relatively trivial - but I doubt you know regexp.

Here is the code with some try-and-fail modification i made, still non-
working based on your hints:
==========================================================

id = 0 # unique page_id

for currdir, files, dirs in os.walk('varsa'):

for f in files:

if f.endswith('php'):

# get abs path to filename
src_f = join(currdir, f)

# open php src file
print 'reading from %s' % src_f
f = open(src_f, 'r')
src_data = f.read() # read contents of PHP file
f.close()

# replace tags
print 'replacing php tags and contents within'
src_data = src_data.replace(r'<?.?>', '') #
the dot matches any character i hope! no matter how many of them?!?

# add ID
print 'adding unique page_id'
src_data = ( '' % id ) + src_data
id += 1

# add template variables
print 'adding counter template variable'
src_data = src_data + ''' <h4><font color=green> Î‘ÏÎ¹Î¸Î¼ÏŒÏ‚
Î•Ï€Î¹ÏƒÎºÎµÏ€Ï„ÏŽÎ½: %(counter)d </font></h4> '''
# i can think of this but the above line must be above </
body></html> NOT after but how to right that?!?

# rename old php file to new with .html extension
src_file = src_file.replace('.php', '.html')

# open newly created html file for inserting data
print 'writing to %s' % dest_f
dest_f = open(src_f, 'w')
dest_f.write(src_data) # write contents
dest_f.close()

This is the best i can do. Sorry for any typos i might made.

Please shed some LIGHT!

Thomas Jollans · Aug 8, 2010

Here is the code with some try-and-fail modification i made, still non-
working based on your hints:
==========================================================

id = 0 # unique page_id

for currdir, files, dirs in os.walk('varsa'):

for f in files:

if f.endswith('php'):

# get abs path to filename
src_f = join(currdir, f)

# open php src file
print 'reading from %s' % src_f
f = open(src_f, 'r')
src_data = f.read() # read contents of PHP file
f.close()

# replace tags
print 'replacing php tags and contents within'
src_data = src_data.replace(r'<?.?>', '') #
the dot matches any character i hope! no matter how many of them?!?

Two problems here:

str.replace doesn't use regular expressions. You'll have to use the re
module to use regexps. (the re.sub function to be precise)

'.' matches a single character. Any character, but only one.
'.*' matches as many characters as possible. This is not what you want,
since it will match everything between the *first* <? and the *last* ?>.
You want non-greedy matching.

'.*?' is the same thing, without the greed.

# add ID
print 'adding unique page_id'
src_data = ( '' % id ) + src_data
id += 1

# add template variables
print 'adding counter template variable'
src_data = src_data + ''' <h4><font color=green> Î‘ÏÎ¹Î¸Î¼ÏŒÏ‚
Î•Ï€Î¹ÏƒÎºÎµÏ€Ï„ÏŽÎ½: %(counter)d </font></h4> '''
# i can think of this but the above line must be above </
body></html> NOT after but how to right that?!?

You will have to find the </body> tag before inserting the string.
str.find should help -- or you could use str.replace and replace the

# rename old php file to new with .html extension
src_file = src_file.replace('.php', '.html')

# open newly created html file for inserting data
print 'writing to %s' % dest_f
dest_f = open(src_f, 'w')
dest_f.write(src_data) # write contents
dest_f.close()

This is the best i can do.

No it's not. You're just giving up too soon.

John S · Aug 8, 2010

Two problems here:

str.replace doesn't use regular expressions. You'll have to use the re
module to use regexps. (the re.sub function to be precise)

'.' Â matches a single character. Any character, but only one.
'.*' matches as many characters as possible. This is not what you want,
since it will match everything between the *first* <? and the *last* ?>.
You want non-greedy matching.

'.*?' is the same thing, without the greed.

You will have to find the </body> tag before inserting the string.
str.find should help -- or you could use str.replace and replace the
</body> tag with you counter line, plus a new </body>.

No it's not. You're just giving up too soon.

When replacing text in an HTML document with re.sub, you want to use
the re.S (singleline) option; otherwise your pattern won't match when
the opening tag is on one line and the closing is on another.

Joel Goldstick · Aug 8, 2010

ï¿½ said:
Hello dear Pythoneers,

I have over 500 .php web pages in various subfolders under 'data'
folder that i have to rename to .html and and ditch the '<?' and '?>'
tages from within and also insert a very first line of 
where id must be an identification unique number of every page for
counter tracking purposes. ONly pure html code must be left.

I before find otu Python used php and now iam switching to templates +
python solution so i ahve to change each and every page.

I don't know how to handle such a big data replacing problem and
cannot play with fire because those 500 pages are my cleints pages and
data of those filesjust cannot be messes up.

Can you provide to me a script please that is able of performing an
automatic way of such a page content replacing?

Thanks a million!

This is quite a vague description of the file contents. But, for a
completely different approach, how about using a browser and doing view
source, then saving the html that was generated. This will contain no
php code, but it will contain the results of whatever the php was doing.

If you don't have time to do this manually, look into wget or curl,
which will do the job in a program environment.

The discussion so far has dealt with stripping php, and leaving the
html. But the html must have embeded <?php some code to print something
?> in it. Or, there could be long fragments of html which are
constructed by php and then echo'ed.

Joel Goldstick

Problem with a login script, SESSION user rights and put this together so it works with the other pages and MySQL. Code examples.	2	May 5, 2023
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
Can I get a little help with my program? (string searching and regex)	0	Jan 8, 2009
Using a RegEx as a "variable" WITHIN an array?	4	May 15, 2005
Use of CSS and Master Pages	4	May 2, 2007
Archos 70 Android tablet with HTML pages for control and data display	3	Sep 14, 2011
CFP with extended deadline of Mar. 31, 2011: The 2011 InternationalConference on Modeling, Simulati	0	Mar 20, 2011

Replace and inserting strings within .txt files with the use of regex

Íßêïò

rantingrick

MRAB

Íßêïò

John S

rantingrick

John S

Íßêïò

Íßêïò

Steven D'Aprano

ÎÎ¯ÎºÎ¿Ï‚

ÎÎ¯ÎºÎ¿Ï‚

Thomas Jollans

Thomas Jollans

ÎÎ¯ÎºÎ¿Ï‚

Thomas Jollans

ÎÎ¯ÎºÎ¿Ï‚

Thomas Jollans

John S

Joel Goldstick

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads