Replace and inserting strings within .txt files with the use of regex

Í

Íßêïò

Hello dear Pythoneers,

I have over 500 .php web pages in various subfolders under 'data'
folder that i have to rename to .html and and ditch the '<?' and '?>'
tages from within and also insert a very first line of <!-- id -->
where id must be an identification unique number of every page for
counter tracking purposes. ONly pure html code must be left.

I before find otu Python used php and now iam switching to templates +
python solution so i ahve to change each and every page.

I don't know how to handle such a big data replacing problem and
cannot play with fire because those 500 pages are my cleints pages and
data of those filesjust cannot be messes up.

Can you provide to me a script please that is able of performing an
automatic way of such a page content replacing?

Thanks a million!
 
R

rantingrick

Hello dear Pythoneers,

I prefer Pythonista, but anywho..
I have over 500 .php web pages in various subfolders under 'data'
folder that i have to rename to .html

import os
os.rename(old, new)
and and ditch the '<?' and '?>' tages from within

path = 'some/valid/path'
f = open(path, 'r')
data = f.read()
f.close()
data.replace('<?', '')
data.replace('?>', '')
and also insert a very first line of <!-- id -->
where id must be an identification unique number of every page for
counter tracking purposes.

ONly pure html code must be left.

Well then don't F up! However judging from the amount of typos in this
post i would suggest you do some major testing!
I don't know how to handle such a big data replacing problem and
cannot play with fire because those 500 pages are my cleints pages and
data of those files just cannot be messes up.

Better do some serous testing first, or (if you have enough disc
space ) create copies instead!
Can you provide to me a script please that is able of performing an
automatic way of such a page content replacing?

This is very basic stuff and the fine manual is free you know. But how
much are you willing to pay?
 
M

MRAB

rantingrick said:
I prefer Pythonista, but anywho..


import os
os.rename(old, new)


path = 'some/valid/path'
f = open(path, 'r')
data = f.read()
f.close()
data.replace('<?', '')
data.replace('?>', '')
That should be:

data = data.replace('<?', '')
data = data.replace('?>', '')
comment = "<!-- %s -->"%(idnum)
data.insert(idx, comment)
Strings don't have an 'insert' method!
 
Í

Íßêïò

# rename ALL php files to html in every subfolder of the folder 'data'
os.rename('*.php', '*.html') # how to tell python to
rename ALL php files to html to ALL subfolder under 'data' ?

# current path of the file to be processed
path = './data' # this must be somehow in a loop i feel
that read every file of every subfolder

# open an html file for reading
f = open(path, 'rw')
# read the contents of the whole file
data = f.read()

# replace all php tags with empty string
data = data.replace('<?', '')
data = data.replace('?>', '')

# write replaced data to file
data = f.write()

# insert an increasing unique integer number at the very first line
of every html file processing
comment = "<!-- %s -->"%(idnum) # how will the number
change here an increased by one file after file?
f = f.close()

Please help i'm new to python an apart from syntx its a logic problem
as well and needs experience.
 
J

John S

Hello dear Pythoneers,

I have over 500 .php web pages in various subfolders under 'data'
folder that i have to rename to .html and and ditch the '<?' and '?>'
tages from within and also insert a very first line of <!-- id -->
where id must be an identification unique number of every page for
counter tracking purposes. ONly pure html code must be left.

I before find otu Python used php and now iam switching to templates +
python solution so i ahve to change each and every page.

I don't know how to handle such a big data replacing problem and
cannot play with fire because those 500 pages are my cleints pages and
data of those filesjust cannot be messes up.

Can you provide to me a script please that is able of performing an
automatic way of such a page content replacing?

Thanks a million!

If the 500 web pages are PHP only in the sense that there is only one
pair of <? ?> tags in each file, surrounding the entire content, then
what you ask for is doable.

from os.path import join
import os

id = 1 # id number
for currdir,files,dirs in os.walk('data'):
for f in files:
if f.endswith('php'):
source_file_name = join(currdir,f) # get abs path to
filename
source_file = open(source_file_name)
source_contents = source_file.read() # read contents of
PHP file
source_file.close()

# replace tags
source_contents = source_contents.replace('<%','')
source_contents = source_contents.replace('%>','')

# add ID
source_contents = ( '<!-- %d -->' % id ) + source_contents
id += 1

# create new file with .html extension
source_file_name =
source_file_name.replace('.php','.html')
dest_file = open(source_file_name,'w')
dest_file.write(source_contents) # write contents
dest_file.close()

Note: error checking left out for clarity.

On the other hand, if your 500 web pages contain embedded PHP
variables or logic, you have a big job ahead. Django templates and PHP
are two different languages for embedding data and logic in web pages.
Converting a project from PHP to Django involves more than renaming
the template files and deleting "<?" and friends.

For example, here is a snippet of PHP which checks which browser is
viewing the page:

<?php
if (strpos($_SERVER['HTTP_USER_AGENT'], 'MSIE') !== FALSE) {
echo 'You are using Internet Explorer.<br />';
}
?>

In Django, you would typically put this logic in a Django *view*
(which btw is not what is called a 'view' in MVC term), which is the
code that prepares data for the template. The logic would not live
with the HTML. The template uses "template variables" that the view
has associated with a Python variable or function. You might create a
template variable (created via a Context object) named 'browser' that
contains a value that identifies the browser.

Thus, your Python template (HTML file) might look like this:

{% if browser == 'IE' %}You are using Internet Explorer{% endif %}

PHP tends to combine the presentation with the business logic, or in
MVC terms, combines the view with the controller. Django separates
them out, which many people find to be a better way. The person who
writes the HTML doesn't have to speak Python, but only know the names
of template variables and a little bit of template logic. In PHP, the
HTML code and all the business logic lives in the same files. Even
here, it would probably make sense to calculate the browser ID in the
header of the HTML file, then access it via a variable in the body.

If you have 500 static web pages that are part of the same
application, but that do not contain any logic, your application might
need to be redesigned.

Also, you are doing your changes on a COPY of the application on a non-
public server, aren't you? If not, then you really are playing with
fire.


HTH,
John
 
R

rantingrick

That should be:

   data = data.replace('<?', '')
   data = data.replace('?>', '')

Yes, Thanks MRAB. I did forget that important detail.
Strings don't have an 'insert' method!

*facepalm*! I really must stop Usenet-ing whilst consuming large
volumes of alcoholic beverages.
 
J

John S

Even though I just replied above, in reading over the OP's message, I
think the OP might be asking:

"How can I use RE string replacement to find PHP tags and convert them
to Django template tags?"

Instead of saying

source_contents = source_contents.replace(...)

say this instead:

import re


def replace_php_tags(m):
''' PHP tag replacer
This function is called for each PHP tag. It gets a Match object as
its parameter, so you can get the contents of the old tag, and
should
return the new (Django) tag.
'''

# m is the match object from the current match
php_guts = m.group(1) # the contents of the PHP tag

# now put the replacement logic here

# and return whatever should go in place of the PHP tag,
# which could be '{{ python_template_var }}'
# or '{% template logic ... %}
# or some combination

source_contents = re.sub('<?\s*(.*?)\s*?
 
Í

Íßêïò

If the 500 web pages are PHP only in the sense that there is only one
pair of <? ?> tags in each file, surrounding the entire content, then
what you ask for is doable.

First of all, thank you very much John for your BIG effort to help
me(i'm still readign your posts)!

I have to tell you here that those php files contain several instances
of php opening and closing tags(like 3 each php file). The rest is
pure html data. That happened because those files were in the
beginning html only files that later needed conversion to php due to
some dynamic code that had to be used to address some issues.

Please tell me that the code you provided can be adjusted to several
instances as well!
 
Í

Íßêïò

"How can I use RE string replacement to find PHP tags and convert them
to Django template tags?"

No, not at all John, at least not yet!

I have only 1 week that i'm learnign python(changing from php & perl)
so i'm very fresh at this beautifull and straighforwrd language.

When i have a good understnading of Python then i will proceed to
Django templates.
Until then my Python templates would be only 'simple html files' that
the only thign they contain apart form the html data would be the
special string formatting identifies '%s' :)
 
S

Steven D'Aprano

I don't know how to handle such a big data replacing problem and cannot
play with fire because those 500 pages are my cleints pages and data of
those filesjust cannot be messes up.

Take a backup copy of the files, and only edit the copies. Don't replace
the originals until you know they're correct.
 
Î

Îίκος

Take a backup copy of the files, and only edit the copies. Don't replace
the originals until you know they're correct.

Yes of course, but the code that John S provided need soem
modification in order to be able to change various instances of php
tags and not only one set.
 
Î

Îίκος

Script so far:

#!/usr/bin/python

import cgitb; cgitb.enable()
import cgi, re, os

print ( "Content-type: text/html; charset=UTF-8 \n" )


id = 0 # unique page_id

for currdir, files, dirs in os.walk('data'):

for f in files:

if f.endswith('php'):

# get abs path to filename
src_f = join(currdir,f)

# open php src file
f = open(src_f, 'r')
src_data = f.read() # read contents of PHP file
f.close()
print 'reading from %s' % src_f

# replace tags
src_data = src_data.replace('<%', '')
src_data = src_data.replace('%>', '')
print 'replacing php tags'

# add ID
src_data = ( '<!-- %d -->' % id ) + src_data
id += 1
print 'adding unique page_id'

# create new file with .html extension
src_file = src_file.replace('.php', '.html')

# open newly created html file for insertid data
dest_f = open(src_f, 'w')
dest_f.write(src_data) # write contents
dest_f.close()
print 'writing to %s' % dest_f

Please help me adjust it, if need extra modification for more php tags
replacing.
 
T

Thomas Jollans

Please help me adjust it, if need extra modification for more php tags
replacing.

Have you tried it ? I haven't, but I see no immediate reason why it
wouldn't work with multiple PHP blocks.
#!/usr/bin/python

import cgitb; cgitb.enable()
import cgi, re, os

print ( "Content-type: text/html; charset=UTF-8 \n" )


id = 0 # unique page_id

for currdir, files, dirs in os.walk('data'):

for f in files:

if f.endswith('php'):

# get abs path to filename
src_f = join(currdir,f)

# open php src file
f = open(src_f, 'r')
src_data = f.read() # read contents of PHP file
f.close()
print 'reading from %s' % src_f

# replace tags
src_data = src_data.replace('<%', '')
src_data = src_data.replace('%>', '')

Did you read the script before posting? ;-)
Here, you remove ASP-style tags. Which is fine, PHP supports them if you
configure it that way, but you probably didn't. Change this to the start
and end tags you actually use, and, if you use multiple forms (such as
<?php vs <?), then add another line or two.
 
Î

Îίκος

Have you tried it ? I haven't, but I see no immediate reason why it
wouldn't work with multiple PHP blocks.















Did you read the script before posting? ;-)
Here, you remove ASP-style tags. Which is fine, PHP supports them if you
configure it that way, but you probably didn't. Change this to the start
and end tags you actually use, and, if you use multiple forms (such as
<?php vs <?), then add another line or two.

Yes i have read the code very well and by mistake i wrote '<%>'
instead of '<?'

I was so dizzy and confused yesterday that i forgot to metnion that
not only i need removal of php openign and closing tags but whaevers
data lurks inside those tags as well ebcause now with the 'counter.py'
script i wrote the html fiels would open ftm there and substitute the
tempalte variabels like %(counter)d

Also before the

</body>
</html>

of every html file afetr removing the tags this line must be
inserted(this holds the template variable) that 'counter.py' uses to
produce data

<br><br><center><h4><font color=green> ΑÏιθμός Επισκεπτών: %(counter)d
</h4>

After making this modifications then i can trst the script to a COPY
of the original data in my pc.

*In my pc i run Windows 7 while remote web hosting setup uses Linux
Servers.
*That wont be a problem right?
 
T

Thomas Jollans

I was so dizzy and confused yesterday that i forgot to metnion that
not only i need removal of php openign and closing tags but whaevers
data lurks inside those tags as well ebcause now with the 'counter.py'
script i wrote the html fiels would open ftm there and substitute the
tempalte variabels like %(counter)d

I could just hand you a solution, but I'll be a bit of a bastard and
just give you some hints.

You could use regular expressions. If you know regular expressions, it's
relatively trivial - but I doubt you know regexp.

You could also repeatedly find the next occurrence of first a start tag,
then an end tag, using either str.find or str.split, and build up a
version of the file without PHP yourself.

Also before the

</body>
</html>

of every html file afetr removing the tags this line must be
inserted(this holds the template variable) that 'counter.py' uses to
produce data

<br><br><center><h4><font color=green> ΑÏιθμός Επισκεπτών: %(counter)d
</h4>

This problem is truly trivial. I know you can do it yourself, or at
least give it a good shot, and ask again when you hit a serious roadblock.

If I may comment on your HTML: you forgot to close your <center> and
<font> tags. Close them! Also, both (CENTER and FONT) have been
deprecated since HTML 4.0 -- you should consider using CSS for these
tasks instead. Also, this line does not look like a heading, so H4 is
hardly fitting.
After making this modifications then i can trst the script to a COPY
of the original data in my pc.

It would be nice if you re-read your posts before sending and tried to
iron out some of more careless spelling mistakes. Maybe you are doing
your best to post in good English -- it isn't bad and I realize this is
neither your native language nor alphabet, in which case I apologize.
The fact of the matter is: I originally interpreter "trst" as "trust",
which made no sense whatsoever.
*In my pc i run Windows 7 while remote web hosting setup uses Linux
Servers.
*That wont be a problem right?

Nah.
 
Î

Îίκος

I could just hand you a solution, but I'll be a bit of a bastard and
just give you some hints.

You could use regular expressions. If you know regular expressions, it's
relatively trivial - but I doubt you know regexp.

Here is the code with some try-and-fail modification i made, still non-
working based on your hints:
==========================================================

id = 0 # unique page_id

for currdir, files, dirs in os.walk('varsa'):

for f in files:

if f.endswith('php'):

# get abs path to filename
src_f = join(currdir, f)

# open php src file
print 'reading from %s' % src_f
f = open(src_f, 'r')
src_data = f.read() # read contents of PHP file
f.close()

# replace tags
print 'replacing php tags and contents within'
src_data = src_data.replace(r'<?.?>', '') #
the dot matches any character i hope! no matter how many of them?!?

# add ID
print 'adding unique page_id'
src_data = ( '<!-- %d -->' % id ) + src_data
id += 1

# add template variables
print 'adding counter template variable'
src_data = src_data + ''' <h4><font color=green> ΑÏιθμός
Επισκεπτών: %(counter)d </font></h4> '''
# i can think of this but the above line must be above </
body></html> NOT after but how to right that?!?

# rename old php file to new with .html extension
src_file = src_file.replace('.php', '.html')

# open newly created html file for inserting data
print 'writing to %s' % dest_f
dest_f = open(src_f, 'w')
dest_f.write(src_data) # write contents
dest_f.close()

This is the best i can do. Sorry for any typos i might made.

Please shed some LIGHT!
 
T

Thomas Jollans

Here is the code with some try-and-fail modification i made, still non-
working based on your hints:
==========================================================

id = 0 # unique page_id

for currdir, files, dirs in os.walk('varsa'):

for f in files:

if f.endswith('php'):

# get abs path to filename
src_f = join(currdir, f)

# open php src file
print 'reading from %s' % src_f
f = open(src_f, 'r')
src_data = f.read() # read contents of PHP file
f.close()

# replace tags
print 'replacing php tags and contents within'
src_data = src_data.replace(r'<?.?>', '') #
the dot matches any character i hope! no matter how many of them?!?

Two problems here:

str.replace doesn't use regular expressions. You'll have to use the re
module to use regexps. (the re.sub function to be precise)

'.' matches a single character. Any character, but only one.
'.*' matches as many characters as possible. This is not what you want,
since it will match everything between the *first* <? and the *last* ?>.
You want non-greedy matching.

'.*?' is the same thing, without the greed.
# add ID
print 'adding unique page_id'
src_data = ( '<!-- %d -->' % id ) + src_data
id += 1

# add template variables
print 'adding counter template variable'
src_data = src_data + ''' <h4><font color=green> ΑÏιθμός
Επισκεπτών: %(counter)d </font></h4> '''
# i can think of this but the above line must be above </
body></html> NOT after but how to right that?!?

You will have to find the </body> tag before inserting the string.
str.find should help -- or you could use str.replace and replace the
# rename old php file to new with .html extension
src_file = src_file.replace('.php', '.html')

# open newly created html file for inserting data
print 'writing to %s' % dest_f
dest_f = open(src_f, 'w')
dest_f.write(src_data) # write contents
dest_f.close()

This is the best i can do.

No it's not. You're just giving up too soon.
 
J

John S

Two problems here:

str.replace doesn't use regular expressions. You'll have to use the re
module to use regexps. (the re.sub function to be precise)

'.'  matches a single character. Any character, but only one.
'.*' matches as many characters as possible. This is not what you want,
since it will match everything between the *first* <? and the *last* ?>.
You want non-greedy matching.

'.*?' is the same thing, without the greed.





You will have to find the </body> tag before inserting the string.
str.find should help -- or you could use str.replace and replace the
</body> tag with you counter line, plus a new </body>.






No it's not. You're just giving up too soon.

When replacing text in an HTML document with re.sub, you want to use
the re.S (singleline) option; otherwise your pattern won't match when
the opening tag is on one line and the closing is on another.
 
J

Joel Goldstick

� said:
Hello dear Pythoneers,

I have over 500 .php web pages in various subfolders under 'data'
folder that i have to rename to .html and and ditch the '<?' and '?>'
tages from within and also insert a very first line of <!-- id -->
where id must be an identification unique number of every page for
counter tracking purposes. ONly pure html code must be left.

I before find otu Python used php and now iam switching to templates +
python solution so i ahve to change each and every page.

I don't know how to handle such a big data replacing problem and
cannot play with fire because those 500 pages are my cleints pages and
data of those filesjust cannot be messes up.

Can you provide to me a script please that is able of performing an
automatic way of such a page content replacing?

Thanks a million!

This is quite a vague description of the file contents. But, for a
completely different approach, how about using a browser and doing view
source, then saving the html that was generated. This will contain no
php code, but it will contain the results of whatever the php was doing.

If you don't have time to do this manually, look into wget or curl,
which will do the job in a program environment.

The discussion so far has dealt with stripping php, and leaving the
html. But the html must have embeded <?php some code to print something
?> in it. Or, there could be long fragments of html which are
constructed by php and then echo'ed.

Joel Goldstick
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top