Replace and inserting strings within .txt files with the use of regex

Discussion in 'Python' started by Íßêïò, Aug 8, 2010.

  1. Íßêïò

    Íßêïò Guest

    Hello dear Pythoneers,

    I have over 500 .php web pages in various subfolders under 'data'
    folder that i have to rename to .html and and ditch the '<?' and '?>'
    tages from within and also insert a very first line of <!-- id -->
    where id must be an identification unique number of every page for
    counter tracking purposes. ONly pure html code must be left.

    I before find otu Python used php and now iam switching to templates +
    python solution so i ahve to change each and every page.

    I don't know how to handle such a big data replacing problem and
    cannot play with fire because those 500 pages are my cleints pages and
    data of those filesjust cannot be messes up.

    Can you provide to me a script please that is able of performing an
    automatic way of such a page content replacing?

    Thanks a million!
    Íßêïò, Aug 8, 2010
    1. Advertisements

  2. Íßêïò

    rantingrick Guest

    I prefer Pythonista, but anywho..
    import os
    os.rename(old, new)
    path = 'some/valid/path'
    f = open(path, 'r')
    data =
    data.replace('<?', '')
    data.replace('?>', '')
    Well then don't F up! However judging from the amount of typos in this
    post i would suggest you do some major testing!
    Better do some serous testing first, or (if you have enough disc
    space ) create copies instead!
    This is very basic stuff and the fine manual is free you know. But how
    much are you willing to pay?
    rantingrick, Aug 8, 2010
    1. Advertisements

  3. Íßêïò

    MRAB Guest

    That should be:

    data = data.replace('<?', '')
    data = data.replace('?>', '')
    Strings don't have an 'insert' method!
    MRAB, Aug 8, 2010
  4. Íßêïò

    Íßêïò Guest

    # rename ALL php files to html in every subfolder of the folder 'data'
    os.rename('*.php', '*.html') # how to tell python to
    rename ALL php files to html to ALL subfolder under 'data' ?

    # current path of the file to be processed
    path = './data' # this must be somehow in a loop i feel
    that read every file of every subfolder

    # open an html file for reading
    f = open(path, 'rw')
    # read the contents of the whole file
    data =

    # replace all php tags with empty string
    data = data.replace('<?', '')
    data = data.replace('?>', '')

    # write replaced data to file
    data = f.write()

    # insert an increasing unique integer number at the very first line
    of every html file processing
    comment = "<!-- %s -->"%(idnum) # how will the number
    change here an increased by one file after file?
    f = f.close()

    Please help i'm new to python an apart from syntx its a logic problem
    as well and needs experience.
    Íßêïò, Aug 8, 2010
  5. Íßêïò

    John S Guest

    If the 500 web pages are PHP only in the sense that there is only one
    pair of <? ?> tags in each file, surrounding the entire content, then
    what you ask for is doable.

    from os.path import join
    import os

    id = 1 # id number
    for currdir,files,dirs in os.walk('data'):
    for f in files:
    if f.endswith('php'):
    source_file_name = join(currdir,f) # get abs path to
    source_file = open(source_file_name)
    source_contents = # read contents of
    PHP file

    # replace tags
    source_contents = source_contents.replace('<%','')
    source_contents = source_contents.replace('%>','')

    # add ID
    source_contents = ( '<!-- %d -->' % id ) + source_contents
    id += 1

    # create new file with .html extension
    source_file_name =
    dest_file = open(source_file_name,'w')
    dest_file.write(source_contents) # write contents

    Note: error checking left out for clarity.

    On the other hand, if your 500 web pages contain embedded PHP
    variables or logic, you have a big job ahead. Django templates and PHP
    are two different languages for embedding data and logic in web pages.
    Converting a project from PHP to Django involves more than renaming
    the template files and deleting "<?" and friends.

    For example, here is a snippet of PHP which checks which browser is
    viewing the page:

    if (strpos($_SERVER['HTTP_USER_AGENT'], 'MSIE') !== FALSE) {
    echo 'You are using Internet Explorer.<br />';

    In Django, you would typically put this logic in a Django *view*
    (which btw is not what is called a 'view' in MVC term), which is the
    code that prepares data for the template. The logic would not live
    with the HTML. The template uses "template variables" that the view
    has associated with a Python variable or function. You might create a
    template variable (created via a Context object) named 'browser' that
    contains a value that identifies the browser.

    Thus, your Python template (HTML file) might look like this:

    {% if browser == 'IE' %}You are using Internet Explorer{% endif %}

    PHP tends to combine the presentation with the business logic, or in
    MVC terms, combines the view with the controller. Django separates
    them out, which many people find to be a better way. The person who
    writes the HTML doesn't have to speak Python, but only know the names
    of template variables and a little bit of template logic. In PHP, the
    HTML code and all the business logic lives in the same files. Even
    here, it would probably make sense to calculate the browser ID in the
    header of the HTML file, then access it via a variable in the body.

    If you have 500 static web pages that are part of the same
    application, but that do not contain any logic, your application might
    need to be redesigned.

    Also, you are doing your changes on a COPY of the application on a non-
    public server, aren't you? If not, then you really are playing with

    John S, Aug 8, 2010
  6. Íßêïò

    rantingrick Guest

    Yes, Thanks MRAB. I did forget that important detail.
    *facepalm*! I really must stop Usenet-ing whilst consuming large
    volumes of alcoholic beverages.
    rantingrick, Aug 8, 2010
  7. Íßêïò

    John S Guest

    Even though I just replied above, in reading over the OP's message, I
    think the OP might be asking:

    "How can I use RE string replacement to find PHP tags and convert them
    to Django template tags?"

    Instead of saying

    source_contents = source_contents.replace(...)

    say this instead:

    import re

    def replace_php_tags(m):
    ''' PHP tag replacer
    This function is called for each PHP tag. It gets a Match object as
    its parameter, so you can get the contents of the old tag, and
    return the new (Django) tag.

    # m is the match object from the current match
    php_guts = # the contents of the PHP tag

    # now put the replacement logic here

    # and return whatever should go in place of the PHP tag,
    # which could be '{{ python_template_var }}'
    # or '{% template logic ... %}
    # or some combination

    source_contents = re.sub('<?\s*(.*?)\s*?
    John S, Aug 8, 2010
  8. Íßêïò

    Íßêïò Guest

    First of all, thank you very much John for your BIG effort to help
    me(i'm still readign your posts)!

    I have to tell you here that those php files contain several instances
    of php opening and closing tags(like 3 each php file). The rest is
    pure html data. That happened because those files were in the
    beginning html only files that later needed conversion to php due to
    some dynamic code that had to be used to address some issues.

    Please tell me that the code you provided can be adjusted to several
    instances as well!
    Íßêïò, Aug 8, 2010
  9. Íßêïò

    Íßêïò Guest

    No, not at all John, at least not yet!

    I have only 1 week that i'm learnign python(changing from php & perl)
    so i'm very fresh at this beautifull and straighforwrd language.

    When i have a good understnading of Python then i will proceed to
    Django templates.
    Until then my Python templates would be only 'simple html files' that
    the only thign they contain apart form the html data would be the
    special string formatting identifies '%s' :)
    Íßêïò, Aug 8, 2010
  10. Take a backup copy of the files, and only edit the copies. Don't replace
    the originals until you know they're correct.
    Steven D'Aprano, Aug 8, 2010
  11. Yes of course, but the code that John S provided need soem
    modification in order to be able to change various instances of php
    tags and not only one set.
    Îίκος, Aug 8, 2010
  12. Script so far:


    import cgitb; cgitb.enable()
    import cgi, re, os

    print ( "Content-type: text/html; charset=UTF-8 \n" )

    id = 0 # unique page_id

    for currdir, files, dirs in os.walk('data'):

    for f in files:

    if f.endswith('php'):

    # get abs path to filename
    src_f = join(currdir,f)

    # open php src file
    f = open(src_f, 'r')
    src_data = # read contents of PHP file
    print 'reading from %s' % src_f

    # replace tags
    src_data = src_data.replace('<%', '')
    src_data = src_data.replace('%>', '')
    print 'replacing php tags'

    # add ID
    src_data = ( '<!-- %d -->' % id ) + src_data
    id += 1
    print 'adding unique page_id'

    # create new file with .html extension
    src_file = src_file.replace('.php', '.html')

    # open newly created html file for insertid data
    dest_f = open(src_f, 'w')
    dest_f.write(src_data) # write contents
    print 'writing to %s' % dest_f

    Please help me adjust it, if need extra modification for more php tags
    Îίκος, Aug 8, 2010
  13. THAT explains a lot.

    Thomas Jollans, Aug 8, 2010
  14. Have you tried it ? I haven't, but I see no immediate reason why it
    wouldn't work with multiple PHP blocks.
    Did you read the script before posting? ;-)
    Here, you remove ASP-style tags. Which is fine, PHP supports them if you
    configure it that way, but you probably didn't. Change this to the start
    and end tags you actually use, and, if you use multiple forms (such as
    <?php vs <?), then add another line or two.
    Thomas Jollans, Aug 8, 2010
  15. Yes i have read the code very well and by mistake i wrote '<%>'
    instead of '<?'

    I was so dizzy and confused yesterday that i forgot to metnion that
    not only i need removal of php openign and closing tags but whaevers
    data lurks inside those tags as well ebcause now with the ''
    script i wrote the html fiels would open ftm there and substitute the
    tempalte variabels like %(counter)d

    Also before the


    of every html file afetr removing the tags this line must be
    inserted(this holds the template variable) that '' uses to
    produce data

    <br><br><center><h4><font color=green> ΑÏιθμός Επισκεπτών: %(counter)d

    After making this modifications then i can trst the script to a COPY
    of the original data in my pc.

    *In my pc i run Windows 7 while remote web hosting setup uses Linux
    *That wont be a problem right?
    Îίκος, Aug 8, 2010
  16. I could just hand you a solution, but I'll be a bit of a bastard and
    just give you some hints.

    You could use regular expressions. If you know regular expressions, it's
    relatively trivial - but I doubt you know regexp.

    You could also repeatedly find the next occurrence of first a start tag,
    then an end tag, using either str.find or str.split, and build up a
    version of the file without PHP yourself.

    This problem is truly trivial. I know you can do it yourself, or at
    least give it a good shot, and ask again when you hit a serious roadblock.

    If I may comment on your HTML: you forgot to close your <center> and
    <font> tags. Close them! Also, both (CENTER and FONT) have been
    deprecated since HTML 4.0 -- you should consider using CSS for these
    tasks instead. Also, this line does not look like a heading, so H4 is
    hardly fitting.
    It would be nice if you re-read your posts before sending and tried to
    iron out some of more careless spelling mistakes. Maybe you are doing
    your best to post in good English -- it isn't bad and I realize this is
    neither your native language nor alphabet, in which case I apologize.
    The fact of the matter is: I originally interpreter "trst" as "trust",
    which made no sense whatsoever.
    Thomas Jollans, Aug 8, 2010
  17. Here is the code with some try-and-fail modification i made, still non-
    working based on your hints:

    id = 0 # unique page_id

    for currdir, files, dirs in os.walk('varsa'):

    for f in files:

    if f.endswith('php'):

    # get abs path to filename
    src_f = join(currdir, f)

    # open php src file
    print 'reading from %s' % src_f
    f = open(src_f, 'r')
    src_data = # read contents of PHP file

    # replace tags
    print 'replacing php tags and contents within'
    src_data = src_data.replace(r'<?.?>', '') #
    the dot matches any character i hope! no matter how many of them?!?

    # add ID
    print 'adding unique page_id'
    src_data = ( '<!-- %d -->' % id ) + src_data
    id += 1

    # add template variables
    print 'adding counter template variable'
    src_data = src_data + ''' <h4><font color=green> ΑÏιθμός
    Επισκεπτών: %(counter)d </font></h4> '''
    # i can think of this but the above line must be above </
    body></html> NOT after but how to right that?!?

    # rename old php file to new with .html extension
    src_file = src_file.replace('.php', '.html')

    # open newly created html file for inserting data
    print 'writing to %s' % dest_f
    dest_f = open(src_f, 'w')
    dest_f.write(src_data) # write contents

    This is the best i can do. Sorry for any typos i might made.

    Please shed some LIGHT!
    Îίκος, Aug 8, 2010
  18. Two problems here:

    str.replace doesn't use regular expressions. You'll have to use the re
    module to use regexps. (the re.sub function to be precise)

    '.' matches a single character. Any character, but only one.
    '.*' matches as many characters as possible. This is not what you want,
    since it will match everything between the *first* <? and the *last* ?>.
    You want non-greedy matching.

    '.*?' is the same thing, without the greed.
    You will have to find the </body> tag before inserting the string.
    str.find should help -- or you could use str.replace and replace the
    No it's not. You're just giving up too soon.
    Thomas Jollans, Aug 8, 2010
  19. Íßêïò

    John S Guest

    When replacing text in an HTML document with re.sub, you want to use
    the re.S (singleline) option; otherwise your pattern won't match when
    the opening tag is on one line and the closing is on another.
    John S, Aug 8, 2010
  20. This is quite a vague description of the file contents. But, for a
    completely different approach, how about using a browser and doing view
    source, then saving the html that was generated. This will contain no
    php code, but it will contain the results of whatever the php was doing.

    If you don't have time to do this manually, look into wget or curl,
    which will do the job in a program environment.

    The discussion so far has dealt with stripping php, and leaving the
    html. But the html must have embeded <?php some code to print something
    ?> in it. Or, there could be long fragments of html which are
    constructed by php and then echo'ed.

    Joel Goldstick
    Joel Goldstick, Aug 8, 2010
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.