Why isn't my re.sub replacing the contents of my MS Word file?

Discussion in 'Python' started by scottcabit, May 9, 2014.

  1. scottcabit

    scottcabit Guest

    Hi,

    here is a snippet of code that opens a file (fn contains the path\name) and first tried to replace all endash, emdash etc characters with simple dash characters, before doing a search.
    But the replaces are not having any effect. Obviously a syntax problem....wwhat silly thing am I doing wrong?

    Thanks!

    fn = 'z:\Documentation\Software'
    def processdoc(fn,outfile):
    fStr = open(fn, 'rb').read()
    re.sub(b'&#x2012','-',fStr)
    re.sub(b'&#x2013','-',fStr)
    re.sub(b'&#x2014','-',fStr)
    re.sub(b'&#x2015','-',fStr)
    re.sub(b'&#x2E3A','-',fStr)
    re.sub(b'&#x2E3B','-',fStr)
    re.sub(b'&#x002D','-',fStr)
    re.sub(b'&#x00AD','-',fStr)
     
    scottcabit, May 9, 2014
    #1
    1. Advertisements

  2. scottcabit

    MRAB Guest

    re.sub _returns_ its result (strings are immutable).
     
    MRAB, May 9, 2014
    #2
    1. Advertisements

  3. I can see several things that might be wrong, but it's hard to say
    what *is* wrong without trying it.

    1) Is the file close enough to text that you can even do this sort of
    parsing? You say it's an MS Word file; that, unfortunately, could mean
    a lot of things. Some of the newer formats are basically zipped XML,
    so translations like this won't work. Other forms of Word document may
    be closer to text, but you majorly risk corrupting the binary content.

    2) How are characters represented? Are they actually stored in the
    file with ampersands, hashes, etc? Your source strings are all seven
    bytes long, and will look for exactly those bytes. There must be some
    form of character encoding used; possibly, instead of the &#x
    notation, you need to UTF-8 or UTF-16LE encode the characters to look
    for.

    3) You're doing simple string replacements using regular expressions.
    I don't think any of your symbols here is a metacharacter, but I might
    be wrong. If you're simply replacing one stream of bytes with another,
    don't use regex at all, just use string replacement.

    4) There's nothing in your current code to actually write the contents
    anywhere. You do all the changes and then do nothing with it. Or is
    this just part of the code?

    5) Similarly, there's nothing in this fragment that actually calls
    processdoc(). Did you elide that? The fragment you wrote will do a
    whole lot of nothing, on its own.

    6) There's no file extension on your input file name; be sure you
    really have the file you want, and not (for instance) a directory. Or
    if you need to iterate over all the files in a directory, you'll need
    to do that explicitly.

    7) This one isn't technically a problem, but it's a risk. The string
    'z:\Documentation\Software' has two backslash escapes \D and \S, which
    the parser fails to recognize, and therefore passes through literally.
    So it works, currently. However, if you were to change the path to,
    say, 'z:\Documentation\backups', then it would suddenly fail. There
    are several solutions to this:
    7a) fn = r'z:\Documentation\Software'
    7b) fn = 'z:\\Documentation\\Software'
    7c) fn = 'z:/Documentation/Software'

    Hope that helps some, at least! A more full program would be easier to
    work with.

    ChrisA
     
    Chris Angelico, May 9, 2014
    #3
  4. scottcabit

    Tim Chase Guest

    A Word doc (as your subject mentions) is a binary format. There's
    the older .doc and the newer .docx (which is actually a .zip file
    with a particular content-structure renamed to .docx).

    Your example doesn't show the extension, so it's hard to tell whether
    you're working with the old format or the new format.

    That said, a simple replacement *certainly* won't work for a .docx
    file, as you'd have to uncompress the contents, open up the various
    files inside, perform the replacements, then zip everything back up,
    and save the result back out.

    For the older .doc file, it's a binary format, so even if you can
    successfully find & swap out sequences of 7 chars for a single char,
    it might screw up the internal offsets, breaking your file.
    Additionally, I vaguely remember sparring with them using some 16-bit
    wide characters in .doc files so you might have to search for
    atrocious things like b"\x00&\x00#\x00x\x002\x000\x001\x002" (each
    character being prefixed with "\x00".

    -tkc
     
    Tim Chase, May 9, 2014
    #4
  5. scottcabit

    scottcabit Guest

    Ahh....so I tried this for each re.sub

    fStr = re.sub(b'&#x2012','-',fStr)

    No errors running it, but it still does nothing.....
     
    scottcabit, May 9, 2014
    #5
  6. scottcabit

    scottcabit Guest

    I am using .doc files only......
    I do not save the file out again, only try to change all en-dash and em-dash to dashes, then search and print things to another file, closing the searched file without writing it.
    Hmmm..thought that was what I was doing. Can anyone figure out why the syntax is wrong for Word 2007 document binary file data?
     
    scottcabit, May 9, 2014
    #6
  7. You're making the substitution, then throwing the result away.

    And you're using a nuclear-powered bulldozer to crack a peanut. This is
    not a job for regexes, this is a job for normal string replacement.
    Good:

    fStr = re.sub(b'&#x2012', b'-', fStr)

    Better:

    fStr = fStr.replace(b'&#x2012', b'-')


    But having said that, you actually can make use of the nuclear-powered
    bulldozer, and do all the replacements in one go:

    Best:

    # Untested
    fStr = re.sub(b'&#x(201[2-5])|(2E3[AB])|(00[2A]D)', b'-', fStr)


    If you're going to unload the power of regexes, unload them on something
    that makes it worthwhile. Replacing a constant, fixed string with another
    constant, fixed string does not require a regex.
     
    Steven D'Aprano, May 10, 2014
    #7
  8. Ah, my previous email missed the fact that you are operating on Word docs.
    You are searching for the literal "&#x2012", in other words:

    ampersand hash x two zero one two

    *not* a FIGURE DASH. Compare:


    py> import re
    py> source = b'aaaa&#x2012aaaa'
    py> print(source)
    b'aaaa&#x2012aaaa'
    py> re.sub(b'&#x2012', b'Z', source)
    b'aaaaZaaaa'

    But if the source contains an *actual* FIGURE DASH:

    py> source = u'aaaa\u2012aaaa'.encode('utf-8')
    py> print(source)
    b'aaaa\xe2\x80\x92aaaa'
    py> re.sub(b'&#x2012', b'Z', source)
    b'aaaa\xe2\x80\x92aaaa'


    You're dealing with a binary file format, and I believe it is an
    undocumented binary file format. You don't know which parts of the file
    represent text, metadata, formatting and layout information, or images.
    Even if you identify which parts are text, you don't know what encoding
    is used internally:

    py> u'aaaa\u2012aaaa'.encode('utf-8')
    b'aaaa\xe2\x80\x92aaaa'
    py> u'aaaa\u2012aaaa'.encode('utf-16be')
    b'\x00a\x00a\x00a\x00a \x12\x00a\x00a\x00a\x00a'
    py> u'aaaa\u2012aaaa'.encode('utf-16le')
    b'a\x00a\x00a\x00a\x00\x12 a\x00a\x00a\x00a\x00'

    or something else.

    You're on *extremely* thin ice here.

    If you *must* do this, then you'll need to identify how Word stores
    various dashes in the file. If you're lucky, the textual parts of the doc
    file will be obvious to the eye, so open a few sample files using a hex
    editor and you might be able to identify what Word is using to store the
    various forms of dash.
     
    Steven D'Aprano, May 10, 2014
    #8
  9. scottcabit

    Rustom Mody Guest

    If you are using MS-Word use that, not python.

    Yeah it is possible to script MS with something like this
    http://timgolden.me.uk/pywin32-docs/
    [no experience myself!]
    but its probably not worth the headache for such a simple job.

    The VBA (or whatever is the modern equivalent) will be about as short and simple
    as your attempted python and making it work will be far easier.

    I way I used to do it with Windows-98 Word.
    Start a macro
    Do a simple single search and replace by hand
    Close the macro
    Edit the macro (VBA version)
    Replace the single search-n-replace with all the many you require
     
    Rustom Mody, May 10, 2014
    #9
  10. scottcabit

    wxjmfauth Guest

    Le samedi 10 mai 2014 06:22:00 UTC+2, Rustom Mody a écrit :
    =========

    That's a wise reommendation.

    Anyway, as Python may fail as soon as one uses an
    EM DASH or an EM DASH, I think it's not worth the
    effort to spend to much time with it.

    LibreOffice could be a solution.

    jmf
     
    wxjmfauth, May 10, 2014
    #10
  11. scottcabit

    Tim Golden Guest

    Nope -- seems all right to me. (Hopefully helping the OP out as well as
    rebutting a rather foolish assertion).

    <code>
    #!python3.4
    import win32com.client
    import unicodedata

    word = win32com.client.gencache.EnsureDispatch("Word.Application")
    try:
    doc1 = word.Documents.Add()
    doc1.Range().Text += "Hello \u2014 World"
    doc1.SaveAs(r"c:\temp\em_dash.docx")
    doc1.Close()

    doc2 = win32com.client.GetObject(r"c:\temp\em_dash.docx")
    for uchar in doc2.Range().Text.strip():
    print(unicodedata.name(uchar))

    finally:
    word.Quit()

    </code>

    TJG
     
    Tim Golden, May 10, 2014
    #11
  12. scottcabit

    scottcabit Guest

    Doesn't work...the document has been verified to contain endash and emdash characters, but this does NOT replace them.
    Still doesn't work.

    Guess whatever the code is for endash and mdash are not the ones I am using....
     
    scottcabit, May 12, 2014
    #12
  13. scottcabit

    Dave Angel Guest

    More likely, your MSWord document isn't a simple text file. Some
    encodings don't resemble ASCII or Unicode in the least.
     
    Dave Angel, May 12, 2014
    #13
  14. scottcabit

    Rustom Mody Guest

    What happens if you divide two string?Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    TypeError: unsupported operand type(s) for /: 'str' and 'str'

    Or multiply 2 lists?
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    TypeError: can't multiply sequence by non-int of type 'list'

    Trying to do a text operation like re.sub on a NON-text object like a doc-file
    is the same.

    Yes python may not be intelligent enough to give you such useful error messages
    outside its territory ie on contents of random files, however logically its the
    same -- an impossible operation.


    The options you have:
    1. Use doc-specific tools eg MS/Libre office to work on doc files ie dont use python
    2. Follow Tim Golden's suggestion, ie use win32com which is a doc-talking
    python API [BTW Thanks Tim for showing how easy it is]
    3. Get out of the doc format to txt (export as plain txt) and then try what you
    are trying on the txt
     
    Rustom Mody, May 13, 2014
    #14
  15. You may have missed my follow up post, where I said I had not noticed you
    were operating on a binary .doc file.

    The text content of your doc file might look like:

    This – is an n-dash.


    when viewed in Microsoft Word, but that is not the contents on disk.
    Word .doc files are a proprietary, secret binary format. Apart from the
    rest of the document structure and metadata, the text itself could be
    stored any old way. We don't know how. Microsoft surely knows how it is
    stored, but are unlikely to tell. A few open source projects like
    OpenOffice, LibreOffice and Abiword have reverse-engineered the file
    format. Taking a wild guess, I think it could be something like:

    This \xe2\x80\x93 is an n-dash.

    or possibly:

    \x00T\x00h\x00i\x00s\x00 \x13\x00 \x00i\x00s\x00 \x00a
    \x00n\x00 \x00n\x00-\x00d\x00a\x00s\x00h\x00.

    or:

    This {EN DASH} is an n-dash.

    or:

    x\x9c\x0b\xc9\xc8,V\xa8v\xf5Spq\x0c\xf6\xa8U\x00r\x12
    \xf3\x14\xf2tS\x12\x8b3\xf4\x00\x82^\x08\xf8


    (that last one is the text passed through the zlib compressor), but
    really I'm just making up vaguely conceivable possibilities.

    If you're not willing or able to use a full-blown doc parser, say by
    controlling Word or LibreOffice, the other alternative is to do something
    quick and dirty that might work most of the time. Open a doc file, or
    multiple doc files, in a hex editor and *hopefully* you will be able to
    see chunks of human-readable text where you can identify how en-dashes
    and similar are stored.
     
    Steven D'Aprano, May 13, 2014
    #15
  16. I had to decompress that just to see what "text" you passed through
    zlib, given that zlib is a *byte* compressor :) Turns out it's the
    braced notation given above, encoded as ASCII/UTF-8.

    ChrisA
     
    Chris Angelico, May 13, 2014
    #16
  17. scottcabit

    scottcabit Guest

    I created a .doc file and opened it with UltraEdit in binary (Hex) mode. What I see is that there are two characters, one for ndash and one for mdash, each a single byte long. 0x96 and 0x97.
    So I tried this: fStr = re.sub(b'\0x96',b'-',fStr)

    that did nothing in my file. So I tried this: fStr = re.sub(b'0x97',b'-',fStr)

    which also did nothing.
    So, for fun I also tried to just put these wildcards in my re.findall so I added |Part \0x96|Part \0x97 to no avail.

    Obviously 0x96 and 0x97 are NOT being interpreted in a re.findall or re.sub as hex byte values of 96 and 97 hexadecimal using my current syntax.

    So here's my question...if I want to replace all ndash or mdash values with regular '-' symbols using re.sub, what is the proper syntax to do so?

    Thanks!
     
    scottcabit, May 13, 2014
    #17
  18. scottcabit

    MRAB Guest

    0x96 is a hexadecimal literal for an int. Within a string you need \x96
    (it's \x for 2 hex digits, \u for 4 hex digits, \U for 8 hex digits).
     
    MRAB, May 13, 2014
    #18
  19. scottcabit

    wxjmfauth Guest

    Le mardi 13 mai 2014 22:26:51 UTC+2, MRAB a écrit :
    False


    - Python and the coding of characters is an unbelievable
    mess.
    - Unicode a joke.
    - I can make Python failing with any valid sequence of
    chars I wish.
    - There is a difference between "look, my code work with
    my chars" and "this code is safely working with any chars".

    jmf
     
    wxjmfauth, May 14, 2014
    #19
  20. scottcabit

    scottcabit Guest

    Yes, that was my problem. Figured it out just after posting my last message. using \x96 works correctly. Thanks!
     
    scottcabit, May 14, 2014
    #20
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.