Problem Converting Word to UTF8 Text File

Discussion in 'Python' started by patrick.waldo@gmail.com, Oct 21, 2007.

  1. Guest

    Hi all,

    I'm trying to copy a bunch of microsoft word documents that have
    unicode characters into utf-8 text files. Everything works fine at
    the beginning. The word documents get converted and new utf-8 text
    files with the same name get created. And then I try to copy the data
    and I keep on getting "TypeError: coercing to Unicode: need string or
    buffer, instance found". I'm probably copying the word document
    wrong. What can I do?

    Thanks,
    Patrick


    import os, codecs, glob, shutil, win32com.client
    from win32com.client import Dispatch

    input = 'C:\\text_samples\\source\\*.doc'
    output_dir = 'C:\\text_samples\\source\\output'
    FileFormat=win32com.client.constants.wdFormatText

    for doc in glob.glob(input):
    doc_copy = shutil.copy(doc,output_dir)
    WordApp = Dispatch("Word.Application")
    WordApp.Visible = 1
    WordApp.Documents.Open(doc)
    WordApp.ActiveDocument.SaveAs(doc, FileFormat)
    WordApp.ActiveDocument.Close()
    WordApp.Quit()


    for doc in glob.glob(input):
    txt_split = os.path.splitext(doc)
    txt_doc = txt_split[0] + '.txt'
    txt_doc = codecs.open(txt_doc,'w','utf-8')
    shutil.copyfile(doc,txt_doc)
    , Oct 21, 2007
    #1
    1. Advertising

  2. En Sun, 21 Oct 2007 13:35:43 -0300, <> escribi�:

    > Hi all,
    >
    > I'm trying to copy a bunch of microsoft word documents that have
    > unicode characters into utf-8 text files. Everything works fine at
    > the beginning. The word documents get converted and new utf-8 text
    > files with the same name get created. And then I try to copy the data
    > and I keep on getting "TypeError: coercing to Unicode: need string or
    > buffer, instance found". I'm probably copying the word document
    > wrong. What can I do?


    Always remember to provide the full traceback.
    Where do you get the error? In the last line: shutil.copyfile?
    If the file already contains the text in utf-8, and you just want to make
    a copy, use shutil.copy as before.
    (or, why not tell Word to save the file using the .txt extension in the
    first place?)

    > for doc in glob.glob(input):
    > txt_split = os.path.splitext(doc)
    > txt_doc = txt_split[0] + '.txt'
    > txt_doc = codecs.open(txt_doc,'w','utf-8')
    > shutil.copyfile(doc,txt_doc)


    copyfile expects path names as arguments, not a
    codecs-wrapped-file-like-object

    --
    Gabriel Genellina
    Gabriel Genellina, Oct 21, 2007
    #2
    1. Advertising

  3. Guest

    Indeed, the shutil.copyfile(doc,txt_doc) was causing the problem for
    the reason you stated. So, I changed it to this:

    for doc in glob.glob(input):
    txt_split = os.path.splitext(doc)
    txt_doc = txt_split[0] + '.txt'
    txt_doc_dir = os.path.join(input_dir,txt_doc)
    doc_dir = os.path.join(input_dir,doc)
    shutil.copy(doc_dir,txt_doc_dir)


    However, I still cannot read the unicode from the Word file. If take
    out the first for-statement, I get a bunch of garbled text, which
    isn't helpful. I would save them all manually, but I want to figure
    out how to do it in Python, since I'm just beginning.

    My intuition says the problem is with

    FileFormat=win32com.client.constants.wdFormatText

    because it converts fine to a text file, just not a utf-8 text file.
    How can I modify this or is there another way to code this type of
    file conversion from *.doc to *.txt with unicode characters?

    Thanks

    On Oct 21, 7:02 pm, "Gabriel Genellina" <>
    wrote:
    > En Sun, 21 Oct 2007 13:35:43 -0300, <> escribi?:
    >
    > > Hi all,

    >
    > > I'm trying to copy a bunch of microsoft word documents that have
    > > unicode characters into utf-8 text files. Everything works fine at
    > > the beginning. The word documents get converted and new utf-8 text
    > > files with the same name get created. And then I try to copy the data
    > > and I keep on getting "TypeError: coercing to Unicode: need string or
    > > buffer, instance found". I'm probably copying the word document
    > > wrong. What can I do?

    >
    > Always remember to provide the full traceback.
    > Where do you get the error? In the last line: shutil.copyfile?
    > If the file already contains the text in utf-8, and you just want to make
    > a copy, use shutil.copy as before.
    > (or, why not tell Word to save the file using the .txt extension in the
    > first place?)
    >
    > > for doc in glob.glob(input):
    > > txt_split = os.path.splitext(doc)
    > > txt_doc = txt_split[0] + '.txt'
    > > txt_doc = codecs.open(txt_doc,'w','utf-8')
    > > shutil.copyfile(doc,txt_doc)

    >
    > copyfile expects path names as arguments, not a
    > codecs-wrapped-file-like-object
    >
    > --
    > Gabriel Genellina
    , Oct 21, 2007
    #3
  4. En Sun, 21 Oct 2007 15:32:57 -0300, <> escribi�:

    > However, I still cannot read the unicode from the Word file. If take
    > out the first for-statement, I get a bunch of garbled text, which
    > isn't helpful. I would save them all manually, but I want to figure
    > out how to do it in Python, since I'm just beginning.
    >
    > My intuition says the problem is with
    >
    > FileFormat=win32com.client.constants.wdFormatText
    >
    > because it converts fine to a text file, just not a utf-8 text file.
    > How can I modify this or is there another way to code this type of
    > file conversion from *.doc to *.txt with unicode characters?


    Ah! I thought you were getting the right file format.
    I can't test it now, but this KB document
    http://support.microsoft.com/kb/209186/en-us
    suggests you should use wdFormatUnicodeText when saving the document.
    What the MS docs call "unicode" when dealing with files, is in general
    utf16.
    In this case, if you want to convert to utf8, the sequence would be:

    f = open(original_filename, "rb")
    udata = f.read().decode("utf16")
    f.close()
    f = open(new_filename, "wb")
    f.write(udata.encode("utf8"))
    f.close()

    --
    Gabriel Genellina
    Gabriel Genellina, Oct 22, 2007
    #4
  5. Guest

    That KB document was really helpful, but the problem still isn't
    solved. What's wierd now is that the unicode characters like
    become è in some odd conversion. However, I noticed when I try to
    open the word documents after I run the first for statement that Word
    gives me a window that says File Conversion and asks me how i want to
    encode it. None of the unicode options retain the characters. Then I
    looked some more and found it has a central european option both ISO
    and Windows which works perfectly since the documents I am looking at
    are in Czech. Then I try to save the document in word and it says if
    I try to save it as a text file I will lose the formating! So I guess
    I'm back at the start.

    Judging from some internet searches, I'm not the only one having this
    problem. For some reason Word can only save as .doc even though .txt
    can support the utf8 format with all these characters.

    Any ideas?



    On Oct 22, 5:39 am, "Gabriel Genellina" <>
    wrote:
    > En Sun, 21 Oct 2007 15:32:57 -0300, <> escribi?:
    >
    > > However, I still cannot read the unicode from the Word file. If take
    > > out the first for-statement, I get a bunch of garbled text, which
    > > isn't helpful. I would save them all manually, but I want to figure
    > > out how to do it in Python, since I'm just beginning.

    >
    > > My intuition says the problem is with

    >
    > > FileFormat=win32com.client.constants.wdFormatText

    >
    > > because it converts fine to a text file, just not a utf-8 text file.
    > > How can I modify this or is there another way to code this type of
    > > file conversion from *.doc to *.txt with unicode characters?

    >
    > Ah! I thought you were getting the right file format.
    > I can't test it now, but this KB documenthttp://support.microsoft.com/kb/209186/en-us
    > suggests you should use wdFormatUnicodeText when saving the document.
    > What the MS docs call "unicode" when dealing with files, is in general
    > utf16.
    > In this case, if you want to convert to utf8, the sequence would be:
    >
    > f = open(original_filename, "rb")
    > udata = f.read().decode("utf16")
    > f.close()
    > f = open(new_filename, "wb")
    > f.write(udata.encode("utf8"))
    > f.close()
    >
    > --
    > Gabriel Genellina
    , Oct 22, 2007
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    2
    Views:
    589
    Rohit Gupta
    Jun 20, 2005
  2. Al Moritz
    Replies:
    7
    Views:
    622
    Richard Laing
    Jul 22, 2003
  3. Replies:
    2
    Views:
    345
  4. gry
    Replies:
    2
    Views:
    702
    Alf P. Steinbach
    Mar 13, 2012
  5. Guest
    Replies:
    4
    Views:
    282
    Guest
    May 12, 2006
Loading...

Share This Page