Help: String search in Windows 2000 doesn't find text in Windows

Discussion in 'Perl Misc' started by Barry Millman, Nov 27, 2005.

  1. Hi:

    I am using Perl 5 (I believe both machines are using ActivePERL 5) on
    two machines with the same data files. One machine is Win 2000 the
    other is Win XP. The files are MS Word 2000 documents e-mailed
    (manually) from the Win 2000 machine to the XP machine.

    The program searches the MS Word Files (both created with MS Word 2000)
    for the word HYPERLINK. The format for the HYPERLINK that I am
    searching for in the document is:

    HYPERLINK "mydoc.doc"

    (I checked this on the XP machine in Notepad and it is OK.)

    PROBLEM: The program works on the Windows 2000 machine, but does not
    find the files on the Win Xp machine.

    The code that is not finding the text on the Win XP machine (same as
    the Win 2000 machine which does find the test)is:

    ----------- start actual code segment --------------------
    while (/HYPERLINK(\s+.{1,80}?\.doc)/gim) # the "g" causes multiple
    matches

    {
    $fndxx = $1;

    $fndxx =~ s/\"//; # remove leading quote
    $fndxx =~ s/\s+//; # remove leading spaces
    $dir="C:\\IGINproducts\\UserDocuments\\";

    $fullname = ($dir . $fndxx);
    $date_string = "Cannot Find";
    if (-e $fullname) { $date_string = ctime(stat($dir .
    $fndxx)->mtime); } #last update date of that file
    print(OUTFILE $fndxx,",",$date_string,", in: ",basename($file),
    "\n") ;
    $matches += 1; # count matches

    } #end while HYPERLINK
    ----------- end actual code segment --------------------

    The output for a found HYPERLINK should look like this (it does on the
    Win 2000 machine):

    mydoc.doc,(date of last update), in: otherdoc.doc

    On Win XP, the program cannot even find the word HYPERLINK (if I modify
    the code to just search for that). The directories are valid, I can
    have the program print a list of all files as it processes them.

    If I try this with a test program (the string to test is in the program
    itself ) it works fine on the XP machine.

    There are no encryption issues, nor any file or directory problems.

    I would really appreciate any comments or suggestions about what I am
    doing wrong.

    Thanks,

    Barry Millman
    Barry Millman, Nov 27, 2005
    #1
    1. Advertising

  2. Just some added info:

    The search works fine if I save the MS Word files as RTF.

    Also I wanted to mention that I have this around the hyperlink search code:
    #open the file
    open(INFILE,"< $file") or die "Couldn't open file ",$file;


    while(<INFILE>)
    {
    # the hyperlink code I posted earlier
    } # end while infile

    Barry



    Barry Millman wrote:

    > Hi:
    >
    > I am using Perl 5 (I believe both machines are using ActivePERL 5)
    > on two machines with the same data files. One machine is Win 2000 the
    > other is Win XP. The files are MS Word 2000 documents e-mailed
    > (manually) from the Win 2000 machine to the XP machine.
    >
    > The program searches the MS Word Files (both created with MS Word
    > 2000) for the word HYPERLINK. The format for the HYPERLINK that I am
    > searching for in the document is:
    >
    > HYPERLINK "mydoc.doc"
    >
    > (I checked this on the XP machine in Notepad and it is OK.)
    >
    > PROBLEM: The program works on the Windows 2000 machine, but does not
    > find the files on the Win Xp machine.
    >
    > The code that is not finding the text on the Win XP machine (same as
    > the Win 2000 machine which does find the test)is:
    >
    > ----------- start actual code segment --------------------
    > while (/HYPERLINK(\s+.{1,80}?\.doc)/gim) # the "g" causes multiple
    > matches
    >
    > {
    > $fndxx = $1;
    >
    > $fndxx =~ s/\"//; # remove leading quote
    > $fndxx =~ s/\s+//; # remove leading spaces
    > $dir="C:\\IGINproducts\\UserDocuments\\";
    >
    > $fullname = ($dir . $fndxx);
    > $date_string = "Cannot Find";
    > if (-e $fullname) { $date_string = ctime(stat($dir .
    > $fndxx)->mtime); } #last update date of that file
    > print(OUTFILE $fndxx,",",$date_string,", in:
    > ",basename($file), "\n") ;
    > $matches += 1; # count matches
    >
    > } #end while HYPERLINK
    > ----------- end actual code segment --------------------
    >
    > The output for a found HYPERLINK should look like this (it does on the
    > Win 2000 machine):
    >
    > mydoc.doc,(date of last update), in: otherdoc.doc
    >
    > On Win XP, the program cannot even find the word HYPERLINK (if I modify
    > the code to just search for that). The directories are valid, I can
    > have the program print a list of all files as it processes them.
    >
    > If I try this with a test program (the string to test is in the program
    > itself ) it works fine on the XP machine.
    >
    > There are no encryption issues, nor any file or directory problems.
    >
    > I would really appreciate any comments or suggestions about what I am
    > doing wrong.
    >
    > Thanks,
    >
    > Barry Millman
    >
    >
    Barry Millman, Nov 27, 2005
    #2
    1. Advertising

  3. Re: Help: String search in Windows 2000 doesn't find text in Windows XP: MS Word document

    Barry Millman <> wrote:

    > The format for the HYPERLINK that I am
    > searching for in the document is:
    >
    > HYPERLINK "mydoc.doc"


    > PROBLEM: The program works on the Windows 2000 machine, but does not
    > find the files on the Win Xp machine.



    I don't think I can help with that part, but the code is too hokey
    to just let it pass...


    > ----------- start actual code segment --------------------
    > while (/HYPERLINK(\s+.{1,80}?\.doc)/gim) # the "g" causes multiple
    > matches



    The //m does not do anything, so why is it there?

    It changes the meaning of ^ and $, but you don't use those
    anchors in your pattern, so you don't need //m.

    .{1,80}?

    is the same as

    .{0,80}

    Do you really want to match ' .doc' ?


    We can't help you analyse why the match is failing because we
    need two things to do that: the pattern and the string that
    the pattern is to be matched against.

    We have only one of those two things...


    >
    > {
    > $fndxx = $1;
    >
    > $fndxx =~ s/\"//; # remove leading quote
    > $fndxx =~ s/\s+//; # remove leading spaces



    Why capture them only to strip them out of the captured string?

    Why not just leave them out of the capture in the first place?


    while (/HYPERLINK\s+"(.{1,78}\.doc")/gi)

    or, probably better:

    while (/HYPERLINK\s+"([^"]{1,78}\.doc")/gi)


    > $dir="C:\\IGINproducts\\UserDocuments\\";
    >



    Use single quotes unless you want to make use of one of the two
    extra things that double quotes give you (interpolation
    and backslash escapes).

    Use forward slashes instead of silly slashes unless the path
    is going to be fed to the "command interpreter".


    $dir='C:/IGINproducts/UserDocuments/';


    > print(OUTFILE $fndxx,",",$date_string,", in: ",basename($file),
    > "\n") ;



    Gak!

    Use double quoted strings to concatenate your output string:

    print(OUTFILE "$fndxx,$date_string, in: ", basename($file), "\n") ;


    > If I try this with a test program (the string to test is in the program
    > itself ) it works fine on the XP machine.



    If you had shown us your complete test program, then we could
    have helped you debug it.

    But you didn't, so we can't. (hint)


    > I would really appreciate any comments or suggestions about what I am
    > doing wrong.



    Not posting a short and complete program that we can run that
    illustrates your problem.

    Have you seen the Posting Guidelines that are posted here frequently?


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Nov 27, 2005
    #3
  4. Hi:

    I tried your suggestions, but no luck. I did nove that directory
    assignment outside the loop. Stupid of me!

    There is something really odd in MS Word storage in Win XP. If I save
    the document to RTF it finds the stuff in the RTF file.

    I looked at both the MS Word and RTF files with the XVI32 Hex editor.
    They both showed the same hex values for the string HYPERLINK.

    Barry




    Purl Gurl wrote:

    > Barry Millman wrote:
    >
    > (snipped)
    >
    >
    >>The code that is not finding the text on the Win XP machine (same as
    >>the Win 2000 machine which does find the test)is:

    >
    >
    > (snipped)
    >
    > Move this line above and outside your while loop:
    >
    >
    >> $dir="C:\\IGINproducts\\UserDocuments\\";

    >
    >
    > The reason for moving that line above and outside your while loop
    > is you are creating a new value for that variable with each loop
    > iteration. That is inefficient because that variable has a "fixed"
    > value; set the value above and outside your while loop.
    >
    > You do not need to use double left hand slashes for your
    > file path but doing so causes no harm. You can use single
    > right hand slashes for your path, for a open(FILE) syntax
    > as shown below.
    >
    > However, despite claims of one the "experts" in this group,
    > you must use double lefthand slashes for some syntax,
    > certainly for some system command syntax for Win32.
    >
    > For a file open, you do not need double slashes but it
    > is perfectly ok to use them.
    >
    > Uppercase letters in a file path are not needed for Win32
    > but are ok to use; no problem.
    >
    > Your code produces this directory / file name path:
    >
    > C:\IGINproducts\UserDocuments\mydoc.doc
    >
    > That "appears" to be a valid path. Check to be sure it is valid.
    > Double check to be sure there are not spaces in a directory
    > name, such as, User Documents which is typical.
    >
    > You do not show your syntax for your OUTFILE open for write.
    > Be sure to use error checking to verify that file opens for write.
    >
    > Run this test code,
    >
    > #!perl
    >
    > open (TEST, "c:/iginproducts/userdocuments/mydoc.doc") || die "File Open Failed: $!";
    >
    > while (<TEST>)
    > {
    > if (index ($_, "HYPERLINK") > -1)
    > { print "HYPERLINK found at line $.\n"; }
    > }
    >
    > close (TEST) || die "File Close Failed $!";
    >
    >
    > Clearly I cannot test that code not having your file to test.
    > However, my syntax is ok,
    >
    > C:\APACHE\USERS\TEST>perl -c test.pl
    > test.pl syntax OK
    >
    > Running that test code will determine if your file path and file name
    > are valid, and will determine if HYPERLINK is actually in your file.
    >
    > Be cautious. If your HYPERLINK word spans lines, index will not
    > find that specific instance.
    >
    > Often, reducing your code to most simple version possible will find
    > errors for you, quickly.
    >
    > Purl Gurl
    Barry Millman, Nov 27, 2005
    #4
  5. OK. Sorry about the bad code. However, let's reduce this to the
    minimum, removing the search for the text. All we will do is read
    chunks of data, with this program:

    -------------------- start of program --------------------------
    open (TEST, "c:\\PERL\\Barry\\Starthere.rtf") || die "File Open Failed: $!";

    while (<TEST>)
    {

    print( "Chunk length: ", length($_),"\n");
    $chunks += 1;
    }

    close (TEST) || die "File Close Failed $!";

    print( $chunks, " Chunks\n");
    -------------------- end of program --------------------------

    Now, if I run this using Starthere.rtf, I get 1544 Chunks and they have
    all sorts of different lengths. Some of the first chunks are of length:
    103, 218, 250,1,230,63, 255.

    However, if I run this using Starthere.doc, I get only ONE chunk, and it
    is of length 6 bytes.

    If I examine the MS Word file using a Hex editor, I get the following
    values for bytes 5 through 7 (calling the first byte as zero):
    B1 1A E1

    The 1A is the seventh byte of the file.

    The PERL program (above) seems to stop at this character.

    So forgetting about the search, does this yield any clues?

    Thank you,

    Barry




    Tad McClellan wrote:
    > Barry Millman <> wrote:
    >
    >
    >>The format for the HYPERLINK that I am
    >>searching for in the document is:
    >>
    >>HYPERLINK "mydoc.doc"

    >
    >
    >>PROBLEM: The program works on the Windows 2000 machine, but does not
    >>find the files on the Win Xp machine.

    >
    >
    >
    > I don't think I can help with that part, but the code is too hokey
    > to just let it pass...
    >
    >
    >
    >>----------- start actual code segment --------------------
    >> while (/HYPERLINK(\s+.{1,80}?\.doc)/gim) # the "g" causes multiple
    >>matches

    >
    >
    >
    > The //m does not do anything, so why is it there?
    >
    > It changes the meaning of ^ and $, but you don't use those
    > anchors in your pattern, so you don't need //m.
    >
    > .{1,80}?
    >
    > is the same as
    >
    > .{0,80}
    >
    > Do you really want to match ' .doc' ?
    >
    >
    > We can't help you analyse why the match is failing because we
    > need two things to do that: the pattern and the string that
    > the pattern is to be matched against.
    >
    > We have only one of those two things...
    >
    >
    >
    >> {
    >> $fndxx = $1;
    >>
    >> $fndxx =~ s/\"//; # remove leading quote
    >> $fndxx =~ s/\s+//; # remove leading spaces

    >
    >
    >
    > Why capture them only to strip them out of the captured string?
    >
    > Why not just leave them out of the capture in the first place?
    >
    >
    > while (/HYPERLINK\s+"(.{1,78}\.doc")/gi)
    >
    > or, probably better:
    >
    > while (/HYPERLINK\s+"([^"]{1,78}\.doc")/gi)
    >
    >
    >
    >> $dir="C:\\IGINproducts\\UserDocuments\\";
    >>

    >
    >
    >
    > Use single quotes unless you want to make use of one of the two
    > extra things that double quotes give you (interpolation
    > and backslash escapes).
    >
    > Use forward slashes instead of silly slashes unless the path
    > is going to be fed to the "command interpreter".
    >
    >
    > $dir='C:/IGINproducts/UserDocuments/';
    >
    >
    >
    >> print(OUTFILE $fndxx,",",$date_string,", in: ",basename($file),
    >>"\n") ;

    >
    >
    >
    > Gak!
    >
    > Use double quoted strings to concatenate your output string:
    >
    > print(OUTFILE "$fndxx,$date_string, in: ", basename($file), "\n") ;
    >
    >
    >
    >>If I try this with a test program (the string to test is in the program
    >>itself ) it works fine on the XP machine.

    >
    >
    >
    > If you had shown us your complete test program, then we could
    > have helped you debug it.
    >
    > But you didn't, so we can't. (hint)
    >
    >
    >
    >>I would really appreciate any comments or suggestions about what I am
    >>doing wrong.

    >
    >
    >
    > Not posting a short and complete program that we can run that
    > illustrates your problem.
    >
    > Have you seen the Posting Guidelines that are posted here frequently?
    >
    >
    Barry Millman, Nov 27, 2005
    #5
  6. Re: Help: String search in Windows 2000 doesn't find text in Windows XP: MS Word document

    Purl Gurl <> wrote:
    > Tad McClellan wrote:
    >
    > (snipped)
    >
    >> I don't think I can help with that part, but the code is too hokey
    >> to just let it pass...

    >
    > Have you helped the author resolve his problem?



    Have you?


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Nov 27, 2005
    #6
  7. Barry Millman

    Bob Walton Guest

    Barry Millman wrote:

    > Hi:
    >
    > I am using Perl 5 (I believe both machines are using ActivePERL 5)
    > on two machines with the same data files. One machine is Win 2000 the
    > other is Win XP. The files are MS Word 2000 documents e-mailed
    > (manually) from the Win 2000 machine to the XP machine.
    >
    > The program searches the MS Word Files (both created with MS Word
    > 2000) for the word HYPERLINK. The format for the HYPERLINK that I am
    > searching for in the document is:
    >
    > HYPERLINK "mydoc.doc"
    >
    > (I checked this on the XP machine in Notepad and it is OK.)
    >


    Note that MS Word documents are stored in a proprietary binary
    gibberish format. To assume that a given word in a document will
    actually always be stored in an ASCII string in the .doc file is
    assuming too much. For example, perhaps it is stored in Unicode?
    And maybe newer Notepad versions understand enough to present
    Unicode strings? Try looking at your files with an editor that
    you *know* won't munge the contents. I suggest VIM.

    It is a mystery why a document would get changed while emailing
    it from one system to another. Or did you perhaps open the
    document with Word after emailing it, and then save it? You
    don't say. Is it the same version of Word? And what email
    system are you using on each of the computers? Does the same
    thing happen if you zip the file, email the zipped version, and
    unzip it on the other system?

    > PROBLEM: The program works on the Windows 2000 machine, but does not
    > find the files on the Win Xp machine.
    >
    > The code that is not finding the text on the Win XP machine (same as
    > the Win 2000 machine which does find the test)is:
    >
    > ----------- start actual code segment --------------------
    > while (/HYPERLINK(\s+.{1,80}?\.doc)/gim) # the "g" causes multiple
    > matches


    As others have mentioned, the /m modifier does nothing, and the
    ..{1,80}? would be better as .{0,80} .

    >
    > {
    > $fndxx = $1;
    >
    > $fndxx =~ s/\"//; # remove leading quote


    Your comment doesn't match the regex -- it will remove the first
    quote, not a leading quote.

    > $fndxx =~ s/\s+//; # remove leading spaces


    Again, this will remove the first run of whitespace from the
    string, not leading whitespace.

    > $dir="C:\\IGINproducts\\UserDocuments\\";
    >
    > $fullname = ($dir . $fndxx);
    > $date_string = "Cannot Find";
    > if (-e $fullname) { $date_string = ctime(stat($dir .
    > $fndxx)->mtime); } #last update date of that file
    > print(OUTFILE $fndxx,",",$date_string,", in:
    > ",basename($file), "\n") ;
    > $matches += 1; # count matches
    >
    > } #end while HYPERLINK
    > ----------- end actual code segment --------------------
    >
    > The output for a found HYPERLINK should look like this (it does on the
    > Win 2000 machine):
    >
    > mydoc.doc,(date of last update), in: otherdoc.doc
    >
    > On Win XP, the program cannot even find the word HYPERLINK (if I modify
    > the code to just search for that). The directories are valid, I can
    > have the program print a list of all files as it processes them.
    >
    > If I try this with a test program (the string to test is in the program
    > itself ) it works fine on the XP machine.
    >
    > There are no encryption issues, nor any file or directory problems.


    How exactly do you know this? Using a piece of garbage like
    Notepad won't definitively tell you this. I would trust Perl
    much further than Notepad.
    ....
    > Barry Millman

    --
    Bob Walton
    Email: http://bwalton.com/cgi-bin/emailbob.pl
    Bob Walton, Nov 27, 2005
    #7
  8. Re: Help: String search in Windows 2000 doesn't find text in Windows XP: MS Word document

    Purl Gurl wrote:
    > Purl Gurl wrote:


    Isn't talking to yourself the first sign?


    >
    > I have looked over Word Perfect and MS Word but not RTF formats, on a
    > 9.x machine, a 2K machine and an XP machine.


    Somewhat irrelevant because the OP wrote " The files are MS Word 2000
    documents e-mailed (manually) from the Win 2000 machine to the XP
    machine."


    <half-baked story about WordPerfect deleted>


    > A hex editor will display plaintext format, if in a binary file. I use
    > Hex Workshop v. 2.2x for this. Very old program but works with
    > excellence. You could simply open your Word document with a
    > hex editor, then search for http: from there.


    Pay attention Kira, the OP already wrote "I looked at both the MS Word
    and RTF files with the XVI32 Hex editor. They both showed the same hex
    values for the string HYPERLINK."


    Its so sad to see an old rusty V8 that's only running on three
    cylinders.
    foo bar baz qux, Nov 27, 2005
    #8
  9. Re: Help: String search in Windows 2000 doesn't find text in Windows XP: MS Word document

    Purl Gurl wrote:
    > Tad McClellan wrote:
    >
    > > Purl Gurl wrote:
    > > > Tad McClellan wrote:

    >
    > (snipped)
    >
    > > >> I don't think I can help with that part, but the code is too hokey
    > > >> to just let it pass...

    >
    > > > Have you helped the author resolve his problem?

    >
    > > Have you?

    >
    > I have. You have not.
    >


    The OP wrote about MS Word and you entertained him with a pointless and
    inconclusive story about an unrelated product: WordPerfect. After he
    wrote about using a hex editor you advised him to use a hex editor.
    foo bar baz qux, Nov 27, 2005
    #9
  10. Re: Help: String search in Windows 2000 doesn't find text in Windows XP: MS Word document

    Purl Gurl wrote:
    > Barry Millman wrote:
    >
    > (snipped)
    >
    > > If I examine the MS Word file using a Hex editor, I get the following
    > > values for bytes 5 through 7 (calling the first byte as zero):
    > > B1 1A E1

    >
    > > The 1A is the seventh byte of the file.

    >
    > > The PERL program (above) seems to stop at this character.

    >
    > Possible false end of file (eof) signal


    "Possible"? Don't be such an unassertive wimp Kira, it is well known
    that control-Z (hex 1A) *is* the end of file marker for text files on
    MS-DOS and hence (for compatibility reasons) on Win32..

    Perl uses the OS for file I/O and it is inevitable that Windows stops
    reading your binary file prematurely unless you tell it to use binary
    mode.
    foo bar baz qux, Nov 27, 2005
    #10
  11. Well Purl Gurl you are the BEST!!!!!

    The binmode solved the problem.

    Thank you all for your help. Plese don't fight!

    It still seems strange that the same file, created by the same word
    processor (Word 2000) would behave differently on two diffent versions
    of the same OS.

    Thanks to Bill Gates and his team for a wonderful morning.

    All the best,

    Barry



    Purl Gurl wrote:

    > Barry Millman wrote:
    >
    > (snipped)
    >
    >
    >>If I examine the MS Word file using a Hex editor, I get the following
    >>values for bytes 5 through 7 (calling the first byte as zero):
    >>B1 1A E1

    >
    >
    >>The 1A is the seventh byte of the file.

    >
    >
    >>The PERL program (above) seems to stop at this character.

    >
    >
    > Possible false end of file (eof) signal or a general collapse
    > of the read filehandle function because of illegal characters
    > for the specific read mode, ASCII for what you show.
    >
    > Give binmode a try.
    >
    > binmode (STDOUT);
    >
    > open (TEST ....
    >
    > binmode (TEST);
    >
    > My sincere suggestion is you pursue your binary files for fun, only.
    >
    > Should you need to accomplish your task, soon, use your RTF format
    > or convert your Word documents to plaintext.
    >
    > Working with binary files via Perl, is very challenging. Perl core is simply
    > not designed to handle binary data. Perl core is designed to open filehandles
    > for various functions, tell a system to read or write in a specific mode, but
    > perl core is not involved in the actual transfer of data, ASCII or binary. Perl
    > is designed to manipulate "plaintext" data, not binary.
    >
    > You can be successful in reading and writing binary data, but most likely will
    > not be successful using Perl to manipulate binary data, such as substr,
    > index, regex and other functions; Perl is not binary capable.
    >
    > I have not looked at CPAN for binary handling modules. Have a look. You might
    > find a module which can be adapted for your needs.
    >
    > If not, I suggest you stop mucking around with binary data and get your task done. =)
    >
    > Purl Gurl
    Barry Millman, Nov 27, 2005
    #11
  12. Re: Help: String search in Windows 2000 doesn't find text in Windows XP: MS Word document

    Barry Millman <> wrote:

    > OK. Sorry about the bad code.



    Please do not send stealth Cc's.

    That is considered a rude practice, so I'm moving on to
    someone else's post...


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Nov 27, 2005
    #12
  13. Re: Help: String search in Windows 2000 doesn't find text in Windows XP: MS Word document

    Tad McClellan <> wrote in
    news::

    > Barry Millman <> wrote:
    >
    >> OK. Sorry about the bad code.

    >
    >
    > Please do not send stealth Cc's.
    >
    > That is considered a rude practice, so I'm moving on to
    > someone else's post...


    Well, he seems to have found a good match (see elsethread) ;-)

    Sinan
    --
    A. Sinan Unur <>
    (reverse each component and remove .invalid for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
    A. Sinan Unur, Nov 27, 2005
    #13
  14. Barry Millman

    robic0 Guest

    Re: Help: String search in Windows 2000 doesn't find text in Windows XP: MS Word document

    On Sun, 27 Nov 2005 18:22:05 GMT, Bob Walton
    <> wrote:

    >Barry Millman wrote:
    >
    >> Hi:
    >>
    >> I am using Perl 5 (I believe both machines are using ActivePERL 5)
    >> on two machines with the same data files. One machine is Win 2000 the
    >> other is Win XP. The files are MS Word 2000 documents e-mailed
    >> (manually) from the Win 2000 machine to the XP machine.
    >>
    >> The program searches the MS Word Files (both created with MS Word
    >> 2000) for the word HYPERLINK. The format for the HYPERLINK that I am
    >> searching for in the document is:
    >>
    >> HYPERLINK "mydoc.doc"
    > >
    > > (I checked this on the XP machine in Notepad and it is OK.)
    > >

    >
    >Note that MS Word documents are stored in a proprietary binary
    >gibberish format. To assume that a given word in a document will
    >actually always be stored in an ASCII string in the .doc file is
    >assuming too much. For example, perhaps it is stored in Unicode?
    > And maybe newer Notepad versions understand enough to present
    >Unicode strings? Try looking at your files with an editor that
    >you *know* won't munge the contents. I suggest VIM.
    >
    >It is a mystery why a document would get changed while emailing
    >it from one system to another. Or did you perhaps open the
    >document with Word after emailing it, and then save it? You
    >don't say. Is it the same version of Word? And what email
    >system are you using on each of the computers? Does the same
    >thing happen if you zip the file, email the zipped version, and
    >unzip it on the other system?
    >

    [--snip--]

    Yeah, "propriatory binary" thats a phrase you don't hear much.
    Comparing md5's or even checksums should resolve transmission
    or open/save issues between versions/machines. Email? Maybe the AV
    firewall did some elective stripping somewhere en-route.
    You wasted your time on this, you should have tried to code
    to discerne the "difference" between saves. In reality
    thats what your are trying to do. Just because you can "see" some
    discernable text sometimes doesen't mean its a text stream.
    You can type out a ".exe" file too. What are the odds it reads
    everything to the eof sequence? Pretty good. What are the odds
    its got thousands of them in the file? Pretty good. Why?
    You can't reliably code for strings in a binary stream unless
    you already know the format and read the entire thing into
    waiting structures. By that time your past stream processing.
    Why do you think xml was invented, or yenc or uucp? Control
    codes munge up stream processing. The binary file data are
    sometimes control codes when read by consoles, editors and the like.

    There is no solution to the OPs problem, there is none.
    The approach is wrong. He made what engineers call "conceptual error".
    "It worked once" is not proof of concept! Given binary structured data
    files, it is absolutely, positively, impossible to treat it as
    streaming text in ANY search capacity, unless controls can be
    discerened from data at the search core api routines, and thats
    not what it does. You can't monitor or change fast enough the api
    concept of control codes. The attempt is a bridge to nowhere..
    Its a good bridge but the traffic drives off the end.

    >>
    >> The output for a found HYPERLINK should look like this (it does on the
    >> Win 2000 machine):
    >>
    >> mydoc.doc,(date of last update), in: otherdoc.doc
    >>
    >> On Win XP, the program cannot even find the word HYPERLINK (if I modify
    >> the code to just search for that). The directories are valid, I can
    >> have the program print a list of all files as it processes them.
    >>
    >> If I try this with a test program (the string to test is in the program
    >> itself ) it works fine on the XP machine.
    >>
    >> There are no encryption issues, nor any file or directory problems.

    >
    >How exactly do you know this? Using a piece of garbage like
    >Notepad won't definitively tell you this. I would trust Perl
    >much further than Notepad.
    >...
    >> Barry Millman
    robic0, Nov 28, 2005
    #14
  15. Barry Millman

    Grod Guest

    Re: Help: String search in Windows 2000 doesn't find text in Windows XP: MS Word document

    I'd add that MS Office files are compound files, that's "file system in
    a file" objects. You need an advanced hex editor like FlexHex to see
    the structure of a compound file:
    http://www.flexhex.com/docs/help/objects/compound_files.phtml

    I doubt that Perl supports structured storage, so there may be a
    problem locating the main data stream.
    Grod, Nov 28, 2005
    #15
  16. Barry Millman wrote:
    > Well Purl Gurl you are the BEST!!!!!
    >
    > The binmode solved the problem.
    >


    Even a stopped clock tells the right time twice a day.
    Mark Clements, Nov 28, 2005
    #16
  17. Barry Millman

    fda Guest

    Mark Clements wrote:

    >
    > Even a stopped clock tells the right time twice a day.


    And even tells us when it does so : exactly at the time where it is stopped.
    fda, Nov 30, 2005
    #17
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?Z2F1cmF2?=

    Uploading Data From MS Acess 2000 ti SQL server 2000

    =?Utf-8?B?Z2F1cmF2?=, Jan 9, 2006, in forum: ASP .Net
    Replies:
    2
    Views:
    799
    Mary Chipman [MSFT]
    Jan 9, 2006
  2. Guest
    Replies:
    2
    Views:
    872
    dave wanta
    Jul 11, 2003
  3. C Did
    Replies:
    3
    Views:
    3,936
    Chris Lithgow
    Jun 20, 2006
  4. Adhik
    Replies:
    1
    Views:
    235
    Manohar Kamath [MVP]
    Sep 5, 2003
  5. Denis

    Access 2000 or SQL Server 2000

    Denis, Jan 25, 2004, in forum: ASP General
    Replies:
    1
    Views:
    188
    Nicole Calinoiu
    Jan 25, 2004
Loading...

Share This Page