Detecting non-printing characters(?)

Discussion in 'Perl Misc' started by Peter Jamieson, Aug 20, 2009.

  1. Each day I use the LWP module to retrieve web pages then parse out useful
    information from a single site, all OK except that occasionally a page does
    not parse as expected. I examined the errant page but could see no
    visual difference. I examined the source HTML line by line comparing the
    errant page with a normal page but no visible difference at all.

    Is it possible that there are non-printing characters present in the errant
    pages that are causing my parser script to fail?
    If so how can I detect and remove them?

    Thanx for any assistance! Cheers, Peter
    Peter Jamieson, Aug 20, 2009
    #1
    1. Advertising

  2. "Peter Jamieson" <> wrote:
    >Each day I use the LWP module to retrieve web pages then parse out useful
    >information from a single site, all OK except that occasionally a page does
    >not parse as expected. I examined the errant page but could see no
    >visual difference. I examined the source HTML line by line comparing the
    >errant page with a normal page but no visible difference at all.


    Did you try a diff between the working and the errant page?

    >Is it possible that there are non-printing characters present in the errant
    >pages that are causing my parser script to fail?


    Don't know, maybe. However IMO it's more likely that either the page is
    not correct HTML (did you check with an HTML validator) and therefore
    the parser chokes or the there is an error in your parser.

    >If so how can I detect and remove them?


    Perl's regular expressions support the POSIX :print: character class.

    jue
    Jürgen Exner, Aug 20, 2009
    #2
    1. Advertising

  3. Peter Jamieson

    Guest

    On Wed, 19 Aug 2009 23:14:33 GMT, "Peter Jamieson" <> wrote:

    >Each day I use the LWP module to retrieve web pages then parse out useful
    >information from a single site, all OK except that occasionally a page does
    >not parse as expected. I examined the errant page but could see no
    >visual difference. I examined the source HTML line by line comparing the
    >errant page with a normal page but no visible difference at all.
    >
    >Is it possible that there are non-printing characters present in the errant
    >pages that are causing my parser script to fail?
    >If so how can I detect and remove them?
    >
    >Thanx for any assistance! Cheers, Peter
    >


    What kind of error? Even wrong encodings should parse. I mean its
    not die'ing is it?

    How do you parse it, write it to file then pass the handler to the parser,
    or just pass the buffer to it? Is the buffer bytes or utf8 promoted with
    embed chars.

    Have you examined the buffer with something like this?
    for (map {ord $_} split //, $line) {
    printf ("%x ",$_);
    }

    How does an errant page equal a normal page. Are they supposed to
    be the same all the time?

    -sln
    , Aug 20, 2009
    #3
  4. Thx Ben, Jürgen and sln for your kind assistance!
    I will further investigate your suggestions....cheers, Peter
    Peter Jamieson, Aug 21, 2009
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Sachin
    Replies:
    3
    Views:
    655
    Roedy Green
    Nov 11, 2005
  2. the idiot
    Replies:
    33
    Views:
    1,057
    the idiot
    Mar 4, 2005
  3. =?ISO-8859-1?Q?Marian_Aldenh=F6vel?=

    Printing Filenames with non-Ascii-Characters

    =?ISO-8859-1?Q?Marian_Aldenh=F6vel?=, Feb 1, 2005, in forum: Python
    Replies:
    13
    Views:
    666
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Feb 8, 2005
  4. Alex Vinokur

    Printing non-printable characters

    Alex Vinokur, May 18, 2011, in forum: C++
    Replies:
    8
    Views:
    738
    Alf P. Steinbach /Usenet
    May 18, 2011
  5. J Taylor
    Replies:
    2
    Views:
    105
    Ilmari Karonen
    Jan 15, 2005
Loading...

Share This Page