Detecting non-printing characters(?)

Peter Jamieson · Aug 20, 2009

Each day I use the LWP module to retrieve web pages then parse out useful
information from a single site, all OK except that occasionally a page does
not parse as expected. I examined the errant page but could see no
visual difference. I examined the source HTML line by line comparing the
errant page with a normal page but no visible difference at all.

Is it possible that there are non-printing characters present in the errant
pages that are causing my parser script to fail?
If so how can I detect and remove them?

Thanx for any assistance! Cheers, Peter

Jürgen Exner · Aug 20, 2009

Peter Jamieson said:
Each day I use the LWP module to retrieve web pages then parse out useful
information from a single site, all OK except that occasionally a page does
not parse as expected. I examined the errant page but could see no
visual difference. I examined the source HTML line by line comparing the
errant page with a normal page but no visible difference at all.

Did you try a diff between the working and the errant page?

Is it possible that there are non-printing characters present in the errant
pages that are causing my parser script to fail?

Don't know, maybe. However IMO it's more likely that either the page is
not correct HTML (did you check with an HTML validator) and therefore
the parser chokes or the there is an error in your parser.

If so how can I detect and remove them?

Perl's regular expressions support the POSIX

rint: character class.

jue

sln · Aug 20, 2009

Each day I use the LWP module to retrieve web pages then parse out useful
information from a single site, all OK except that occasionally a page does
not parse as expected. I examined the errant page but could see no
visual difference. I examined the source HTML line by line comparing the
errant page with a normal page but no visible difference at all.

Is it possible that there are non-printing characters present in the errant
pages that are causing my parser script to fail?
If so how can I detect and remove them?

Thanx for any assistance! Cheers, Peter

What kind of error? Even wrong encodings should parse. I mean its
not die'ing is it?

How do you parse it, write it to file then pass the handler to the parser,
or just pass the buffer to it? Is the buffer bytes or utf8 promoted with
embed chars.

Have you examined the buffer with something like this?
for (map {ord $_} split //, $line) {
printf ("%x ",$_);
}

How does an errant page equal a normal page. Are they supposed to
be the same all the time?

-sln

Peter Jamieson · Aug 21, 2009

Thx Ben, Jürgen and sln for your kind assistance!
I will further investigate your suggestions....cheers, Peter

How bad is $'? (Was: "Get substring of line")	4	Jan 18, 2005
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
ANN: a non-heinous acceptance test rig inside MiniRubyWiki	0	Nov 27, 2004
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Mar 1, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Jan 12, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Dec 15, 2007
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Nov 1, 2007
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Apr 1, 2008

Detecting non-printing characters(?)

Peter Jamieson

Jürgen Exner

sln

Peter Jamieson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads