Detecting non-printing characters(?)

P

Peter Jamieson

Each day I use the LWP module to retrieve web pages then parse out useful
information from a single site, all OK except that occasionally a page does
not parse as expected. I examined the errant page but could see no
visual difference. I examined the source HTML line by line comparing the
errant page with a normal page but no visible difference at all.

Is it possible that there are non-printing characters present in the errant
pages that are causing my parser script to fail?
If so how can I detect and remove them?

Thanx for any assistance! Cheers, Peter
 
J

Jürgen Exner

Peter Jamieson said:
Each day I use the LWP module to retrieve web pages then parse out useful
information from a single site, all OK except that occasionally a page does
not parse as expected. I examined the errant page but could see no
visual difference. I examined the source HTML line by line comparing the
errant page with a normal page but no visible difference at all.

Did you try a diff between the working and the errant page?
Is it possible that there are non-printing characters present in the errant
pages that are causing my parser script to fail?

Don't know, maybe. However IMO it's more likely that either the page is
not correct HTML (did you check with an HTML validator) and therefore
the parser chokes or the there is an error in your parser.
If so how can I detect and remove them?

Perl's regular expressions support the POSIX :print: character class.

jue
 
S

sln

Each day I use the LWP module to retrieve web pages then parse out useful
information from a single site, all OK except that occasionally a page does
not parse as expected. I examined the errant page but could see no
visual difference. I examined the source HTML line by line comparing the
errant page with a normal page but no visible difference at all.

Is it possible that there are non-printing characters present in the errant
pages that are causing my parser script to fail?
If so how can I detect and remove them?

Thanx for any assistance! Cheers, Peter

What kind of error? Even wrong encodings should parse. I mean its
not die'ing is it?

How do you parse it, write it to file then pass the handler to the parser,
or just pass the buffer to it? Is the buffer bytes or utf8 promoted with
embed chars.

Have you examined the buffer with something like this?
for (map {ord $_} split //, $line) {
printf ("%x ",$_);
}

How does an errant page equal a normal page. Are they supposed to
be the same all the time?

-sln
 
P

Peter Jamieson

Thx Ben, Jürgen and sln for your kind assistance!
I will further investigate your suggestions....cheers, Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top