RSS feeds and HTML special characters

eurosnob · Jul 11, 2006

If this isn't the right place to post this, please point me in the
right direction?

I'm a relatively casual Perl programmer trying to implement an RSS feed
into my personal site. I've got it working using a slightly modified
example from the O'Reilly book (content syndication with RSS), but for
one annoying caveat...

If I load the feed in, say, Firefox, a title might look like this:

<title>This Artist is Good - Frank D'Armata</title>

When I use "View Source," the title string is actually:

<title>This Artist is Good - Frank D’Armata</title>

However, when I go to use the string from within Perl, I get a Warning,
"Wide character in print", and giberish printed where the special
character sits:

This Artist is Good - Frank Dâ€™Armata

(That's a lowercase 'a' with an accent, the Euro symbol, and the
trademark symbol, between D and Armata.)

I'm sure there's a relatively simple fix, but I'm kind of lost at this
point... Help?!

Thanks!

Ben Morrow · Jul 11, 2006

Quoth (e-mail address removed):

I'm a relatively casual Perl programmer trying to implement an RSS feed
into my personal site. I've got it working using a slightly modified
example from the O'Reilly book (content syndication with RSS), but for
one annoying caveat...

If I load the feed in, say, Firefox, a title might look like this:

<title>This Artist is Good - Frank D'Armata</title>

When I use "View Source," the title string is actually:

<title>This Artist is Good - Frank D’Armata</title>

However, when I go to use the string from within Perl, I get a Warning,
"Wide character in print", and giberish printed where the special
character sits:

This Artist is Good - Frank Dâ€™Armata

(That's a lowercase 'a' with an accent, the Euro symbol, and the
trademark symbol, between D and Armata.)

Firstly, you need to be using perl 5.8.

Next, we need to know how you are getting hold of these strings. Please
post a minimal complete program that shows what you are doing.

Basically, your data is coming in (from wherever you're getting it from)
in the UTF8 encoding, and you haven't told Perl that. Perl assumes data
is in ISO8859-1 unless you tell it otherwise (for hysterical raisins),
so you're getting gibberish. If you show us how you're reading your data
we can tell you how to tell Perl it's in UTF8.

Ben

John Bokma · Jul 11, 2006

If this isn't the right place to post this, please point me in the
right direction?

I'm a relatively casual Perl programmer trying to implement an RSS feed
into my personal site. I've got it working using a slightly modified
example from the O'Reilly book (content syndication with RSS), but for
one annoying caveat...

If I load the feed in, say, Firefox, a title might look like this:

<title>This Artist is Good - Frank D'Armata</title>

When I use "View Source," the title string is actually:

<title>This Artist is Good - Frank D’Armata</title>

However, when I go to use the string from within Perl, I get a Warning,
"Wide character in print", and giberish printed where the special
character sits:

This Artist is Good - Frank Dâ€™Armata

(That's a lowercase 'a' with an accent, the Euro symbol, and the
trademark symbol, between D and Armata.)

I'm sure there's a relatively simple fix, but I'm kind of lost at this
point... Help?!

You might want to study:
http://ahinea.com/en/tech/perl-unicode-struggle.html

Mumia W. · Jul 12, 2006

[...]
<title>This Artist is Good - Frank D’Armata</title>

However, when I go to use the string from within Perl, I get a Warning,
"Wide character in print", and giberish printed where the special
character sits:

This Artist is Good - Frank Dâ€™Armata
[...]

You probably need "use encoding 'utf8';" at the top of
your script. You need to output utf8 data (0x8217 is a
unicode character), but STDOUT wasn't warned to look
out for unicode (muti-byte, wide) characters.

eurosnob · Jul 12, 2006

Ben said:
Firstly, you need to be using perl 5.8.

"This is perl, v5.8.3 built for i386-linux-thread-multi"

Next, we need to know how you are getting hold of these strings. Please
post a minimal complete program that shows what you are doing.

use LWP::Simple;
use XML::Simple;

my $feed = get ("http://the/url/of/the/feed");

# At this point, $feed contains:
# ... <title>This Artist is Good - Frank D’Armata</title> ...

my $parser = XML::Simple->new( );
my $rss = $parser->XMLin("$feed");

# At this point, $rss->{'channel'}->{'item'}->[x]->{'title'} contains:
# This Artist is Good - Frank D<gibberish>Armata

So it looks like the XML::Simple routine(s) are unencoding the encoded
HTML entity from the feed. Since the output is an HTML page, I'd like
to leave the text HTML-encoded (in this example, ’), as that's
what it should properly be for output.

John Bokma · Jul 12, 2006

Ben said:
Ben said:

Firstly, you need to be using perl 5.8.

Click to expand...

"This is perl, v5.8.3 built for i386-linux-thread-multi"

Next, we need to know how you are getting hold of these strings. Please
post a minimal complete program that shows what you are doing.

Click to expand...

use LWP::Simple;
use XML::Simple;

my $feed = get ("http://the/url/of/the/feed");

# At this point, $feed contains:
# ... <title>This Artist is Good - Frank D’Armata</title> ...

my $parser = XML::Simple->new( );
my $rss = $parser->XMLin("$feed");

# At this point, $rss->{'channel'}->{'item'}->[x]->{'title'} contains:
# This Artist is Good - Frank D<gibberish>Armata

How did you find this out? By printing? Notice that the printing step

might do the said:
So it looks like the XML::Simple routine(s) are unencoding the encoded
HTML entity from the feed. Since the output is an HTML page, I'd like
to leave the text HTML-encoded (in this example, ’), as that's
what it should properly be for output.

Easiest solution is to output your HTML as utf8.

If you use A. Sinan Unur's solution, as far as I know you still have to
specify somewhere that the HTML document should be rendered as having
utf8, since ’ is just a fancy way of writing an utf character.

Ben Morrow · Jul 12, 2006

Quoth John Bokma said:
Ben said:

Firstly, you need to be using perl 5.8.

Click to expand...

"This is perl, v5.8.3 built for i386-linux-thread-multi"

Next, we need to know how you are getting hold of these strings. Please
post a minimal complete program that shows what you are doing.

Click to expand...

use LWP::Simple;
use XML::Simple;

my $feed = get ("http://the/url/of/the/feed");

# At this point, $feed contains:
# ... <title>This Artist is Good - Frank D’Armata</title> ...

my $parser = XML::Simple->new( );
my $rss = $parser->XMLin("$feed");

# At this point, $rss->{'channel'}->{'item'}->[x]->{'title'} contains:
# This Artist is Good - Frank D<gibberish>Armata

Click to expand...

How did you find this out? By printing? Notice that the printing step
might do the <gibberish> thing (which probably is the case).

It will, unless you tell Perl what encoding you want for output. This is
what the 'wide character in print' warning means: you need to specify an
encoding on the output filehandle. (This warning could be a little
clearer, IMHO; though it's difficult to reconcile all of Perl's forward-
and backward-compatibility goals

.)

Easiest solution is to output your HTML as utf8.

You can just print things and hope, but it's much better (safer, more
flexible and you don't get warnings) to do it right.

1. Decide what encoding you want to use. I generally use us-ascii, 'cos
I *know* it's safe; you may want to stick to ISO8859-1 as that's the
default so it's probably what you've been using.

2. Tell the browser what you've chosen. The right answer is to set the
charset in the HTTP Content-type header; there are other ways if that's
difficult (there have been threads on this recently here).

3. Tell Perl what you want, and tell it to use HTML entities for
characters that don't exist in your chosen encoding:

use Encode qw/:fallbacks/;

$PerlIO::encoding::fallback = FB_HTMLCREF;
binmode STDOUT, ':encoding(iso8859-1)';

Substitute the appropriate filehandle and encoding in the binmode call.

If you use A. Sinan Unur's solution, as far as I know you still have to
specify somewhere that the HTML document should be rendered as having
utf8, since ’ is just a fancy way of writing an utf character.

The encoding of the HTML document doesn't affect how numeric entities
are interpreted. They always refer to Unicode characters (note: not UTF8
bytes. Ã© does not mean é, even though those bytes
represent e-acute in UTF8).

Ben

Alan J. Flavell · Jul 12, 2006

If you use A. Sinan Unur's solution, as far as I know you still have
to specify somewhere that the HTML document should be rendered as
having utf8, since ’ is just a fancy way of writing an utf
character.

This is precisely the conceptual error which existed in all versions
of Netscape 4.*. Fortunately, I think we can forget that old thing
for most practical purposes today.

It's an important principle (see RFC2070 for the original
codification, or HTML/4.01 specification for a more current version of
the story) that the "document character set" of HTML is always, in
principle, Unicode, no matter what character encoding it uses (e.g the
numerical values that appear in notation don't change, not
even if you re-code the document in EBCDIC). And of course the more
common situations, where the document is encoded in one of the
iso-8859-somthing codes, or even in us-ascii, are precisely the times
when you *do* need notation to represent the occasional
character which is outside of the chosen character encoding.

Don't confuse "character set" with "character encoding" - not even if
the old MIME terminology uses the attribute name "charset" to specify
the character encoding - they are two fundamentally different things,
and already were in SGML, although it's only more recently that the
distinction has become so widely significant - it's a shame that the
MIME specifications introduced this confusion, but too late to repair
it now I guess.

So much for principles. In practical terms, if you are involved in
handling forms submissions then it's mostly a good idea to choose
utf-8 encoding anyway, and this will incidentally be helpful for any
old NN4.* versions which are still shambling around. But, aside from
those practical issues, the theory is clear, as above, the display
(if the data is correct!) will be correct in any halfways decent
current browser - but you might need to be a bit more resourceful in
handling user input via forms, as commented in the Perl encoding
documentation:

|| it is beyond the power of words to describe the
|| way HTML browsers encode non-ASCII form data.

Hope this clears things up a bit.

Alan J. Flavell · Jul 12, 2006

(0x8217 is a unicode character),

So it would be, but it's not ’

The values in notation are decimal (8217 = 0x2019, "right
single quotation mark).

To use hexadecimal numerical character references, code e.g ’

but STDOUT wasn't warned to look
out for unicode (muti-byte, wide) characters.

Don't confuse the real characters with their numerical character
references. In HTML or any XML-based language they are different
ways of representing the same thing, but &-notation can be used even
when the character encoding is us-ascii, whereas the "real" character
itself obviously needs a correct character encoding to be in effect,
and then your remark about STDOUT is pertinent.

h t h

Dr.Ruud · Jul 12, 2006

Ben Morrow schreef:

Firstly, you need to be using perl 5.8.

Please make that 5.8.1+

John Bokma · Jul 12, 2006

Ben Morrow said:
Quoth John Bokma <[email protected]>:
[..]

Easiest solution is to output your HTML as utf8.

Click to expand...

You can just print things and hope, but it's much better (safer, more
flexible and you don't get warnings) to do it right.

1. Decide what encoding you want to use. I generally use us-ascii,
'cos I *know* it's safe; you may want to stick to ISO8859-1 as that's
the default so it's probably what you've been using.

The problem is: what is ’ in us-ascii or ISO8859-1? What happens
when you tell a browser: this is ISO8859-1 and next you ask it to decode
’ Does the browser always render in UTF mode?

John Bokma · Jul 12, 2006

Alan J. Flavell said:
Hope this clears things up a bit.

Thanks it does (i hope). So it's safe to assume that browsers handle HTML
internally as utf8 no matter how it's offered by the webserver? And hence
using with number outside the actual encoding of the HTML file
itself is perfectly legal?

Alan J. Flavell · Jul 12, 2006

The problem is: what is ’ in us-ascii or ISO8859-1?

No, that isn't the problem. The problem here, I'm afraid, is that you
/still/ show evidence that you don't understand this aspect of
character representation in HTML.

The characters ampersand, hash, 8 2 1 7 semi-colon *are*, after all,
are all us-ascii characters, each and every one of them. Why do you
think that's a problem?

What happens when you tell a browser: this is ISO8859-1 and next you
ask it to decode ’

RFC2070 was published in what - 1997? It tells you want to do.

Does the browser always render in UTF mode?

No. The browser is required to *behave as if* it understands Unicode.
How it does that internally is entirely its own affair (black box
model). Internally, it might work in EBCDIC DBCS, for all that the
web specifications care.

Or to take another in-principle shot, it *could* perfectly well look
up ’ in its tables and find that (amongst other things) in
Windows-1252 encoding it's 0x92.

h t h

Alan J. Flavell · Jul 12, 2006

Thanks it does (i hope).

Sorry, we seem to have overlapped in posting.

So it's safe to assume that browsers handle HTML
internally as utf8 no matter how it's offered by the webserver?

It's safe (now that NN4 is practically out of the way) to assume that
they will *understand* all three[1] representations of characters, no
matter what "character encoding scheme" (that's the current technical
term for it) is used for transmitting the data.

As I said in the other f'up - it's entirely up to the browser
developer just how they implement that, inside of their black box, as
long as it works as intended when viewed from the outside. In
practice, most browsers will work internally in some representation of
Unicode, but it's not a requirement.

[1] Those three representations being:

1. an encoded character itself, if the character in question can be
represented in the encoding scheme that's in use;

2. a numerical character reference ( or ຾)
referring to the *Unicode* character number (irrespective of what
character encoding scheme is in use);

3. a character entity (&name

if one is defined in HTML/4

And hence using with number outside the actual encoding of
the HTML file itself is perfectly legal?

Yes, absolutely, and it has been since at least RFC2070.

Unfortunately, the authors of NN4 don't seem to have understood
RFC2070. Good riddance to NN4.

What would be the point of defining notation if it did
nothing more than to duplicate the characters which could be encoded
in the encoding scheme used? There'd be little sense in that!

p.s As you may have noticed, this is one of my "special subjects". I
hope I haven't said anything to offend. But if it helps to shock some
readers out of a confidently- but mistakenly-held belief, it may have
done some good.

all the best.

Ben Morrow · Jul 12, 2006

Quoth John Bokma said:
Thanks it does (i hope). So it's safe to assume that browsers handle HTML
internally as utf8 no matter how it's offered by the webserver?

^^^^
*Unicode*, not UTF8. They are different: and when I say 'Unicode', I
don't mean UCS2 or UTF16 or whatever it is Java and Microsoft mean when
they say it.

Unicode is a big old list of characters, with a number for each one.

UTF8 is a means of representing a sequence of Unicode characters as a
sequence of 8-bit bytes, with certain desirable properties for those
characters with Unicode indices less than 128.

The difference is crucial: for instance, {SG,HT,X}ML use Unicode
character numbers directly in their escape mechanism, whereas for URIs
you have to first encode your Unicode characters into UTF8 bytes and
then escape those.

And hence
using with number outside the actual encoding of the HTML file
itself is perfectly legal?

Yes. That, after all, is the whole point: to represent characters you
can't put directly in the document

.

Ben

Eric R. Meyers · Jul 12, 2006

Hi Euro,

Re: HTML special characters in RSS feed. I answered you over in
perl.beginners, but I'm bring this over here to where your action is at.

If this isn't the right place to post this, please point me in the
right direction?

I'm a relatively casual Perl programmer trying to implement an RSS feed
into my personal site. I've got it working using a slightly modified
example from the O'Reilly book (content syndication with RSS), but for
one annoying caveat...

If I load the feed in, say, Firefox, a title might look like this:

<title>This Artist is Good - Frank D'Armata</title>

When I use "View Source," the title string is actually:

<title>This Artist is Good - Frank D’Armata</title>

However, when I go to use the string from within Perl, I get a Warning,
"Wide character in print", and giberish printed where the special
character sits:

This Artist is Good - Frank DÃ¢Â€Â™Armata

(That's a lowercase 'a' with an accent, the Euro symbol, and the
trademark symbol, between D and Armata.)

I'm sure there's a relatively simple fix, but I'm kind of lost at this
point... Help?!

Thanks!

You need to find the XML::Simple OPTIONS section called "NumericEscape"
which discusses an XMLout ability to output the high characters as "numeric
entities." I think that you need to use this "NumericEscape" option before
printing.

XML::Simple also mentions a "STRICT MODE" to automatically catch common
errors.

You might also want to read 'perldoc perluniintro' to learn how perl handles
characters internally.

I hope this helps.

Eric

John Bokma · Jul 12, 2006

Alan J. Flavell said:
p.s As you may have noticed, this is one of my "special subjects". I
hope I haven't said anything to offend. But if it helps to shock some
readers out of a confidently- but mistakenly-held belief, it may have
done some good.

It worked for me

Thanks.

Alan J. Flavell · Jul 12, 2006

^^^^
*Unicode*, not UTF8.

Good call; but also, I stand by what I said before, that the internal
workings are at the discretion of the implementer, as long as the
behaviour as seen from outside is correct.

They are different:

Indeed they are conceptually of different categories.

and when I say 'Unicode', I don't mean UCS2 or UTF16 or whatever it
is Java and Microsoft mean when they say it.

It's very annoying that MS can present, on one and the same menu, one
entry that says "utf-8" and another that says "Unicode". It's as
logical as a menu that asks you to chooce between oranges and fruit.

To understand this aspect of Unicode better, it's useful to read
chapter 2 of the Unicode specification,
http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf

In particular, 2.5 Code Points, 2.6 Character Encoding Forms, and
2.7 Character Encoding Schemes.

Unicode is a big old list of characters, with a number for each one.

Yes - what in the jargon is called a "coded character set". See the
Unicode glossary http://www.unicode.org/versions/Unicode4.0.0/b1.pdf
for details.

UTF8 is a means of representing a sequence of Unicode characters as
a sequence of 8-bit bytes,

Indeed, and it is one of the "character encoding schemes" of Unicode,
of which there are currently officially seven (some of which have
obsoleted the encodings previously called ucs2 and ucs4).

The difference is crucial: for instance, {SG,HT,X}ML use Unicode
character numbers directly in their escape mechanism, whereas for
URIs you have to first encode your Unicode characters into UTF8
bytes and then escape those.

Good point.

And just to hammer this in once again: that unfortunately-named
MIME parameter called "charset" specifies what we are now meant to
call a "character encoding scheme".

Yes. That, after all, is the whole point: to represent characters
you can't put directly in the document .

Good stuff.

I was meaning to include a URL for further reading on this topic as it
relates to HTML (and, for the most part, also for XML-based markups).
In fact, the HTML/4.01 specification has quite a useful piece on this.
So I give you: http://www.w3.org/TR/REC-html40/charset.html

Hoping the audience haven't all gone to sleep yet,

best regards

John Bokma · Jul 13, 2006

Ben Morrow said:
^^^^
*Unicode*, not UTF8

Yup, my mistake.

John Bokma · Jul 13, 2006

Alan J. Flavell said:
Hoping the audience haven't all gone to sleep yet,

Still awake, and thanks Alan and Ben (and others I might have missed).

Recommended library for parsing RSS and Atom feeds	4	Jun 23, 2010
Simple RSS feeds question...	4	Nov 21, 2006
RSS feeds and time zones	1	Oct 26, 2007
best technique for detecting charset/character encoding of RSS feeds	0	Dec 12, 2008
Ruby Program with RSS Feeds	2	Jan 27, 2007
Telnetlib and special quit characters with Ctrl, oh my!	1	Dec 19, 2012
Unknown Entities and Atom/Rss-Feeds: HowTo?	0	Dec 18, 2006
Special characters and validation	17	Jan 29, 2009

RSS feeds and HTML special characters

eurosnob

Ben Morrow

John Bokma

Mumia W.

eurosnob

John Bokma

Ben Morrow

Alan J. Flavell

Alan J. Flavell

Dr.Ruud

John Bokma

John Bokma

Alan J. Flavell

Alan J. Flavell

Ben Morrow

Eric R. Meyers

John Bokma

Alan J. Flavell

John Bokma

John Bokma

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads