Replacing utf-8 characters

Mike · Oct 5, 2005

Hi, I am using Python to scrape web pages and I do not have problem
unless I run into a site that is utf-8. It seems & is changed to &
when the site is utf-8.

If I try to replace it with .replace('&','&') it for some reason
does not replace it.

For example: http://today.reuters.co.uk/news/default.aspx

The url in the page looks like this

http://today.reuters.co.uk/news/New...423599_RTRUKOC_0_UK-BRITAIN-CONSERVATIVES.xml

However when I pull it into python the URL ends up looking like this
(notice the & instead of just & in the URL)

http://today.reuters.co.uk/news/new...11_RTRUKOC_0_UK-CONSTRUCTION-BPB-STGOBAIN.xml

Any ideas?

Richard Brodie · Oct 5, 2005

Mike said:
However when I pull it into python the URL ends up looking like this
(notice the & instead of just & in the URL)

Any ideas?

Some code would be helpful: the "&" is in the page source to start
with (which is as it ought to be). What are you using to parse the HTML?

Mike · Oct 5, 2005

For example this is what I am trying to do that is not working.

The contents of link is the reuters web page, containing

"/news/newsArticle.aspx?type=businessNews&amp;storyID=2005-10-05T151245Z_01_HO548006_RTRUKOC_0_UK-AIRLINES-BA.xml"

link = link.replace('&amp;','&')

But if I now view the the contents link it shows it the same as when it
was assigned.

Steve Holden · Oct 5, 2005

Unknown said:
For example this is what I am trying to do that is not working.

The contents of link is the reuters web page, containing

"/news/newsArticle.aspx?type=businessNews&amp;storyID=2005-10-05T151245Z_01_HO548006_RTRUKOC_0_UK-AIRLINES-BA.xml"

link = link.replace('&amp;','&')

But if I now view the the contents link it shows it the same as when it
was assigned.

You must be doing *something* wrong:

regards
Steve

Mike · Oct 5, 2005

Steve said:
You must be doing *something* wrong:

"/news/newsArticle.aspx?type=businessNews&amp;storyID=2005-10-05T151245Z_01_HO548006_RTRUKOC_0_UK-AIRLINES-BA.xml"

'/news/newsArticle.aspx?type=businessNews&storyID=2005-10-05T151245Z_01_HO548006_RTRUKOC_0_UK-AIRLINES-BA.xml'

regards
Steve

What you and I typed was ascii. The value of link came from importing
that utf-8 web page into that variable. That is why I think it is not
working. But not sure what the solution is.

Klaus Alexander Seistrup · Oct 5, 2005

Mike said:
Hi, I am using Python to scrape web pages and I do not have problem
unless I run into a site that is utf-8. It seems & is changed to
& when the site is utf-8.

[...]

Any ideas?

How about using the universal feedparser from feedparser.org to fetch
and parse the RSS from Reuters? That's what I do and it works like a
charm.

#v+
.... print rss.entries[0][what]
.... print
....
http://today.reuters.com/news/newsa...Z_01_DIT561620_RTRUKOC_0_US-COURT-SUICIDE.xml

Top court seems closely divided on suicide law

During arguments, the justices sharply questioned both sides on whether then-Attorney General John Ashcroft had the power under federal law in 2001 to bar distribution of controlled drugs to assist suicides, regardless of state law.
#v-

Cheers,

Mike · Oct 5, 2005

In playing with this I found link.replace does work but when I use

link.replace('&','&')

it replaces it with & instead of just &. link.replace is working
for me since if I changed the second option from & to something else I
see the change.

So it seems link.replace() function reads whether the first option is
utf-8 and converts the second option automatically to utf-8? How do I
prevent that?

Thanks again.

David Bolen · Oct 5, 2005

Mike said:
What you and I typed was ascii. The value of link came from importing
that utf-8 web page into that variable. That is why I think it is not
working. But not sure what the solution is.

Are you sure you're asking what you think you are asking? Both the
ampersand character (&) and the characters within the ampersand entity
character reference (&amp

are ASCII. As it turns out they are also
legal UTF-8, but I would not call a web page UTF-8 just because I saw
the sequence of characters "&" within the stream. (That's not to
say it isn't UTF-8 encoded, just that I don't think that's the issue)

I'm just guessing, but you do realize that legal HTML should quote all
uses of the ampersand character with an entity reference, since the
ampersand itself is reserved for use in such references. This
includes URL references whether inside attributes or in the body of
the text.

So when you see something in a browser in a web page that shows a URL
that includes "&" such as for separating parameters, internally that
page is (or should be) stored with "&" for that character. Thus
if you retrieve the page in code, that's what you'll find. It's the
browser processing that entity reference that turns it back into the
"&" for presentation.

Note that whether or not the page in question is encoded as UTF-8 is a
completely distinct question - whatever encoding the page is in would
be used to encode the characters in the entity reference (namely
"&").

I'm assuming that in scraping the page you want to reverse the process
(e.g., perform the interpretation of the entity references much as a
browser would) before using that URL for other purposes. If so, the
string replacement you tried should handle the replacement just fine,
at least within the value of the URL as managed by your code.

You then mention it being the same when you view the contents of the
link, which isn't quite clear to me, but if that means retrieving
another copy of the link as embedded in an HTML page then yes, it'll
get quoted again since as initially, you have to quote an ampersand
as an entity reference within HTML.

What did you mean by "view the contents link"?

-- David

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Oct 5, 2005

Mike said:
So it seems link.replace() function reads whether the first option is
utf-8 and converts the second option automatically to utf-8? How do I
prevent that?

Not sure what an option is... if you are talking about parameters,
rest assured that <string>.replace does not know or care whether any
of its parameters is encoded in UTF-8. Also not sure where you got
the impression UTF-8 could have to do anything with this.

Regards,
Martin

Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
Encoding of surrogate code points to UTF-8	14	Oct 8, 2013
utf-8 and ctypes	5	Sep 28, 2010
Unicode/UTF-8 confusion	1	Mar 15, 2008
Python unicode utf-8 characters and MySQL unicode utf-8 characters	2	Jan 18, 2011
UTF-8 read & print?	6	Nov 25, 2012
what is best method to set sys.stdout to utf-8?	3	Mar 7, 2012
UTF-8	4	Mar 10, 2007

Replacing utf-8 characters

Mike

Richard Brodie

Mike

Steve Holden

Mike

Klaus Alexander Seistrup

Mike

David Bolen

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads