Re: Html character entity conversion

Discussion in 'Python' started by Anthra Norell, Aug 1, 2006.

  1. Pak (or Andrei, whichever is your first name),

    My proposal below:


    ----- Original Message -----
    From: <>
    Newsgroups: comp.lang.python
    To: <>
    Sent: Sunday, July 30, 2006 8:52 PM
    Subject: Re: Html character entity conversion


    > danielx wrote:
    > > wrote:
    > > > Here is my script:
    > > >
    > > > from mechanize import *
    > > > from BeautifulSoup import *
    > > > import StringIO
    > > > b = Browser()
    > > > f = b.open("http://www.translate.ru/text.asp?lang=ru")
    > > > b.select_form(nr=0)
    > > > b["source"] = "hello python"
    > > > html = b.submit().get_data()
    > > > soup = BeautifulSoup(html)
    > > > print soup.find("span", id = "r_text").string
    > > >
    > > > OUTPUT:
    > > > привет
    > > > питон
    > > > ----------
    > > > In russian it looks like:
    > > > "привет питон"
    > > >
    > > > How can I translate this using standard Python libraries??
    > > >
    > > > --
    > > > Pak Andrei, http://paxoblog.blogspot.com, icq://97449800

    > >


    I've been proposing solutions of late using a stream editor I recently wrote, realizing each time how well it works in a vareity of
    different situations. I can only hope I am not beginning to get on people's nerves (Here he comes again with his damn thing!).
    I base the following on proposals others have made so far, because I haven't used unicodes and know little about them. If
    nothing else, I do think this is a rather elegant way to translate the ampersands to the unicode stirngs. Having to read them
    through an 'eval', though, doesn't seem to be the ultimate solution. I couldn't assign a unicode string to a variable so that it
    would print text as Claudio proposed.


    Here is my htm example:

    >>> htm = StringIO.StringIO ('''

    <htm>
    <!-- Examen -->
    <head><title>Deuxi&egrave;me question</title></head>
    <body bgcolor="#beb4a0" text="#000082" etc. >
    <b>L&acute;&eacute;l&egrave;ve doit lire et traduire:</b>&nbsp;привет
    питон<br>
    </body>
    </htm> ''')

    And here is my SE hack:

    >>> import SE # Available at the Cheese Shop
    >>> Ampersand_Filter = SE.SE (' <EAT> "~&#[0-9]+;~==(10)" ')
    >>> for line in htm:

    line = line [:-1]
    ampersand_codes = Ampersand_Filter (line [:-1])
    # A list of the ampersand codes found in the current line
    if ampersand_codes:
    # From it we edit the substitution defintiions for the current line
    substitutions = ''
    for code in ampersand_codes.split ('\n')[:-1]:
    substitutions = '%s%s=\\u%04x\n' % (substitutions, code, int (code [2:-1]))
    # And make a custom Editor just for the current line
    Line_Unicoder = SE.SE (substitutions)
    unicode_line = Line_Unicoder (line)
    print eval ('u"%s"' % unicode_line)
    else:
    print line

    <htm>
    <!-- Examen -->
    <head><title>Deuxi&egrave;me question</title></head>
    <body bgcolor="#beb4a0" text="#000082" etc. >
    <b>L&acute;&eacute;l&egrave;ve doit lire et traduire:</b>&nbsp;привет питон<br>
    </body>
    </htm>

    This is a text book example of dynamic substitutions. Typically SE compiles static substituions lists. But with 2**16 (?) unicodes,
    building a static list would be absurd if at all possible. So we dynamically make custom substitutions for each line after
    extracting the ampersand escapes that may be there.

    Next we would like to fix the regular ascii ampersand escapes and also strip the tags. That is a simple question of preprocessing
    the file.

    >>> Legibilizer = SE.SE ('htm2iso.se "~<(.|\n)*?>~=" "~<!--(.|\n)*?-->~=" ')


    'htm2iso.se' is a substitutions definition file that defines the standard ascii ampersands to characters. It is included in the SE
    package. You can name as many definition files as you want. In a definition string the name of a file is equivalent to its contents.

    >>> htm.seek (0)
    >>> htm_no_tags = Legibilizer (htm.read ())
    >>> for line in htm_no_tags.split ('\n'):

    if line.strip () == '': continue
    ampersand_codes = Ampersand_Filter (line)
    ... (same as above)

    Deuxième question
    L'élève doit lire et traduire: привет питон


    Whether this serves your purpose I don't really know. How you can use it other than read it in the IDLE window, I don't know
    either.I tried to copy it out, but it doesn't survive the operation and the paste has question marks or squares in the place of the
    Russian letters.

    Regards

    Frederic
    Anthra Norell, Aug 1, 2006
    #1
    1. Advertising

  2. Anthra Norell wrote:

    >>>>import SE # Available at the Cheese Shop


    I mean, that OP requested:
    'How can I translate this using standard Python libraries??'

    so it's just only not on topic.

    Claudio Grondi
    Claudio Grondi, Aug 1, 2006
    #2
    1. Advertising

  3. ----- Original Message -----
    From: "Claudio Grondi" <>
    Newsgroups: comp.lang.python
    To: <>
    Sent: Tuesday, August 01, 2006 2:42 PM
    Subject: Re: Html character entity conversion


    > Anthra Norell wrote:
    >
    > >>>>import SE # Available at the Cheese Shop

    >
    > I mean, that OP requested:
    > 'How can I translate this using standard Python libraries??'
    >
    > so it's just only not on topic.
    >
    > Claudio Grondi
    > --
    > http://mail.python.org/mailman/listinfo/python-list


    Claudio,

    I was hoping to do the OP a service. Are you also hoping to do him a service?

    Frederic
    Anthra Norell, Aug 1, 2006
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    10
    Views:
    2,923
    yichun
    Sep 10, 2006
  2. Samuel van Laere

    Entity Name or Entity Number?

    Samuel van Laere, Feb 24, 2007, in forum: HTML
    Replies:
    4
    Views:
    1,593
    Jukka K. Korpela
    Feb 24, 2007
  3. markla
    Replies:
    1
    Views:
    531
    Steven Cheng
    Oct 6, 2008
  4. Norm
    Replies:
    3
    Views:
    2,676
  5. ThatsIT.net.au

    Entity, problem with entity key

    ThatsIT.net.au, Sep 6, 2009, in forum: ASP .Net
    Replies:
    1
    Views:
    1,176
    ThatsIT.net.au
    Sep 7, 2009
Loading...

Share This Page