multiple codepages

Discussion in 'Perl Misc' started by George Mpouras, Oct 3, 2013.

  1. I receive files containing text of multiple codepages (at the same file)
    .. You can not know the codepage of every line from before.
    Any idea to convert it to valid utf8 ?
    George Mpouras, Oct 3, 2013
    #1
    1. Advertising

  2. * George Mpouras wrote in comp.lang.perl.misc:
    >I receive files containing text of multiple codepages (at the same file)
    >. You can not know the codepage of every line from before.
    >Any idea to convert it to valid utf8 ?


    In order to properly convert to UTF-8 you have to know the encoding the
    bytes are in prior to the conversion. Switching between encodings inside
    a single file should be no problem so long as you can isolate the bytes
    around the positions where the encoding changes. If you cannot do that,
    or cannot know the encoding of the bytes through any means at all, then
    you have a problem. Perhaps you can elaborate on your problem?
    --
    Björn Höhrmann · mailto: · http://bjoern.hoehrmann.de
    Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
    25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
    Bjoern Hoehrmann, Oct 3, 2013
    #2
    1. Advertising

  3. Στις 3/10/2013 23:47, ο/η Bjoern Hoehrmann έγÏαψε:
    > * George Mpouras wrote in comp.lang.perl.misc:
    >> I receive files containing text of multiple codepages (at the same file)
    >> . You can not know the codepage of every line from before.
    >> Any idea to convert it to valid utf8 ?

    >
    > In order to properly convert to UTF-8 you have to know the encoding the
    > bytes are in prior to the conversion. Switching between encodings inside
    > a single file should be no problem so long as you can isolate the bytes
    > around the positions where the encoding changes. If you cannot do that,
    > or cannot know the encoding of the bytes through any means at all, then
    > you have a problem. Perhaps you can elaborate on your problem?
    >


    there is no way to know it , they are email headers on big log
    George Mpouras, Oct 3, 2013
    #3
  4. On 2013-10-03 21:29, George Mpouras <> wrote:
    > Στις 3/10/2013 23:47, ο/η Bjoern Hoehrmann έγÏαψε:
    >> * George Mpouras wrote in comp.lang.perl.misc:
    >>> I receive files containing text of multiple codepages (at the same file)
    >>> . You can not know the codepage of every line from before.
    >>> Any idea to convert it to valid utf8 ?

    >>
    >> In order to properly convert to UTF-8 you have to know the encoding the
    >> bytes are in prior to the conversion. Switching between encodings inside
    >> a single file should be no problem so long as you can isolate the bytes
    >> around the positions where the encoding changes. If you cannot do that,
    >> or cannot know the encoding of the bytes through any means at all, then
    >> you have a problem. Perhaps you can elaborate on your problem?
    >>

    >
    > there is no way to know it , they are email headers on big log


    Email headers use RFC 2047 encoding.

    hp


    --
    _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
    |_|_) | | Man feilt solange an seinen Text um, bis
    | | | | die Satzbestandteile des Satzes nicht mehr
    __/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
    Peter J. Holzer, Oct 3, 2013
    #4
  5. Στις 4/10/2013 00:31, ο/η Peter J. Holzer έγÏαψε:
    > On 2013-10-03 21:29, George Mpouras <> wrote:
    >> Στις 3/10/2013 23:47, ο/η Bjoern Hoehrmann έγÏαψε:
    >>> * George Mpouras wrote in comp.lang.perl.misc:
    >>>> I receive files containing text of multiple codepages (at the same file)
    >>>> . You can not know the codepage of every line from before.
    >>>> Any idea to convert it to valid utf8 ?
    >>>
    >>> In order to properly convert to UTF-8 you have to know the encoding the
    >>> bytes are in prior to the conversion. Switching between encodings inside
    >>> a single file should be no problem so long as you can isolate the bytes
    >>> around the positions where the encoding changes. If you cannot do that,
    >>> or cannot know the encoding of the bytes through any means at all, then
    >>> you have a problem. Perhaps you can elaborate on your problem?
    >>>

    >>
    >> there is no way to know it , they are email headers on big log

    >
    > Email headers use RFC 2047 encoding.
    >
    > hp
    >
    >


    maybe but there are Cyrillic, France , whatever .. at "username"
    George Mpouras, Oct 3, 2013
    #5
  6. Στις 3/10/2013 23:47, ο/η Bjoern Hoehrmann έγÏαψε:
    > * George Mpouras wrote in comp.lang.perl.misc:
    >> I receive files containing text of multiple codepages (at the same file)
    >> . You can not know the codepage of every line from before.
    >> Any idea to convert it to valid utf8 ?

    >
    > In order to properly convert to UTF-8 you have to know the encoding the
    > bytes are in prior to the conversion. Switching between encodings inside
    > a single file should be no problem so long as you can isolate the bytes
    > around the positions where the encoding changes. If you cannot do that,
    > or cannot know the encoding of the bytes through any means at all, then
    > you have a problem. Perhaps you can elaborate on your problem?
    >


    I remember a module called encode-guess ... maybe this will work
    George Mpouras, Oct 3, 2013
    #6
  7. On 2013-10-03 21:38, George Mpouras <> wrote:
    > Στις 4/10/2013 00:31, ο/η Peter J. Holzer έγÏαψε:
    >> On 2013-10-03 21:29, George Mpouras <> wrote:
    >>> Στις 3/10/2013 23:47, ο/η Bjoern Hoehrmann έγÏαψε:
    >>>> * George Mpouras wrote in comp.lang.perl.misc:
    >>>>> I receive files containing text of multiple codepages (at the same file)
    >>>>> . You can not know the codepage of every line from before.
    >>>>> Any idea to convert it to valid utf8 ?
    >>>>
    >>>> In order to properly convert to UTF-8 you have to know the encoding the
    >>>> bytes are in prior to the conversion. Switching between encodings inside
    >>>> a single file should be no problem so long as you can isolate the bytes
    >>>> around the positions where the encoding changes. If you cannot do that,
    >>>> or cannot know the encoding of the bytes through any means at all, then
    >>>> you have a problem. Perhaps you can elaborate on your problem?
    >>>>
    >>>
    >>> there is no way to know it , they are email headers on big log

    >>
    >> Email headers use RFC 2047 encoding.

    >
    > maybe but there are Cyrillic, France , whatever .. at "username"


    RFC 2047 encoding includes the encoding. So there s a way to know it
    (otherwise non-ascii characters in subjects, from or to headers etc.
    would be impossible).

    hp


    --
    _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
    |_|_) | | Man feilt solange an seinen Text um, bis
    | | | | die Satzbestandteile des Satzes nicht mehr
    __/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
    Peter J. Holzer, Oct 3, 2013
    #7
  8. George Mpouras <> wrote:
    >I receive files containing text of multiple codepages (at the same file)
    >. You can not know the codepage of every line from before.
    >Any idea to convert it to valid utf8 ?


    Given your statement that you do not know the codepage for each line,
    no, that is not possible.
    The simple text 'abcd' would be exactly the same byte sequence (0x61
    0x62 0x63 0x64) in ASCII, Latin-1, Latin-15, Windows-1252, UTF-8, and
    several dozen other encodings. Without additional external information
    it is not possible to determine which one is the right one.

    jue
    Jürgen Exner, Oct 4, 2013
    #8
  9. On Thu, 3 Oct 2013, George Mpouras wrote:

    > I receive files containing text of multiple codepages (at the same file) . You
    > can not know the codepage of every line from before.
    > Any idea to convert it to valid utf8 ?


    I have a tool that translates a mixture of UTF-8 and *one* codepage into
    pure UTF-8 (under the assumption that a valid UTF-8 byte sequence is
    indeed meant as an UTF-8 character). But if more than one 8-bit code is
    involved, you have to do some hand massage before or after.

    If you are interested, I'll make it available somehow.

    --
    Helmut Richter
    Helmut Richter, Oct 4, 2013
    #9
  10. Στις 4/10/2013 11:02, ο/η Helmut Richter έγÏαψε:
    > On Thu, 3 Oct 2013, George Mpouras wrote:
    >
    >> I receive files containing text of multiple codepages (at the same file) . You
    >> can not know the codepage of every line from before.
    >> Any idea to convert it to valid utf8 ?

    >
    > I have a tool that translates a mixture of UTF-8 and *one* codepage into
    > pure UTF-8 (under the assumption that a valid UTF-8 byte sequence is
    > indeed meant as an UTF-8 character). But if more than one 8-bit code is
    > involved, you have to do some hand massage before or after.
    >
    > If you are interested, I'll make it available somehow.
    >


    I would love to have a look if you can
    George Mpouras, Oct 5, 2013
    #10
  11. On Sat, 5 Oct 2013, George Mpouras wrote:

    > Στις 4/10/2013 11:02, ο/η Helmut Richter έγÏαψε:


    > > I have a tool that translates a mixture of UTF-8 and *one* codepage into
    > > pure UTF-8 (under the assumption that a valid UTF-8 byte sequence is
    > > indeed meant as an UTF-8 character). But if more than one 8-bit code is
    > > involved, you have to do some hand massage before or after.


    > I would love to have a look if you can


    I have made it available as http://hhr-m.userweb.mwn.de/tmp/repcode.txt

    When you call it with option -h, it displays a long help text explaining
    all detail.

    (As I learnt perl decades ago, the coding style may be a bit old-fashioned
    but I tried to make it clean.)

    --
    Helmut Richter
    Helmut Richter, Oct 5, 2013
    #11
  12. Helmut Richter <> writes:
    > On Sat, 5 Oct 2013, George Mpouras wrote:
    >
    >> Στις 4/10/2013 11:02, ο/η Helmut Richter έγÏαψε:

    >
    >> > I have a tool that translates a mixture of UTF-8 and *one* codepage into
    >> > pure UTF-8 (under the assumption that a valid UTF-8 byte sequence is
    >> > indeed meant as an UTF-8 character). But if more than one 8-bit code is
    >> > involved, you have to do some hand massage before or after.

    >
    >> I would love to have a look if you can

    >
    > I have made it available as http://hhr-m.userweb.mwn.de/tmp/repcode.txt
    >
    > When you call it with option -h, it displays a long help text explaining
    > all detail.
    >
    > (As I learnt perl decades ago, the coding style may be a bit old-fashioned
    > but I tried to make it clean.)


    The entity_value subroutine uses a my variable named %cache for storing
    translations. Unless I'm missing something, this cannot possibly
    accomplish anything useful because the hash only exists while the
    subroutine is executed. This should probably be moved to an outer scope
    or use a 'state' variable.
    Rainer Weikusat, Oct 6, 2013
    #12
  13. On Sun, 6 Oct 2013, Rainer Weikusat wrote:

    > The entity_value subroutine uses a my variable named %cache for storing
    > translations. Unless I'm missing something, this cannot possibly
    > accomplish anything useful because the hash only exists while the
    > subroutine is executed. This should probably be moved to an outer scope
    > or use a 'state' variable.


    I am afraid you are right. The program is old enough that "state" may then
    not have existed, and at that time I may have misunderstood "my" to have
    only a lexical effect: the variable is inaccessible outside its scope but
    continues to exist. Now, I reread the perldoc: it does not say much about
    the fate of the variable upon exit from its scope but the mere existence
    of a "state" keyword allows one to construe that it must serve a purpose.

    As it is just a cache, the function of the program ist not affected, only
    its efficiency. Well, I should corrected it.

    --
    Helmut Richter
    Helmut Richter, Oct 6, 2013
    #13
  14. Helmut Richter <> writes:
    > On Sun, 6 Oct 2013, Rainer Weikusat wrote:
    >> The entity_value subroutine uses a my variable named %cache for storing
    >> translations. Unless I'm missing something, this cannot possibly
    >> accomplish anything useful because the hash only exists while the
    >> subroutine is executed. This should probably be moved to an outer scope
    >> or use a 'state' variable.

    >
    > I am afraid you are right. The program is old enough that "state" may then
    > not have existed,


    [...]

    The traditional way to create static, stateful subroutines would be by using
    code similar to this:

    -----------
    {
    my $accu;

    sub add_something
    {
    return $accu += $_[0];
    }
    }

    print(add_something(4), "\n");
    print(add_something(12), "\n");
    -----------

    Because nothing except the add_something subroutine exists in the
    lexical scope of $accu, it's the only thing which can access it and
    because Perl supports closures, it will retain a reference to the $accu
    object after the scope which established it has ended.
    Rainer Weikusat, Oct 7, 2013
    #14
  15. Ben Morrow <> writes:

    [variable lifetimes]

    > The nearest equivalent to C's 'extern' is 'our' (or
    > fully-qualified globals), in that these are the only variables that are
    > visible across files. (C has particular rules about symbols needing to
    > be declared 'extern' in most places but defined without 'extern' in one
    > place only; Perl's 'our' variables are more like Unix C's common
    > variables, in that they can be created in many places and the
    > definitions will be merged.)


    This isn't really a good analogy because our doesn't create objects, it
    just declares them to be 'intentionally used identifiers in the
    namespace of the current package' so that strict 'vars' doesn't complain
    about them.
    Rainer Weikusat, Oct 7, 2013
    #15
  16. l
    On 10/7/2013 9:15 AM, Rainer Weikusat wrote:
    > Ben Morrow <> writes:
    >
    > [variable lifetimes]
    >
    >> The nearest equivalent to C's 'extern' is 'our' (or
    >> fully-qualified globals), in that these are the only variables that are
    >> visible across files. (C has particular rules about symbols needing to
    >> be declared 'extern' in most places but defined without 'extern' in one
    >> place only; Perl's 'our' variables are more like Unix C's common
    >> variables, in that they can be created in many places and the
    >> definitions will be merged.)

    >
    > This isn't really a good analogy because our doesn't create objects, it
    > just declares them to be 'intentionally used identifiers in the
    > namespace of the current package' so that strict 'vars' doesn't complain
    > about them.
    >


    Wouldn't something like this effectively encapsulate $foo and act as a
    closure though:

    package main; ...
    package MyFoo;sub add_something { our $foo; return $foo += $_[0]; };
    package main; ...


    --
    Charles DeRykus
    Charles DeRykus, Oct 7, 2013
    #16
  17. Charles DeRykus <> writes:
    > On 10/7/2013 9:15 AM, Rainer Weikusat wrote:
    >> Ben Morrow <> writes:
    >>
    >> [variable lifetimes]
    >>
    >>> The nearest equivalent to C's 'extern' is 'our' (or
    >>> fully-qualified globals), in that these are the only variables that are
    >>> visible across files. (C has particular rules about symbols needing to
    >>> be declared 'extern' in most places but defined without 'extern' in one
    >>> place only; Perl's 'our' variables are more like Unix C's common
    >>> variables, in that they can be created in many places and the
    >>> definitions will be merged.)

    >>
    >> This isn't really a good analogy because our doesn't create objects, it
    >> just declares them to be 'intentionally used identifiers in the
    >> namespace of the current package' so that strict 'vars' doesn't complain
    >> about them.
    >>

    >
    > Wouldn't something like this effectively encapsulate $foo and act as a
    > closure though:
    >
    > package main; ...
    > package MyFoo;sub add_something { our $foo; return $foo += $_[0]; };
    > package main; ...


    package MyFoo;sub add_something { our $foo; return $foo += $_[0]; };
    package main;

    $MyFoo::foo = -15;
    print MyFoo::add_something(3), "\n";
    Rainer Weikusat, Oct 7, 2013
    #17
  18. On 10/7/2013 3:09 PM, Rainer Weikusat wrote:
    > Charles DeRykus <> writes:
    >> On 10/7/2013 9:15 AM, Rainer Weikusat wrote:
    >>> Ben Morrow <> writes:
    >>>
    >>> [variable lifetimes]
    >>>

    ....
    >>
    >> Wouldn't something like this effectively encapsulate $foo and act as a
    >> closure though:
    >>
    >> package main; ...
    >> package MyFoo;sub add_something { our $foo; return $foo += $_[0]; };
    >> package main; ...

    >
    > package MyFoo;sub add_something { our $foo; return $foo += $_[0]; };
    > package main;
    >
    > $MyFoo::foo = -15;
    > print MyFoo::add_something(3), "\n";
    >


    I overreached with 'effectively' perhaps but didn't intend that it was
    unassailable. A $_foo might have helped. But it's a bit of a stretch to
    break it accidentally too.

    --
    Charles DeRykus
    Charles DeRykus, Oct 8, 2013
    #18
  19. Charles DeRykus <> writes:
    > On 10/7/2013 3:09 PM, Rainer Weikusat wrote:
    >> Charles DeRykus <> writes:
    >>> On 10/7/2013 9:15 AM, Rainer Weikusat wrote:
    >>>> Ben Morrow <> writes:
    >>>>
    >>>> [variable lifetimes]
    >>>>

    > ...
    >>>
    >>> Wouldn't something like this effectively encapsulate $foo and act as a
    >>> closure though:
    >>>
    >>> package main; ...
    >>> package MyFoo;sub add_something { our $foo; return $foo += $_[0]; };
    >>> package main; ...

    >>
    >> package MyFoo;sub add_something { our $foo; return $foo += $_[0]; };
    >> package main;
    >>
    >> $MyFoo::foo = -15;
    >> print MyFoo::add_something(3), "\n";

    >
    > I overreached with 'effectively' perhaps but didn't intend that it was
    > unassailable. A $_foo might have helped. But it's a bit of a stretch
    > to break it accidentally too.


    The point was supposed to be that our is genuinely different from my
    because it doesn't create a perl-level object (it may do so
    accidentally, but that's an implementation detail which can be ignored)
    but a short (that is, not fully-qualified) name referring to an object
    associated with the symbol-table of the package the our resides in (in
    order to prevent "use strict 'vars'" from complaining about such an
    object being used without a fully-qualified name): With your
    add_something, not the scope of the object referred to by $foo is
    restricted but the scope of the short name $foo. This code

    -------
    use strict;

    package MyFoo;

    sub add_something { our $foo; return $foo += $_[0]; };

    $foo = 5;

    package main;

    print MyFoo::add_something(3), "\n";
    --------

    won't compile because the $foo name is used outside of the scope of the
    our declaration but this code

    --------
    use strict;

    package MyFoo;

    sub add_something { our $foo; return $foo += $_[0]; };

    our $foo = 5;

    package main;

    print MyFoo::add_something(3), "\n";
    --------

    will because a name referring to the same object is declared in both
    scopes.

    In contrast to this, both the 'state' feature and the trick with
    creating a my-variables in a surrounding scope end up creating an object
    which is private to the subroutine in question and will keep its value
    between invocations of that.
    Rainer Weikusat, Oct 8, 2013
    #19
  20. George Mpouras

    C.DeRykus Guest

    On Tuesday, October 8, 2013 4:59:11 AM UTC-7, Rainer Weikusat wrote:
    > Charles DeRykus <> writes:
    >
    > ...
    >
    > The point was supposed to be that our is genuinely different from my
    > because it doesn't create a perl-level object (it may do so
    > accidentally, but that's an implementation detail which c be ignored
    > but a short (that is, not fully-qualified) name referring to an object
    > associated with the symbol-table of the package the our resides in (in
    > order to prevent "use strict 'vars'" from complaining about such an
    > object being used without a fully-qualified name): With your
    > add_something, not the scope of the object referred to by $foo is
    > restricted but the scope of the short name $foo. This code
    >
    >
    > use strict;
    > package MyFoo;
    > sub add_something { our $foo; return $foo += $_[0]; };
    > $foo = 5;
    > package main;
    > print MyFoo::add_something(3), "\n";
    >
    > won't compile because the $foo name is used outside of the scope of the our declaration but this code
    > --------
    > use strict;
    > package MyFoo;
    > sub add_something { our $foo; return $foo += $_[0]; };
    > our $foo = 5;
    > package main;
    > print MyFoo::add_something(3), "\n";
    >
    > will because a name referring to the same object is declared in both scopes.
    >
    > In contrast to this, both the 'state' feature and the trick with
    >
    > creating a my-variables in a surrounding scope end up creating an object
    >
    > which is private to the subroutine in question and will keep its value
    > between invocations of that.


    Yes, thanks I realize that. My point was although it was a loose "encapsulation" you can come close to faking what 'state' and 'my-variables in a surrounding scope' can do.

    In fact, although it's nothing more than a curiosity,
    (maybe "curioser and curioser" Alice would say) you could
    even tighten it a bit more with an eval:

    package MyFoo;
    our( $foo, $tmp );
    sub add_something {
    local $tmp = $foo;
    $tmp += $_[0];
    *foo = eval "\\$tmp"; die $@ if $@ ;
    return $foo;
    }


    Now an injection of $MyFoo::foo = -17 won't be possible.

    --
    Charles DeRykus

    [sorry for any Google spacing injection, the thread disappeared from my regular newsreader]
    C.DeRykus, Oct 12, 2013
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Stephen Miller
    Replies:
    0
    Views:
    1,631
    Stephen Miller
    Jun 24, 2003
  2. Andrew Kidd
    Replies:
    3
    Views:
    461
    Andrew Kidd
    Apr 22, 2004
  3. Mark

    codepages and cookies

    Mark, Mar 28, 2005, in forum: ASP General
    Replies:
    5
    Views:
    130
    [MSFT]
    Apr 1, 2005
  4. marco
    Replies:
    1
    Views:
    93
    Sascha Ebach
    Feb 26, 2005
  5. P

    Converting codepages to UTF8

    P, Mar 30, 2006, in forum: Perl Misc
    Replies:
    16
    Views:
    671
    Dr.Ruud
    Apr 2, 2006
Loading...

Share This Page