multiple codepages

G

George Mpouras

I receive files containing text of multiple codepages (at the same file)
.. You can not know the codepage of every line from before.
Any idea to convert it to valid utf8 ?
 
B

Bjoern Hoehrmann

* George Mpouras wrote in comp.lang.perl.misc:
I receive files containing text of multiple codepages (at the same file)
. You can not know the codepage of every line from before.
Any idea to convert it to valid utf8 ?

In order to properly convert to UTF-8 you have to know the encoding the
bytes are in prior to the conversion. Switching between encodings inside
a single file should be no problem so long as you can isolate the bytes
around the positions where the encoding changes. If you cannot do that,
or cannot know the encoding of the bytes through any means at all, then
you have a problem. Perhaps you can elaborate on your problem?
 
G

George Mpouras

Στις 3/10/2013 23:47, ο/η Bjoern Hoehrmann έγÏαψε:
* George Mpouras wrote in comp.lang.perl.misc:

In order to properly convert to UTF-8 you have to know the encoding the
bytes are in prior to the conversion. Switching between encodings inside
a single file should be no problem so long as you can isolate the bytes
around the positions where the encoding changes. If you cannot do that,
or cannot know the encoding of the bytes through any means at all, then
you have a problem. Perhaps you can elaborate on your problem?

there is no way to know it , they are email headers on big log
 
P

Peter J. Holzer

Στις 3/10/2013 23:47, ο/η Bjoern Hoehrmann έγÏαψε:

there is no way to know it , they are email headers on big log

Email headers use RFC 2047 encoding.

hp
 
G

George Mpouras

Στις 4/10/2013 00:31, ο/η Peter J. Holzer έγÏαψε:
Email headers use RFC 2047 encoding.

hp

maybe but there are Cyrillic, France , whatever .. at "username"
 
G

George Mpouras

Στις 3/10/2013 23:47, ο/η Bjoern Hoehrmann έγÏαψε:
* George Mpouras wrote in comp.lang.perl.misc:

In order to properly convert to UTF-8 you have to know the encoding the
bytes are in prior to the conversion. Switching between encodings inside
a single file should be no problem so long as you can isolate the bytes
around the positions where the encoding changes. If you cannot do that,
or cannot know the encoding of the bytes through any means at all, then
you have a problem. Perhaps you can elaborate on your problem?

I remember a module called encode-guess ... maybe this will work
 
P

Peter J. Holzer

Στις 4/10/2013 00:31, ο/η Peter J. Holzer έγÏαψε:

maybe but there are Cyrillic, France , whatever .. at "username"

RFC 2047 encoding includes the encoding. So there s a way to know it
(otherwise non-ascii characters in subjects, from or to headers etc.
would be impossible).

hp
 
J

Jürgen Exner

George Mpouras said:
I receive files containing text of multiple codepages (at the same file)
. You can not know the codepage of every line from before.
Any idea to convert it to valid utf8 ?

Given your statement that you do not know the codepage for each line,
no, that is not possible.
The simple text 'abcd' would be exactly the same byte sequence (0x61
0x62 0x63 0x64) in ASCII, Latin-1, Latin-15, Windows-1252, UTF-8, and
several dozen other encodings. Without additional external information
it is not possible to determine which one is the right one.

jue
 
H

Helmut Richter

I receive files containing text of multiple codepages (at the same file) . You
can not know the codepage of every line from before.
Any idea to convert it to valid utf8 ?

I have a tool that translates a mixture of UTF-8 and *one* codepage into
pure UTF-8 (under the assumption that a valid UTF-8 byte sequence is
indeed meant as an UTF-8 character). But if more than one 8-bit code is
involved, you have to do some hand massage before or after.

If you are interested, I'll make it available somehow.
 
G

George Mpouras

Στις 4/10/2013 11:02, ο/η Helmut Richter έγÏαψε:
I have a tool that translates a mixture of UTF-8 and *one* codepage into
pure UTF-8 (under the assumption that a valid UTF-8 byte sequence is
indeed meant as an UTF-8 character). But if more than one 8-bit code is
involved, you have to do some hand massage before or after.

If you are interested, I'll make it available somehow.

I would love to have a look if you can
 
R

Rainer Weikusat

Helmut Richter said:
I have made it available as http://hhr-m.userweb.mwn.de/tmp/repcode.txt

When you call it with option -h, it displays a long help text explaining
all detail.

(As I learnt perl decades ago, the coding style may be a bit old-fashioned
but I tried to make it clean.)

The entity_value subroutine uses a my variable named %cache for storing
translations. Unless I'm missing something, this cannot possibly
accomplish anything useful because the hash only exists while the
subroutine is executed. This should probably be moved to an outer scope
or use a 'state' variable.
 
H

Helmut Richter

The entity_value subroutine uses a my variable named %cache for storing
translations. Unless I'm missing something, this cannot possibly
accomplish anything useful because the hash only exists while the
subroutine is executed. This should probably be moved to an outer scope
or use a 'state' variable.

I am afraid you are right. The program is old enough that "state" may then
not have existed, and at that time I may have misunderstood "my" to have
only a lexical effect: the variable is inaccessible outside its scope but
continues to exist. Now, I reread the perldoc: it does not say much about
the fate of the variable upon exit from its scope but the mere existence
of a "state" keyword allows one to construe that it must serve a purpose.

As it is just a cache, the function of the program ist not affected, only
its efficiency. Well, I should corrected it.
 
R

Rainer Weikusat

Helmut Richter said:
I am afraid you are right. The program is old enough that "state" may then
not have existed,

[...]

The traditional way to create static, stateful subroutines would be by using
code similar to this:

-----------
{
my $accu;

sub add_something
{
return $accu += $_[0];
}
}

print(add_something(4), "\n");
print(add_something(12), "\n");
-----------

Because nothing except the add_something subroutine exists in the
lexical scope of $accu, it's the only thing which can access it and
because Perl supports closures, it will retain a reference to the $accu
object after the scope which established it has ended.
 
R

Rainer Weikusat

[variable lifetimes]
The nearest equivalent to C's 'extern' is 'our' (or
fully-qualified globals), in that these are the only variables that are
visible across files. (C has particular rules about symbols needing to
be declared 'extern' in most places but defined without 'extern' in one
place only; Perl's 'our' variables are more like Unix C's common
variables, in that they can be created in many places and the
definitions will be merged.)

This isn't really a good analogy because our doesn't create objects, it
just declares them to be 'intentionally used identifiers in the
namespace of the current package' so that strict 'vars' doesn't complain
about them.
 
C

Charles DeRykus

l
[variable lifetimes]
The nearest equivalent to C's 'extern' is 'our' (or
fully-qualified globals), in that these are the only variables that are
visible across files. (C has particular rules about symbols needing to
be declared 'extern' in most places but defined without 'extern' in one
place only; Perl's 'our' variables are more like Unix C's common
variables, in that they can be created in many places and the
definitions will be merged.)

This isn't really a good analogy because our doesn't create objects, it
just declares them to be 'intentionally used identifiers in the
namespace of the current package' so that strict 'vars' doesn't complain
about them.

Wouldn't something like this effectively encapsulate $foo and act as a
closure though:

package main; ...
package MyFoo;sub add_something { our $foo; return $foo += $_[0]; };
package main; ...
 
R

Rainer Weikusat

Charles DeRykus said:
[variable lifetimes]
The nearest equivalent to C's 'extern' is 'our' (or
fully-qualified globals), in that these are the only variables that are
visible across files. (C has particular rules about symbols needing to
be declared 'extern' in most places but defined without 'extern' in one
place only; Perl's 'our' variables are more like Unix C's common
variables, in that they can be created in many places and the
definitions will be merged.)

This isn't really a good analogy because our doesn't create objects, it
just declares them to be 'intentionally used identifiers in the
namespace of the current package' so that strict 'vars' doesn't complain
about them.

Wouldn't something like this effectively encapsulate $foo and act as a
closure though:

package main; ...
package MyFoo;sub add_something { our $foo; return $foo += $_[0]; };
package main; ...

package MyFoo;sub add_something { our $foo; return $foo += $_[0]; };
package main;

$MyFoo::foo = -15;
print MyFoo::add_something(3), "\n";
 
C

Charles DeRykus

Charles DeRykus said:
[variable lifetimes]
....

Wouldn't something like this effectively encapsulate $foo and act as a
closure though:

package main; ...
package MyFoo;sub add_something { our $foo; return $foo += $_[0]; };
package main; ...

package MyFoo;sub add_something { our $foo; return $foo += $_[0]; };
package main;

$MyFoo::foo = -15;
print MyFoo::add_something(3), "\n";

I overreached with 'effectively' perhaps but didn't intend that it was
unassailable. A $_foo might have helped. But it's a bit of a stretch to
break it accidentally too.
 
R

Rainer Weikusat

Charles DeRykus said:
Charles DeRykus said:
On 10/7/2013 9:15 AM, Rainer Weikusat wrote:

[variable lifetimes]
...

Wouldn't something like this effectively encapsulate $foo and act as a
closure though:

package main; ...
package MyFoo;sub add_something { our $foo; return $foo += $_[0]; };
package main; ...

package MyFoo;sub add_something { our $foo; return $foo += $_[0]; };
package main;

$MyFoo::foo = -15;
print MyFoo::add_something(3), "\n";

I overreached with 'effectively' perhaps but didn't intend that it was
unassailable. A $_foo might have helped. But it's a bit of a stretch
to break it accidentally too.

The point was supposed to be that our is genuinely different from my
because it doesn't create a perl-level object (it may do so
accidentally, but that's an implementation detail which can be ignored)
but a short (that is, not fully-qualified) name referring to an object
associated with the symbol-table of the package the our resides in (in
order to prevent "use strict 'vars'" from complaining about such an
object being used without a fully-qualified name): With your
add_something, not the scope of the object referred to by $foo is
restricted but the scope of the short name $foo. This code

-------
use strict;

package MyFoo;

sub add_something { our $foo; return $foo += $_[0]; };

$foo = 5;

package main;

print MyFoo::add_something(3), "\n";
--------

won't compile because the $foo name is used outside of the scope of the
our declaration but this code

--------
use strict;

package MyFoo;

sub add_something { our $foo; return $foo += $_[0]; };

our $foo = 5;

package main;

print MyFoo::add_something(3), "\n";
--------

will because a name referring to the same object is declared in both
scopes.

In contrast to this, both the 'state' feature and the trick with
creating a my-variables in a surrounding scope end up creating an object
which is private to the subroutine in question and will keep its value
between invocations of that.
 
C

C.DeRykus

...

The point was supposed to be that our is genuinely different from my
because it doesn't create a perl-level object (it may do so
accidentally, but that's an implementation detail which c be ignored
but a short (that is, not fully-qualified) name referring to an object
associated with the symbol-table of the package the our resides in (in
order to prevent "use strict 'vars'" from complaining about such an
object being used without a fully-qualified name): With your
add_something, not the scope of the object referred to by $foo is
restricted but the scope of the short name $foo. This code


use strict;
package MyFoo;
sub add_something { our $foo; return $foo += $_[0]; };
$foo = 5;
package main;
print MyFoo::add_something(3), "\n";

won't compile because the $foo name is used outside of the scope of the our declaration but this code
--------
use strict;
package MyFoo;
sub add_something { our $foo; return $foo += $_[0]; };
our $foo = 5;
package main;
print MyFoo::add_something(3), "\n";

will because a name referring to the same object is declared in both scopes.

In contrast to this, both the 'state' feature and the trick with

creating a my-variables in a surrounding scope end up creating an object

which is private to the subroutine in question and will keep its value
between invocations of that.

Yes, thanks I realize that. My point was although it was a loose "encapsulation" you can come close to faking what 'state' and 'my-variables in a surrounding scope' can do.

In fact, although it's nothing more than a curiosity,
(maybe "curioser and curioser" Alice would say) you could
even tighten it a bit more with an eval:

package MyFoo;
our( $foo, $tmp );
sub add_something {
local $tmp = $foo;
$tmp += $_[0];
*foo = eval "\\$tmp"; die $@ if $@ ;
return $foo;
}


Now an injection of $MyFoo::foo = -17 won't be possible.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top