replace delimiter in unicode encdoded file

C

ciapecki

Is there a way in ruby to:
- open a file encoded in ucs-2le,
- replace every occurance of '\t' (X'0009') with ',' (X'002c'),
- and save it back in ucs-2le, without loosing any content?

thanks
chris
 
R

Ross Bamford

Is there a way in ruby to:
- open a file encoded in ucs-2le,
- replace every occurance of '\t' (X'0009') with ',' (X'002c'),
- and save it back in ucs-2le, without loosing any content?

Well, you _could_ do it with iconv:

$ irb -riconv

data = File.read('test')
# => "a\000b\000c\000\t\000\273\006\t\0001\000"

str = Iconv.iconv('utf-8', 'ucs-2le', data).first
# => "abc\t\332\273\t1"

newstr = str.tr("\t", ',')
# => "abc,\332\273,1"

newdata = Iconv.iconv('ucs-2le', 'utf-8', newstr).first
# => "a\000b\000c\000,\000\273\006,\0001\000"

But that strikes me as unnecessary when you could just do:

newdata = File.read('test').tr("\t", ',')
# => "a\000b\000c\000,\000\273\006,\0001\000"

;)

Hope that helps,
 
D

David Vallner

--------------enig4A00E1A3DAAB09EEF0C6DD3E
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Ross said:
But that strikes me as unnecessary when you could just do:
=20
newdata =3D File.read('test').tr("\t", ',')
# =3D> "a\000b\000c\000,\000\273\006,\0001\000"
=20

Um. Other way around. *Old* data is in UCS-2LE, not in UTF-8, so it's
not ASCII-transparent. Your iconv approach could work if you swapped
around the encoding names, except you'd probably also have to involve a
$KCODE =3D 'u' and require 'jcode' to avoid clobbering the possible cases=

where in UTF8, 0x09 and 0x2c are part of a multibyte sequence.

David Vallner


--------------enig4A00E1A3DAAB09EEF0C6DD3E
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (MingW32)

iD8DBQFFdl7Sy6MhrS8astoRAuqhAJsFmHW9c9sVJaaGWeWoNpoZLTJbagCeNvTj
4AdaA5ybtotNf+Q0F4wKlsA=
=MRcg
-----END PGP SIGNATURE-----

--------------enig4A00E1A3DAAB09EEF0C6DD3E--
 
C

ciapecki

David said:
Um. Other way around. *Old* data is in UCS-2LE, not in UTF-8, so it's
not ASCII-transparent. Your iconv approach could work if you swapped
around the encoding names, except you'd probably also have to involve a
$KCODE = 'u' and require 'jcode' to avoid clobbering the possible cases
where in UTF8, 0x09 and 0x2c are part of a multibyte sequence.

David Vallner


--------------enig4A00E1A3DAAB09EEF0C6DD3E
Content-Type: application/pgp-signature
Content-Disposition: inline;
filename="signature.asc"
Content-Description: OpenPGP digital signature
X-Google-AttachSize: 188

Thanks Ross for the try, but it is not working,
tried for:

"\377\376B\001\363\000|\001k\000o\000\t\000k\000s\000i\000\005\001|\001k\000a\000\t\000c\000z\000B\001o\000w\000i\000e\000k\000\r\000\n\000B\001\005\001k\000a\000\t\000\t\000|\001d\000z\001b\000B\001o\000\r\000\n\000"
which is:

łóżko książka człowiek
łąka żdźbło

-> (the same :))

the conversion should be:
łóżko,książka,człowiek
łąka,,żdźbło

but with the Iconv try:
łóżko,książka,człowiek
à¨äˆ€Ôæ¬æ„€â°€â°€ç°€æç¨€æˆäˆ€æ¼à´€à´€

after swapping utf-8 to ucs-2le in the both iconv convertions, I get an
error message:
`iconv': "\377\376B\001¾ |☺k\000o\000\t\000k\000"...
(Iconv::IllegalSequence)


Any other suggestions highly appreciated.

Thanks
chris
 
C

ciapecki

I think David is confusing the order of the 'from' and 'to' arguments to
Iconv.iconv - they go: (to, from, data). My short example was
ill-conceived, though - this might be safer:

$ irb -riconv

s = <the string you show above>

s.gsub(/\t\000(?!\000)/, ",\000")
# =>
"\377\376B\001\363\000|\001k\000o\000,\000k\000s\000i\000\005\001|\001k\000a\000,\000c\000z\000B\001o\000w\000i\000e\000k\000\r\000\n\000B\001\005\001k\000a\000,\000,\000|\001d\000z\001b\000B\001o\000\r\000\n\000"

(This is:

łóżko,książka,człowiek
łąka,,żdźbło
)

But I'm not totally sure, so you might be better with iconv anyway:

Iconv.iconv('ucs-2le', 'utf-8', Iconv.iconv('utf-8','ucs-2le',
s).first.gsub(/\t/u, ',')).first
# =>
"\377\376B\001\363\000|\001k\000o\000,\000k\000s\000i\000\005\001|\001k\000a\000,\000c\000z\000B\001o\000w\000i\000e\000k\000\r\000\n\000B\001\005\001k\000a\000,\000,\000|\001d\000z\001b\000B\001o\000\r\000\n\000"

(This too is:

łóżko,książka,człowiek
łąka,,żdźbło
)

Unless I missed something, this seems to work fine here. Does it work for
you?

Thanks Ross,

I was that stupid and forgot to open the writable file as binary "wb"
(before I had "w" only)

Thanks again for your help
chris
 
C

ciapecki

Another question following up.
Is there a way to find out in what encoding is the file encoded (is it
ucs-2le or utf-8)?
when I open a file in VIM I can check it with :set fileencoding
so there must be any way to recognize the file and its encoding.

Thanks
chris
 
D

David Kastrup

Paul Lutus said:
ciapecki wrote:

/ ...


Don't kick yourself too hard, the error lies with Microsoft trying
to golf its way out of a thicket of its own making. There never
should have been two standard line endings (actually three if you
include the Mac), and there never should have been two path
delimiters either, both of which cause endless headaches for
cross-platform coders.

The reason these variations exist is so someone can say, "my
software is different, unique, patentable, now you have to pay me
for it." Even if the differences convey no benefit to the users.

No, the reason is that CP/M had no tty concept, and consequently no
automatic LF->CRLF translation (and CRLF is required on printers).
Also forward slashes were used in CP/M as option lead-ins (CP/M, not
having named directories, did not need to use forwards slashes for
those).

This legacy is from long before POSIX, in fact, from long before C.
 
D

David Vallner

--------------enig388792FCCE182B912FD785FB
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Ross said:
I think David is confusing the order of the 'from' and 'to' arguments t= o
Iconv.iconv - they go: (to, from, data).

/me puts on dunce hat.

Sorry! I recall always using the command-line iconv specifying them in
from,to order, and apparently that burned deeper into my brain pathways
than it should have.

David Vallner


--------------enig388792FCCE182B912FD785FB
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (MingW32)

iD8DBQFFdzSOy6MhrS8astoRAnFFAJ0Vd8aS1MLp9n87x3yi2G9Y14ffzQCeKajg
Jgi7Ey4JsyVQCkNO1noR6Rw=
=mx6B
-----END PGP SIGNATURE-----

--------------enig388792FCCE182B912FD785FB--
 
D

David Vallner

--------------enig67E9A41E626EF4D29A0C851B
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

David said:
=20
No, the reason is that CP/M had no tty concept, and consequently no
automatic LF->CRLF translation (and CRLF is required on printers).
Also forward slashes were used in CP/M as option lead-ins (CP/M, not
having named directories, did not need to use forwards slashes for
those).
=20
This legacy is from long before POSIX, in fact, from long before C.
=20

Hrm, and I also recall once knowing about why the different text /
binary file handling was around. Something to do with some DOS
programming environment and efficient (by a measure that could only have
been important enough to warrant a design wart on the hardware from
then) line-oriented text processing.

I don't think there's any distinction between the file modes on the OS
level anymore, but programming language runtimes interpret the absence
of the 'b' flag as "translate newlines" to only have to internally
support one convention and avoid having to have every text manipulation
routine handle the difference gracefully.

The blurb about preserving the idiosyncracies as a business strategy is
hilarious. Also patent nonsense and FUD ;)

David Vallner




--------------enig67E9A41E626EF4D29A0C851B
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (MingW32)

iD8DBQFFdzfGy6MhrS8astoRAp7oAJ9oz0Lz0miO+qfyR6ZmYan9UOcRhQCfbo5X
ER/8szZCgLUcERAh1mwnAuc=
=M3Ji
-----END PGP SIGNATURE-----

--------------enig67E9A41E626EF4D29A0C851B--
 
C

ciapecki

Paul said:
The fact that you can choose a particular encoding doesn't mean that
encoding is innate to the file. In the case of a unicode text file without
an identifying header, strictly speaking it is not possible to determine
the encoding -- I mean, apart from a human being using common sense and
text recognition.

Hi Paul,

in VIM :set filencoding (does not only set fileencoding, but as well
shows current fileencoding when run like I wrote)
so when I open a utf-8 file and enter :set fileencoding I get utf8,
when I open a ucs-2le file I get ucs-2le, I do not know how it
recognizes,
but the same thing happens (but not always) in Microsoft Notepad. When
you mark a file which is in UTF-8 Notepad marks UTF-8 as encoding, when
the file is ucs-2le, it marks Unicode as encoding.
So there must be something characteristic in those files.

chris
 
D

David Vallner

--------------enig28B4F6765313794615769659
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Paul said:
David Vallner wrote:
=20
/ ...
=20
=20
That was the original reason, yes, but it is hard to justify it this fa= r
down the road, with what are ostensibly sophisticated operating systems= ,
unless the idea is to enshrine a handful of bad choices forever.
=20

Backwards binary compatibility for decades is one of the MS hallmarks.
It's entrenched rather than enshrined - first it was necessary as a
business objective to support and interoperate with arbitrary DOS
software on WinNT. I think -everyone- saw a office worker sit at a text
mode FoxPro app long after Win98 was ubiquitous. Keeping the legacy
behaviour as the default was an easier path for vendors developing new
applications that would work with textual output from older versions
(letting them reuse old code), instead of making them handle differences
gracefully. Interoperability with other operating systems was just
unimportant either as a goal to achieve or to avoid, *nix occupied a
share of the market that wasn't important when the NT architecture came
to be.
=20
I recall the line ending management issue came up because C (and, later= ,
C++) used Unix line endings internally, therefore they converted any te= xt
files on the fly as they were read or written. But, because some files = were
binary, not text, it became necessary to tell the file reader/writer
routines whether or not this behavior was desired.
=20

Which was hilarious fun with buggy web browsers interpreting compressed
archives as text, thoroughly thrashing them.
=20
But only on Windows. The presence or absence of the "b" flag has no eff= ect
on other platforms, which don't convert line endings. Except possibly t= he
Mac -- I don't know how Macintoshes deal with this, IIRC they have \r a= s a
line ending.
=20

I'd expect the Mac side to be worse. IIRC, Mac OS Classic had a CR as a
line ending, and the OS X Classic subsystem apps still have. OS X uses
the (POSIX-specified, I think) LF. Then there's also the PPC -> Intel
switch, so together you should have three combinations of expected line
endings and byte endianness to handle at some level. Someone with more
detailed knowledge about Macsen might know how that's handled.
=20
One can't help thinking this is the real motive behind a lot of this st= uff.
=20

Well, my phrasing there was wrong, it was indeed a business strategy.
The motivation however was compatibility; Windows aims to achieve vendor
lock-in by the range of exclusively available software, not using
low-level technical idiosyncracies (unless you count emergent
consequences). This includes both providing attractive tools (like .NET
being installed via Automatic Updates and an essential Vista component -
a brilliant move from Redmond, actually), and reassuring existing
software keeps working (read: potentially making more money to the
vendor without having to be updated).

Davud Vallner


--------------enig28B4F6765313794615769659
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (MingW32)

iD8DBQFFd7jQy6MhrS8astoRAnkMAJ9EHy1GdsfQBdyIDwaIVHioV4B+HQCfc/3t
71JHuAtdgrC3UhWQWVB5STM=
=nl52
-----END PGP SIGNATURE-----

--------------enig28B4F6765313794615769659--
 
D

David Vallner

--------------enigD4B9C75A9F44A5850B75F727
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Paul Lutus schrieb:
=20
=20
Hi Paul,
=20
in VIM :set filencoding (does not only set fileencoding, but as well
shows current fileencoding when run like I wrote)
so when I open a utf-8 file and enter :set fileencoding I get utf8,
when I open a ucs-2le file I get ucs-2le, I do not know how it
recognizes,
but the same thing happens (but not always) in Microsoft Notepad. When
you mark a file which is in UTF-8 Notepad marks UTF-8 as encoding, when=
the file is ucs-2le, it marks Unicode as encoding.
So there must be something characteristic in those files.
=20
chris
=20
=20

Byte order marks[1]? They're a hack of sorts that you can abuse to
indicate "This file is in Unicode encoding $FOO" in a text-file context.
However, they're a form of in-band signalling, and therefore a potential
Bad Thing depending on what the data will be passing through.

David Vallner

[1]: http://unicode.org/unicode/faq/utf_bom.html#BOM


--------------enigD4B9C75A9F44A5850B75F727
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (MingW32)

iD8DBQFFeHAJy6MhrS8astoRAji9AJ9tuygMY9ZPs0n3qibKy+MQFWx8cQCcCmBG
XUpGWBVVQwH/l+ukJAAB8Yc=
=M3Wu
-----END PGP SIGNATURE-----

--------------enigD4B9C75A9F44A5850B75F727--
 
J

John W. Kennedy

Paul said:
ciapecki wrote:

/ ...


Don't kick yourself too hard, the error lies with Microsoft trying to golf
its way out of a thicket of its own making.

Not fair to MS, in this case; they simply copied DR, who had copied DEC.
(And the CRLF ending is, arguably, the most faithful to the ASCII
design.) It was only as of MS-DOS 2.0 that MS started the long uphill
road to kinda-sorta Unix compatibility, and by then it was too late to
change, just as it was too late to use "/" as a directory separator.

(To an IBM mainframe programmer, after all, all three line-ending
methods look stupid. In mainframes, files are made up discrete records
-- like rows in an SQL database -- and aren't terminated by any byte
value at all.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,263
Messages
2,571,064
Members
48,769
Latest member
Clifft

Latest Threads

Top