Unicode strings, struct, and files

Tom Plunket · Oct 9, 2006

I am building a file with the help of the struct module.

I would like to be able to put Unicode strings into this file, but I'm
not sure how to do it.

The format I'm trying to write is basically this C structure:

struct MyFile
{
int magic;
int flags;
short otherFlags;
char pad[22];

wchar_t line1[32];
wchar_t line2[32];

// ... other data which is easy.

};

(I'm writing data on a PC to be read on a big-endian machine.)

So I can write the four leading members with the output of
struct.pack('>IIH22x', magic, flags, otherFlags). Unfortunately I
can't figure out how to write the unicode strings, since:

message = unicode('Hello, world')
myFile.write(message)

results in 'message' being converted back to a string before being
written. Is the way to do this to do something hideous like this:

for c in message:
myFile.write(struct.pack('>H', ord(unicode(c))))

?

Thanks from a unicode n00b,
-tom!

John Machin · Oct 9, 2006

Tom said:
I am building a file with the help of the struct module.

I would like to be able to put Unicode strings into this file, but I'm
not sure how to do it.

The format I'm trying to write is basically this C structure:

struct MyFile
{
int magic;
int flags;
short otherFlags;
char pad[22];

wchar_t line1[32];
wchar_t line2[32];

// ... other data which is easy.
};

(I'm writing data on a PC to be read on a big-endian machine.)

So I can write the four leading members with the output of
struct.pack('>IIH22x', magic, flags, otherFlags). Unfortunately I
can't figure out how to write the unicode strings, since:

message = unicode('Hello, world')
myFile.write(message)

results in 'message' being converted back to a string before being
written. Is the way to do this to do something hideous like this:

for c in message:
myFile.write(struct.pack('>H', ord(unicode(c))))

?

I'd suggest UTF-encoding it as a string, using the encoding that
matches whatever wchar means on the target machine, for example
assuming bigendian and sizeof(wchar) == 2:

utf_line1 = unicode_line1.encode('utf_16_be')
etc
struct.pack(">.........64s64s", ......, utf_line1, utf_line2)
Presumes (1) you have already checked that you don't have more than 32
characters in each "line" (2) padding with unichr(0) is acceptable.

HTH,
John

Tom Plunket · Oct 9, 2006

John said:
I'd suggest UTF-encoding it as a string, using the encoding that
matches whatever wchar means on the target machine, for example
assuming bigendian and sizeof(wchar) == 2:

Ahh, this is the info that my trawling through the documentation
didn't let me find!

Thanks a bunch.

utf_line1 = unicode_line1.encode('utf_16_be')
etc
struct.pack(">.........64s64s", ......, utf_line1, utf_line2)
Presumes (1) you have already checked that you don't have more than 32
characters in each "line" (2) padding with unichr(0) is acceptable.

This works frighteningly well.

-tom!

struct pointing to another struct?	3	Aug 13, 2010
Python Unicode handling wins again -- mostly	67	Nov 30, 2013
My "telegram_polling()" and "@message_handler" does not work on "herokuapp.com" under gunicorn	0	Dec 12, 2021
Math with unicode strings?	2	Apr 2, 2007
Linux: using "clone3" and "waitid"	0	Oct 17, 2023
unicode and strings	1	Nov 2, 2004
Non-constant constant strings	561	Jan 19, 2014
unicode and data strings	0	Jan 28, 2005

Unicode strings, struct, and files

Tom Plunket

John Machin

Tom Plunket

Ask a Question

Similar Threads

Staff online

Members online

Forum statistics

Latest Threads