Unicode strings, struct, and files

T

Tom Plunket

I am building a file with the help of the struct module.

I would like to be able to put Unicode strings into this file, but I'm
not sure how to do it.

The format I'm trying to write is basically this C structure:

struct MyFile
{
int magic;
int flags;
short otherFlags;
char pad[22];

wchar_t line1[32];
wchar_t line2[32];

// ... other data which is easy. :)
};

(I'm writing data on a PC to be read on a big-endian machine.)

So I can write the four leading members with the output of
struct.pack('>IIH22x', magic, flags, otherFlags). Unfortunately I
can't figure out how to write the unicode strings, since:

message = unicode('Hello, world')
myFile.write(message)

results in 'message' being converted back to a string before being
written. Is the way to do this to do something hideous like this:

for c in message:
myFile.write(struct.pack('>H', ord(unicode(c))))

?

Thanks from a unicode n00b,
-tom!
 
J

John Machin

Tom said:
I am building a file with the help of the struct module.

I would like to be able to put Unicode strings into this file, but I'm
not sure how to do it.

The format I'm trying to write is basically this C structure:

struct MyFile
{
int magic;
int flags;
short otherFlags;
char pad[22];

wchar_t line1[32];
wchar_t line2[32];

// ... other data which is easy. :)
};

(I'm writing data on a PC to be read on a big-endian machine.)

So I can write the four leading members with the output of
struct.pack('>IIH22x', magic, flags, otherFlags). Unfortunately I
can't figure out how to write the unicode strings, since:

message = unicode('Hello, world')
myFile.write(message)

results in 'message' being converted back to a string before being
written. Is the way to do this to do something hideous like this:

for c in message:
myFile.write(struct.pack('>H', ord(unicode(c))))

?

I'd suggest UTF-encoding it as a string, using the encoding that
matches whatever wchar means on the target machine, for example
assuming bigendian and sizeof(wchar) == 2:

utf_line1 = unicode_line1.encode('utf_16_be')
etc
struct.pack(">.........64s64s", ......, utf_line1, utf_line2)
Presumes (1) you have already checked that you don't have more than 32
characters in each "line" (2) padding with unichr(0) is acceptable.

HTH,
John
 
T

Tom Plunket

John said:
I'd suggest UTF-encoding it as a string, using the encoding that
matches whatever wchar means on the target machine, for example
assuming bigendian and sizeof(wchar) == 2:

Ahh, this is the info that my trawling through the documentation
didn't let me find!

Thanks a bunch.
utf_line1 = unicode_line1.encode('utf_16_be')
etc
struct.pack(">.........64s64s", ......, utf_line1, utf_line2)
Presumes (1) you have already checked that you don't have more than 32
characters in each "line" (2) padding with unichr(0) is acceptable.

This works frighteningly well. ;)


-tom!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Staff online

Members online

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,072
Latest member
trafficcone

Latest Threads

Top