seek/tell in presence of multibyte characters

R

Robert Dodier

Hello,

I would like to call seek and tell on files which contain multibyte
characters (utf8).
perldoc -f seek says that seek only considers byte offsets, not
character offsets.
How can I implement a seek-like function which takes a character
offset?
(Or does such a thing already exist in some library?)
Likewise, I need a tell function which reports character offset instead
of byte offset.

It is OK if these character seek/tell functions are slower than the
built-in byte seek/tell.

Thanks for any light you can shed on this problem.

Robert Dodier
 
X

xhoster

Robert Dodier said:
Hello,

I would like to call seek and tell on files which contain multibyte
characters (utf8).
perldoc -f seek says that seek only considers byte offsets, not
character offsets.
How can I implement a seek-like function which takes a character
offset?

Seek to the beginning of the file, then count characters until you
get where you want. To speed it up, save a table of occasional waymarks
giving the byte offset of certain character offsets, then you can seek back
to the last waymark which is less than the desired one and count from
there.

Xho
 
M

Mumia W. (reading news)

Hello,

I would like to call seek and tell on files which contain multibyte
characters (utf8).
perldoc -f seek says that seek only considers byte offsets, not
character offsets.
How can I implement a seek-like function which takes a character
offset?
(Or does such a thing already exist in some library?)
Likewise, I need a tell function which reports character offset instead
of byte offset.

It is OK if these character seek/tell functions are slower than the
built-in byte seek/tell.

Thanks for any light you can shed on this problem.

Robert Dodier

If I had control over the file format, I would use a 16-bit unicode
version. That would provide predictable character sizes.

I have no ideas of how you can seek through a utf-8 encoded file.

If you can, use GNU "recode" or a similar utility (such a Perl :) ) to
convert the file from its original encoding to a 16-bit unicode file.
 
B

Brian McCauley

Robert said:
I would like to call seek and tell on files which contain multibyte
characters (utf8).
perldoc -f seek says that seek only considers byte offsets, not
character offsets.
How can I implement a seek-like function which takes a character
offset?

Why? No, seriously, why?

What meaning does a character offset have?

If you just want to get back to a position you've visited before then
byte offsets (or even opaque constructs) are prefectly adequate.
 
P

Peter J. Holzer

I would like to call seek and tell on files which contain multibyte
characters (utf8).
perldoc -f seek says that seek only considers byte offsets, not
character offsets.
How can I implement a seek-like function which takes a character
offset?
(Or does such a thing already exist in some library?)
Likewise, I need a tell function which reports character offset instead
of byte offset.
[...]

If I had control over the file format, I would use a 16-bit unicode
version. That would provide predictable character sizes.

The Unicode character set these days contains characters beyond U+FFFF.
You may not need them now but somebody will want to use them in the
future (Murphy isn't sleeping) - so "a 16-bit unicode version" means
UTF-16, not UCS-2, and you don't have a predictable character size
anymore: One character can be 2 or 4 bytes.

hp
 
R

Robert Dodier

Brian said:
Why? No, seriously, why?

To find a location in the file specified as the number of characters
from the beginning.
What meaning does a character offset have?

The number of characters from the beginning of the file.
It is like counting variable-length records.
If you just want to get back to a position you've visited before

You may wish to consider that I want what I asked for.

FWIW
Robert Dodier
 
R

Robert Dodier

Robert said:
I would like to call seek and tell on files which contain multibyte
characters (utf8).
perldoc -f seek says that seek only considers byte offsets, not
character offsets.
How can I implement a seek-like function which takes a character
offset?

I ended up emulating seek by calling read FH, $stuff, $N; where $N is
the number of characters to seek, after open FH, "<:utf8", $filename;
(and then ignoring $stuff). For tell, I call length on a string to get
the
character count.

It turns out that Perl's open & read functions are plenty fast
enough to make this work. I could wish for a more elegant solution
but for the moment I'm just happy it works.

Thanks to everyone who responded.

Robert Dodier
 
B

Brian McCauley

Robert said:
To find a location in the file specified as the number of characters
from the beginning.

I asked why not what.
You may wish to consider that I want what I asked for.

I am happy to consider all possibilities.

Are you prepared to consider that this may be X-Y?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top