replace extended characters

J

Joshua Cranmer

I decompiled a number of switches a while back, and was disappointed
to see it nearly always use lookupswitch, even where I would have used
tableswitch. This suggests, if you want to guarantee the efficient
version, design your keys to be dense ints starting at 0. You can
often replace a switch with an array or Map data lookup, or array
lookup of a delegate.

From OpenJDK:
long table_space_cost = 4 + ((long) hi - lo + 1); // words
long table_time_cost = 3; // comparisons
long lookup_space_cost = 3 + 2 * (long) nlabels;
long lookup_time_cost = nlabels;
int opcode =
nlabels > 0 &&
table_space_cost + 3 * table_time_cost <=
lookup_space_cost + 3 * lookup_time_cost
?
tableswitch : lookupswitch;

In other words, it must be that about 1/5 of the range is filled with
labels to use a table switch (rather, 1/5 of the range + 11). This
meshes well with what I know from playing around with a custom-built
bytecode reader, namely that it took surprisingly few case statements in
a large table (Java's bytecode consumes around 200 different values),
namely that it switched to tableswitch by the time I enabled the if*
opcodes.
 
O

Owen Jacobson

Hi,

I'm trying to create a java utility that will read in a file that may
or may not contain extended ascii characters and replace these
characters with a predetermined character e.g. replace é with e and
then write the amended file out.

How would people suggest I approach this from an efficiency point of
view given that the input files could be pretty large?

Any guidance appreciated.

This process already has a name: "normalization". The Java standard
library includes tools (java.text.Normalizer and friends) for applying
the standard Unicode normalizations. A rough sketch for the program
would be:

1. Load your input text under the correct encoding. You will likely
want to leave the choice of encodings up to the user; "extended ASCII"
is not an encoding but encompasses several possible options, and since
there are only statistical and not deterministically correct ways to
detect encodings, it's better not to guess.

2. Normalize the text under NFD. This will replace characters like 'ü'
with a sequence containing a simple character - 'u' in this case - and
a sequence of combining marks - '¨' in this case. (Alternately, use
NFKD instead of NFD. NFKD is more liberal about changing the meaning of
the normalized text, but permits things like detatching ligatures which
NFD does not do. Examples are in the normalization spec - see figure 6.)

3. Output the resulting normalized text under the target encoding
(presumably US-ASCII). You'll want to do this the "hard way", via
java.nio.charset.Charset and CharsetEncoder, so that you can use the
onUnmappableCharacter action CodingErrorAction.IGNORE to strip
unencodable characters.

You'll want to read up on normalization and the Unicode normalization
specs[0] before proceeding. This topic is fraught with non-obvious edge
cases.

The suggestion that you use iconv or other existing Unicode-aware
encoding conversion tools is a good one, incidentally. This isn't
really a problem you need to solve yourself, unless you're completely
convinced that none of the usual normalization rules is right for your
use case.

-o

[0] <http://unicode.org/reports/tr15/>
 
L

Lew

Roedy said:
There are 2^16 = 65536 possible 16-bit unicode [sic] chars. Which chars do
you transform?

You're off by a few orders of magnitude, though, for the entire Unicode
character set, since Unicode is not limited to 16 bits.

--
Lew
Ceci n'est pas une fenêtre.
..___________.
|###] | [###|
|##/ | *\##|
|#/ * | \#|
|#----|----#|
|| | * ||
|o * | o|
|_____|_____|
|===========|
 
L

Lew

Cygwin works for me.

For me, too, but that doesn't guarantee that the OP is allowed to use it.

--
Lew
Ceci n'est pas une fenêtre.
..___________.
|###] | [###|
|##/ | *\##|
|#/ * | \#|
|#----|----#|
|| | * ||
|o * | o|
|_____|_____|
|===========|
 
A

Arne Vajhøj

My version reads the entire file into RAM in one I/O.
By making whacking huge buffers, you can ensure the bottleneck is the
CPU. I use a big switch statement.

That is a very good example of how optimization indeed can be
evil.

The speed increases of using larger buffers quickly diminishes.

And doing it this way ensure that the the program can not handle
huge files.

Very bad design.

Arne
 
J

Joshua Cranmer

Roedy said:
There are 2^16 = 65536 possible 16-bit unicode [sic] chars. Which
chars do
you transform?

You're off by a few orders of magnitude, though, for the entire Unicode
character set, since Unicode is not limited to 16 bits.

A Java char is 16-bits, which allows you everything in the BMP. Since
the OP's goal is basically remove-all-accents, I think ignoring non-BMP
characters is safe, since it looks like everything outside is either
extended unified CJK, symbols, historic scripts, and modified variations
of alphanumerics (e.g., blackboard math characters).
 
P

Paul Cager

It's too early to ask that question.  It suffices to come up with the
reasonably performant algorithm as suggested upthread.

I think we are in agreement, then. My point was that attempting to
guess where the bottleneck will be is futile. Unfortunately there was
no way for you top see my Gallic Shrug indicating the futility of
guessing. Or maybe it just looked like a bade case of indigestion.
 
L

Lew

I think we are in agreement, then. My point was that attempting to

I disagree. You seem to be warning against something in which people
are not engaging. If in fact you are warning against premature
optimization proactively (not to say prematurely), then I guess we are
in agreement, but I failed to see why you added yet another warning
against it to a thread that didn't need the warning, at least not yet,
and in which that warning had already been retired.
guess where the bottleneck will be is futile. Unfortunately there was

And my point was that no one is doing that.
no way for you top see my Gallic Shrug indicating the futility of
guessing. Or maybe it just looked like a bade case of indigestion.

I wouldn't have found it relevant even had I seen it. No one was
engaging in "guessing".

At least, not yet. Again, if your intent was only to warn yet again
one more time against such activity, then I guess we are in agreement
after all.

And given that, I wonder why everyone seems so hepped up to warn
against it here.

Statements like, "Premature optimization is the root of all evil" are
not supposed to prevent critical thinking and turn us into cargo-cult
slogan monkeys. You still have to distinguish between what is
premature (or micro-) optimization and what is just darn good sense.
I see far, far too much code-by-superstition in the field.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top