replace extended characters

Joshua Cranmer · Feb 12, 2011

I decompiled a number of switches a while back, and was disappointed
to see it nearly always use lookupswitch, even where I would have used
tableswitch. This suggests, if you want to guarantee the efficient
version, design your keys to be dense ints starting at 0. You can
often replace a switch with an array or Map data lookup, or array
lookup of a delegate.

From OpenJDK:
long table_space_cost = 4 + ((long) hi - lo + 1); // words
long table_time_cost = 3; // comparisons
long lookup_space_cost = 3 + 2 * (long) nlabels;
long lookup_time_cost = nlabels;
int opcode =
nlabels > 0 &&
table_space_cost + 3 * table_time_cost <=
lookup_space_cost + 3 * lookup_time_cost
?
tableswitch : lookupswitch;

In other words, it must be that about 1/5 of the range is filled with
labels to use a table switch (rather, 1/5 of the range + 11). This
meshes well with what I know from playing around with a custom-built
bytecode reader, namely that it took surprisingly few case statements in
a large table (Java's bytecode consumes around 200 different values),
namely that it switched to tableswitch by the time I enabled the if*
opcodes.

Owen Jacobson · Feb 12, 2011

Hi,

I'm trying to create a java utility that will read in a file that may
or may not contain extended ascii characters and replace these
characters with a predetermined character e.g. replace é with e and
then write the amended file out.

How would people suggest I approach this from an efficiency point of
view given that the input files could be pretty large?

Any guidance appreciated.

This process already has a name: "normalization". The Java standard
library includes tools (java.text.Normalizer and friends) for applying
the standard Unicode normalizations. A rough sketch for the program
would be:

1. Load your input text under the correct encoding. You will likely
want to leave the choice of encodings up to the user; "extended ASCII"
is not an encoding but encompasses several possible options, and since
there are only statistical and not deterministically correct ways to
detect encodings, it's better not to guess.

2. Normalize the text under NFD. This will replace characters like 'ü'
with a sequence containing a simple character - 'u' in this case - and
a sequence of combining marks - '¨' in this case. (Alternately, use
NFKD instead of NFD. NFKD is more liberal about changing the meaning of
the normalized text, but permits things like detatching ligatures which
NFD does not do. Examples are in the normalization spec - see figure 6.)

3. Output the resulting normalized text under the target encoding
(presumably US-ASCII). You'll want to do this the "hard way", via
java.nio.charset.Charset and CharsetEncoder, so that you can use the
onUnmappableCharacter action CodingErrorAction.IGNORE to strip
unencodable characters.

You'll want to read up on normalization and the Unicode normalization
specs[0] before proceeding. This topic is fraught with non-obvious edge
cases.

The suggestion that you use iconv or other existing Unicode-aware
encoding conversion tools is a good one, incidentally. This isn't
really a problem you need to solve yourself, unless you're completely
convinced that none of the usual normalization rules is right for your
use case.

-o

[0] <http://unicode.org/reports/tr15/>

Volker Borchert · Feb 12, 2011

Roedy said:
Does that not presume Unix? or is there a decent Java/Windows
implementation?

Cygwin works for me.

Lew · Feb 12, 2011

Roedy said:
There are 2^16 = 65536 possible 16-bit unicode [sic] chars. Which chars do
you transform?

You're off by a few orders of magnitude, though, for the entire Unicode
character set, since Unicode is not limited to 16 bits.

--
Lew
Ceci n'est pas une fenÃªtre.
..___________.
|###] | [###|
|##/ | *\##|
|#/ * | \#|
|#----|----#|
|| | * ||
|o * | o|
|_____|_____|
|===========|

Lew · Feb 12, 2011

Cygwin works for me.

For me, too, but that doesn't guarantee that the OP is allowed to use it.

--
Lew
Ceci n'est pas une fenÃªtre.
..___________.
|###] | [###|
|##/ | *\##|
|#/ * | \#|
|#----|----#|
|| | * ||
|o * | o|
|_____|_____|
|===========|

Arne Vajhøj · Feb 13, 2011

My version reads the entire file into RAM in one I/O.

By making whacking huge buffers, you can ensure the bottleneck is the
CPU. I use a big switch statement.

That is a very good example of how optimization indeed can be
evil.

The speed increases of using larger buffers quickly diminishes.

And doing it this way ensure that the the program can not handle
huge files.

Very bad design.

Arne

Arne VajhÃ¸j · Feb 13, 2011

For me, too, but that doesn't guarantee that the OP is allowed to use it.

Or that non-*nix means Windows.

Arne

Joshua Cranmer · Feb 13, 2011

Roedy said:
Roedy said:

There are 2^16 = 65536 possible 16-bit unicode [sic] chars. Which
chars do
you transform?

Click to expand...

You're off by a few orders of magnitude, though, for the entire Unicode
character set, since Unicode is not limited to 16 bits.

A Java char is 16-bits, which allows you everything in the BMP. Since
the OP's goal is basically remove-all-accents, I think ignoring non-BMP
characters is safe, since it looks like everything outside is either
extended unified CJK, symbols, historic scripts, and modified variations
of alphanumerics (e.g., blackboard math characters).

Paul Cager · Feb 15, 2011

It's too early to ask that question. It suffices to come up with the
reasonably performant algorithm as suggested upthread.

I think we are in agreement, then. My point was that attempting to
guess where the bottleneck will be is futile. Unfortunately there was
no way for you top see my Gallic Shrug indicating the futility of
guessing. Or maybe it just looked like a bade case of indigestion.

Lew · Feb 15, 2011

I think we are in agreement, then. My point was that attempting to

I disagree. You seem to be warning against something in which people
are not engaging. If in fact you are warning against premature
optimization proactively (not to say prematurely), then I guess we are
in agreement, but I failed to see why you added yet another warning
against it to a thread that didn't need the warning, at least not yet,
and in which that warning had already been retired.

guess where the bottleneck will be is futile. Unfortunately there was

And my point was that no one is doing that.

no way for you top see my Gallic Shrug indicating the futility of
guessing. Or maybe it just looked like a bade case of indigestion.

I wouldn't have found it relevant even had I seen it. No one was
engaging in "guessing".

At least, not yet. Again, if your intent was only to warn yet again
one more time against such activity, then I guess we are in agreement
after all.

And given that, I wonder why everyone seems so hepped up to warn
against it here.

Statements like, "Premature optimization is the root of all evil" are
not supposed to prevent critical thinking and turn us into cargo-cult
slogan monkeys. You still have to distinguish between what is
premature (or micro-) optimization and what is just darn good sense.
I see far, far too much code-by-superstition in the field.

Regex replace problem	2	Jan 6, 2022
Call for Papers Reminder (extended): The 2013 InternationalConference of Data Mining and Knowledge E	0	Mar 10, 2013
I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	1	Jun 4, 2023
Last Call for Papers Reminder (extended): The World Congress onEngineering WCE 2013	0	Mar 18, 2013
Call for Papers Reminder (extended): The World Congress onEngineering WCE 2013	0	Mar 5, 2013
Call for Papers Reminder (extended): The 2013 InternationalConference of Signal and Image Engineerin	0	Mar 12, 2013
Can't solve problems! please Help	0	Sep 26, 2022
Call for Papers Reminder (extended): International MultiConference ofEngineers and Computer Scientis	0	Dec 4, 2012

replace extended characters

Joshua Cranmer

Owen Jacobson

Volker Borchert

Lew

Lew

Arne Vajhøj

Arne VajhÃ¸j

Joshua Cranmer

Paul Cager

Lew

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads