replace extended characters

VIDEO MAN · Feb 10, 2011

Hi,

I'm trying to create a java utility that will read in a file that may
or may not contain extended ascii characters and replace these
characters with a predetermined character e.g. replace é with e and
then write the amended file out.

How would people suggest I approach this from an efficiency point of
view given that the input files could be pretty large?

Any guidance appreciated.

Lawrence D'Oliveiro · Feb 11, 2011

In message

VIDEO said:
How would people suggest I approach this from an efficiency point of
view given that the input files could be pretty large?

Correctness comes before efficiency. Get it working first.

Lew · Feb 11, 2011

VIDEO said:
I'm trying to create a java [sic] utility that will read in a file that may
or may not contain extended ascii [sic] characters and replace these
characters with a predetermined character [sic] e.g. [sic] replace Ã© with e and
then write the amended file out.

How would people suggest I approach this from an efficiency point of
view given that the input files could be pretty large?

Any guidance appreciated.

Read from a BufferedReader. Write to a BufferedWriter. Process one character
at a time. It won't be efficient unless you are guaranteed a limited
character-set input. The Unicode character space is on the order of 2^24
characters large. "Extended ASCII" is a very tiny subset of that, and also
depends on the character encoding.

If you are certain that the set of possible input characters is small, and
those you wish to substitute even smaller, you can use a lookup table. Use a
'Map<Character,Character>' (will choke on supplementary code points) for
those, and only those, you wish to substitute. If the key is absent, pass the
source character through unchanged. If present, replace with the associated
value.

--
Lew
Ceci n'est pas une fenÃªtre.
..___________.
|###] | [###|
|##/ | *\##|
|#/ * | \#|
|#----|----#|
|| | * ||
|o * | o|
|_____|_____|
|===========|

Joshua Cranmer · Feb 11, 2011

I'm trying to create a java utility that will read in a file that may
or may not contain extended ascii characters and replace these
characters with a predetermined character e.g. replace é with e and
then write the amended file out.

What charset is the input/output file in? If (AND ONLY IF) you can
guarantee that it is a fixed-width charset like ISO 8859-1, then you can
use binary input and use a lookup table to do transformations, probably
on blocks of 1-4K. If you can't guarantee that, then
character-by-character reading is the best way to go.

How would people suggest I approach this from an efficiency point of
view given that the input files could be pretty large?

How large is "pretty large"? If you're only talking about a few MB,
probably any method you come up with is going to be fast. If you're
talking on the order of hundreds of MB, you'll probably need to go with
a lookup table approach.

Arved Sandstrom · Feb 11, 2011

In message

Correctness comes before efficiency. Get it working first.

That's actually not true. When you're designing your solution - which is
the point that the OP is at - you should most definitely be considering
efficiency. More often than not you cannot make an inefficient solution
efficient without changing it fairly radically, so why not design it
properly first?

It's much more correct to say, get an efficient maintainable solution
working correctly first, then do your last little tweaks if necessary.

AHS

Lew · Feb 11, 2011

Arved said:
That's actually not true. When you're designing your solution - which is the
point that the OP is at - you should most definitely be considering
efficiency. More often than not you cannot make an inefficient solution
efficient without changing it fairly radically, so why not design it properly
first?

It's much more correct to say, get an efficient maintainable solution working
correctly first, then do your last little tweaks if necessary.

The OP correctly seeks to consider algorithms prior to delving into
implementation. Arved wisely shows how correctness and algorithmic efficiency
can share precedence. Joshua reminds us that problem scale affects
performance analysis and design.

Optimizations discussed here so far have avoided descent into
micro-optimization, and we have heard a helpful caution against such error.
The OP's goal lets us clarify the difference.

Synthesizing the advice of our gurus, the OP now knows that a couple of
algorithms - character-at-a-time and block-oriented - will work, and at what
problem scales they're likely to help. (Predictions prior to measurement are
always at best hypothetical, but you have to assess likelihood while
planning.) Signposts guard the fringes - avoid multi-char code points and
large character sets. Those are the optimizations and the correctness.
Performance analysis is algorithmic and doesn't seem premature.

Micro-optimization would be to start in right away with custom char arrays and
manual buffering through idiosyncratic I/O streams. Whatever the standard API
may lack, it at least provides decent infrastructure for that sort of thing.
So initial implementation will leave that stuff bog standard, and I'll bet
dollars to doughnuts that in the OP's scenario that'll do. Changing that
layer from get-go would be tweaking and constitutes premature optimization,
a.k.a. micro-optimization.

--
Lew
Ceci n'est pas une fenÃªtre.
..___________.
|###] | [###|
|##/ | *\##|
|#/ * | \#|
|#----|----#|
|| | * ||
|o * | o|
|_____|_____|
|===========|

Lawrence D'Oliveiro · Feb 11, 2011

That's actually not true. When you're designing your solution - which is
the point that the OP is at - you should most definitely be considering
efficiency.

It was either Donald Knuth or Tony Hoare who said â€œpremature optimization is
the root of all evilâ€. You canâ€™t worry about efficiency until you know where
your inefficiencies are.

Arne VajhÃ¸j · Feb 11, 2011

It was either Donald Knuth or Tony Hoare who said â€œpremature optimization is
the root of all evilâ€. You canâ€™t worry about efficiency until you know where
your inefficiencies are.

Good design is not premature optimization.

Arne

Arne Vajhøj · Feb 11, 2011

That's actually not true. When you're designing your solution - which is
the point that the OP is at - you should most definitely be considering
efficiency. More often than not you cannot make an inefficient solution
efficient without changing it fairly radically, so why not design it
properly first?

It's much more correct to say, get an efficient maintainable solution
working correctly first, then do your last little tweaks if necessary.

Yep.

Clean code with good big O characteristics and then measure
and evaluate whether further optimization is necessary.

Arne

Arne Vajhøj · Feb 11, 2011

I'm trying to create a java utility that will read in a file that may
or may not contain extended ascii characters and replace these
characters with a predetermined character e.g. replace é with e and
then write the amended file out.

How would people suggest I approach this from an efficiency point of
view given that the input files could be pretty large?

Any guidance appreciated.

Most likely there are not so many choices.

You need to decide on IO classes. You should pick
either a class with buffer and not buffer yourself or
a class without buffer and buffer yourself.

If it is single byte charset then you can process
either as bytes or chars otherwise you will need
to process chars.

You can replace either with if/switch or array lookup.

I guess that turned out to be 2 x 2 x 2 = 8
possible designs and in reality they are pretty
similar.

Arne

Volker Borchert · Feb 11, 2011

VIDEO said:
Hi,

I'm trying to create a java utility that will read in a file that may
or may not contain extended ascii characters and replace these
characters with a predetermined character e.g. replace é with e and
then write the amended file out.

How would people suggest I approach this from an efficiency point of
view given that the input files could be pretty large?

Any guidance appreciated.

Don't reinvent the wheel, use tr

RedGrittyBrick · Feb 11, 2011

Don't reinvent the wheel, use tr

Or at least consider iconv or recode. At a minimum I'd see how they
handle the mapping (if any).

The term "Extended ASCII" covers several handfuls of different 8-bit
character-set/encodings. Many of which include Ã©. If the file "may or
may not" contain "extended" characters you wil at a minimum have to
assume a specific encoding. If this assumption is wrong you may
translate Ãš to e by mistake.

Paul Cager · Feb 11, 2011

Good design is not premature optimization.

That's true, but in this case I'd like to quote Rob Pike: "Bottlenecks
occur in surprising places, so don't try to second guess and put in a
speed hack until you have proven that's where the bottleneck is."

Where would you guess the program will be spending most of its time -
manipulating bits in the CPU or waiting for disk IO?

Lew · Feb 11, 2011

Paul said:
That's true, but in this case I'd like to quote Rob Pike: "Bottlenecks
occur in surprising places, so don't try to second guess and put in a
speed hack until you have proven that's where the bottleneck is."

Where would you guess the program will be spending most of its time -
manipulating bits in the CPU or waiting for disk IO?

It's too early to ask that question. It suffices to come up with the
reasonably performant algorithm as suggested upthread. No one so far has
engaged in premature optimization here.

--
Lew
Ceci n'est pas une fenÃªtre.
..___________.
|###] | [###|
|##/ | *\##|
|#/ * | \#|
|#----|----#|
|| | * ||
|o * | o|
|_____|_____|
|===========|

Roedy Green · Feb 11, 2011

I'm trying to create a java utility that will read in a file that may
or may not contain extended ascii characters and replace these
characters with a predetermined character e.g. replace =E9 with e and
then write the amended file out.

How would people suggest I approach this from an efficiency point of
view given that the input files could be pretty large?

Have at look at http://mindprod.com/products1.html#ENTITIES

It includes a program called Entify that finds awkward chars and
replaces them with &xxxx; entities in a set of files. There is also a
program that does the reverse, DeEntify.

You could use the code almost as is and simply modify the table of
entities with your unaccented versions of the chars.

--
Roedy Green Canadian Mind Products
http://mindprod.com
Refactor early. If you procrastinate, you will have
even more code to adjust based on the faulty design.
..

Roedy Green · Feb 11, 2011

Have at look at http://mindprod.com/products1.html#ENTITIES

It includes a program called Entify that finds awkward chars and
replaces them with &xxxx; entities in a set of files. There is also a
program that does the reverse, DeEntify.

You could use the code almost as is and simply modify the table of
entities with your unaccented versions of the chars.

My version reads the entire file into RAM in one I/O. You could
modify it to read one line of a file at it a time. For the code to do
that talk to http://mindprod.com/applet/fileio.html

By making whacking huge buffers, you can ensure the bottleneck is the
CPU. I use a big switch statement. For extra speed you could use an
array lookup of the replacement string for each char. The
compiler/JVM is not all that clever about generating code for switch
statements.
--
Roedy Green Canadian Mind Products
http://mindprod.com
Refactor early. If you procrastinate, you will have
even more code to adjust based on the faulty design.
..

Joshua Cranmer · Feb 11, 2011

My version reads the entire file into RAM in one I/O. You could
modify it to read one line of a file at it a time. For the code to do
that talk to http://mindprod.com/applet/fileio.html

By making whacking huge buffers, you can ensure the bottleneck is the
CPU. I use a big switch statement. For extra speed you could use an
array lookup of the replacement string for each char. The
compiler/JVM is not all that clever about generating code for switch
statements.

If the file is very large (on the order of 1GB or larger), you may have
paging bottlenecks and the mere bottleneck of having to wait for every
block to be loaded into memory before doing any work, so you spend a
large amount of time in I/O wait unable to do anything at all. I suspect
that using mmap'd-like API (i.e., java.nio.MappedByteBuffer) would be
more efficient, especially if the kernel decides to be nice and preload
pages without being in page fault.

The Java compiler will make switch statements into a straight jump table
if it's dense enough, but jump tables can wreak havoc on caches and
branch predicting, so the JIT may unroll jump tables into better
constructs at runtime.

Roedy Green · Feb 12, 2011

If you are certain that the set of possible input characters is small, and
those you wish to substitute even smaller, you can use a lookup table. Use a
'Map<Character,Character>' (will choke on supplementary code points) for
those, and only those, you wish to substitute. If the key is absent, pass the
source character through unchanged. If present, replace with the associated
value.

If you are just handling 256 or 4906 possible chars, then an ordinary
array will be both easy to code and very fast. The highest named
entity is 9830 &diams; 0x2666 black diamond suit.
There are 2^16 = 65536 possible 16-bit unicode chars. Which chars do
you transform?
--
Roedy Green Canadian Mind Products
http://mindprod.com
Refactor early. If you procrastinate, you will have
even more code to adjust based on the faulty design.
..

Roedy Green · Feb 12, 2011

Don't reinvent the wheel, use tr

Does that not presume Unix? or is there a decent Java/Windows
implementation?

There is also native2ascii see
http://mindprod.com/jgloss/native2ascii.html

--
Roedy Green Canadian Mind Products
http://mindprod.com
Refactor early. If you procrastinate, you will have
even more code to adjust based on the faulty design.
..

Roedy Green · Feb 12, 2011

The Java compiler will make switch statements into a straight jump table
if it's dense enough, but jump tables can wreak havoc on caches and
branch predicting, so the JIT may unroll jump tables into better
constructs at runtime.

There are two constructs in byte code to support switch:
tableswitch for dense keys which does a jump table and lookupswitch
for sparsekeys which could be implemented with binary search, or
linear search or any of an number of other clever means the JVM could
concoct.

I decompiled a number of switches a while back, and was disappointed
to see it nearly always use lookupswitch, even where I would have used
tableswitch. This suggests, if you want to guarantee the efficient
version, design your keys to be dense ints starting at 0. You can
often replace a switch with an array or Map data lookup, or array
lookup of a delegate.

Switch is a queer bit of syntax, with its default fallthru. There is
nothing else like it in Java.
--
Roedy Green Canadian Mind Products
http://mindprod.com
Refactor early. If you procrastinate, you will have
even more code to adjust based on the faulty design.
..

Regex replace problem	2	Jan 6, 2022
Call for Papers Reminder (extended): The 2013 InternationalConference of Data Mining and Knowledge E	0	Mar 10, 2013
I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	1	Jun 4, 2023
Last Call for Papers Reminder (extended): The World Congress onEngineering WCE 2013	0	Mar 18, 2013
Call for Papers Reminder (extended): The World Congress onEngineering WCE 2013	0	Mar 5, 2013
Call for Papers Reminder (extended): The 2013 InternationalConference of Signal and Image Engineerin	0	Mar 12, 2013
Can't solve problems! please Help	0	Sep 26, 2022
Call for Papers Reminder (extended): International MultiConference ofEngineers and Computer Scientis	0	Dec 4, 2012

replace extended characters

VIDEO MAN

Lawrence D'Oliveiro

Lew

Joshua Cranmer

Arved Sandstrom

Lew

Lawrence D'Oliveiro

Arne VajhÃ¸j

Arne Vajhøj

Arne Vajhøj

Volker Borchert

RedGrittyBrick

Paul Cager

Lew

Roedy Green

Roedy Green

Joshua Cranmer

Roedy Green

Roedy Green

Roedy Green

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads