replace extended characters

V

VIDEO MAN

Hi,

I'm trying to create a java utility that will read in a file that may
or may not contain extended ascii characters and replace these
characters with a predetermined character e.g. replace é with e and
then write the amended file out.

How would people suggest I approach this from an efficiency point of
view given that the input files could be pretty large?

Any guidance appreciated.
 
L

Lawrence D'Oliveiro

In message
VIDEO said:
How would people suggest I approach this from an efficiency point of
view given that the input files could be pretty large?

Correctness comes before efficiency. Get it working first.
 
L

Lew

VIDEO said:
I'm trying to create a java [sic] utility that will read in a file that may
or may not contain extended ascii [sic] characters and replace these
characters with a predetermined character [sic] e.g. [sic] replace é with e and
then write the amended file out.

How would people suggest I approach this from an efficiency point of
view given that the input files could be pretty large?

Any guidance appreciated.

Read from a BufferedReader. Write to a BufferedWriter. Process one character
at a time. It won't be efficient unless you are guaranteed a limited
character-set input. The Unicode character space is on the order of 2^24
characters large. "Extended ASCII" is a very tiny subset of that, and also
depends on the character encoding.

If you are certain that the set of possible input characters is small, and
those you wish to substitute even smaller, you can use a lookup table. Use a
'Map<Character,Character>' (will choke on supplementary code points) for
those, and only those, you wish to substitute. If the key is absent, pass the
source character through unchanged. If present, replace with the associated
value.

--
Lew
Ceci n'est pas une fenêtre.
..___________.
|###] | [###|
|##/ | *\##|
|#/ * | \#|
|#----|----#|
|| | * ||
|o * | o|
|_____|_____|
|===========|
 
J

Joshua Cranmer

I'm trying to create a java utility that will read in a file that may
or may not contain extended ascii characters and replace these
characters with a predetermined character e.g. replace é with e and
then write the amended file out.

What charset is the input/output file in? If (AND ONLY IF) you can
guarantee that it is a fixed-width charset like ISO 8859-1, then you can
use binary input and use a lookup table to do transformations, probably
on blocks of 1-4K. If you can't guarantee that, then
character-by-character reading is the best way to go.
How would people suggest I approach this from an efficiency point of
view given that the input files could be pretty large?

How large is "pretty large"? If you're only talking about a few MB,
probably any method you come up with is going to be fast. If you're
talking on the order of hundreds of MB, you'll probably need to go with
a lookup table approach.
 
A

Arved Sandstrom

In message


Correctness comes before efficiency. Get it working first.

That's actually not true. When you're designing your solution - which is
the point that the OP is at - you should most definitely be considering
efficiency. More often than not you cannot make an inefficient solution
efficient without changing it fairly radically, so why not design it
properly first?

It's much more correct to say, get an efficient maintainable solution
working correctly first, then do your last little tweaks if necessary.

AHS
 
L

Lew

Arved said:
That's actually not true. When you're designing your solution - which is the
point that the OP is at - you should most definitely be considering
efficiency. More often than not you cannot make an inefficient solution
efficient without changing it fairly radically, so why not design it properly
first?

It's much more correct to say, get an efficient maintainable solution working
correctly first, then do your last little tweaks if necessary.

The OP correctly seeks to consider algorithms prior to delving into
implementation. Arved wisely shows how correctness and algorithmic efficiency
can share precedence. Joshua reminds us that problem scale affects
performance analysis and design.

Optimizations discussed here so far have avoided descent into
micro-optimization, and we have heard a helpful caution against such error.
The OP's goal lets us clarify the difference.

Synthesizing the advice of our gurus, the OP now knows that a couple of
algorithms - character-at-a-time and block-oriented - will work, and at what
problem scales they're likely to help. (Predictions prior to measurement are
always at best hypothetical, but you have to assess likelihood while
planning.) Signposts guard the fringes - avoid multi-char code points and
large character sets. Those are the optimizations and the correctness.
Performance analysis is algorithmic and doesn't seem premature.

Micro-optimization would be to start in right away with custom char arrays and
manual buffering through idiosyncratic I/O streams. Whatever the standard API
may lack, it at least provides decent infrastructure for that sort of thing.
So initial implementation will leave that stuff bog standard, and I'll bet
dollars to doughnuts that in the OP's scenario that'll do. Changing that
layer from get-go would be tweaking and constitutes premature optimization,
a.k.a. micro-optimization.

--
Lew
Ceci n'est pas une fenêtre.
..___________.
|###] | [###|
|##/ | *\##|
|#/ * | \#|
|#----|----#|
|| | * ||
|o * | o|
|_____|_____|
|===========|
 
L

Lawrence D'Oliveiro

That's actually not true. When you're designing your solution - which is
the point that the OP is at - you should most definitely be considering
efficiency.

It was either Donald Knuth or Tony Hoare who said “premature optimization is
the root of all evilâ€. You can’t worry about efficiency until you know where
your inefficiencies are.
 
A

Arne Vajhøj

It was either Donald Knuth or Tony Hoare who said “premature optimization is
the root of all evilâ€. You can’t worry about efficiency until you know where
your inefficiencies are.

Good design is not premature optimization.

Arne
 
A

Arne Vajhøj

That's actually not true. When you're designing your solution - which is
the point that the OP is at - you should most definitely be considering
efficiency. More often than not you cannot make an inefficient solution
efficient without changing it fairly radically, so why not design it
properly first?

It's much more correct to say, get an efficient maintainable solution
working correctly first, then do your last little tweaks if necessary.

Yep.

Clean code with good big O characteristics and then measure
and evaluate whether further optimization is necessary.

Arne
 
A

Arne Vajhøj

I'm trying to create a java utility that will read in a file that may
or may not contain extended ascii characters and replace these
characters with a predetermined character e.g. replace é with e and
then write the amended file out.

How would people suggest I approach this from an efficiency point of
view given that the input files could be pretty large?

Any guidance appreciated.

Most likely there are not so many choices.

You need to decide on IO classes. You should pick
either a class with buffer and not buffer yourself or
a class without buffer and buffer yourself.

If it is single byte charset then you can process
either as bytes or chars otherwise you will need
to process chars.

You can replace either with if/switch or array lookup.

I guess that turned out to be 2 x 2 x 2 = 8
possible designs and in reality they are pretty
similar.

Arne
 
V

Volker Borchert

VIDEO said:
Hi,

I'm trying to create a java utility that will read in a file that may
or may not contain extended ascii characters and replace these
characters with a predetermined character e.g. replace é with e and
then write the amended file out.

How would people suggest I approach this from an efficiency point of
view given that the input files could be pretty large?

Any guidance appreciated.

Don't reinvent the wheel, use tr
 
R

RedGrittyBrick

Don't reinvent the wheel, use tr

Or at least consider iconv or recode. At a minimum I'd see how they
handle the mapping (if any).

The term "Extended ASCII" covers several handfuls of different 8-bit
character-set/encodings. Many of which include é. If the file "may or
may not" contain "extended" characters you wil at a minimum have to
assume a specific encoding. If this assumption is wrong you may
translate Ú to e by mistake.
 
P

Paul Cager

Good design is not premature optimization.

That's true, but in this case I'd like to quote Rob Pike: "Bottlenecks
occur in surprising places, so don't try to second guess and put in a
speed hack until you have proven that's where the bottleneck is."

Where would you guess the program will be spending most of its time -
manipulating bits in the CPU or waiting for disk IO?
 
L

Lew

Paul said:
That's true, but in this case I'd like to quote Rob Pike: "Bottlenecks
occur in surprising places, so don't try to second guess and put in a
speed hack until you have proven that's where the bottleneck is."

Where would you guess the program will be spending most of its time -
manipulating bits in the CPU or waiting for disk IO?

It's too early to ask that question. It suffices to come up with the
reasonably performant algorithm as suggested upthread. No one so far has
engaged in premature optimization here.

--
Lew
Ceci n'est pas une fenêtre.
..___________.
|###] | [###|
|##/ | *\##|
|#/ * | \#|
|#----|----#|
|| | * ||
|o * | o|
|_____|_____|
|===========|
 
R

Roedy Green

I'm trying to create a java utility that will read in a file that may
or may not contain extended ascii characters and replace these
characters with a predetermined character e.g. replace =E9 with e and
then write the amended file out.

How would people suggest I approach this from an efficiency point of
view given that the input files could be pretty large?

Have at look at http://mindprod.com/products1.html#ENTITIES

It includes a program called Entify that finds awkward chars and
replaces them with &xxxx; entities in a set of files. There is also a
program that does the reverse, DeEntify.

You could use the code almost as is and simply modify the table of
entities with your unaccented versions of the chars.


--
Roedy Green Canadian Mind Products
http://mindprod.com
Refactor early. If you procrastinate, you will have
even more code to adjust based on the faulty design.
..
 
R

Roedy Green

Have at look at http://mindprod.com/products1.html#ENTITIES

It includes a program called Entify that finds awkward chars and
replaces them with &xxxx; entities in a set of files. There is also a
program that does the reverse, DeEntify.

You could use the code almost as is and simply modify the table of
entities with your unaccented versions of the chars.

My version reads the entire file into RAM in one I/O. You could
modify it to read one line of a file at it a time. For the code to do
that talk to http://mindprod.com/applet/fileio.html

By making whacking huge buffers, you can ensure the bottleneck is the
CPU. I use a big switch statement. For extra speed you could use an
array lookup of the replacement string for each char. The
compiler/JVM is not all that clever about generating code for switch
statements.
--
Roedy Green Canadian Mind Products
http://mindprod.com
Refactor early. If you procrastinate, you will have
even more code to adjust based on the faulty design.
..
 
J

Joshua Cranmer

My version reads the entire file into RAM in one I/O. You could
modify it to read one line of a file at it a time. For the code to do
that talk to http://mindprod.com/applet/fileio.html

By making whacking huge buffers, you can ensure the bottleneck is the
CPU. I use a big switch statement. For extra speed you could use an
array lookup of the replacement string for each char. The
compiler/JVM is not all that clever about generating code for switch
statements.

If the file is very large (on the order of 1GB or larger), you may have
paging bottlenecks and the mere bottleneck of having to wait for every
block to be loaded into memory before doing any work, so you spend a
large amount of time in I/O wait unable to do anything at all. I suspect
that using mmap'd-like API (i.e., java.nio.MappedByteBuffer) would be
more efficient, especially if the kernel decides to be nice and preload
pages without being in page fault.

The Java compiler will make switch statements into a straight jump table
if it's dense enough, but jump tables can wreak havoc on caches and
branch predicting, so the JIT may unroll jump tables into better
constructs at runtime.
 
R

Roedy Green

If you are certain that the set of possible input characters is small, and
those you wish to substitute even smaller, you can use a lookup table. Use a
'Map<Character,Character>' (will choke on supplementary code points) for
those, and only those, you wish to substitute. If the key is absent, pass the
source character through unchanged. If present, replace with the associated
value.

If you are just handling 256 or 4906 possible chars, then an ordinary
array will be both easy to code and very fast. The highest named
entity is 9830 &diams; 0x2666 black diamond suit.
There are 2^16 = 65536 possible 16-bit unicode chars. Which chars do
you transform?
--
Roedy Green Canadian Mind Products
http://mindprod.com
Refactor early. If you procrastinate, you will have
even more code to adjust based on the faulty design.
..
 
R

Roedy Green

The Java compiler will make switch statements into a straight jump table
if it's dense enough, but jump tables can wreak havoc on caches and
branch predicting, so the JIT may unroll jump tables into better
constructs at runtime.

There are two constructs in byte code to support switch:
tableswitch for dense keys which does a jump table and lookupswitch
for sparsekeys which could be implemented with binary search, or
linear search or any of an number of other clever means the JVM could
concoct.

I decompiled a number of switches a while back, and was disappointed
to see it nearly always use lookupswitch, even where I would have used
tableswitch. This suggests, if you want to guarantee the efficient
version, design your keys to be dense ints starting at 0. You can
often replace a switch with an array or Map data lookup, or array
lookup of a delegate.

Switch is a queer bit of syntax, with its default fallthru. There is
nothing else like it in Java.
--
Roedy Green Canadian Mind Products
http://mindprod.com
Refactor early. If you procrastinate, you will have
even more code to adjust based on the faulty design.
..
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top