replace extended characters

Discussion in 'Java' started by VIDEO MAN, Feb 10, 2011.

  1. VIDEO MAN

    VIDEO MAN Guest

    Hi,

    I'm trying to create a java utility that will read in a file that may
    or may not contain extended ascii characters and replace these
    characters with a predetermined character e.g. replace é with e and
    then write the amended file out.

    How would people suggest I approach this from an efficiency point of
    view given that the input files could be pretty large?

    Any guidance appreciated.
    VIDEO MAN, Feb 10, 2011
    #1
    1. Advertising

  2. In message
    <>, VIDEO
    MAN wrote:

    > How would people suggest I approach this from an efficiency point of
    > view given that the input files could be pretty large?


    Correctness comes before efficiency. Get it working first.
    Lawrence D'Oliveiro, Feb 11, 2011
    #2
    1. Advertising

  3. VIDEO MAN

    Lew Guest

    VIDEO MAN wrote:
    > I'm trying to create a java [sic] utility that will read in a file that may
    > or may not contain extended ascii [sic] characters and replace these
    > characters with a predetermined character [sic] e.g. [sic] replace é with e and
    > then write the amended file out.
    >
    > How would people suggest I approach this from an efficiency point of
    > view given that the input files could be pretty large?
    >
    > Any guidance appreciated.


    Read from a BufferedReader. Write to a BufferedWriter. Process one character
    at a time. It won't be efficient unless you are guaranteed a limited
    character-set input. The Unicode character space is on the order of 2^24
    characters large. "Extended ASCII" is a very tiny subset of that, and also
    depends on the character encoding.

    If you are certain that the set of possible input characters is small, and
    those you wish to substitute even smaller, you can use a lookup table. Use a
    'Map<Character,Character>' (will choke on supplementary code points) for
    those, and only those, you wish to substitute. If the key is absent, pass the
    source character through unchanged. If present, replace with the associated
    value.

    --
    Lew
    Ceci n'est pas une fenêtre.
    ..___________.
    |###] | [###|
    |##/ | *\##|
    |#/ * | \#|
    |#----|----#|
    || | * ||
    |o * | o|
    |_____|_____|
    |===========|
    Lew, Feb 11, 2011
    #3
  4. On 02/10/2011 06:33 PM, VIDEO MAN wrote:
    > I'm trying to create a java utility that will read in a file that may
    > or may not contain extended ascii characters and replace these
    > characters with a predetermined character e.g. replace é with e and
    > then write the amended file out.


    What charset is the input/output file in? If (AND ONLY IF) you can
    guarantee that it is a fixed-width charset like ISO 8859-1, then you can
    use binary input and use a lookup table to do transformations, probably
    on blocks of 1-4K. If you can't guarantee that, then
    character-by-character reading is the best way to go.

    > How would people suggest I approach this from an efficiency point of
    > view given that the input files could be pretty large?


    How large is "pretty large"? If you're only talking about a few MB,
    probably any method you come up with is going to be fast. If you're
    talking on the order of hundreds of MB, you'll probably need to go with
    a lookup table approach.

    --
    Beware of bugs in the above code; I have only proved it correct, not
    tried it. -- Donald E. Knuth
    Joshua Cranmer, Feb 11, 2011
    #4
  5. On 11-02-10 08:06 PM, Lawrence D'Oliveiro wrote:
    > In message
    > <>, VIDEO
    > MAN wrote:
    >
    >> How would people suggest I approach this from an efficiency point of
    >> view given that the input files could be pretty large?

    >
    > Correctness comes before efficiency. Get it working first.


    That's actually not true. When you're designing your solution - which is
    the point that the OP is at - you should most definitely be considering
    efficiency. More often than not you cannot make an inefficient solution
    efficient without changing it fairly radically, so why not design it
    properly first?

    It's much more correct to say, get an efficient maintainable solution
    working correctly first, then do your last little tweaks if necessary.

    AHS

    --
    We must recognize the chief characteristic of the modern era - a
    permanent state of what I call violent peace.
    -- James D. Watkins
    Arved Sandstrom, Feb 11, 2011
    #5
  6. VIDEO MAN

    Lew Guest

    Arved Sandstrom wrote:
    > That's actually not true. When you're designing your solution - which is the
    > point that the OP is at - you should most definitely be considering
    > efficiency. More often than not you cannot make an inefficient solution
    > efficient without changing it fairly radically, so why not design it properly
    > first?
    >
    > It's much more correct to say, get an efficient maintainable solution working
    > correctly first, then do your last little tweaks if necessary.


    The OP correctly seeks to consider algorithms prior to delving into
    implementation. Arved wisely shows how correctness and algorithmic efficiency
    can share precedence. Joshua reminds us that problem scale affects
    performance analysis and design.

    Optimizations discussed here so far have avoided descent into
    micro-optimization, and we have heard a helpful caution against such error.
    The OP's goal lets us clarify the difference.

    Synthesizing the advice of our gurus, the OP now knows that a couple of
    algorithms - character-at-a-time and block-oriented - will work, and at what
    problem scales they're likely to help. (Predictions prior to measurement are
    always at best hypothetical, but you have to assess likelihood while
    planning.) Signposts guard the fringes - avoid multi-char code points and
    large character sets. Those are the optimizations and the correctness.
    Performance analysis is algorithmic and doesn't seem premature.

    Micro-optimization would be to start in right away with custom char arrays and
    manual buffering through idiosyncratic I/O streams. Whatever the standard API
    may lack, it at least provides decent infrastructure for that sort of thing.
    So initial implementation will leave that stuff bog standard, and I'll bet
    dollars to doughnuts that in the OP's scenario that'll do. Changing that
    layer from get-go would be tweaking and constitutes premature optimization,
    a.k.a. micro-optimization.

    --
    Lew
    Ceci n'est pas une fenêtre.
    ..___________.
    |###] | [###|
    |##/ | *\##|
    |#/ * | \#|
    |#----|----#|
    || | * ||
    |o * | o|
    |_____|_____|
    |===========|
    Lew, Feb 11, 2011
    #6
  7. In message <pg05p.441$>, Arved Sandstrom wrote:

    > On 11-02-10 08:06 PM, Lawrence D'Oliveiro wrote:
    >
    >> In message
    >> <>, VIDEO
    >> MAN wrote:
    >>
    >>> How would people suggest I approach this from an efficiency point of
    >>> view given that the input files could be pretty large?

    >>
    >> Correctness comes before efficiency. Get it working first.

    >
    > That's actually not true. When you're designing your solution - which is
    > the point that the OP is at - you should most definitely be considering
    > efficiency.


    It was either Donald Knuth or Tony Hoare who said “premature optimization is
    the root of all evilâ€. You can’t worry about efficiency until you know where
    your inefficiencies are.
    Lawrence D'Oliveiro, Feb 11, 2011
    #7
  8. On 10-02-2011 21:35, Lawrence D'Oliveiro wrote:
    > In message<pg05p.441$>, Arved Sandstrom wrote:
    >> On 11-02-10 08:06 PM, Lawrence D'Oliveiro wrote:
    >>> In message
    >>> <>, VIDEO
    >>> MAN wrote:
    >>>> How would people suggest I approach this from an efficiency point of
    >>>> view given that the input files could be pretty large?
    >>>
    >>> Correctness comes before efficiency. Get it working first.

    >>
    >> That's actually not true. When you're designing your solution - which is
    >> the point that the OP is at - you should most definitely be considering
    >> efficiency.

    >
    > It was either Donald Knuth or Tony Hoare who said “premature optimization is
    > the root of all evilâ€. You can’t worry about efficiency until you know where
    > your inefficiencies are.


    Good design is not premature optimization.

    Arne
    Arne Vajhøj, Feb 11, 2011
    #8
  9. VIDEO MAN

    Arne Vajhøj Guest

    On 10-02-2011 20:27, Arved Sandstrom wrote:
    > On 11-02-10 08:06 PM, Lawrence D'Oliveiro wrote:
    >> In message
    >> <>, VIDEO
    >> MAN wrote:
    >>> How would people suggest I approach this from an efficiency point of
    >>> view given that the input files could be pretty large?

    >>
    >> Correctness comes before efficiency. Get it working first.

    >
    > That's actually not true. When you're designing your solution - which is
    > the point that the OP is at - you should most definitely be considering
    > efficiency. More often than not you cannot make an inefficient solution
    > efficient without changing it fairly radically, so why not design it
    > properly first?
    >
    > It's much more correct to say, get an efficient maintainable solution
    > working correctly first, then do your last little tweaks if necessary.


    Yep.

    Clean code with good big O characteristics and then measure
    and evaluate whether further optimization is necessary.

    Arne
    Arne Vajhøj, Feb 11, 2011
    #9
  10. VIDEO MAN

    Arne Vajhøj Guest

    On 10-02-2011 18:33, VIDEO MAN wrote:
    > I'm trying to create a java utility that will read in a file that may
    > or may not contain extended ascii characters and replace these
    > characters with a predetermined character e.g. replace é with e and
    > then write the amended file out.
    >
    > How would people suggest I approach this from an efficiency point of
    > view given that the input files could be pretty large?
    >
    > Any guidance appreciated.


    Most likely there are not so many choices.

    You need to decide on IO classes. You should pick
    either a class with buffer and not buffer yourself or
    a class without buffer and buffer yourself.

    If it is single byte charset then you can process
    either as bytes or chars otherwise you will need
    to process chars.

    You can replace either with if/switch or array lookup.

    I guess that turned out to be 2 x 2 x 2 = 8
    possible designs and in reality they are pretty
    similar.

    Arne
    Arne Vajhøj, Feb 11, 2011
    #10
  11. VIDEO MAN wrote:
    > Hi,
    >
    > I'm trying to create a java utility that will read in a file that may
    > or may not contain extended ascii characters and replace these
    > characters with a predetermined character e.g. replace é with e and
    > then write the amended file out.
    >
    > How would people suggest I approach this from an efficiency point of
    > view given that the input files could be pretty large?
    >
    > Any guidance appreciated.


    Don't reinvent the wheel, use tr

    --

    "I'm a doctor, not a mechanic." Dr Leonard McCoy <>
    "I'm a mechanic, not a doctor." Volker Borchert <>
    Volker Borchert, Feb 11, 2011
    #11
  12. On 11/02/2011 03:24, Volker Borchert wrote:
    > VIDEO MAN wrote:
    >> Hi,
    >>
    >> I'm trying to create a java utility that will read in a file that may
    >> or may not contain extended ascii characters and replace these
    >> characters with a predetermined character e.g. replace é with e and
    >> then write the amended file out.
    >>
    >> How would people suggest I approach this from an efficiency point of
    >> view given that the input files could be pretty large?
    >>
    >> Any guidance appreciated.

    >
    > Don't reinvent the wheel, use tr
    >


    Or at least consider iconv or recode. At a minimum I'd see how they
    handle the mapping (if any).

    The term "Extended ASCII" covers several handfuls of different 8-bit
    character-set/encodings. Many of which include é. If the file "may or
    may not" contain "extended" characters you wil at a minimum have to
    assume a specific encoding. If this assumption is wrong you may
    translate Ú to e by mistake.

    --
    RGB
    RedGrittyBrick, Feb 11, 2011
    #12
  13. VIDEO MAN

    Paul Cager Guest

    On Feb 11, 2:42 am, Arne Vajhøj <> wrote:
    > On 10-02-2011 21:35, Lawrence D'Oliveiro wrote:
    > > In message<pg05p.441$>, Arved Sandstrom wrote:
    > >> On 11-02-10 08:06 PM, Lawrence D'Oliveiro wrote:
    > >>> In message
    > >>> <>, VIDEO
    > >>> MAN wrote:
    > >>>> How would people suggest I approach this from an efficiency  point of
    > >>>> view given that the input files could be pretty large?

    >
    > >>> Correctness comes before efficiency. Get it working first.

    >
    > >> That's actually not true. When you're designing your solution - which is
    > >> the point that the OP is at - you should most definitely be considering
    > >> efficiency.

    >
    > > It was either Donald Knuth or Tony Hoare who said “premature optimization is
    > > the root of all evil”. You can’t worry about efficiency until you know where
    > > your inefficiencies are.

    >
    > Good design is not premature optimization.


    That's true, but in this case I'd like to quote Rob Pike: "Bottlenecks
    occur in surprising places, so don't try to second guess and put in a
    speed hack until you have proven that's where the bottleneck is."

    Where would you guess the program will be spending most of its time -
    manipulating bits in the CPU or waiting for disk IO?
    Paul Cager, Feb 11, 2011
    #13
  14. VIDEO MAN

    Lew Guest

    Arne Vajhøj wrote:
    >> Good design is not premature optimization.


    Paul Cager wrote:
    > That's true, but in this case I'd like to quote Rob Pike: "Bottlenecks
    > occur in surprising places, so don't try to second guess and put in a
    > speed hack until you have proven that's where the bottleneck is."
    >
    > Where would you guess the program will be spending most of its time -
    > manipulating bits in the CPU or waiting for disk IO?


    It's too early to ask that question. It suffices to come up with the
    reasonably performant algorithm as suggested upthread. No one so far has
    engaged in premature optimization here.

    --
    Lew
    Ceci n'est pas une fenêtre.
    ..___________.
    |###] | [###|
    |##/ | *\##|
    |#/ * | \#|
    |#----|----#|
    || | * ||
    |o * | o|
    |_____|_____|
    |===========|
    Lew, Feb 11, 2011
    #14
  15. VIDEO MAN

    Roedy Green Guest

    On Thu, 10 Feb 2011 15:33:39 -0800 (PST), VIDEO MAN
    <> wrote, quoted or indirectly quoted someone
    who said :

    >I'm trying to create a java utility that will read in a file that may
    >or may not contain extended ascii characters and replace these
    >characters with a predetermined character e.g. replace =E9 with e and
    >then write the amended file out.
    >
    >How would people suggest I approach this from an efficiency point of
    >view given that the input files could be pretty large?


    Have at look at http://mindprod.com/products1.html#ENTITIES

    It includes a program called Entify that finds awkward chars and
    replaces them with &xxxx; entities in a set of files. There is also a
    program that does the reverse, DeEntify.

    You could use the code almost as is and simply modify the table of
    entities with your unaccented versions of the chars.


    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    Refactor early. If you procrastinate, you will have
    even more code to adjust based on the faulty design.
    ..
    Roedy Green, Feb 11, 2011
    #15
  16. VIDEO MAN

    Roedy Green Guest

    On Fri, 11 Feb 2011 15:07:22 -0800, Roedy Green
    <> wrote, quoted or indirectly quoted
    someone who said :

    >
    >Have at look at http://mindprod.com/products1.html#ENTITIES
    >
    >It includes a program called Entify that finds awkward chars and
    >replaces them with &xxxx; entities in a set of files. There is also a
    >program that does the reverse, DeEntify.
    >
    >You could use the code almost as is and simply modify the table of
    >entities with your unaccented versions of the chars.


    My version reads the entire file into RAM in one I/O. You could
    modify it to read one line of a file at it a time. For the code to do
    that talk to http://mindprod.com/applet/fileio.html

    By making whacking huge buffers, you can ensure the bottleneck is the
    CPU. I use a big switch statement. For extra speed you could use an
    array lookup of the replacement string for each char. The
    compiler/JVM is not all that clever about generating code for switch
    statements.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    Refactor early. If you procrastinate, you will have
    even more code to adjust based on the faulty design.
    ..
    Roedy Green, Feb 11, 2011
    #16
  17. On 02/11/2011 06:11 PM, Roedy Green wrote:
    > My version reads the entire file into RAM in one I/O. You could
    > modify it to read one line of a file at it a time. For the code to do
    > that talk to http://mindprod.com/applet/fileio.html
    >
    > By making whacking huge buffers, you can ensure the bottleneck is the
    > CPU. I use a big switch statement. For extra speed you could use an
    > array lookup of the replacement string for each char. The
    > compiler/JVM is not all that clever about generating code for switch
    > statements.


    If the file is very large (on the order of 1GB or larger), you may have
    paging bottlenecks and the mere bottleneck of having to wait for every
    block to be loaded into memory before doing any work, so you spend a
    large amount of time in I/O wait unable to do anything at all. I suspect
    that using mmap'd-like API (i.e., java.nio.MappedByteBuffer) would be
    more efficient, especially if the kernel decides to be nice and preload
    pages without being in page fault.

    The Java compiler will make switch statements into a straight jump table
    if it's dense enough, but jump tables can wreak havoc on caches and
    branch predicting, so the JIT may unroll jump tables into better
    constructs at runtime.

    --
    Beware of bugs in the above code; I have only proved it correct, not
    tried it. -- Donald E. Knuth
    Joshua Cranmer, Feb 11, 2011
    #17
  18. VIDEO MAN

    Roedy Green Guest

    On Thu, 10 Feb 2011 19:18:36 -0500, Lew <> wrote,
    quoted or indirectly quoted someone who said :

    >If you are certain that the set of possible input characters is small, and
    >those you wish to substitute even smaller, you can use a lookup table. Use a
    >'Map<Character,Character>' (will choke on supplementary code points) for
    >those, and only those, you wish to substitute. If the key is absent, pass the
    >source character through unchanged. If present, replace with the associated
    >value.


    If you are just handling 256 or 4906 possible chars, then an ordinary
    array will be both easy to code and very fast. The highest named
    entity is 9830 &diams; 0x2666 black diamond suit.
    There are 2^16 = 65536 possible 16-bit unicode chars. Which chars do
    you transform?
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    Refactor early. If you procrastinate, you will have
    even more code to adjust based on the faulty design.
    ..
    Roedy Green, Feb 12, 2011
    #18
  19. VIDEO MAN

    Roedy Green Guest

    On 11 Feb 2011 03:24:11 GMT, (Volker
    Borchert) wrote, quoted or indirectly quoted someone who said :

    >Don't reinvent the wheel, use tr


    Does that not presume Unix? or is there a decent Java/Windows
    implementation?

    There is also native2ascii see
    http://mindprod.com/jgloss/native2ascii.html

    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    Refactor early. If you procrastinate, you will have
    even more code to adjust based on the faulty design.
    ..
    Roedy Green, Feb 12, 2011
    #19
  20. VIDEO MAN

    Roedy Green Guest

    On Fri, 11 Feb 2011 18:40:50 -0500, Joshua Cranmer
    <> wrote, quoted or indirectly quoted someone
    who said :

    >The Java compiler will make switch statements into a straight jump table
    >if it's dense enough, but jump tables can wreak havoc on caches and
    >branch predicting, so the JIT may unroll jump tables into better
    >constructs at runtime.


    There are two constructs in byte code to support switch:
    tableswitch for dense keys which does a jump table and lookupswitch
    for sparsekeys which could be implemented with binary search, or
    linear search or any of an number of other clever means the JVM could
    concoct.

    I decompiled a number of switches a while back, and was disappointed
    to see it nearly always use lookupswitch, even where I would have used
    tableswitch. This suggests, if you want to guarantee the efficient
    version, design your keys to be dense ints starting at 0. You can
    often replace a switch with an array or Map data lookup, or array
    lookup of a delegate.

    Switch is a queer bit of syntax, with its default fallthru. There is
    nothing else like it in Java.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    Refactor early. If you procrastinate, you will have
    even more code to adjust based on the faulty design.
    ..
    Roedy Green, Feb 12, 2011
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Navanith
    Replies:
    2
    Views:
    5,265
    Navanith
    Dec 30, 2003
  2. Geoff Warnock
    Replies:
    2
    Views:
    7,988
    Daniel Tryba
    Mar 9, 2005
  3. Replies:
    6
    Views:
    6,322
    Peter Flynn
    Mar 22, 2005
  4. Bob Hartung
    Replies:
    5
    Views:
    8,485
    shan23
    May 28, 2009
  5. wob
    Replies:
    4
    Views:
    440
    Dave Thompson
    Aug 1, 2005
Loading...

Share This Page