transforming german characters

Discussion in 'Perl Misc' started by steve_f, Aug 6, 2004.

  1. steve_f

    steve_f Guest

    I want to transform special German characters to obtain the following
    variations:

    groß bräu
    gross bräu
    gross braeu

    there are two sets -

    set one:
    ß = ss = \xDF

    set two:
    Ä = Ae = \xC4
    Ö = Oe = \xD6
    Ü = Ue = \xDC
    ä = ae = \xE4
    ö = oe = \xF6
    ü = ue = \xFC

    basically, the rules are transform ß independently
    and with set two, they are either all on or off together.

    I wrote the follow which works well, but looks
    pretty bad I think. so again this is a style question...
    can anyone suggest a cleaner approach? TIA

    sub transform_characters {
    my @input = @_;
    my @output;
    for my $string (@input) {
    push @output, $string;
    if ($string =~ /\xDF/) {
    $string =~ s/\xDF/ss/g;
    push @output, $string;
    if (test_for_character($string)) {
    $string = swap_all($string);
    push @output, $string;
    }
    next;
    }
    if (test_for_character($string)) {
    $string = swap_all($string);
    push @output, $string;
    }
    }
    return @output;
    }

    sub test_for_character {
    my $string = shift;
    if ($string =~ /\xC4/ ||
    $string =~ /\xD6/ ||
    $string =~ /\xDC/ ||
    $string =~ /\xE4/ ||
    $string =~ /\xF6/ ||
    $string =~ /\xFC/) {
    return 1
    } else {
    return 0
    }
    }

    sub swap_all {
    my $string = shift;
    $string =~ s/\xC4/Ae/g;
    $string =~ s/\xD6/Oe/g;
    $string =~ s/\xDC/Ue/g;
    $string =~ s/\xE4/ae/g;
    $string =~ s/\xF6/oe/g;
    $string =~ s/\xFC/ue/g;
    return $string;
    }
    steve_f, Aug 6, 2004
    #1
    1. Advertising

  2. steve_f wrote:
    > I want to transform special German characters to obtain the following
    > variations:
    >
    > groß bräu
    > gross bräu
    > gross braeu
    >
    > there are two sets -
    >
    > set one:
    > ß = ss = \xDF
    >
    > set two:
    > Ä = Ae = \xC4
    > Ö = Oe = \xD6
    > Ü = Ue = \xDC
    > ä = ae = \xE4
    > ö = oe = \xF6
    > ü = ue = \xFC
    >
    > basically, the rules are transform ß independently
    > and with set two, they are either all on or off together.
    >
    > I wrote the follow which works well, but looks
    > pretty bad I think.


    It doesn't look too bad, I've seen worse. :)


    > so again this is a style question...
    > can anyone suggest a cleaner approach? TIA


    The usual idiom is to use a hash for the search and replace tables.


    > sub transform_characters {
    > my @input = @_;
    > my @output;
    > for my $string (@input) {
    > push @output, $string;
    > if ($string =~ /\xDF/) {
    > $string =~ s/\xDF/ss/g;


    Using a match followed by a substitution is a usual beginner mistake.
    You only need the substitution.

    if ( $string =~ s/\xDF/ss/g ) {


    > push @output, $string;
    > if (test_for_character($string)) {
    > $string = swap_all($string);
    > push @output, $string;
    > }
    > next;
    > }
    > if (test_for_character($string)) {
    > $string = swap_all($string);
    > push @output, $string;
    > }
    > }
    > return @output;
    > }
    >
    > [snip code]


    Using a hash you could write that as:

    my %set1 = (
    "\xDF" => 'ss',
    );
    # Use a character class because all keys are single characters
    # If keys are multiple characters use alternation instead
    my $key1 = '[' . join( '', keys %set1 ) . ']';

    my %set2 = (
    "\xC4" => 'Ae',
    "\xD6" => 'Oe',
    "\xDC" => 'Ue',
    "\xE4" => 'ae',
    "\xF6" => 'oe',
    "\xFC" => 'ue',
    );
    my $key2 = '[' . join( '', keys %set2 ) . ']';

    sub transform_characters {
    my @input = @_;
    my @output;
    for my $string ( @input ) {
    push @output, $string;
    if ( $string =~ s/($key1)/$set1{$1}/og ) {
    push @output, $string;
    if ( $string =~ s/($key2)/$set2{$1}/og ) {
    push @output, $string;
    }
    next;
    }
    if ( $string =~ s/($key2)/$set2{$1}/og ) {
    push @output, $string;
    }
    }
    return @output;
    }



    John
    --
    use Perl;
    program
    fulfillment
    John W. Krahn, Aug 6, 2004
    #2
    1. Advertising

  3. steve_f wrote:
    > I want to transform special German characters to obtain the
    > following variations:
    >
    > groß bräu
    > gross bräu
    > gross braeu
    >
    > there are two sets -
    >
    > set one:
    > ß = ss = \xDF
    >
    > set two:
    > Ä = Ae = \xC4
    > Ö = Oe = \xD6
    > Ü = Ue = \xDC
    > ä = ae = \xE4
    > ö = oe = \xF6
    > ü = ue = \xFC
    >
    > basically, the rules are transform ß independently
    > and with set two, they are either all on or off together.


    As John said, there is no reason to look for the characters with
    separate regexes, and accordingly there is no reason to distinguish
    between two sets.

    > for my $string (@input) {
    > push @output, $string;


    Here you copy the whole original text to @output ...

    > if ($string =~ /\xDF/) {
    > $string =~ s/\xDF/ss/g;
    > push @output, $string;


    .... and here you *add* the converted string. In the suggestion below,
    I'm assuming that was a mistake.

    sub transform_characters {
    my @text = @_;

    my %replace = (
    "\xDF" => 'ss',
    "\xC4" => 'Ae',
    "\xD6" => 'Oe',
    "\xDC" => 'Ue',
    "\xE4" => 'ae',
    "\xF6" => 'oe',
    "\xFC" => 'ue',
    );

    for (@text) {
    s/(\xDF|\xC4|\xD6|\xDC|\xE4|\xF6|\xFC)/$replace{$1}/g;
    }

    @text
    }

    my @output = transform_characters(@input);

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Aug 6, 2004
    #3
  4. steve_f

    steve_f Guest

    Thanks Gunnar, some great stuff here....I can use simple
    statements to just brute force things, but I know there is
    a more elegent way.

    On Fri, 06 Aug 2004 21:35:18 +0200, Gunnar Hjalmarsson <> wrote:

    >steve_f wrote:
    >> I want to transform special German characters to obtain the
    >> following variations:
    >>
    >> groß bräu
    >> gross bräu
    >> gross braeu
    >>
    >> there are two sets -
    >>
    >> set one:
    >> ß = ss = \xDF
    >>
    >> set two:
    >> Ä = Ae = \xC4
    >> Ö = Oe = \xD6
    >> Ü = Ue = \xDC
    >> ä = ae = \xE4
    >> ö = oe = \xF6
    >> ü = ue = \xFC
    >>
    >> basically, the rules are transform ß independently
    >> and with set two, they are either all on or off together.

    >
    >As John said, there is no reason to look for the characters with
    >separate regexes, and accordingly there is no reason to distinguish
    >between two sets.


    The ß can either be on or off independent of the others so
    you can get:

    groß bräu
    gross bräu
    gross braeu

    I should of stated the problem more directly:

    if set one - set one on & set two on
    set one off & set two on
    set one off & set two off

    if only set two - set two all on
    - set two all off

    >> for my $string (@input) {
    >> push @output, $string;

    >
    >Here you copy the whole original text to @output ...
    >
    >> if ($string =~ /\xDF/) {
    >> $string =~ s/\xDF/ss/g;
    >> push @output, $string;

    >
    >... and here you *add* the converted string. In the suggestion below,
    >I'm assuming that was a mistake.
    >
    > sub transform_characters {
    > my @text = @_;
    >
    > my %replace = (
    > "\xDF" => 'ss',
    > "\xC4" => 'Ae',
    > "\xD6" => 'Oe',
    > "\xDC" => 'Ue',
    > "\xE4" => 'ae',
    > "\xF6" => 'oe',
    > "\xFC" => 'ue',
    > );
    >

    I really like the idea of the hash. Yes, I have heard you are not
    thinking in Perl if you are not using hashes.

    > for (@text) {
    > s/(\xDF|\xC4|\xD6|\xDC|\xE4|\xF6|\xFC)/$replace{$1}/g;
    > }

    this is super! thanks
    >
    > @text
    > }
    >
    > my @output = transform_characters(@input);
    steve_f, Aug 7, 2004
    #4
  5. steve_f

    steve_f Guest

    Thank you John, this is really useful. Just to start, I must always remind
    myself if I am doing something too many times to generalize.

    >John W. Krahn wrote:


    [ snip - my statement of problem ]

    >>
    >> I wrote the follow which works well, but looks
    >> pretty bad I think.

    >
    >It doesn't look too bad, I've seen worse. :)
    >

    I was able to brute force my way through it ;-)
    >
    >> so again this is a style question...
    >> can anyone suggest a cleaner approach? TIA

    >
    >The usual idiom is to use a hash for the search and replace tables.
    >


    yes, I see and it is very good...changes the whole approach

    >
    >> sub transform_characters {
    >> my @input = @_;
    >> my @output;
    >> for my $string (@input) {
    >> push @output, $string;
    >> if ($string =~ /\xDF/) {
    >> $string =~ s/\xDF/ss/g;

    >
    >Using a match followed by a substitution is a usual beginner mistake.
    >You only need the substitution.
    >
    > if ( $string =~ s/\xDF/ss/g ) {
    >


    ahh...ok...that's good to learn

    [ snip code ]

    >
    >Using a hash you could write that as:
    >
    >my %set1 = (
    > "\xDF" => 'ss',
    > );
    ># Use a character class because all keys are single characters
    ># If keys are multiple characters use alternation instead


    can you explain this a bit further? I'm not quite sure what you mean
    by alternation, but I really only looked up the escaped values for
    this particular problem.

    >my $key1 = '[' . join( '', keys %set1 ) . ']';


    also here I start to get really lost....ok, you are loading into a scalar
    the keys as one long string...joining them with no space between...
    with two brackets so

    $key1 = [\xDF]
    $key2 = [\xC4\xD6\xDC\xE4\xF6\xFC]
    correct?

    I see you use it down below in this substitution but it is a bit hard
    for me to understand:

    if ( $string =~ s/($key1)/$set1{$1}/og )

    well, if you have the time please give me a bit more clarrification
    on this because I haven't seen it before.

    >
    >my %set2 = (
    > "\xC4" => 'Ae',
    > "\xD6" => 'Oe',
    > "\xDC" => 'Ue',
    > "\xE4" => 'ae',
    > "\xF6" => 'oe',
    > "\xFC" => 'ue',
    > );
    >my $key2 = '[' . join( '', keys %set2 ) . ']';
    >
    >sub transform_characters {
    > my @input = @_;
    > my @output;
    > for my $string ( @input ) {
    > push @output, $string;
    > if ( $string =~ s/($key1)/$set1{$1}/og ) {
    > push @output, $string;
    > if ( $string =~ s/($key2)/$set2{$1}/og ) {
    > push @output, $string;
    > }
    > next;
    > }
    > if ( $string =~ s/($key2)/$set2{$1}/og ) {
    > push @output, $string;
    > }
    > }
    > return @output;
    > }
    >
    >
    >
    >John


    Thanks again John.

    Steve
    steve_f, Aug 7, 2004
    #5
  6. steve_f

    Joe Smith Guest

    Gunnar Hjalmarsson wrote:

    > steve_f wrote:
    >
    >> I want to transform special German characters to obtain the
    >> following variations:
    >>
    >> groß bräu
    >> gross bräu
    >> gross braeu


    >> for my $string (@input) {
    >> push @output, $string;

    >
    > Here you copy the whole original text to @output ...
    >
    >> if ($string =~ /\xDF/) {
    >> $string =~ s/\xDF/ss/g;
    >> push @output, $string;

    >
    >
    > ... and here you *add* the converted string. In the suggestion below,
    > I'm assuming that was a mistake.


    As I read it, steve_f wants to output three separate lines for each
    line of input that has both sets of characters.
    line 1 = original string.
    line 2 = string after doing just the ss substitution
    line 3 = string after doing ss and all the other substitutions.
    If so, adding the converted string with a second and third push is correct.
    -Joe
    Joe Smith, Aug 7, 2004
    #6
  7. steve_f wrote:
    >
    >>John W. Krahn wrote:
    >>
    >>Using a hash you could write that as:
    >>
    >>my %set1 = (
    >> "\xDF" => 'ss',
    >> );
    >># Use a character class because all keys are single characters
    >># If keys are multiple characters use alternation instead

    >
    > can you explain this a bit further? I'm not quite sure what you mean
    > by alternation, but I really only looked up the escaped values for
    > this particular problem.


    Gunnar's example uses alternation.


    >>my $key1 = '[' . join( '', keys %set1 ) . ']';


    Changing this to use alternation would look something like:

    my $key1 = '(?:' . join( '|', keys %set1 ) . ')';


    > also here I start to get really lost....ok, you are loading into a scalar
    > the keys as one long string...joining them with no space between...
    > with two brackets so
    >
    > $key1 = [\xDF]
    > $key2 = [\xC4\xD6\xDC\xE4\xF6\xFC]
    > correct?


    Yes.


    > I see you use it down below in this substitution but it is a bit hard
    > for me to understand:
    >
    > if ( $string =~ s/($key1)/$set1{$1}/og )
    >
    > well, if you have the time please give me a bit more clarrification
    > on this because I haven't seen it before.


    The substitution and match operators interpolate variables like double
    quoted strings so after interpolation the substitution operator sees:

    if ( $string =~ s/([\xDF])/ss/g )


    John
    --
    use Perl;
    program
    fulfillment
    John W. Krahn, Aug 9, 2004
    #7
  8. steve_f

    steve_f Guest

    On Sun, 08 Aug 2004 23:41:42 GMT, "John W. Krahn" <> wrote:

    >steve_f wrote:
    >>
    >>>John W. Krahn wrote:
    >>>
    >>>Using a hash you could write that as:
    >>>
    >>>my %set1 = (
    >>> "\xDF" => 'ss',
    >>> );
    >>># Use a character class because all keys are single characters
    >>># If keys are multiple characters use alternation instead

    >>
    >> can you explain this a bit further? I'm not quite sure what you mean
    >> by alternation, but I really only looked up the escaped values for
    >> this particular problem.

    >
    >Gunnar's example uses alternation.
    >
    >
    >>>my $key1 = '[' . join( '', keys %set1 ) . ']';

    >
    >Changing this to use alternation would look something like:
    >
    >my $key1 = '(?:' . join( '|', keys %set1 ) . ')';
    >
    >
    >> also here I start to get really lost....ok, you are loading into a scalar
    >> the keys as one long string...joining them with no space between...
    >> with two brackets so
    >>
    >> $key1 = [\xDF]
    >> $key2 = [\xC4\xD6\xDC\xE4\xF6\xFC]
    >> correct?

    >
    >Yes.
    >
    >
    >> I see you use it down below in this substitution but it is a bit hard
    >> for me to understand:
    >>
    >> if ( $string =~ s/($key1)/$set1{$1}/og )
    >>
    >> well, if you have the time please give me a bit more clarrification
    >> on this because I haven't seen it before.

    >
    >The substitution and match operators interpolate variables like double
    >quoted strings so after interpolation the substitution operator sees:
    >


    ahhhhhhhhhh...all very fancy stuff, but I got it! thanks for
    showing me this ;-)

    >if ( $string =~ s/([\xDF])/ss/g )
    >
    >
    >John
    steve_f, Aug 9, 2004
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Zsolt
    Replies:
    6
    Views:
    4,611
    Jon A. Cruz
    Feb 8, 2004
  2. Replies:
    4
    Views:
    1,320
    David Carlisle
    Jun 8, 2005
  3. Ajey
    Replies:
    1
    Views:
    674
    Ron Natalie
    Mar 30, 2005
  4. Navin Mishra
    Replies:
    2
    Views:
    430
    Joerg Jooss
    Feb 27, 2007
  5. Replies:
    8
    Views:
    751
    Jussi Salmela
    Mar 4, 2007
Loading...

Share This Page