regexp for removing {} around latin1 characters

Discussion in 'Perl Misc' started by Michael Friendly, Nov 27, 2009.

  1. I have BibTeX files containing accented characters in forms like
    {Johann Peter S{\"u}ssmilch}
    {Johann Peter S\"ussmilch}
    where, in BibTex, the {} are optional.

    To export these to, e.g., EndNote, I have to translate these latex
    encodings to latin1, which I can largely do with the unix recode tool.
    However, recode cheerfully copies the {}s which mess up things when
    I import them.

    % echo '{Johann Peter S{\"u}ssmilch},' | recode latex..latin1
    {Johann Peter S{ü}ssmilch},

    So, I'm looking to complete the process by finding a regexp to remove
    the braces around single accented latin1 characters.

    recode latex..latin1 < my.bib | perl -pe "s|\{([WHATGOESHERE])\}|$1|g"


    --
    Michael Friendly Email:
    Professor, Psychology Dept.
    York University Voice: 416 736-5115 x66249 Fax: 416 736-5814
    4700 Keele Street http://www.math.yorku.ca/SCS/friendly.html
    Toronto, ONT M3J 1P3 CANADA
     
    Michael Friendly, Nov 27, 2009
    #1
    1. Advertising

  2. Peter J. Holzer wrote:
    > On 2009-11-27 17:39, Glenn Jackman <> wrote:
    >> At 2009-11-27 12:05PM, "Michael Friendly" wrote:
    >>> I have BibTeX files containing accented characters in forms like
    >>> {Johann Peter S{\"u}ssmilch}
    >>> {Johann Peter S\"ussmilch}
    >>> where, in BibTex, the {} are optional.

    >> [...]
    >>> So, I'm looking to complete the process by finding a regexp to remove
    >>> the braces around single accented latin1 characters.
    >>>
    >>> recode latex..latin1 < my.bib | perl -pe "s|\{([WHATGOESHERE])\}|$1|g"

    >> Maybe:
    >>
    >> s#{(\\.(?:{.+?}|.+?))}#$1#g

    >
    > more likely:
    >
    > perl -pe "s|\{([\xA0-\xFF])\}|$1|g"
    >
    >
    > I think you are trying to replace the recode, too, but for that you need
    > a lookup table with all the accented characters.
    >
    > hp
    >


    No, all I want to do is to strip the {} around the accented characters;
    recode does the conversion well. With the small test bib file below,
    here's what I get using only recode, vs. recode + perl


    % recode latex..latin1 < timeref.bib | grep ssmilch
    @BOOK{Sussmilch:1741,
    author = {Johann Peter S{ü}ssmilch},

    % recode latex..latin1 < timeref.bib | perl -pe
    "s|\{([\xA0-\xFF])\}|$1|g" | grep ssmilch
    @BOOK{Sussmilch:1741,
    author = {Johann Peter Sssmilch},

    Note that the ü just disappears.

    ---begin timeref.bib ---
    @ARTICLE{Buache:1752,
    author = {Buache, Phillippe},
    title = {Essai De G{\'e}ographie Physique},
    journal = {M{\'e}moires de L'Acad{\'e}mie Royale des Sciences},
    year = {1752},
    pages = {399--416},
    note = {\Loc{BNF: Ge.FF-8816-8822}},
    annote = {Contour map},
    oldnum = {2}
    }

    @BOOK{Crome:1785,
    title = {{\"U}ber die Gr{\"o}sse and Bev{\"o}lkerung der
    S{\"a}mtlichen Europ{\"a}schen
    Staaten},
    publisher = {Weygand},
    year = {1785},
    author = {Crome, August F. W.},
    address = {Leipzig},
    annote = {Superimposed squares to compare areas (of European states)},
    oldnum = {5}
    }

    @BOOK{Sussmilch:1741,
    title = {Die g{\"o}ttliche Ordnung in den Ver\"anderungen des
    menschlichen Geschlechts,
    aus der Geburt, Tod, und Fortpflantzung},
    publisher = {n.p.},
    year = {1741},
    author = {Johann Peter S{\"u}ssmilch},
    address = {Germany},
    note = {(published in French translation as \emph{L'ordre divin. dans les
    changements de l'esp\`ece humaine, d{\'e}montr{\'e} par la naissance,
    la mort et la propagation de celle-ci}, trans: Jean-Marc Rohrbasser,
    Paris: INED, 1998, ISBN 2-7332-1019-X)},
    url = {http://www.ined.fr/publicat/collections/classiques/Ordivin.htm}
    }
    --- end timeref.bib -----



    --
    Michael Friendly Email:
    Professor, Psychology Dept.
    York University Voice: 416 736-5115 x66249 Fax: 416 736-5814
    4700 Keele Street http://www.math.yorku.ca/SCS/friendly.html
    Toronto, ONT M3J 1P3 CANADA
     
    Michael Friendly, Nov 27, 2009
    #2
    1. Advertising

  3. Michael Friendly

    Guest

    On Fri, 27 Nov 2009 15:59:59 -0500, Michael Friendly <> wrote:

    >Peter J. Holzer wrote:
    >> On 2009-11-27 17:39, Glenn Jackman <> wrote:
    >>> At 2009-11-27 12:05PM, "Michael Friendly" wrote:
    >>>> I have BibTeX files containing accented characters in forms like
    >>>> {Johann Peter S{\"u}ssmilch}
    >>>> {Johann Peter S\"ussmilch}
    >>>> where, in BibTex, the {} are optional.
    >>> [...]
    >>>> So, I'm looking to complete the process by finding a regexp to remove
    >>>> the braces around single accented latin1 characters.
    >>>>
    >>>> recode latex..latin1 < my.bib | perl -pe "s|\{([WHATGOESHERE])\}|$1|g"
    >>> Maybe:
    >>>
    >>> s#{(\\.(?:{.+?}|.+?))}#$1#g

    >>
    >> more likely:
    >>
    >> perl -pe "s|\{([\xA0-\xFF])\}|$1|g"
    >>
    >>
    >> I think you are trying to replace the recode, too, but for that you need
    >> a lookup table with all the accented characters.
    >>
    >> hp
    >>

    >
    >No, all I want to do is to strip the {} around the accented characters;
    >recode does the conversion well. With the small test bib file below,
    >here's what I get using only recode, vs. recode + perl
    >
    >
    > % recode latex..latin1 < timeref.bib | grep ssmilch
    >@BOOK{Sussmilch:1741,
    > author = {Johann Peter S{ü}ssmilch},
    >
    > % recode latex..latin1 < timeref.bib | perl -pe
    >"s|\{([\xA0-\xFF])\}|$1|g" | grep ssmilch
    >@BOOK{Sussmilch:1741,
    > author = {Johann Peter Sssmilch},
    >
    >Note that the ü just disappears.
    >

    I didn't have problem with the substitution when run as
    a stand-alone Perl program. The one liner may need its STDOUT
    adjusted with binmode().

    ----------------------
    perl gg.pl > itt.txt

    itt.txt (from word):
    252
    unix crlf encoding(iso-8859-1) utf8
    {Johann Peter Süssmilch}
    ::
    However, I don't need to set the STDOUT
    encoding. The default does the same thing,
    probably because internally it remained as
    byte strings during the regex since 0-255
    latin has common utf8 code points.

    -sln
    --------------------

    use strict;
    use warnings;

    print ord('ü'),"\n";

    my $str = "{Johann Peter S{\xFC}ssmilch}";

    $str =~ s/\{([\xC0-\xFF])\}/$1/g;

    # try one of these:
    binmode (STDOUT, ":encoding(latin-1)");
    #binmode (STDOUT);
    #binmode (STDOUT, ":raw");

    print "@{[PerlIO::get_layers(STDOUT)]}\n";

    print "$str\n";
     
    , Nov 27, 2009
    #3
  4. [Please don't cc usenet postings]

    On 2009-11-27 20:59, Michael Friendly <> wrote:
    > Peter J. Holzer wrote:
    >> On 2009-11-27 17:39, Glenn Jackman <> wrote:
    >>> At 2009-11-27 12:05PM, "Michael Friendly" wrote:
    >>>> I have BibTeX files containing accented characters in forms like
    >>>> {Johann Peter S{\"u}ssmilch}
    >>>> {Johann Peter S\"ussmilch}
    >>>> where, in BibTex, the {} are optional.
    >>> [...]
    >>>> So, I'm looking to complete the process by finding a regexp to remove
    >>>> the braces around single accented latin1 characters.
    >>>>
    >>>> recode latex..latin1 < my.bib | perl -pe "s|\{([WHATGOESHERE])\}|$1|g"
    >>> Maybe:
    >>>
    >>> s#{(\\.(?:{.+?}|.+?))}#$1#g

    >>
    >> more likely:
    >>
    >> perl -pe "s|\{([\xA0-\xFF])\}|$1|g"
    >>
    >>
    >> I think you are trying to replace the recode, too, but for that you need
    >> a lookup table with all the accented characters.

    >
    > No, all I want to do is to strip the {} around the accented characters;


    Yes, I was following up to Glenn here, so the "you" was referring him.
    If you look at his regexp, you will see that it matches the for example
    {\"{u}} or {\"u}. That doesn't work after the recode, because the \" has
    already been replaced.


    > recode does the conversion well. With the small test bib file below,
    > here's what I get using only recode, vs. recode + perl
    >
    >
    > % recode latex..latin1 < timeref.bib | grep ssmilch
    > @BOOK{Sussmilch:1741,
    > author = {Johann Peter S{ü}ssmilch},
    >
    > % recode latex..latin1 < timeref.bib | perl -pe
    > "s|\{([\xA0-\xFF])\}|$1|g" | grep ssmilch
    > @BOOK{Sussmilch:1741,
    > author = {Johann Peter Sssmilch},
    >
    > Note that the ü just disappears.


    You are using a unixish system? The shell replaces $1 inside the double
    quotes with the current value of the shell variable $1 (in your case
    probably nothing), so that the code that perl sees is:

    s|\{([\xA0-\xFF])\}||g

    On unixish systems you should always use single quotes to enclose perl
    code unless you want the shell to substitute part of your code. Since
    you were using double quotes I was assuming you are on Windows.

    In general you should only use one-liners if you are familiar with the
    shell you are using.

    hp
     
    Peter J. Holzer, Nov 28, 2009
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Fritz Bayer
    Replies:
    2
    Views:
    5,833
    Fritz Bayer
    Apr 20, 2005
  2. Marko Faldix
    Replies:
    8
    Views:
    429
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Dec 15, 2003
  3. Luis P. Mendes

    ascii to latin1

    Luis P. Mendes, May 9, 2006, in forum: Python
    Replies:
    14
    Views:
    736
    Luis P. Mendes
    May 10, 2006
  4. Helmut Jarausch

    restructuredtext latin1 encoding (FAQ?)

    Helmut Jarausch, Jul 3, 2007, in forum: Python
    Replies:
    2
    Views:
    401
    Helmut Jarausch
    Jul 3, 2007
  5. Joao Silva
    Replies:
    16
    Views:
    391
    7stud --
    Aug 21, 2009
Loading...

Share This Page