regexp for removing {} around latin1 characters

M

Michael Friendly

I have BibTeX files containing accented characters in forms like
{Johann Peter S{\"u}ssmilch}
{Johann Peter S\"ussmilch}
where, in BibTex, the {} are optional.

To export these to, e.g., EndNote, I have to translate these latex
encodings to latin1, which I can largely do with the unix recode tool.
However, recode cheerfully copies the {}s which mess up things when
I import them.

% echo '{Johann Peter S{\"u}ssmilch},' | recode latex..latin1
{Johann Peter S{ü}ssmilch},

So, I'm looking to complete the process by finding a regexp to remove
the braces around single accented latin1 characters.

recode latex..latin1 < my.bib | perl -pe "s|\{([WHATGOESHERE])\}|$1|g"


--
Michael Friendly Email: (e-mail address removed)
Professor, Psychology Dept.
York University Voice: 416 736-5115 x66249 Fax: 416 736-5814
4700 Keele Street http://www.math.yorku.ca/SCS/friendly.html
Toronto, ONT M3J 1P3 CANADA
 
M

Michael Friendly

Peter said:
At said:
I have BibTeX files containing accented characters in forms like
{Johann Peter S{\"u}ssmilch}
{Johann Peter S\"ussmilch}
where, in BibTex, the {} are optional. [...]
So, I'm looking to complete the process by finding a regexp to remove
the braces around single accented latin1 characters.

recode latex..latin1 < my.bib | perl -pe "s|\{([WHATGOESHERE])\}|$1|g"
Maybe:

s#{(\\.(?:{.+?}|.+?))}#$1#g

more likely:

perl -pe "s|\{([\xA0-\xFF])\}|$1|g"


I think you are trying to replace the recode, too, but for that you need
a lookup table with all the accented characters.

hp

No, all I want to do is to strip the {} around the accented characters;
recode does the conversion well. With the small test bib file below,
here's what I get using only recode, vs. recode + perl


% recode latex..latin1 < timeref.bib | grep ssmilch
@BOOK{Sussmilch:1741,
author = {Johann Peter S{ü}ssmilch},

% recode latex..latin1 < timeref.bib | perl -pe
"s|\{([\xA0-\xFF])\}|$1|g" | grep ssmilch
@BOOK{Sussmilch:1741,
author = {Johann Peter Sssmilch},

Note that the ü just disappears.

---begin timeref.bib ---
@ARTICLE{Buache:1752,
author = {Buache, Phillippe},
title = {Essai De G{\'e}ographie Physique},
journal = {M{\'e}moires de L'Acad{\'e}mie Royale des Sciences},
year = {1752},
pages = {399--416},
note = {\Loc{BNF: Ge.FF-8816-8822}},
annote = {Contour map},
oldnum = {2}
}

@BOOK{Crome:1785,
title = {{\"U}ber die Gr{\"o}sse and Bev{\"o}lkerung der
S{\"a}mtlichen Europ{\"a}schen
Staaten},
publisher = {Weygand},
year = {1785},
author = {Crome, August F. W.},
address = {Leipzig},
annote = {Superimposed squares to compare areas (of European states)},
oldnum = {5}
}

@BOOK{Sussmilch:1741,
title = {Die g{\"o}ttliche Ordnung in den Ver\"anderungen des
menschlichen Geschlechts,
aus der Geburt, Tod, und Fortpflantzung},
publisher = {n.p.},
year = {1741},
author = {Johann Peter S{\"u}ssmilch},
address = {Germany},
note = {(published in French translation as \emph{L'ordre divin. dans les
changements de l'esp\`ece humaine, d{\'e}montr{\'e} par la naissance,
la mort et la propagation de celle-ci}, trans: Jean-Marc Rohrbasser,
Paris: INED, 1998, ISBN 2-7332-1019-X)},
url = {http://www.ined.fr/publicat/collections/classiques/Ordivin.htm}
}
--- end timeref.bib -----



--
Michael Friendly Email: (e-mail address removed)
Professor, Psychology Dept.
York University Voice: 416 736-5115 x66249 Fax: 416 736-5814
4700 Keele Street http://www.math.yorku.ca/SCS/friendly.html
Toronto, ONT M3J 1P3 CANADA
 
S

sln

Peter said:
At 2009-11-27 12:05PM, "Michael Friendly" wrote:
I have BibTeX files containing accented characters in forms like
{Johann Peter S{\"u}ssmilch}
{Johann Peter S\"ussmilch}
where, in BibTex, the {} are optional.
[...]
So, I'm looking to complete the process by finding a regexp to remove
the braces around single accented latin1 characters.

recode latex..latin1 < my.bib | perl -pe "s|\{([WHATGOESHERE])\}|$1|g"
Maybe:

s#{(\\.(?:{.+?}|.+?))}#$1#g

more likely:

perl -pe "s|\{([\xA0-\xFF])\}|$1|g"


I think you are trying to replace the recode, too, but for that you need
a lookup table with all the accented characters.

hp

No, all I want to do is to strip the {} around the accented characters;
recode does the conversion well. With the small test bib file below,
here's what I get using only recode, vs. recode + perl


% recode latex..latin1 < timeref.bib | grep ssmilch
@BOOK{Sussmilch:1741,
author = {Johann Peter S{ü}ssmilch},

% recode latex..latin1 < timeref.bib | perl -pe
"s|\{([\xA0-\xFF])\}|$1|g" | grep ssmilch
@BOOK{Sussmilch:1741,
author = {Johann Peter Sssmilch},

Note that the ü just disappears.
I didn't have problem with the substitution when run as
a stand-alone Perl program. The one liner may need its STDOUT
adjusted with binmode().

----------------------
perl gg.pl > itt.txt

itt.txt (from word):
252
unix crlf encoding(iso-8859-1) utf8
{Johann Peter Süssmilch}
::
However, I don't need to set the STDOUT
encoding. The default does the same thing,
probably because internally it remained as
byte strings during the regex since 0-255
latin has common utf8 code points.

-sln
--------------------

use strict;
use warnings;

print ord('ü'),"\n";

my $str = "{Johann Peter S{\xFC}ssmilch}";

$str =~ s/\{([\xC0-\xFF])\}/$1/g;

# try one of these:
binmode (STDOUT, ":encoding(latin-1)");
#binmode (STDOUT);
#binmode (STDOUT, ":raw");

print "@{[PerlIO::get_layers(STDOUT)]}\n";

print "$str\n";
 
P

Peter J. Holzer

[Please don't cc usenet postings]

Peter said:
At 2009-11-27 12:05PM, "Michael Friendly" wrote:
I have BibTeX files containing accented characters in forms like
{Johann Peter S{\"u}ssmilch}
{Johann Peter S\"ussmilch}
where, in BibTex, the {} are optional.
[...]
So, I'm looking to complete the process by finding a regexp to remove
the braces around single accented latin1 characters.

recode latex..latin1 < my.bib | perl -pe "s|\{([WHATGOESHERE])\}|$1|g"
Maybe:

s#{(\\.(?:{.+?}|.+?))}#$1#g

more likely:

perl -pe "s|\{([\xA0-\xFF])\}|$1|g"


I think you are trying to replace the recode, too, but for that you need
a lookup table with all the accented characters.

No, all I want to do is to strip the {} around the accented characters;

Yes, I was following up to Glenn here, so the "you" was referring him.
If you look at his regexp, you will see that it matches the for example
{\"{u}} or {\"u}. That doesn't work after the recode, because the \" has
already been replaced.

recode does the conversion well. With the small test bib file below,
here's what I get using only recode, vs. recode + perl


% recode latex..latin1 < timeref.bib | grep ssmilch
@BOOK{Sussmilch:1741,
author = {Johann Peter S{ü}ssmilch},

% recode latex..latin1 < timeref.bib | perl -pe
"s|\{([\xA0-\xFF])\}|$1|g" | grep ssmilch
@BOOK{Sussmilch:1741,
author = {Johann Peter Sssmilch},

Note that the ü just disappears.

You are using a unixish system? The shell replaces $1 inside the double
quotes with the current value of the shell variable $1 (in your case
probably nothing), so that the code that perl sees is:

s|\{([\xA0-\xFF])\}||g

On unixish systems you should always use single quotes to enclose perl
code unless you want the shell to substitute part of your code. Since
you were using double quotes I was assuming you are on Windows.

In general you should only use one-liners if you are familiar with the
shell you are using.

hp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top