Help to find a regular expression to parse po file

G

gialloporpora

Hi all,
I would like to extract string from a PO file. To do this I have created
a little python function to parse po file and extract string:

import re
regex=re.compile("msgid (.*)\\nmsgstr (.*)\\n\\n")
m=r.findall(s)

where s is a po file like this:

msgctxt "write ubiquity commands.description"
msgid "Takes you to the Ubiquity <a
href=\"chrome://ubiquity/content/editor.html\">command editor</a> page."
msgstr "Apre l'<a href=\"chrome://ubiquity/content/editor.html\">editor
dei comandi</a> di Ubiquity."


#. list ubiquity commands command:
#. use | to separate multiple name values:
msgctxt "list ubiquity commands.names"
msgid "list ubiquity commands"
msgstr "elenco comandi disponibili"

msgctxt "list ubiquity commands.description"
msgid "Opens <a href=\"chrome://ubiquity/content/cmdlist.html\">the
list</a>\n"
" of all Ubiquity commands available and what they all do."
msgstr "Apre una <a
href=\"chrome://ubiquity/content/cmdlist.html\">pagina</a>\n"
" in cui sono elencati tutti i comandi disponibili e per ognuno
viene spiegato in breve a cosa serve."



#. change ubiquity settings command:
#. use | to separate multiple name values:
msgctxt "change ubiquity settings.names"
msgid "change ubiquity settings|change ubiquity preferences|change
ubiquity skin"
msgstr "modifica impostazioni di ubiquity|modifica preferenze di
ubiquity|modifica tema di ubiquity"

msgctxt "change ubiquity settings.description"
msgid "Takes you to the <a
href=\"chrome://ubiquity/content/settings.html\">settings</a> page,\n"
" where you can change your skin, key combinations, etc."
msgstr "Apre la pagina <a
href=\"chrome://ubiquity/content/settings.html\">delle impostazioni</a>
di Ubiquity,\n"
" dalla quale è possibile modificare la combinazione da tastiera
utilizzata per richiamare Ubiquity, il tema, ecc."



but, obviusly, with the code above the last string is not matched. If
I use re.DOTALL to match also new line character it not works because it
match the entire file, I would like to stop the matching when "msgstr"
is found.

regex=re.compile("msgid (.*)\\nmsgstr (.*)\\n\\n\\n",re.DOTALL)

is it possible or not ?
 
H

Hallvard B Furuseth

gialloporpora said:
I would like to extract string from a PO file. To do this I have created
a little python function to parse po file and extract string:

import re
regex=re.compile("msgid (.*)\\nmsgstr (.*)\\n\\n")
m=r.findall(s)

I don't know the syntax of a po file, but this works for the
snippet you posted:

arg_re = r'"[^\\\"]*(?:\\.[^\\\"]*)*"'
arg_re = '%s(?:\s+%s)*' % (arg_re, arg_re)
find_re = re.compile(
r'^msgid\s+(' + arg_re + ')\s*\nmsgstr\s+(' + arg_re + ')\s*\n', re.M)

However, can \ quote a newline? If so, replace \\. with \\[\s\S] or
something.
Can there be other keywords between msgid and msgstr? If so,
add something like (?:\w+\s+<arg_re>\s*\n)*? between them.
Can msgstr come before msgid? If so, forget using a single regexp.
Anything else to the syntax to look out for? Single quotes, maybe?

Is it a problem if the regexp isn't quite right and doesn't match all
cases, yet doesn't report an error when that happens?

All in all, it may be a bad idea to sqeeze this into a single regexp.
It gets ugly real fast. Might be better to parse the file in a more
regular way, maybe using regexps just to extract each (keyword, "value")
pair.
 
M

MRAB

gialloporpora said:
Hi all,
I would like to extract string from a PO file. To do this I have created
a little python function to parse po file and extract string:

import re
regex=re.compile("msgid (.*)\\nmsgstr (.*)\\n\\n")
m=r.findall(s)

where s is a po file like this:

msgctxt "write ubiquity commands.description"
msgid "Takes you to the Ubiquity <a
href=\"chrome://ubiquity/content/editor.html\">command editor</a> page."
msgstr "Apre l'<a href=\"chrome://ubiquity/content/editor.html\">editor
dei comandi</a> di Ubiquity."


#. list ubiquity commands command:
#. use | to separate multiple name values:
msgctxt "list ubiquity commands.names"
msgid "list ubiquity commands"
msgstr "elenco comandi disponibili"

msgctxt "list ubiquity commands.description"
msgid "Opens <a href=\"chrome://ubiquity/content/cmdlist.html\">the
list</a>\n"
" of all Ubiquity commands available and what they all do."
msgstr "Apre una <a
href=\"chrome://ubiquity/content/cmdlist.html\">pagina</a>\n"
" in cui sono elencati tutti i comandi disponibili e per ognuno
viene spiegato in breve a cosa serve."



#. change ubiquity settings command:
#. use | to separate multiple name values:
msgctxt "change ubiquity settings.names"
msgid "change ubiquity settings|change ubiquity preferences|change
ubiquity skin"
msgstr "modifica impostazioni di ubiquity|modifica preferenze di
ubiquity|modifica tema di ubiquity"

msgctxt "change ubiquity settings.description"
msgid "Takes you to the <a
href=\"chrome://ubiquity/content/settings.html\">settings</a> page,\n"
" where you can change your skin, key combinations, etc."
msgstr "Apre la pagina <a
href=\"chrome://ubiquity/content/settings.html\">delle impostazioni</a>
di Ubiquity,\n"
" dalla quale è possibile modificare la combinazione da tastiera
utilizzata per richiamare Ubiquity, il tema, ecc."



but, obviusly, with the code above the last string is not matched. If
I use re.DOTALL to match also new line character it not works because it
match the entire file, I would like to stop the matching when "msgstr"
is found.

regex=re.compile("msgid (.*)\\nmsgstr (.*)\\n\\n\\n",re.DOTALL)

is it possible or not ?
You could try:

regex = re.compile(r"msgid (.*(?:\n".*")*)\nmsgstr (.*(?:\n".*")*)$")

and then, if necessary, tidy what you get.
 
G

gialloporpora

Risposta al messaggio di Hallvard B Furuseth :

I don't know the syntax of a po file, but this works for the
snippet you posted:

arg_re = r'"[^\\\"]*(?:\\.[^\\\"]*)*"'
arg_re = '%s(?:\s+%s)*' % (arg_re, arg_re)
find_re = re.compile(
r'^msgid\s+(' + arg_re + ')\s*\nmsgstr\s+(' + arg_re + ')\s*\n', re.M)

However, can \ quote a newline? If so, replace \\. with \\[\s\S] or
something.
Can there be other keywords between msgid and msgstr? If so,
add something like (?:\w+\s+<arg_re>\s*\n)*? between them.
Can msgstr come before msgid? If so, forget using a single regexp.
Anything else to the syntax to look out for? Single quotes, maybe?

Is it a problem if the regexp isn't quite right and doesn't match all
cases, yet doesn't report an error when that happens?

All in all, it may be a bad idea to sqeeze this into a single regexp.
It gets ugly real fast. Might be better to parse the file in a more
regular way, maybe using regexps just to extract each (keyword, "value")
pair.
Thank you very much, Haldvard, it seem to works, there is a strange
match in the file header but I could skip the first match.


The po files have this structure:
http://bit.ly/18qbVc

msgid "string to translate"
" second string to match"
" n string to match"
msgstr "translated sting"
" second translated string"
" n translated string"
One or more new line before the next group.

In past I have created a Python script to parse PO files where msgid
and msgstr are in two sequential lines, for example:

msgid "string to translate"
msgstr "translated string"

now the problem is how to match also (optional) string between msgid and
msgstr.

Sandro
 
G

gialloporpora

Risposta al messaggio di MRAB :
You could try:

regex = re.compile(r"msgid (.*(?:\n".*")*)\nmsgstr (.*(?:\n".*")*)$")

and then, if necessary, tidy what you get.


MRAB, thank you for your help, I have tried the code posted by Hallvard
because I have seen it before and it works. Now I'll check also your
suggestions.
Sandro
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top