Help to find a regular expression to parse po file

gialloporpora · Jul 6, 2009

Hi all,
I would like to extract string from a PO file. To do this I have created
a little python function to parse po file and extract string:

import re
regex=re.compile("msgid (.*)\\nmsgstr (.*)\\n\\n")
m=r.findall(s)

where s is a po file like this:

msgctxt "write ubiquity commands.description"
msgid "Takes you to the Ubiquity <a
href=\"chrome://ubiquity/content/editor.html\">command editor</a> page."
msgstr "Apre l'<a href=\"chrome://ubiquity/content/editor.html\">editor
dei comandi</a> di Ubiquity."

#. list ubiquity commands command:
#. use | to separate multiple name values:
msgctxt "list ubiquity commands.names"
msgid "list ubiquity commands"
msgstr "elenco comandi disponibili"

msgctxt "list ubiquity commands.description"
msgid "Opens <a href=\"chrome://ubiquity/content/cmdlist.html\">the
list</a>\n"
" of all Ubiquity commands available and what they all do."
msgstr "Apre una <a
href=\"chrome://ubiquity/content/cmdlist.html\">pagina</a>\n"
" in cui sono elencati tutti i comandi disponibili e per ognuno
viene spiegato in breve a cosa serve."

#. change ubiquity settings command:
#. use | to separate multiple name values:
msgctxt "change ubiquity settings.names"
msgid "change ubiquity settings|change ubiquity preferences|change
ubiquity skin"
msgstr "modifica impostazioni di ubiquity|modifica preferenze di
ubiquity|modifica tema di ubiquity"

msgctxt "change ubiquity settings.description"
msgid "Takes you to the <a
href=\"chrome://ubiquity/content/settings.html\">settings</a> page,\n"
" where you can change your skin, key combinations, etc."
msgstr "Apre la pagina <a
href=\"chrome://ubiquity/content/settings.html\">delle impostazioni</a>
di Ubiquity,\n"
" dalla quale ÃƒÂ¨ possibile modificare la combinazione da tastiera
utilizzata per richiamare Ubiquity, il tema, ecc."

but, obviusly, with the code above the last string is not matched. If
I use re.DOTALL to match also new line character it not works because it
match the entire file, I would like to stop the matching when "msgstr"
is found.

regex=re.compile("msgid (.*)\\nmsgstr (.*)\\n\\n\\n",re.DOTALL)

is it possible or not ?

Hallvard B Furuseth · Jul 6, 2009

gialloporpora said:
I would like to extract string from a PO file. To do this I have created
a little python function to parse po file and extract string:

import re
regex=re.compile("msgid (.*)\\nmsgstr (.*)\\n\\n")
m=r.findall(s)

I don't know the syntax of a po file, but this works for the
snippet you posted:

arg_re = r'"[^\\\"]*(?:\\.[^\\\"]*)*"'
arg_re = '%s(?:\s+%s)*' % (arg_re, arg_re)
find_re = re.compile(
r'^msgid\s+(' + arg_re + ')\s*\nmsgstr\s+(' + arg_re + ')\s*\n', re.M)

However, can \ quote a newline? If so, replace \\. with \\[\s\S] or
something.
Can there be other keywords between msgid and msgstr? If so,
add something like (?:\w+\s+<arg_re>\s*\n)*? between them.
Can msgstr come before msgid? If so, forget using a single regexp.
Anything else to the syntax to look out for? Single quotes, maybe?

Is it a problem if the regexp isn't quite right and doesn't match all
cases, yet doesn't report an error when that happens?

All in all, it may be a bad idea to sqeeze this into a single regexp.
It gets ugly real fast. Might be better to parse the file in a more
regular way, maybe using regexps just to extract each (keyword, "value")
pair.

MRAB · Jul 6, 2009

gialloporpora said:
Hi all,
I would like to extract string from a PO file. To do this I have created
a little python function to parse po file and extract string:

import re
regex=re.compile("msgid (.*)\\nmsgstr (.*)\\n\\n")
m=r.findall(s)

where s is a po file like this:

msgctxt "write ubiquity commands.description"
msgid "Takes you to the Ubiquity <a
href=\"chrome://ubiquity/content/editor.html\">command editor</a> page."
msgstr "Apre l'<a href=\"chrome://ubiquity/content/editor.html\">editor
dei comandi</a> di Ubiquity."

#. list ubiquity commands command:
#. use | to separate multiple name values:
msgctxt "list ubiquity commands.names"
msgid "list ubiquity commands"
msgstr "elenco comandi disponibili"

msgctxt "list ubiquity commands.description"
msgid "Opens <a href=\"chrome://ubiquity/content/cmdlist.html\">the
list</a>\n"
" of all Ubiquity commands available and what they all do."
msgstr "Apre una <a
href=\"chrome://ubiquity/content/cmdlist.html\">pagina</a>\n"
" in cui sono elencati tutti i comandi disponibili e per ognuno
viene spiegato in breve a cosa serve."

#. change ubiquity settings command:
#. use | to separate multiple name values:
msgctxt "change ubiquity settings.names"
msgid "change ubiquity settings|change ubiquity preferences|change
ubiquity skin"
msgstr "modifica impostazioni di ubiquity|modifica preferenze di
ubiquity|modifica tema di ubiquity"

msgctxt "change ubiquity settings.description"
msgid "Takes you to the <a
href=\"chrome://ubiquity/content/settings.html\">settings</a> page,\n"
" where you can change your skin, key combinations, etc."
msgstr "Apre la pagina <a
href=\"chrome://ubiquity/content/settings.html\">delle impostazioni</a>
di Ubiquity,\n"
" dalla quale ÃƒÂ¨ possibile modificare la combinazione da tastiera
utilizzata per richiamare Ubiquity, il tema, ecc."

but, obviusly, with the code above the last string is not matched. If
I use re.DOTALL to match also new line character it not works because it
match the entire file, I would like to stop the matching when "msgstr"
is found.

regex=re.compile("msgid (.*)\\nmsgstr (.*)\\n\\n\\n",re.DOTALL)

is it possible or not ?

You could try:

regex = re.compile(r"msgid (.*(?:\n".*")*)\nmsgstr (.*(?:\n".*")*)$")

and then, if necessary, tidy what you get.

gialloporpora · Jul 6, 2009

Risposta al messaggio di Hallvard B Furuseth :

I don't know the syntax of a po file, but this works for the
snippet you posted:

arg_re = r'"[^\\\"]*(?:\\.[^\\\"]*)*"'
arg_re = '%s(?:\s+%s)*' % (arg_re, arg_re)
find_re = re.compile(
r'^msgid\s+(' + arg_re + ')\s*\nmsgstr\s+(' + arg_re + ')\s*\n', re.M)

However, can \ quote a newline? If so, replace \\. with \\[\s\S] or
something.
Can there be other keywords between msgid and msgstr? If so,
add something like (?:\w+\s+<arg_re>\s*\n)*? between them.
Can msgstr come before msgid? If so, forget using a single regexp.
Anything else to the syntax to look out for? Single quotes, maybe?

Is it a problem if the regexp isn't quite right and doesn't match all
cases, yet doesn't report an error when that happens?

All in all, it may be a bad idea to sqeeze this into a single regexp.
It gets ugly real fast. Might be better to parse the file in a more
regular way, maybe using regexps just to extract each (keyword, "value")
pair.

Thank you very much, Haldvard, it seem to works, there is a strange
match in the file header but I could skip the first match.

The po files have this structure:
http://bit.ly/18qbVc

msgid "string to translate"
" second string to match"
" n string to match"
msgstr "translated sting"
" second translated string"
" n translated string"
One or more new line before the next group.

In past I have created a Python script to parse PO files where msgid
and msgstr are in two sequential lines, for example:

msgid "string to translate"
msgstr "translated string"

now the problem is how to match also (optional) string between msgid and
msgstr.

Sandro

gialloporpora · Jul 6, 2009

Risposta al messaggio di MRAB :

You could try:

regex = re.compile(r"msgid (.*(?:\n".*")*)\nmsgstr (.*(?:\n".*")*)$")

and then, if necessary, tidy what you get.

MRAB, thank you for your help, I have tried the code posted by Hallvard
because I have seen it before and it works. Now I'll check also your
suggestions.
Sandro

Problem creating a regular expression to parse open-iscsi, iscsiadmoutput (help?)	5	Jun 13, 2013
Help needed to write a regular expression.	3	Mar 9, 2005
Help to parse text file	2	Feb 8, 2006
Help needed to retrieve text from a text-file using RegEx	4	Feb 9, 2009
problems getting os.system and wxmenu to read options from a file andthen execute	4	Jun 28, 2010
Question about how to get line buffering from paramiko	0	Jul 5, 2011
Best way to parse a small txt file	4	Oct 3, 2005
help writing to a file	1	Jul 10, 2003

Help to find a regular expression to parse po file

gialloporpora

Hallvard B Furuseth

MRAB

gialloporpora

gialloporpora

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads