Help to find a regular expression to parse po file

Discussion in 'Python' started by gialloporpora, Jul 6, 2009.

  1. Hi all,
    I would like to extract string from a PO file. To do this I have created
    a little python function to parse po file and extract string:

    import re
    regex=re.compile("msgid (.*)\\nmsgstr (.*)\\n\\n")
    m=r.findall(s)

    where s is a po file like this:

    msgctxt "write ubiquity commands.description"
    msgid "Takes you to the Ubiquity <a
    href=\"chrome://ubiquity/content/editor.html\">command editor</a> page."
    msgstr "Apre l'<a href=\"chrome://ubiquity/content/editor.html\">editor
    dei comandi</a> di Ubiquity."


    #. list ubiquity commands command:
    #. use | to separate multiple name values:
    msgctxt "list ubiquity commands.names"
    msgid "list ubiquity commands"
    msgstr "elenco comandi disponibili"

    msgctxt "list ubiquity commands.description"
    msgid "Opens <a href=\"chrome://ubiquity/content/cmdlist.html\">the
    list</a>\n"
    " of all Ubiquity commands available and what they all do."
    msgstr "Apre una <a
    href=\"chrome://ubiquity/content/cmdlist.html\">pagina</a>\n"
    " in cui sono elencati tutti i comandi disponibili e per ognuno
    viene spiegato in breve a cosa serve."



    #. change ubiquity settings command:
    #. use | to separate multiple name values:
    msgctxt "change ubiquity settings.names"
    msgid "change ubiquity settings|change ubiquity preferences|change
    ubiquity skin"
    msgstr "modifica impostazioni di ubiquity|modifica preferenze di
    ubiquity|modifica tema di ubiquity"

    msgctxt "change ubiquity settings.description"
    msgid "Takes you to the <a
    href=\"chrome://ubiquity/content/settings.html\">settings</a> page,\n"
    " where you can change your skin, key combinations, etc."
    msgstr "Apre la pagina <a
    href=\"chrome://ubiquity/content/settings.html\">delle impostazioni</a>
    di Ubiquity,\n"
    " dalla quale è possibile modificare la combinazione da tastiera
    utilizzata per richiamare Ubiquity, il tema, ecc."



    but, obviusly, with the code above the last string is not matched. If
    I use re.DOTALL to match also new line character it not works because it
    match the entire file, I would like to stop the matching when "msgstr"
    is found.

    regex=re.compile("msgid (.*)\\nmsgstr (.*)\\n\\n\\n",re.DOTALL)

    is it possible or not ?
     
    gialloporpora, Jul 6, 2009
    #1
    1. Advertising

  2. gialloporpora writes:
    > I would like to extract string from a PO file. To do this I have created
    > a little python function to parse po file and extract string:
    >
    > import re
    > regex=re.compile("msgid (.*)\\nmsgstr (.*)\\n\\n")
    > m=r.findall(s)


    I don't know the syntax of a po file, but this works for the
    snippet you posted:

    arg_re = r'"[^\\\"]*(?:\\.[^\\\"]*)*"'
    arg_re = '%s(?:\s+%s)*' % (arg_re, arg_re)
    find_re = re.compile(
    r'^msgid\s+(' + arg_re + ')\s*\nmsgstr\s+(' + arg_re + ')\s*\n', re.M)

    However, can \ quote a newline? If so, replace \\. with \\[\s\S] or
    something.
    Can there be other keywords between msgid and msgstr? If so,
    add something like (?:\w+\s+<arg_re>\s*\n)*? between them.
    Can msgstr come before msgid? If so, forget using a single regexp.
    Anything else to the syntax to look out for? Single quotes, maybe?

    Is it a problem if the regexp isn't quite right and doesn't match all
    cases, yet doesn't report an error when that happens?

    All in all, it may be a bad idea to sqeeze this into a single regexp.
    It gets ugly real fast. Might be better to parse the file in a more
    regular way, maybe using regexps just to extract each (keyword, "value")
    pair.

    --
    Hallvard
     
    Hallvard B Furuseth, Jul 6, 2009
    #2
    1. Advertising

  3. gialloporpora

    MRAB Guest

    gialloporpora wrote:
    > Hi all,
    > I would like to extract string from a PO file. To do this I have created
    > a little python function to parse po file and extract string:
    >
    > import re
    > regex=re.compile("msgid (.*)\\nmsgstr (.*)\\n\\n")
    > m=r.findall(s)
    >
    > where s is a po file like this:
    >
    > msgctxt "write ubiquity commands.description"
    > msgid "Takes you to the Ubiquity <a
    > href=\"chrome://ubiquity/content/editor.html\">command editor</a> page."
    > msgstr "Apre l'<a href=\"chrome://ubiquity/content/editor.html\">editor
    > dei comandi</a> di Ubiquity."
    >
    >
    > #. list ubiquity commands command:
    > #. use | to separate multiple name values:
    > msgctxt "list ubiquity commands.names"
    > msgid "list ubiquity commands"
    > msgstr "elenco comandi disponibili"
    >
    > msgctxt "list ubiquity commands.description"
    > msgid "Opens <a href=\"chrome://ubiquity/content/cmdlist.html\">the
    > list</a>\n"
    > " of all Ubiquity commands available and what they all do."
    > msgstr "Apre una <a
    > href=\"chrome://ubiquity/content/cmdlist.html\">pagina</a>\n"
    > " in cui sono elencati tutti i comandi disponibili e per ognuno
    > viene spiegato in breve a cosa serve."
    >
    >
    >
    > #. change ubiquity settings command:
    > #. use | to separate multiple name values:
    > msgctxt "change ubiquity settings.names"
    > msgid "change ubiquity settings|change ubiquity preferences|change
    > ubiquity skin"
    > msgstr "modifica impostazioni di ubiquity|modifica preferenze di
    > ubiquity|modifica tema di ubiquity"
    >
    > msgctxt "change ubiquity settings.description"
    > msgid "Takes you to the <a
    > href=\"chrome://ubiquity/content/settings.html\">settings</a> page,\n"
    > " where you can change your skin, key combinations, etc."
    > msgstr "Apre la pagina <a
    > href=\"chrome://ubiquity/content/settings.html\">delle impostazioni</a>
    > di Ubiquity,\n"
    > " dalla quale è possibile modificare la combinazione da tastiera
    > utilizzata per richiamare Ubiquity, il tema, ecc."
    >
    >
    >
    > but, obviusly, with the code above the last string is not matched. If
    > I use re.DOTALL to match also new line character it not works because it
    > match the entire file, I would like to stop the matching when "msgstr"
    > is found.
    >
    > regex=re.compile("msgid (.*)\\nmsgstr (.*)\\n\\n\\n",re.DOTALL)
    >
    > is it possible or not ?
    >

    You could try:

    regex = re.compile(r"msgid (.*(?:\n".*")*)\nmsgstr (.*(?:\n".*")*)$")

    and then, if necessary, tidy what you get.
     
    MRAB, Jul 6, 2009
    #3
  4. Risposta al messaggio di Hallvard B Furuseth :


    >
    > I don't know the syntax of a po file, but this works for the
    > snippet you posted:
    >
    > arg_re = r'"[^\\\"]*(?:\\.[^\\\"]*)*"'
    > arg_re = '%s(?:\s+%s)*' % (arg_re, arg_re)
    > find_re = re.compile(
    > r'^msgid\s+(' + arg_re + ')\s*\nmsgstr\s+(' + arg_re + ')\s*\n', re.M)
    >
    > However, can \ quote a newline? If so, replace \\. with \\[\s\S] or
    > something.
    > Can there be other keywords between msgid and msgstr? If so,
    > add something like (?:\w+\s+<arg_re>\s*\n)*? between them.
    > Can msgstr come before msgid? If so, forget using a single regexp.
    > Anything else to the syntax to look out for? Single quotes, maybe?
    >
    > Is it a problem if the regexp isn't quite right and doesn't match all
    > cases, yet doesn't report an error when that happens?
    >
    > All in all, it may be a bad idea to sqeeze this into a single regexp.
    > It gets ugly real fast. Might be better to parse the file in a more
    > regular way, maybe using regexps just to extract each (keyword, "value")
    > pair.
    >

    Thank you very much, Haldvard, it seem to works, there is a strange
    match in the file header but I could skip the first match.


    The po files have this structure:
    http://bit.ly/18qbVc

    msgid "string to translate"
    " second string to match"
    " n string to match"
    msgstr "translated sting"
    " second translated string"
    " n translated string"
    One or more new line before the next group.

    In past I have created a Python script to parse PO files where msgid
    and msgstr are in two sequential lines, for example:

    msgid "string to translate"
    msgstr "translated string"

    now the problem is how to match also (optional) string between msgid and
    msgstr.

    Sandro
     
    gialloporpora, Jul 6, 2009
    #4
  5. Risposta al messaggio di MRAB :

    > gialloporpora wrote:
    >> Hi all,
    >> I would like to extract string from a PO file. To do this I have created
    >> a little python function to parse po file and extract string:
    >>
    >> import re
    >> regex=re.compile("msgid (.*)\\nmsgstr (.*)\\n\\n")
    >> m=r.findall(s)
    >>
    >> where s is a po file like this:
    >>
    >> msgctxt "write ubiquity commands.description"
    >> msgid "Takes you to the Ubiquity<a
    >> href=\"chrome://ubiquity/content/editor.html\">command editor</a> page."
    >> msgstr "Apre l'<a href=\"chrome://ubiquity/content/editor.html\">editor
    >> dei comandi</a> di Ubiquity."
    >>
    >>
    >> #. list ubiquity commands command:
    >> #. use | to separate multiple name values:
    >> msgctxt "list ubiquity commands.names"
    >> msgid "list ubiquity commands"
    >> msgstr "elenco comandi disponibili"
    >>
    >> msgctxt "list ubiquity commands.description"
    >> msgid "Opens<a href=\"chrome://ubiquity/content/cmdlist.html\">the
    >> list</a>\n"
    >> " of all Ubiquity commands available and what they all do."
    >> msgstr "Apre una<a
    >> href=\"chrome://ubiquity/content/cmdlist.html\">pagina</a>\n"
    >> " in cui sono elencati tutti i comandi disponibili e per ognuno
    >> viene spiegato in breve a cosa serve."
    >>
    >>
    >>
    >> #. change ubiquity settings command:
    >> #. use | to separate multiple name values:
    >> msgctxt "change ubiquity settings.names"
    >> msgid "change ubiquity settings|change ubiquity preferences|change
    >> ubiquity skin"
    >> msgstr "modifica impostazioni di ubiquity|modifica preferenze di
    >> ubiquity|modifica tema di ubiquity"
    >>
    >> msgctxt "change ubiquity settings.description"
    >> msgid "Takes you to the<a
    >> href=\"chrome://ubiquity/content/settings.html\">settings</a> page,\n"
    >> " where you can change your skin, key combinations, etc."
    >> msgstr "Apre la pagina<a
    >> href=\"chrome://ubiquity/content/settings.html\">delle impostazioni</a>
    >> di Ubiquity,\n"
    >> " dalla quale è possibile modificare la combinazione da tastiera
    >> utilizzata per richiamare Ubiquity, il tema, ecc."
    >>
    >>
    >>
    >> but, obviusly, with the code above the last string is not matched. If
    >> I use re.DOTALL to match also new line character it not works because it
    >> match the entire file, I would like to stop the matching when "msgstr"
    >> is found.
    >>
    >> regex=re.compile("msgid (.*)\\nmsgstr (.*)\\n\\n\\n",re.DOTALL)
    >>
    >> is it possible or not ?
    >>

    > You could try:
    >
    > regex = re.compile(r"msgid (.*(?:\n".*")*)\nmsgstr (.*(?:\n".*")*)$")
    >
    > and then, if necessary, tidy what you get.



    MRAB, thank you for your help, I have tried the code posted by Hallvard
    because I have seen it before and it works. Now I'll check also your
    suggestions.
    Sandro

    --
    *Pink Floyd – The Great Gig in the Sky* - http://sn.im/kggo7
    * FAQ* di /it-alt.comp.software.mozilla/: http://bit.ly/1MZ04d
     
    gialloporpora, Jul 6, 2009
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. VSK
    Replies:
    2
    Views:
    2,305
  2. Roy
    Replies:
    6
    Views:
    620
    Roedy Green
    Jan 7, 2008
  3. Neil
    Replies:
    32
    Views:
    1,275
    Tom Anderson
    Aug 13, 2009
  4. Man-wai Chang
    Replies:
    2
    Views:
    575
    Man-wai Chang
    Mar 3, 2012
  5. Replies:
    5
    Views:
    119
Loading...

Share This Page