Yet another Java regex problem

Discussion in 'Java' started by bauer@b3s.de, May 24, 2005.

  1. Guest

    Hi,
    there's a DocBook XML file which I want to modify. The file contains
    something like
    ....
    <mediaobject>
    <imageobject>
    <imagedata fileref="PathToImage" format="ImgFormat"/>
    </imageobject>
    </mediaobject>
    ....
    I just want to match the whole <mediaobject> thingy and prepend one
    line which contains the PathToImage as a XML comment just like
    <!-- PathToImage -->

    My input to the matcher is the whole file as is. First I tried to get a
    regex to match the whole thing

    content = content.replaceFirst(
    "<mediaobject>" +
    "\\s*<imageobject>" +
    "\\s*<imagedata fileref=\".*\".*/>" +
    "\\s*</imageobject>" +
    "\\s*</mediaobject>",
    "<!-- Test -->"
    );

    But when I use a backref (like \0 for the whole match or \1 if I use
    parentheses for the filename) in the replacement string like this:
    "<!-- Test -->\0"
    I just get
    <!-- Test --> + this square char which cannot display here

    The strange thing is that when I use exactly the same pattern with
    Pattern.compile(regex).matcher(str).replaceAll(repl)
    nothing matches (opposed to the Java API statment for
    String.replaceAll()).

    I tried Pattern.MULTILINE and Pattern.DOTALL in any combination. I
    tried to use .* instead of \\s and even used \r?\n? for the line
    endings ... nothing works.

    Please can anyone help me?

    _

    Tom
     
    , May 24, 2005
    #1
    1. Advertisements

  2. wrote:
    > Hi,
    > there's a DocBook XML file which I want to modify. The file contains
    > something like
    > ...
    > <mediaobject>
    > <imageobject>
    > <imagedata fileref="PathToImage" format="ImgFormat"/>
    > </imageobject>
    > </mediaobject>
    > ...
    > I just want to match the whole <mediaobject> thingy and prepend one
    > line which contains the PathToImage as a XML comment just like
    > <!-- PathToImage -->
    >
    > My input to the matcher is the whole file as is. First I tried to get a
    > regex to match the whole thing
    >
    > content = content.replaceFirst(
    > "<mediaobject>" +
    > "\\s*<imageobject>" +
    > "\\s*<imagedata fileref=\".*\".*/>" +
    > "\\s*</imageobject>" +
    > "\\s*</mediaobject>",
    > "<!-- Test -->"
    > );
    >
    > But when I use a backref (like \0 for the whole match or \1 if I use
    > parentheses for the filename) in the replacement string like this:
    > "<!-- Test -->\0"
    > I just get
    > <!-- Test --> + this square char which cannot display here
    >
    > The strange thing is that when I use exactly the same pattern with
    > Pattern.compile(regex).matcher(str).replaceAll(repl)
    > nothing matches (opposed to the Java API statment for
    > String.replaceAll()).
    >
    > I tried Pattern.MULTILINE and Pattern.DOTALL in any combination. I
    > tried to use .* instead of \\s and even used \r?\n? for the line
    > endings ... nothing works.
    >
    > Please can anyone help me?
    >
    > _
    >
    > Tom
    >


    Have you tried a pattern of "(<mediaobject)(.*)(</mediaobject>)". You
    can then use a replacement along the lines of "<!-- PathToImage
    -->$1$2$3". I'd also use Pattern.MULTILINE | Pattern.DOTALL when
    building the pattern.

    Hope that helps.

    Pan
    ======================================================================
    TechBookReport Java http://www.techbookreport.com/JavaIndex.html
     
    TechBookReport, May 24, 2005
    #2
    1. Advertisements

  3. Guest

    TechBookReport wrote:
    > Have you tried a pattern of "(<mediaobject)(.*)(</mediaobject>)". You


    > can then use a replacement along the lines of "<!-- PathToImage
    > -->$1$2$3". I'd also use Pattern.MULTILINE | Pattern.DOTALL when
    > building the pattern.
    >
    > Hope that helps.


    Not really ... this results in the same problem I already described.
    Instead of substituting \1\2\3 with the matching groups I get only this
    special char (looks like a square, cannot displayed here). Btw I even
    noticed that you used $1$2$3. This is perl, right? In Java it would be
    \1\2\3 or am I wrong?

    You can try it yourself. Save the following content to a file:
    <chapter>
    <title>Chapter 1</title>
    <sect1>
    <title>Section 1</title>
    <para>
    Test Test Test Test Test Test Test Test Test
    </para>
    <mediaobject>
    <imageobject>
    <imagedata fileref="image.svg" format="SVG"/>
    </imageobject>
    </mediaobject>
    <para>
    Test Test Test Test Test Test Test Test Test
    </para>
    </sect1>
    </chapter>

    Read this file with
    public String readPlain( File file ) throws Exception
    {
    String content = new String();
    String line = new String();
    BufferedReader brd = new BufferedReader( new FileReader( file ) );
    while ( ( line = brd.readLine() ) != null )
    content += line + "\r\n";
    brd.close();
    return content;
    }

    and then apply a
    content = Pattern.compile( "(<mediaobject)(.*)(</mediaobject>)",
    Pattern.MULTILINE|Pattern.DOTALL).matcher(
    content).replaceAll("<!-- Test -->\1\2\3");

    _

    Tom
     
    , May 24, 2005
    #3
  4. Guest

    Damn Java regex !!! It is $1$2$3. That was the point. I used the wrong
    syntax for backrefs. But in Java API 1.4.2 under
    java.util.regex.Pattern stands

    Back references
    \n Whatever the nth capturing group matched

    So what ... ?!?
     
    , May 24, 2005
    #4
  5. wrote:
    > Damn Java regex !!! It is $1$2$3. That was the point. I used the wrong
    > syntax for backrefs. But in Java API 1.4.2 under
    > java.util.regex.Pattern stands
    >
    > Back references
    > \n Whatever the nth capturing group matched
    >
    > So what ... ?!?
    >

    Did you escape the backslashes? Also, the funny square character is
    probably the \r\n you are using. Try
    System.getProperty("line.separator") instead.

    Pan

    ======================================================================
    TechBookReport Java http://www.techbookreport.com/JavaIndex.html
     
    TechBookReport, May 24, 2005
    #5
  6. Guest

    TechBookReport schrieb:
    > wrote:
    > > Damn Java regex !!! It is $1$2$3. That was the point. I used the

    wrong
    > > syntax for backrefs. But in Java API 1.4.2 under
    > > java.util.regex.Pattern stands
    > >
    > > Back references
    > > \n Whatever the nth capturing group matched
    > >
    > > So what ... ?!?
    > >

    > Did you escape the backslashes? Also, the funny square character is
    > probably the \r\n you are using. Try
    > System.getProperty("line.separator") instead.
    >

    No the funny square char is not the \r\n cause if so it would be on
    every line independant of the regex code. I'm on Windows and the app
    runs only on this system but you are right, better I use
    getProperty("line.separator").
    I guess the funny square is some unicode character (\1=0x01?) if I use
    \1 without escaping the backslash.
    But that doesn't matter anymore, my problem is solved. Thanks for your
    help.
     
    , May 24, 2005
    #6
  7. Alan Moore Guest

    On Tue, 24 May 2005 15:20:13 +0100, TechBookReport <>
    wrote:

    >Have you tried a pattern of "(<mediaobject)(.*)(</mediaobject>)". You
    >can then use a replacement along the lines of "<!-- PathToImage
    >-->$1$2$3". I'd also use Pattern.MULTILINE | Pattern.DOTALL when
    >building the pattern.


    If there can be more than one mediaobject element in a document, you
    need to use a reluctant dot-star:

    "<mediaobject.*?</mediaobject>"

    Otherwise, it will match everything from the first opening tag to the
    last closing tag. Even if there's only one such element, it will
    probably be more efficient this way.

    You don't really need to use capturing parentheses, since you're
    re-inserting the whole match; just use $0:

    str = str.replaceAll("<mediaobject.*?</mediaobject>",
    "<!-- PathToImage -->$0");


    The JDK regex package uses the same syntax as Perl WRT
    backreferences--"\n" within the regex and "$n" in the replacement
    string--except that it uses $0 instead of $& for the whole match, and
    doesn't emulate the other dollar-plus-punctuation variables: $`, $',
    and $+.
     
    Alan Moore, May 24, 2005
    #7
    1. Advertisements

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Roedy Green

    Yet another way to use Java

    Roedy Green, May 24, 2004, in forum: Java
    Replies:
    2
    Views:
    673
    Steve Horsley
    May 25, 2004
  2. Berehem
    Replies:
    4
    Views:
    645
    Lawrence Kirby
    Apr 28, 2005
  3. Replies:
    3
    Views:
    914
    Reedick, Andrew
    Jul 1, 2008
  4. sjp
    Replies:
    13
    Views:
    214
    A. Sinan Unur
    Apr 9, 2005
  5. siliconmike

    Yet another regex question.

    siliconmike, Apr 18, 2005, in forum: Perl Misc
    Replies:
    4
    Views:
    127
    siliconmike
    Apr 18, 2005
Loading...

Share This Page