Correct use of Unicode in RegExp

Discussion in 'Perl' started by mike blamires, Apr 22, 2004.

  1. I am having great difficulty using Unicode characters in a Regular
    Expression, I am trying to match extended Unicode characters.

    I am wishing to split a large Dumpfile (containing only JPEGS) I have used
    a hex editor to manually extract a file just to show it can be done, so I
    know the input is intact.

    Each JPEG starts with the Unicode characters \u00FF \u00D8 \u00FF \u00E1
    and there are plenty of these to be found within the file.

    open(DUMPFILE, "/pathtodumpfile");
    my $line;
    while(<DUMPFILE>) {
    $line = $line.$_;
    }
    @files = split(/\x{00FF}\x{00D8}\x{00FF}\x{00E1}/, $line);

    (As you may see from the above style I am relatively inexperienced to the
    perl side of programming ;)

    I have tried inserting the Unicode characters in various ways \xFF, \x{FF}
    etc. It just doesn't seem to find the pattern. I am at a bit of a loss as
    to whether it is my regexp that is wrong, my use of Unicode characters
    or use of Extended Unicode characters.

    many thanks for your help.

    cheers
    Mike
    mike blamires, Apr 22, 2004
    #1
    1. Advertising

  2. On Thu, 22 Apr 2004 22:36:44 +0100, mike blamires scribbled furiously:

    > I am having great difficulty using Unicode characters in a Regular
    > Expression, I am trying to match extended Unicode characters.
    >
    > I am wishing to split a large Dumpfile (containing only JPEGS) I have used
    > a hex editor to manually extract a file just to show it can be done, so I
    > know the input is intact.
    >
    > Each JPEG starts with the Unicode characters \u00FF \u00D8 \u00FF \u00E1
    > and there are plenty of these to be found within the file.
    >
    > open(DUMPFILE, "/pathtodumpfile");
    > my $line;
    > while(<DUMPFILE>) {
    > $line = $line.$_;
    > }
    > @files = split(/\x{00FF}\x{00D8}\x{00FF}\x{00E1}/, $line);
    >
    > (As you may see from the above style I am relatively inexperienced to the
    > perl side of programming ;)
    >
    > I have tried inserting the Unicode characters in various ways \xFF, \x{FF}
    > etc. It just doesn't seem to find the pattern. I am at a bit of a loss as
    > to whether it is my regexp that is wrong, my use of Unicode characters
    > or use of Extended Unicode characters.
    >
    > many thanks for your help.
    >
    > cheers
    > Mike


    Apologies, incorrect newsgroup first time round. Please see above.
    cheers
    Mike
    mike blamires, Apr 23, 2004
    #2
    1. Advertising

  3. mike blamires <> wrote in message news:<>...
    > I am having great difficulty using Unicode characters in a Regular
    > Expression, I am trying to match extended Unicode characters.
    >
    > I am wishing to split a large Dumpfile (containing only JPEGS) I have used
    > a hex editor to manually extract a file just to show it can be done, so I
    > know the input is intact.
    >
    > Each JPEG starts with the Unicode characters \u00FF \u00D8 \u00FF \u00E1
    > and there are plenty of these to be found within the file.
    >
    > open(DUMPFILE, "/pathtodumpfile");
    > my $line;
    > while(<DUMPFILE>) {
    > $line = $line.$_;
    > }
    > @files = split(/\x{00FF}\x{00D8}\x{00FF}\x{00E1}/, $line);
    >
    > (As you may see from the above style I am relatively inexperienced to the
    > perl side of programming ;)
    >
    > I have tried inserting the Unicode characters in various ways \xFF, \x{FF}
    > etc. It just doesn't seem to find the pattern. I am at a bit of a loss as
    > to whether it is my regexp that is wrong, my use of Unicode characters
    > or use of Extended Unicode characters.
    >
    > many thanks for your help.
    >
    > cheers
    > Mike


    First of all, I've never worked with unicode characters.

    I see you've tried to do something with \xFF and \x{FF} without
    success. Have you tried \\xFF and \\x\{FF\} instead (notice the '\'
    before all characters that aren't alphapetic or numeric)?

    Good luck,
    DNA
    Daniel N. Andersen, Apr 23, 2004
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. joon
    Replies:
    1
    Views:
    515
    Roedy Green
    Jul 8, 2003
  2. Dan

    correct or not correct?

    Dan, Oct 2, 2003, in forum: HTML
    Replies:
    7
    Views:
    440
  3. J.Ram
    Replies:
    7
    Views:
    649
  4. Joao Silva
    Replies:
    16
    Views:
    355
    7stud --
    Aug 21, 2009
  5. mike blamires

    Correct use of Unicode in RegExp

    mike blamires, Apr 23, 2004, in forum: Perl Misc
    Replies:
    0
    Views:
    133
    mike blamires
    Apr 23, 2004
Loading...

Share This Page