Regular expression for BOM required

Discussion in 'Perl Misc' started by Peter Gordon, Jan 12, 2013.

  1. Peter Gordon

    Peter Gordon Guest

    #!/cygdrive/c/cygwin/bin/perl
    use strict;
    use warnings;
    use 5.14.0;
    open my $fh, '<:encoding(utf16le)', "00Tst.zpl" or die "File opening error
    \n";
    while( <$fh> ) {
    say "Found regular expression" if /\xFE\xFF/;
    # say "Found it!" if s/\A.*nm=//;
    print;
    }

    # I'm trying to match a byte order mask in a file. Below is
    # the start of an octal dump of the file.
    # 0000000 177377 000156 000155 000075 000142 000157 000164 000164
    # The line:
    # say "Found it!" if s/\A.*nm=//;
    # works correctly, but I can't write a regular expression which matches
    # octal 0000000 177377 at the start of a line. Help with the
    # regular expression would be appreciated.
    # If it matters, I'm working on Windows 7.
     
    Peter Gordon, Jan 12, 2013
    #1
    1. Advertisements

  2. You want to match the single character U+FEFF BOM here, not a sequence
    of two characters U+00FE LATIN SMALL LETTER THORN U+00FF LATIN SMALL
    LETTER Y WITH DIAERESIS.

    So you have to write

    say "Found regular expression" if /\x{FEFF}/;
    ^^^^^^
    The default output format of od (little endian 16 bit values in octal)
    is confusing. Yes, 0xFEFF is 0177377 in octal, but 177377 looks too much
    like 7FFF for me to do the bitshift intuitively in my head.

    Better to use "od -tx1" or "od -tx2".

    hp
     
    Peter J. Holzer, Jan 12, 2013
    #2
    1. Advertisements

  3. Peter Gordon

    Peter Gordon Guest

    Thanks Peter,
    It was the curly braces which I was missing.
     
    Peter Gordon, Jan 12, 2013
    #3
  4. No. After decoding there is no byte order any more, just characters, and
    the character you want to match is \x{FEFF}.

    If you try to open a big-endian file with :encoding(utf16le), the script
    will die trying to read the first line.

    (If you open it with :encoding(utf16), the BOM will be used to determine
    endianness and *not* passed through - this seems a little inconsistent
    to me)

    hp
     
    Peter J. Holzer, Jan 14, 2013
    #4
  5. Peter Gordon

    Peter Gordon Guest

    The files I'm editing are the playlists of Zoomplayer which is
    an Israeli media player, thus they are consistent in their Unicode
    and format. Is there a method for getting Unicode to work with
    the combination of the diamond operator and In-place editing?
    The code below runs fine when run as a program eg: $insertTT.pl aa.zpl
    but crashes when I try to run it with the -i command line option. eg:
    $perl -i insertTT.pl aa.zpl

    #!/cygdrive/c/cygwin/bin/perl
    # Used to insert a "tt=NUMBER: " line in a new .df files.
    use strict;
    use warnings;
    use 5.14.0;
    use Encode qw(encode decode);
    use open qw:)std IN :encoding(utf16-le));

    # $^I = ".bak";
    my $first = 1;
    while( <> ) {
    my $line = $_;
    if ( $first == 1 ) {
    $line =~ s/\x{FEFF}nm=(.*)/nm=$1/;
    $first = 0;
    }
    $line = decode("utf8", $line);
    print $line;
    if ( $line =~ /nm=/ ) {
    my $num = $line;
    chomp($num);
    $num =~ s/nm=.*?(\d+).*/$1/;
    print "tt=$num: \n";
    }
    }
     
    Peter Gordon, Jan 14, 2013
    #5
  6. If perl crashes you should file a bug report.

    hp
     
    Peter J. Holzer, Jan 15, 2013
    #6
  7. If that was the intent of the OP, opening the file in one byte order and
    checking for a reversed BOM wouldn't work: The diamond operator dies
    when it encounters the wrong BOM (of course you could catch the
    exception and then try the other endianness).

    I think there are two good ways to open UTF-16 files with unknown byte
    order:

    1) The carefree method: Just use :encoding(utf16), and it will
    automatically determine the endianness from the BOM, and you don't
    have to care whether the file is little or big endian. Plus, the BOM
    is automatically filtered out so you don't have to. On the flipside,
    you lose the information about the endianness and the BOM, so if you
    need that, this isn't for you.

    2) Open the file in binary mode and read the first few bytes. Determine
    the correct encoding from those, rewind and set the encoding layer.
    This is more work, but a lot more flexible: You can detect any
    encoding you want.

    As always, there are probably more ways to do it.

    hp
     
    Peter J. Holzer, Jan 17, 2013
    #7
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.