Regular expression for BOM required

Peter Gordon · Jan 12, 2013

#!/cygdrive/c/cygwin/bin/perl
use strict;
use warnings;
use 5.14.0;
open my $fh, '<:encoding(utf16le)', "00Tst.zpl" or die "File opening error
\n";
while( <$fh> ) {
say "Found regular expression" if /\xFE\xFF/;
# say "Found it!" if s/\A.*nm=//;
print;
}

# I'm trying to match a byte order mask in a file. Below is
# the start of an octal dump of the file.
# 0000000 177377 000156 000155 000075 000142 000157 000164 000164
# The line:
# say "Found it!" if s/\A.*nm=//;
# works correctly, but I can't write a regular expression which matches
# octal 0000000 177377 at the start of a line. Help with the
# regular expression would be appreciated.
# If it matters, I'm working on Windows 7.

Peter J. Holzer · Jan 12, 2013

#!/cygdrive/c/cygwin/bin/perl
use strict;
use warnings;
use 5.14.0;
open my $fh, '<:encoding(utf16le)', "00Tst.zpl" or die "File opening error
\n";
while( <$fh> ) {
say "Found regular expression" if /\xFE\xFF/;

You want to match the single character U+FEFF BOM here, not a sequence
of two characters U+00FE LATIN SMALL LETTER THORN U+00FF LATIN SMALL
LETTER Y WITH DIAERESIS.

So you have to write

say "Found regular expression" if /\x{FEFF}/;

print;
}

# I'm trying to match a byte order mask in a file. Below is
# the start of an octal dump of the file.
# 0000000 177377 000156 000155 000075 000142 000157 000164 000164

^^^^^^
The default output format of od (little endian 16 bit values in octal)
is confusing. Yes, 0xFEFF is 0177377 in octal, but 177377 looks too much
like 7FFF for me to do the bitshift intuitively in my head.

Better to use "od -tx1" or "od -tx2".

hp

Peter Gordon · Jan 12, 2013

You want to match the single character U+FEFF BOM here, not a sequence
of two characters U+00FE LATIN SMALL LETTER THORN U+00FF LATIN SMALL
LETTER Y WITH DIAERESIS.

So you have to write

say "Found regular expression" if /\x{FEFF}/;

print;
}

Thanks Peter,
It was the curly braces which I was missing.

Peter J. Holzer · Jan 14, 2013

Presumably you also have to check for the "other order" ?

No. After decoding there is no byte order any more, just characters, and
the character you want to match is \x{FEFF}.

If you try to open a big-endian file with :encoding(utf16le), the script
will die trying to read the first line.

(If you open it with :encoding(utf16), the BOM will be used to determine
endianness and *not* passed through - this seems a little inconsistent
to me)

hp

Peter Gordon · Jan 14, 2013

Presumably you also have to check for the "other order" ?

BugBear

The files I'm editing are the playlists of Zoomplayer which is
an Israeli media player, thus they are consistent in their Unicode
and format. Is there a method for getting Unicode to work with
the combination of the diamond operator and In-place editing?
The code below runs fine when run as a program eg: $insertTT.pl aa.zpl
but crashes when I try to run it with the -i command line option. eg:
$perl -i insertTT.pl aa.zpl

#!/cygdrive/c/cygwin/bin/perl
# Used to insert a "tt=NUMBER: " line in a new .df files.
use strict;
use warnings;
use 5.14.0;
use Encode qw(encode decode);
use open qw

std IN :encoding(utf16-le));

# $^I = ".bak";
my $first = 1;
while( <> ) {
my $line = $_;
if ( $first == 1 ) {
$line =~ s/\x{FEFF}nm=(.*)/nm=$1/;
$first = 0;
}
$line = decode("utf8", $line);
print $line;
if ( $line =~ /nm=/ ) {
my $num = $line;
chomp($num);
$num =~ s/nm=.*?(\d+).*/$1/;
print "tt=$num: \n";
}
}

Peter J. Holzer · Jan 15, 2013

The code below runs fine when run as a program eg: $insertTT.pl aa.zpl
but crashes when I try to run it with the -i command line option. eg:

If perl crashes you should file a bug report.

hp

Peter J. Holzer · Jan 17, 2013

Peter said:
Peter said:

Peter Gordon wrote:
(e-mail address removed): [$_ was read from a file opened with ":encoding(utf16le)"]
say "Found regular expression" if /\x{FEFF}/; [...]
Presumably you also have to check for the "other order" ?

Click to expand...

No. After decoding there is no byte order any more, just characters, and
the character you want to match is \x{FEFF}.

If you try to open a big-endian file with :encoding(utf16le), the script
will die trying to read the first line.

(If you open it with :encoding(utf16), the BOM will be used to determine
endianness and *not* passed through - this seems a little inconsistent
to me)

Click to expand...

I had (perhaps wrongly) assumed that the OP's true intent (or need)
was to read the BOM and use it to decide *which* byte order
was being used, and hence to use the correct decoder.

If that was the intent of the OP, opening the file in one byte order and
checking for a reversed BOM wouldn't work: The diamond operator dies
when it encounters the wrong BOM (of course you could catch the
exception and then try the other endianness).

I think there are two good ways to open UTF-16 files with unknown byte
order:

1) The carefree method: Just use :encoding(utf16), and it will
automatically determine the endianness from the BOM, and you don't
have to care whether the file is little or big endian. Plus, the BOM
is automatically filtered out so you don't have to. On the flipside,
you lose the information about the endianness and the BOM, so if you
need that, this isn't for you.

2) Open the file in binary mode and read the first few bytes. Determine
the correct encoding from those, rewind and set the encoding layer.
This is more work, but a lot more flexible: You can detect any
encoding you want.

As always, there are probably more ways to do it.

hp

Guessing the encoding from a BOM	7	Jan 16, 2014
How do I get the text that is found by a regular expression?	10	Apr 30, 2014
regular expression for beow text	8	Aug 20, 2010
UTF - SEEK_SET workaround for BOM encoding(utf-16/32) layer Bug	2	Aug 5, 2009
FAQ 6.5 I put a regular expression into $/ but it didn't work. What's wrong?	0	Jan 28, 2011
Regular Expression for the special character "\|" pipe	7	May 27, 2014
FAQ 6.20 What good is "\G" in a regular expression?	0	Mar 3, 2011
Regular Expression for Finding and Deleting comments	1	Jan 4, 2011

Regular expression for BOM required

Peter Gordon

Peter J. Holzer

Peter Gordon

Peter J. Holzer

Peter Gordon

Peter J. Holzer

Peter J. Holzer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads