Regular expression for BOM required

Discussion in 'Perl Misc' started by Peter Gordon, Jan 12, 2013.

  1. Peter Gordon

    Peter Gordon Guest

    #!/cygdrive/c/cygwin/bin/perl
    use strict;
    use warnings;
    use 5.14.0;
    open my $fh, '<:encoding(utf16le)', "00Tst.zpl" or die "File opening error
    \n";
    while( <$fh> ) {
    say "Found regular expression" if /\xFE\xFF/;
    # say "Found it!" if s/\A.*nm=//;
    print;
    }

    # I'm trying to match a byte order mask in a file. Below is
    # the start of an octal dump of the file.
    # 0000000 177377 000156 000155 000075 000142 000157 000164 000164
    # The line:
    # say "Found it!" if s/\A.*nm=//;
    # works correctly, but I can't write a regular expression which matches
    # octal 0000000 177377 at the start of a line. Help with the
    # regular expression would be appreciated.
    # If it matters, I'm working on Windows 7.
     
    Peter Gordon, Jan 12, 2013
    #1
    1. Advertising

  2. On 2013-01-12 11:54, Peter Gordon <petergoATnetspace.net.au> wrote:
    > #!/cygdrive/c/cygwin/bin/perl
    > use strict;
    > use warnings;
    > use 5.14.0;
    > open my $fh, '<:encoding(utf16le)', "00Tst.zpl" or die "File opening error
    > \n";
    > while( <$fh> ) {
    > say "Found regular expression" if /\xFE\xFF/;


    You want to match the single character U+FEFF BOM here, not a sequence
    of two characters U+00FE LATIN SMALL LETTER THORN U+00FF LATIN SMALL
    LETTER Y WITH DIAERESIS.

    So you have to write

    say "Found regular expression" if /\x{FEFF}/;

    > print;
    > }
    >
    > # I'm trying to match a byte order mask in a file. Below is
    > # the start of an octal dump of the file.
    > # 0000000 177377 000156 000155 000075 000142 000157 000164 000164

    ^^^^^^
    The default output format of od (little endian 16 bit values in octal)
    is confusing. Yes, 0xFEFF is 0177377 in octal, but 177377 looks too much
    like 7FFF for me to do the bitshift intuitively in my head.

    Better to use "od -tx1" or "od -tx2".

    hp



    --
    _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
    |_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
    | | | | die Satzbestandteile des Satzes nicht mehr
    __/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
     
    Peter J. Holzer, Jan 12, 2013
    #2
    1. Advertising

  3. Peter Gordon

    Peter Gordon Guest

    "Peter J. Holzer" <> wrote in news:slrnkf30s7.kis.hjp-
    :

    > You want to match the single character U+FEFF BOM here, not a sequence
    > of two characters U+00FE LATIN SMALL LETTER THORN U+00FF LATIN SMALL
    > LETTER Y WITH DIAERESIS.
    >
    > So you have to write
    >
    > say "Found regular expression" if /\x{FEFF}/;
    >
    > print;
    > }
    >

    Thanks Peter,
    It was the curly braces which I was missing.
     
    Peter Gordon, Jan 12, 2013
    #3
  4. On 2013-01-14 10:12, bugbear <bugbear@trim_papermule.co.uk_trim> wrote:
    > Peter Gordon wrote:
    >> "Peter J. Holzer" <> wrote in news:slrnkf30s7.kis.hjp-
    >> :
    >>
    >>> You want to match the single character U+FEFF BOM here, not a sequence
    >>> of two characters U+00FE LATIN SMALL LETTER THORN U+00FF LATIN SMALL
    >>> LETTER Y WITH DIAERESIS.
    >>>
    >>> So you have to write
    >>>
    >>> say "Found regular expression" if /\x{FEFF}/;
    >>>
    >>> print;
    >>> }
    >>>

    >> Thanks Peter,
    >> It was the curly braces which I was missing.
    >>

    >
    > Presumably you also have to check for the "other order" ?


    No. After decoding there is no byte order any more, just characters, and
    the character you want to match is \x{FEFF}.

    If you try to open a big-endian file with :encoding(utf16le), the script
    will die trying to read the first line.

    (If you open it with :encoding(utf16), the BOM will be used to determine
    endianness and *not* passed through - this seems a little inconsistent
    to me)

    hp


    --
    _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
    |_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
    | | | | die Satzbestandteile des Satzes nicht mehr
    __/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
     
    Peter J. Holzer, Jan 14, 2013
    #4
  5. Peter Gordon

    Peter Gordon Guest

    bugbear <bugbear@trim_papermule.co.uk_trim> wrote in
    news:D:

    > Peter Gordon wrote:
    >> "Peter J. Holzer" <> wrote in

    news:slrnkf30s7.kis.hjp-
    >> :
    >>
    >>> You want to match the single character U+FEFF BOM here, not a

    sequence
    >>> of two characters U+00FE LATIN SMALL LETTER THORN U+00FF LATIN SMALL
    >>> LETTER Y WITH DIAERESIS.
    >>>
    >>> So you have to write
    >>>
    >>> say "Found regular expression" if /\x{FEFF}/;
    >>>
    >>> print;
    >>> }
    >>>

    >> Thanks Peter,
    >> It was the curly braces which I was missing.
    >>

    >
    > Presumably you also have to check for the "other order" ?
    >
    > BugBear

    The files I'm editing are the playlists of Zoomplayer which is
    an Israeli media player, thus they are consistent in their Unicode
    and format. Is there a method for getting Unicode to work with
    the combination of the diamond operator and In-place editing?
    The code below runs fine when run as a program eg: $insertTT.pl aa.zpl
    but crashes when I try to run it with the -i command line option. eg:
    $perl -i insertTT.pl aa.zpl

    #!/cygdrive/c/cygwin/bin/perl
    # Used to insert a "tt=NUMBER: " line in a new .df files.
    use strict;
    use warnings;
    use 5.14.0;
    use Encode qw(encode decode);
    use open qw:)std IN :encoding(utf16-le));

    # $^I = ".bak";
    my $first = 1;
    while( <> ) {
    my $line = $_;
    if ( $first == 1 ) {
    $line =~ s/\x{FEFF}nm=(.*)/nm=$1/;
    $first = 0;
    }
    $line = decode("utf8", $line);
    print $line;
    if ( $line =~ /nm=/ ) {
    my $num = $line;
    chomp($num);
    $num =~ s/nm=.*?(\d+).*/$1/;
    print "tt=$num: \n";
    }
    }
     
    Peter Gordon, Jan 14, 2013
    #5
  6. On 2013-01-14 21:04, Peter Gordon <petergoATnetspace.net.au> wrote:
    > The code below runs fine when run as a program eg: $insertTT.pl aa.zpl
    > but crashes when I try to run it with the -i command line option. eg:


    If perl crashes you should file a bug report.

    hp


    --
    _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
    |_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
    | | | | die Satzbestandteile des Satzes nicht mehr
    __/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
     
    Peter J. Holzer, Jan 15, 2013
    #6
  7. On 2013-01-17 12:16, bugbear <bugbear@trim_papermule.co.uk_trim> wrote:
    > Peter J. Holzer wrote:
    >> On 2013-01-14 10:12, bugbear <bugbear@trim_papermule.co.uk_trim> wrote:
    >>> Peter Gordon wrote:
    >>>> "Peter J. Holzer" <> wrote in news:slrnkf30s7.kis.hjp-
    >>>> :

    [$_ was read from a file opened with ":encoding(utf16le)"]
    >>>>> say "Found regular expression" if /\x{FEFF}/;

    [...]
    >>> Presumably you also have to check for the "other order" ?

    >>
    >> No. After decoding there is no byte order any more, just characters, and
    >> the character you want to match is \x{FEFF}.
    >>
    >> If you try to open a big-endian file with :encoding(utf16le), the script
    >> will die trying to read the first line.
    >>
    >> (If you open it with :encoding(utf16), the BOM will be used to determine
    >> endianness and *not* passed through - this seems a little inconsistent
    >> to me)

    >
    > I had (perhaps wrongly) assumed that the OP's true intent (or need)
    > was to read the BOM and use it to decide *which* byte order
    > was being used, and hence to use the correct decoder.


    If that was the intent of the OP, opening the file in one byte order and
    checking for a reversed BOM wouldn't work: The diamond operator dies
    when it encounters the wrong BOM (of course you could catch the
    exception and then try the other endianness).

    I think there are two good ways to open UTF-16 files with unknown byte
    order:

    1) The carefree method: Just use :encoding(utf16), and it will
    automatically determine the endianness from the BOM, and you don't
    have to care whether the file is little or big endian. Plus, the BOM
    is automatically filtered out so you don't have to. On the flipside,
    you lose the information about the endianness and the BOM, so if you
    need that, this isn't for you.

    2) Open the file in binary mode and read the first few bytes. Determine
    the correct encoding from those, rewind and set the encoding layer.
    This is more work, but a lot more flexible: You can detect any
    encoding you want.

    As always, there are probably more ways to do it.

    hp

    --
    _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
    |_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
    | | | | die Satzbestandteile des Satzes nicht mehr
    __/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
     
    Peter J. Holzer, Jan 17, 2013
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. VSK
    Replies:
    2
    Views:
    2,332
  2. Ratman
    Replies:
    0
    Views:
    676
    Ratman
    Sep 14, 2004
  3. John .
    Replies:
    5
    Views:
    15,784
    Peter Blum
    Mar 23, 2005
  4. Keith G Hicks
    Replies:
    9
    Views:
    632
    Jesse Houwing
    Feb 21, 2008
  5. .Net Sports
    Replies:
    7
    Views:
    219
    shimmyshack
    Apr 19, 2007
Loading...

Share This Page