Parsing delimiter-separated data.

A

Adam

I'm writing a hash (string keys, string values) to a text file (by STDOUT)
for reading later, and I decided on the following format:

key|value|
another key|another value|

to make the file clearly human-readable (the values and keys can contain
spaces). I've also provided for escaping ``|'' and ``\'' in the data with
``\|'' and ``\\'' respectively.

Here's the output routine

foreach $key (keys(%table) ) {
$value = $table{$key} ;
$key =~ s/\\/\\\\/g ;
$key =~ s/\|/\\\|/g ;
$value =~ s/\\/\\\\/g ;
$value =~ s/\|/\\\|/g ;
print($key . "|" . $value . "|\n") ;

and here's the input routine

while ($line = <>) {
chomp($line) ;
$line =~ /^(.*([^\\]|\\\\))\|(.*)\|$/ ;
$key = $1 ;
$value = $3 ;
$key =~ s/\\\|/\|/g ;
$key =~ s/\\\\/\\/g ;
$value =~ s/\\\|/\|/g ;
$value =~ s/\\\\/\\/g ;
$table{$key} = $value ;
}


They seem to work, but I'm not sure how efficient they are (in particular
I have doubts about the regexp), so I'd appreciate any suggestions for
improvement.

I've also just noticed that the input routine would not correctly handle a
line like this:

blah\\|blahblah|

What's the best way to reverse the escapes?
 
A

Adam

// I'm writing a hash (string keys, string values) to a text file (by STDOUT)
// for reading later, and I decided on the following format:
//
// key|value|
// another key|another value|
//
// to make the file clearly human-readable (the values and keys can contain
// spaces). I've also provided for escaping ``|'' and ``\'' in the data with
// ``\|'' and ``\\'' respectively. ...
// What's the best way to reverse the escapes?

What is the best way? Here is *a* way of dealing with it: ...
while (<DATA>) {
chomp;
my ($key, $value) = /^([^\\|]*(?:\\.[^\\|]*)*)\|([^\\|]*(?:\\.[^\\|]*)*)\|$/
or next;
map {s/\\(.)/$1/g} $key, $value;
print "[$key] [$value]\n";
}

Thanks -- that's much better. I wonder if it would be more effective
just to work from left to right by characters instead of using a
regexp.

Since this is a standard, traditional, Unix file format, isn't there a
"canonical" way to analyse it?

-- Adam
 
A

Anno Siegel

Adam said:
// I'm writing a hash (string keys, string values) to a text file (by STDOUT)
// for reading later, and I decided on the following format:
//
// key|value|
// another key|another value|
//
// to make the file clearly human-readable (the values and keys can contain
// spaces). I've also provided for escaping ``|'' and ``\'' in the data with
// ``\|'' and ``\\'' respectively.
[...]

Since this is a standard, traditional, Unix file format, isn't there a
"canonical" way to analyse it?

The file format is usually called CSV (comma separated values), even if
the separator can be something else. Do a CPAN search for CSV.

Anno
 
A

Adam

The file format is usually called CSV (comma separated values), even if
the separator can be something else. Do a CPAN search for CSV.

There is a Text::CSV module, but it only handles commas as separators and
it uses the "Windows-like" format, e.g.

Fred, Smith, "Smith, Fred", (e-mail address removed)

whereas I'm trying to use the correct "escaped" format, analogous to this:

Fred, Smith, Smith\, Fred, (e-mail address removed)

as recommended by Eric Raymond.

http://catb.org/~esr/writings/taoup/html/ch05s02.html#id2901882

I'll try looking for a canonical approach to parsing this in C.

-- Adam
 
K

ko

Adam said:
// I'm writing a hash (string keys, string values) to a text file (by STDOUT)
// for reading later, and I decided on the following format:
//
// key|value|
// another key|another value|
//
// to make the file clearly human-readable (the values and keys can contain
// spaces). I've also provided for escaping ``|'' and ``\'' in the data with
// ``\|'' and ``\\'' respectively.
..

// What's the best way to reverse the escapes?
[snip]

Since this is a standard, traditional, Unix file format, isn't there a
"canonical" way to analyse it?

-- Adam

You can try this method:

#!/usr/bin/perl -w
use strict;
use Text::parseWords;

while ( my $line = <DATA> ) {
my ($key, $value) = quotewords('\|', 0, $line);
print "'$key' => '$value'\n";
}

__DATA__
blah|blahblah|
bl ah|bla hblah|
bla\|h|blah\\\\blah|
\\blah|blahblah|

Text::parseWords is a standard module, and the documentation is short
and straightforward.

HTH - keith
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top