Parsing delimiter-separated data.

Adam · Nov 14, 2003

I'm writing a hash (string keys, string values) to a text file (by STDOUT)
for reading later, and I decided on the following format:

key|value|
another key|another value|

to make the file clearly human-readable (the values and keys can contain
spaces). I've also provided for escaping ``|'' and ``\'' in the data with
``\|'' and ``\\'' respectively.

Here's the output routine

foreach $key (keys(%table) ) {
$value = $table{$key} ;
$key =~ s/\\/\\\\/g ;
$key =~ s/\|/\\\|/g ;
$value =~ s/\\/\\\\/g ;
$value =~ s/\|/\\\|/g ;
print($key . "|" . $value . "|\n") ;

and here's the input routine

while ($line = <>) {
chomp($line) ;
$line =~ /^(.*([^\\]|\\\\))\|(.*)\|$/ ;
$key = $1 ;
$value = $3 ;
$key =~ s/\\\|/\|/g ;
$key =~ s/\\\\/\\/g ;
$value =~ s/\\\|/\|/g ;
$value =~ s/\\\\/\\/g ;
$table{$key} = $value ;
}

They seem to work, but I'm not sure how efficient they are (in particular
I have doubts about the regexp), so I'd appreciate any suggestions for
improvement.

I've also just noticed that the input routine would not correctly handle a
line like this:

blah\\|blahblah|

What's the best way to reverse the escapes?

Adam · Dec 12, 2003

// I'm writing a hash (string keys, string values) to a text file (by STDOUT)

// for reading later, and I decided on the following format:
//
// key|value|
// another key|another value|
//
// to make the file clearly human-readable (the values and keys can contain
// spaces). I've also provided for escaping ``|'' and ``\'' in the data with
// ``\|'' and ``\\'' respectively. ...
// What's the best way to reverse the escapes?

What is the best way? Here is *a* way of dealing with it: ...
while (<DATA>) {
chomp;
my ($key, $value) = /^([^\\|]*(?:\\.[^\\|]*)*)\|([^\\|]*(?:\\.[^\\|]*)*)\|$/
or next;
map {s/\\(.)/$1/g} $key, $value;
print "[$key] [$value]\n";
}

Thanks -- that's much better. I wonder if it would be more effective
just to work from left to right by characters instead of using a
regexp.

Since this is a standard, traditional, Unix file format, isn't there a
"canonical" way to analyse it?

-- Adam

Anno Siegel · Dec 12, 2003

Adam said:
// I'm writing a hash (string keys, string values) to a text file (by STDOUT)
// for reading later, and I decided on the following format:
//
// key|value|
// another key|another value|
//
// to make the file clearly human-readable (the values and keys can contain
// spaces). I've also provided for escaping ``|'' and ``\'' in the data with
// ``\|'' and ``\\'' respectively.

Click to expand...

[...]

Since this is a standard, traditional, Unix file format, isn't there a
"canonical" way to analyse it?

The file format is usually called CSV (comma separated values), even if
the separator can be something else. Do a CPAN search for CSV.

Anno

Adam · Dec 15, 2003

The file format is usually called CSV (comma separated values), even if
the separator can be something else. Do a CPAN search for CSV.

There is a Text::CSV module, but it only handles commas as separators and
it uses the "Windows-like" format, e.g.

Fred, Smith, "Smith, Fred", (e-mail address removed)

whereas I'm trying to use the correct "escaped" format, analogous to this:

Fred, Smith, Smith\, Fred, (e-mail address removed)

as recommended by Eric Raymond.

http://catb.org/~esr/writings/taoup/html/ch05s02.html#id2901882

I'll try looking for a canonical approach to parsing this in C.

-- Adam

ko · Dec 15, 2003

Adam said:
// I'm writing a hash (string keys, string values) to a text file (by STDOUT)
// for reading later, and I decided on the following format:
//
// key|value|
// another key|another value|
//
// to make the file clearly human-readable (the values and keys can contain
// spaces). I've also provided for escaping ``|'' and ``\'' in the data with
// ``\|'' and ``\\'' respectively.
..

// What's the best way to reverse the escapes?

Click to expand...

[snip]

Since this is a standard, traditional, Unix file format, isn't there a
"canonical" way to analyse it?

-- Adam

You can try this method:

#!/usr/bin/perl -w
use strict;
use Text:

arseWords;

while ( my $line = <DATA> ) {
my ($key, $value) = quotewords('\|', 0, $line);
print "'$key' => '$value'\n";
}

__DATA__
blah|blahblah|
bl ah|bla hblah|
bla\|h|blah\\\\blah|
\\blah|blahblah|

Text:

arseWords is a standard module, and the documentation is short
and straightforward.

HTH - keith

help with regex	7	Jun 19, 2013
Parsing Numeric Data	2	Nov 8, 2012
parsing string into dict	3	Sep 1, 2010
2 problems parsing output from HTML::TableExtract	8	Sep 1, 2009
A data transformation framework. A presentation inviting commentary.	0	Aug 21, 2013
parsing log in multiple passes	2	Aug 4, 2003
fix a per script	2	Jul 30, 2008
dynamically naming arrays	7	Feb 1, 2011

Parsing delimiter-separated data.

Adam

Adam

Anno Siegel

Adam

ko

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads