Parsing CSV and "  "

H

hotkitty

I'm trying to parse the following csv file in a linux environment:

"this is row1 column0","this is row1 column1","this is row1
column2","this is row1 column3","this
is row1 column4"
"this is row2 column0","this is row2 column1","this is row2
column2","this is row2 column3","this is
row2 column4"

Pretty standard CSV but with the last column running onto the next
line it gets screwed up somehow as my script doesn't recognize when a
new row starts. I tried substituting the carriage return but still no
luck. When I open up the file on my windows box w/ notepad I get the
following (notice the "   " that is added to the end of the
line):
"this is row1 column0","this is row1 column1","this is row1
column2","this is row1 column3","this
is row1 column4""   "
"this is row2 column0","this is row2 column1","this is row2
column2","this is row2 column3","this is
row2 column4""   "

Maybe I'm just overlooking some simple solution but how do I deal w/
the "   " as Linux doesn't recognize it?

Thanks in advance. My code is as follows:

my $csv= Text::CSV->new(); ####### I've also tried setting binary to 1
but still didn't work
open (CSV, "<", $thisfile) or die $!;
while (<CSV>) {
if ($csv->parse($_)) {
my @csvcolumns= $csv->fields();
my $newstuff = "$csvcolumns[1]";
open(OUT, ">>$thatfile");
print OUT "$newstuff\n";
close(OUT);
}
 
B

Ben Morrow

Quoth hotkitty said:
I'm trying to parse the following csv file in a linux environment:

"this is row1 column0","this is row1 column1","this is row1
column2","this is row1 column3","this
is row1 column4"
"this is row2 column0","this is row2 column1","this is row2
column2","this is row2 column3","this is
row2 column4"

Pretty standard CSV but with the last column running onto the next
line it gets screwed up somehow as my script doesn't recognize when a
new row starts.

How do you know when a new record starts? Can you guarantee that a line
beginning with " is always the start of a new record? If so, then
running the file through something like

perl -lne'if (/^"/) { print $line; $line = "" } $line .= $_;
END { print $line }'

may help. If any of your fields could end with a space (so the
terminating " might wrap onto the next line), or could end up not being
quoted, you might have a problem.
I tried substituting the carriage return but still no
luck. When I open up the file on my windows box w/ notepad I get the
following (notice the " &nbsp;&nbsp;" that is added to the end of the
line):
"this is row1 column0","this is row1 column1","this is row1
column2","this is row1 column3","this
is row1 column4"" &nbsp;&nbsp;"
"this is row2 column0","this is row2 column1","this is row2
column2","this is row2 column3","this is
row2 column4"" &nbsp;&nbsp;"

Maybe I'm just overlooking some simple solution but how do I deal w/
the " &nbsp;&nbsp;" as Linux doesn't recognize it?

Where did it come from? How did you transfer the file Linux to Windows:
did you somehow use a web browser or a stupid mail client or something
else that has messed up the file? If " &nbsp;&nbsp;" is never part of
valid data then removing it is as simple as adding

s/" &nbsp;&nbsp;"//;

to the start of the above.

Ben
 
J

Josef Moellers

hotkitty said:
I'm trying to parse the following csv file in a linux environment:

"this is row1 column0","this is row1 column1","this is row1
column2","this is row1 column3","this
is row1 column4"
"this is row2 column0","this is row2 column1","this is row2
column2","this is row2 column3","this is
row2 column4"

Pretty standard CSV but with the last column running onto the next
line it gets screwed up somehow as my script doesn't recognize when a
new row starts.

A pragmatic approach:
Collect lines until you have an even number of quotes:
<untested>
my $line;
while (1) {
$line .= <$src>;
chomp $line;
last if ($line =~ tr/"//) %2 == 0;
}
 
S

sln

I'm trying to parse the following csv file in a linux environment:

"this is row1 column0","this is row1 column1","this is row1
column2","this is row1 column3","this
is row1 column4"
"this is row2 column0","this is row2 column1","this is row2
column2","this is row2 column3","this is
row2 column4"

Pretty standard CSV but with the last column running onto the next
line it gets screwed up somehow as my script doesn't recognize when a
new row starts. I tried substituting the carriage return but still no
luck. When I open up the file on my windows box w/ notepad I get the
following (notice the " &nbsp;&nbsp;" that is added to the end of the
line):
"this is row1 column0","this is row1 column1","this is row1
column2","this is row1 column3","this
is row1 column4"" &nbsp;&nbsp;"
"this is row2 column0","this is row2 column1","this is row2
column2","this is row2 column3","this is
row2 column4"" &nbsp;&nbsp;"

Maybe I'm just overlooking some simple solution but how do I deal w/
the " &nbsp;&nbsp;" as Linux doesn't recognize it?

Thanks in advance. My code is as follows:

my $csv= Text::CSV->new(); ####### I've also tried setting binary to 1
but still didn't work
open (CSV, "<", $thisfile) or die $!;
while (<CSV>) {
if ($csv->parse($_)) {
my @csvcolumns= $csv->fields();
my $newstuff = "$csvcolumns[1]";
open(OUT, ">>$thatfile");
print OUT "$newstuff\n";
close(OUT);
}

This is one way. I used the criteria buffer a row until /"$/, eol is found,
and just process remaining buffer if EOF. Otherwise, there is no delineation of rows..

sln

#############
# Csv1 Regex
#############

use strict;
use warnings;

my $fname = 'c:\temp\junk.csv';
open CSV, $fname or die "can't open $fname...";

my ($row, $tmp) = ('','');
my ($parsing, $count) = (1,1);

while ($parsing)
{
if (!($_ = <CSV>))
{
$parsing = 0;
} else {
$tmp = $_;
$tmp =~ s/\s+$//s;
$row .= " $tmp" if (length($tmp));
## buffer until '"$' or eof
next if ($tmp !~ /"$/);
}
print " (".$count++.") ----------\n";
while ($row =~ /\s*"+\s*([^"]*?)\s*"+\s*|\s*([^,\n]+)\s*/g)
{
my $val = defined $1 ? $1 : $2;
print "val = $val\n";
# ... push @ary, $val;
}
$row = $tmp = '';
}

close CSV;

__END__

output:

(1) ----------
val = this is row1 column0
val = this is row1 column1
val = this is row1 column2
val = this is row1 column3
val = this is row1 column4
(2) ----------
val = this is row2 column0
val = this is row2 column1
val = this is row2 column2
val = this is row2 column3
val = this is row2 column4
(3) ----------
val = this is row3 column0
val = this is row3 column1


junk.csv
----------
"this is row1 column0","this is row1 column1","this is row1
column2","this is row1 column3","this
is row1 column4"
"this is row2 column0","this is row2 column1","this is row2
column2","this is row2 column3","this is
row2 column4"

"this is row3 column0","this is row3 column1",
 
S

sln

I'm trying to parse the following csv file in a linux environment:

"this is row1 column0","this is row1 column1","this is row1
column2","this is row1 column3","this
is row1 column4"
"this is row2 column0","this is row2 column1","this is row2
column2","this is row2 column3","this is
row2 column4"

Pretty standard CSV but with the last column running onto the next
line it gets screwed up somehow as my script doesn't recognize when a
new row starts. I tried substituting the carriage return but still no
luck. When I open up the file on my windows box w/ notepad I get the
following (notice the " &nbsp;&nbsp;" that is added to the end of the
line):
"this is row1 column0","this is row1 column1","this is row1
column2","this is row1 column3","this
is row1 column4"" &nbsp;&nbsp;"
"this is row2 column0","this is row2 column1","this is row2
column2","this is row2 column3","this is
row2 column4"" &nbsp;&nbsp;"

Maybe I'm just overlooking some simple solution but how do I deal w/
the " &nbsp;&nbsp;" as Linux doesn't recognize it?

Thanks in advance. My code is as follows:

my $csv= Text::CSV->new(); ####### I've also tried setting binary to 1
but still didn't work
open (CSV, "<", $thisfile) or die $!;
while (<CSV>) {
if ($csv->parse($_)) {
my @csvcolumns= $csv->fields();
my $newstuff = "$csvcolumns[1]";
open(OUT, ">>$thatfile");
print OUT "$newstuff\n";
close(OUT);
}

This is one way. I used the criteria buffer a row until /"$/, eol is found,
and just process remaining buffer if EOF. Otherwise, there is no delineation of rows..

sln

#############
# Csv1 Regex
#############

use strict;
use warnings;

my $fname = 'c:\temp\junk.csv';
open CSV, $fname or die "can't open $fname...";

my ($row, $tmp) = ('','');
my ($parsing, $count) = (1,1);

while ($parsing)
{
if (!($_ = <CSV>))
{
$parsing = 0;
} else {
$tmp = $_;
$tmp =~ s/\s+$//s;
$row .= " $tmp" if (length($tmp));
## buffer until '"$' or eof
next if ($tmp !~ /"$/);
}
print " (".$count++.") ----------\n";
while ($row =~ /\s*"+\s*([^"]*?)\s*"+\s*|\s*([^,\n]+)\s*/g)
{
my $val = defined $1 ? $1 : $2;
print "val = $val\n";
# ... push @ary, $val;
}
$row = $tmp = '';
}

close CSV;

__END__

output:

(1) ----------
val = this is row1 column0
val = this is row1 column1
val = this is row1 column2
val = this is row1 column3
val = this is row1 column4
(2) ----------
val = this is row2 column0
val = this is row2 column1
val = this is row2 column2
val = this is row2 column3
val = this is row2 column4
(3) ----------
val = this is row3 column0
val = this is row3 column1


junk.csv
----------
"this is row1 column0","this is row1 column1","this is row1
column2","this is row1 column3","this
is row1 column4"
"this is row2 column0","this is row2 column1","this is row2
column2","this is row2 column3","this is
row2 column4"

"this is row3 column0","this is row3 column1",

Btw, Excel will not do this correctly. Either you have to generate a
proper csv file (with EOR definition), or do this kind of a fix-up using your own criteria.
 
H

hotkitty

I'm trying to parse the following csv file in a linux environment:
"this is row1 column0","this is row1 column1","this is row1
column2","this is row1 column3","this
is row1 column4"
"this is row2 column0","this is row2 column1","this is row2
column2","this is row2 column3","this is
row2 column4"
Pretty standard CSV but with the last column running onto the next
line it gets screwed up somehow as my script doesn't recognize when a
new row starts. I tried substituting the carriage return but still no
luck. When I open up the file on my windows box w/ notepad I get the
following (notice the " &nbsp;&nbsp;" that is added to the end of the
line):
"this is row1 column0","this is row1 column1","this is row1
column2","this is row1 column3","this
is row1 column4"" &nbsp;&nbsp;"
"this is row2 column0","this is row2 column1","this is row2
column2","this is row2 column3","this is
row2 column4"" &nbsp;&nbsp;"
Maybe I'm just overlooking some simple solution but how do I deal w/
the " &nbsp;&nbsp;" as Linux doesn't recognize it?
Thanks in advance. My code is as follows:
my $csv= Text::CSV->new(); ####### I've also tried setting binary to 1
but still didn't work
open (CSV, "<", $thisfile) or die $!;
while (<CSV>) {
       if ($csv->parse($_)) {
my @csvcolumns= $csv->fields();
my $newstuff = "$csvcolumns[1]";
open(OUT, ">>$thatfile");
print OUT "$newstuff\n";
close(OUT);
}
This is one way. I used the criteria buffer a row until /"$/, eol is found,
and just process remaining buffer if EOF. Otherwise, there is no delineation of rows..

#############
# Csv1 Regex
#############
use strict;
use warnings;
my $fname = 'c:\temp\junk.csv';
open CSV, $fname or die "can't open $fname...";
my ($row, $tmp) = ('','');
my ($parsing, $count) = (1,1);
while ($parsing)
{
   if (!($_ = <CSV>))
   {
           $parsing = 0;
   } else {
           $tmp = $_;
           $tmp =~ s/\s+$//s;
           $row .= " $tmp" if (length($tmp));
           ## buffer until '"$' or eof
           next if ($tmp !~ /"$/);
   }
   print " (".$count++.") ----------\n";
   while ($row =~ /\s*"+\s*([^"]*?)\s*"+\s*|\s*([^,\n]+)\s*/g)
   {
           my $val = defined $1 ? $1 : $2;
           print "val = $val\n";
           # ... push @ary, $val;
   }
   $row = $tmp = '';
}
close CSV;


(1) ----------
val = this is row1 column0
val = this is row1 column1
val = this is row1 column2
val = this is row1 column3
val = this is row1 column4
(2) ----------
val = this is row2 column0
val = this is row2 column1
val = this is row2 column2
val = this is row2 column3
val = this is row2 column4
(3) ----------
val = this is row3 column0
val = this is row3 column1
junk.csv
----------
"this is row1 column0","this is row1 column1","this is row1
column2","this is row1 column3","this
is row1 column4"
"this is row2 column0","this is row2 column1","this is row2
column2","this is row2 column3","this is
row2 column4"
"this is row3 column0","this is row3 column1",

Btw, Excel will not do this correctly. Either you have to generate a
proper csv file (with EOR definition), or do this kind of a fix-up using your own criteria.

I apologize for the late reply but appreciate the quick responses you
have given me. Perhaps I am doing something wrong w/ the above
suggestions but I'll keep cracking away. Here is the actual .csv file
I am trying to parse:
http://www.nasdaq.com//asp/symbols.asp?exchange=Q&start=0

thx
 
S

sln

I'm trying to parse the following csv file in a linux environment:
"this is row1 column0","this is row1 column1","this is row1
column2","this is row1 column3","this
is row1 column4"
"this is row2 column0","this is row2 column1","this is row2
column2","this is row2 column3","this is
row2 column4"
Pretty standard CSV but with the last column running onto the next
line it gets screwed up somehow as my script doesn't recognize when a
new row starts. I tried substituting the carriage return but still no
luck. When I open up the file on my windows box w/ notepad I get the
following (notice the " &nbsp;&nbsp;" that is added to the end of the
line):
"this is row1 column0","this is row1 column1","this is row1
column2","this is row1 column3","this
is row1 column4"" &nbsp;&nbsp;"
"this is row2 column0","this is row2 column1","this is row2
column2","this is row2 column3","this is
row2 column4"" &nbsp;&nbsp;"
Maybe I'm just overlooking some simple solution but how do I deal w/
the " &nbsp;&nbsp;" as Linux doesn't recognize it?
Thanks in advance. My code is as follows:
my $csv= Text::CSV->new(); ####### I've also tried setting binary to 1
but still didn't work
open (CSV, "<", $thisfile) or die $!;
while (<CSV>) {
       if ($csv->parse($_)) {
my @csvcolumns= $csv->fields();
my $newstuff = "$csvcolumns[1]";
open(OUT, ">>$thatfile");
print OUT "$newstuff\n";
close(OUT);
}
This is one way. I used the criteria buffer a row until /"$/, eol is found,
and just process remaining buffer if EOF. Otherwise, there is no delineation of rows..

#############
# Csv1 Regex
#############
use strict;
use warnings;
my $fname = 'c:\temp\junk.csv';
open CSV, $fname or die "can't open $fname...";
my ($row, $tmp) = ('','');
my ($parsing, $count) = (1,1);
while ($parsing)
{
   if (!($_ = <CSV>))
   {
           $parsing = 0;
   } else {
           $tmp = $_;
           $tmp =~ s/\s+$//s;
           $row .= " $tmp" if (length($tmp));
           ## buffer until '"$' or eof
           next if ($tmp !~ /"$/);
   }
   print " (".$count++.") ----------\n";
   while ($row =~ /\s*"+\s*([^"]*?)\s*"+\s*|\s*([^,\n]+)\s*/g)
   {
           my $val = defined $1 ? $1 : $2;
           print "val = $val\n";
           # ... push @ary, $val;
   }
   $row = $tmp = '';
}
close CSV;


(1) ----------
val = this is row1 column0
val = this is row1 column1
val = this is row1 column2
val = this is row1 column3
val = this is row1 column4
(2) ----------
val = this is row2 column0
val = this is row2 column1
val = this is row2 column2
val = this is row2 column3
val = this is row2 column4
(3) ----------
val = this is row3 column0
val = this is row3 column1
junk.csv
----------
"this is row1 column0","this is row1 column1","this is row1
column2","this is row1 column3","this
is row1 column4"
"this is row2 column0","this is row2 column1","this is row2
column2","this is row2 column3","this is
row2 column4"
"this is row3 column0","this is row3 column1",

Btw, Excel will not do this correctly. Either you have to generate a
proper csv file (with EOR definition), or do this kind of a fix-up using your own criteria.

I apologize for the late reply but appreciate the quick responses you
have given me. Perhaps I am doing something wrong w/ the above
suggestions but I'll keep cracking away. Here is the actual .csv file
I am trying to parse:
http://www.nasdaq.com//asp/symbols.asp?exchange=Q&start=0

thx

Actually, Josef Möllers posted a pragmatic method and it works (Occums Razor),
ie: counting the number of quotes.
And it makes sense because in reality, the end of record is the eol, but in this case
there are multiple double quotes scattered over mutiple lines. The only intersection of
these two principles is eol AND even # of double quotes.
This file loaded up in Excel right away, parsed fine. Although all the junk was left in there.
A pragmatic approach:
Collect lines until you have an even number of quotes:
<untested>
my $line;
while (1) {
$line .= <$src>;
chomp $line;
last if ($line =~ tr/"//) %2 == 0;
}

With a little extra effort I cleaned up the parsing and it works fine.

Good Luck...
sln


#############
# Csv2 Regex
#############

# http://www.nasdaq.com//asp/symbols.asp?exchange=Q&start=0

use strict;
use warnings;

my $fname = 'c:\temp\junkie.csv';
open CSV, $fname or die "can't open $fname...";

my ($row, $tmp) = ('','');
my ($parsing, $count) = (1,1);

while ($parsing)
{
## Buffer until a full row
## -------------------------
if (!($_ = <CSV>)) {
$parsing = 0; # eof, parse what's left
} else {
## -------------------------------
$tmp = $_;
$tmp =~ s/\s+$//s;
next if (!length($tmp));
$row .= " $tmp";
next if (!($row =~ tr/"// %2 == 0)); # Even number of double quotes?
} # Good to go, parse it ...

print " (".$count++.") ----------\n";

# parse the row
# -------------------
while ($row =~ /\s*"\s*([^"]*?)\s*"\s*,|\s*"\s*(.*?)\s*"\s*$/g)
{
my $val = $1;
if (defined $2) {
# do some cleanup
# ----------------
$val = $2;
$val =~ s/""/"/g;
$val =~ s/\.\.\. More\.\.\.//ig;
$val =~ s/&nbsp;/ /ig;
}
print "val = $val\n";
}
$row = '';
}

close CSV;

__END__

Partial output:

(1) ----------
val = NASDAQ Securities as of 12/31/2008
val =
val =
val =
(2) ----------
val = Name
val = Symbol
val = Security Type
val = Shares Outstanding
val = Market Value (millions)
val = Description (as filed with the SEC)
(3) ----------
val = 012 Smile.Communications Ltd.
val = SMLC
val = Ordinary Shares
val = 25,360,000
val = $136.4
val = Prior to our October 2007 public offering, we were a wholly-owned subsidiary of Internet Gold, a public company traded on the NASDAQ Global Market and the Tel Aviv Stock Exchange, whose shares
are included in the TASE-100 Index. Internet Gold currently owns approximately 72.4% of our ordinary shares. In November 2004, Internet Gold became our sole shareholder after purchasing our ordinary
shares from our prior shareholders. As part of its internal restructuring in 2006, Internet Gold transferred its communications and media operations into two operating subsidiaries. Internet Gold
transferred to us its broadband and traditional voice services businesses, which we refer to in this annual report as the Communications Business.
"http://secfilings.nasdaq.com/edgar_...178913-08-001700.html#FIS_COMPANY_INFORMATION"
(4) ----------
val = 1-800 FLOWERS.COM, Inc.
val = FLWS
val = Common Stock
val = 26,528,000
val = $116.5
val = For more than 30 years, 1-800-FLOWERS.COM, Inc. - "Your Florist of Choice(R)" - has been providing customers around the world with the freshest flowers and finest selection of plants,
gift baskets, gourmet foods and confections, and plush stuffed animals perfect for every occasion. 1-800-FLOWERS.COM(R) offers the best of both worlds: exquisite, florist-designed arrangements
individually created by some of the nation's top floral artists and hand-delivered the same day, and spectacular flowers delivered through its "Fresh From Our Growers(TM)" program. Customers can
"call, click, or come in" to shop 1-800-FLOWERS.COM(R) 24 hours a day, 7 days a week at 1-800-356-9377 or www.1800flowers.com. Sales and Service Specialists are available 24/7, and fast and
reliable delivery is offered same day, any day. As always, 100 percent satisfaction and freshness is guaranteed. The 1-800-FLOWERS.
"http://secfilings.nasdaq.com/edgar_conv_html/2007/09/13/0001084869-07-000018.html#FIS_BUSINESS"
(5) ----------
val = 1st Constitution Bancorp (NJ)
val = FCCY
val = Common Stock
val = 3,998,000
val = $35.0
val = 1st Constitution Bancorp (the “Company”) is a bank holding company registered under the Bank Holding Company Act of 1956, as amended. The Company was organized under the laws of the State of New
Jersey in February 1999 for the purpose of acquiring all of the issued and outstanding stock of 1st Constitution Bank (the “Bank”) and thereby enabling the Bank to operate within a bank holding
company structure. The Company became an active bank holding company on July 1, 1999. The Bank is a wholly-owned subsidiary of the Company. Other than its investment in the Bank, the Company currently
conducts no other significant business activities. The main office of the Company and the Bank is located at 2650 Route 130 North, Cranbury, New Jersey 08512, and the telephone number is (609)
655-4500. 1st Constitution Bank The Bank, a commercial bank formed under the laws of the State of New Jersey, engages in the business of commercial and retail banking.
"http://secfilings.nasdaq.com/edgar_conv_html/2008/04/15/0001214659-08-000838.html#FIS_BUSINESS"
(6) ----------
val = 1st Pacific Bancorp (CA)
val = FPBN
val = Common Stock
val = 4,970,000
val = $25.3
val = 1st Pacific Bancorp (the "Company", "we", "our", or "us") is a California corporation incorporated on August 4, 2006 and is registered with the Board of Governors of the Federal Reserve System
as a bank holding company under the Bank Holding Company Act of 1956, as amended. 1st Pacific Bank of California (the "Bank") is a wholly-owned bank subsidiary of the Company and was incorporated in
California on April 17, 2000. The Bank is a California corporation licensed to operate as a commercial bank under the California Banking Law by the California Department of Financial Institutions (the
"DFI"). In accordance with the Federal Deposit Insurance Act, the Federal Deposit Insurance Corporation (the "FDIC") insures the deposits of the Bank. The Bank is a member of the Federal Reserve
System. "http://secfilings.nasdaq.com/edgar_conv_html/2008/03/31/0001047469-08-003795.html#FIS_BUSINESS"
(7) ----------
val = 1st Source Corporation
val = SRCE
val = Common Stock
val = 24,110,000
val = $530.2
val = 1st Source Corporation, an Indiana corporation incorporated in 1971, is a bank holding company headquartered in South Bend, Indiana that provides, through our subsidiaries (collectively referred
to as "1st Source"), a broad array of financial products and services. 1st Source Bank and First National Bank, Valparaiso (collectively referred to as the "Banks"), our banking subsidiaries, offer
commercial and consumer banking services, trust and investment management services, and insurance to individual and business clients through most of our 83 banking center locations in 17 counties in
Indiana and Michigan. 1st Source Bank's Specialty Finance Group, with 24 locations nationwide, offers specialized financing services for new and used private and cargo aircraft, automobiles and light
trucks for leasing and rental agencies, medium and heavy duty trucks, construction equipment, and environmental equipment.
"http://secfilings.nasdaq.com/edgar_conv_html/2008/02/22/0000034782-08-000022.html#FIS_BUSINESS"
(8) ----------
val = 21st Century Holding Company
val = TCHC
val = Common Stock
val = 8,014,000
val = $33.7
val = 21st Century Holding Company (“21st Century,” “Company,” “we,” “us”) is an insurance holding company, which, through our subsidiaries and our contractual relationships with our independent
agents and general agents, controls substantially all aspects of the insurance underwriting, distribution and claims process. We are authorized to underwrite homeowners’ property and casualty
insurance, commercial general liability insurance, personal automobile insurance and commercial automobile insurance in various states with various lines of authority through our wholly owned
subsidiaries, Federated National Insurance Company (“Federated National”) and American Vehicle Insurance Company (“American Vehicle”). The insurable events during 2007 and 2006 did not include any
weather related catastrophic events such as the well publicized series of hurricanes that occurred in Florida during 2005 and 2004.
"http://secfilings.nasdaq.com/edgar_conv_html/2008/03/17/0001144204-08-015873.html#FIS_BUSINESS"
(9) ----------
val = 3Com Corporation
val = COMS
val = Common Stock
val = 405,283,000
val = $911.9
val = We provide secure, converged networking solutions on a global scale to organizations of all sizes. Our products and solutions enable customers to manage business-critical voice, video and data
in a secure, scalable, reliable and efficient network environment. We deliver networking products and services for enterprises that view their networks as mission critical, and value cost-effective
superior performance. Our products form integrated solutions and function in multi-vendor environments based upon open, not proprietary, platforms. Our products are sold on a worldwide basis through a
combination of value added partners and direct sales representatives. We deliver products and solutions that support the increasingly complex and demanding application environments in today’s
businesses. We aspire to be one of the leading enterprise networking companies by delivering innovative, secure, feature-rich products and solutions built on open platform technology.
"http://secfilings.nasdaq.com/edgar_conv_html/2007/07/31/0000950135-07-004539.html#FIS_BUSINESS"
(10) ----------
val = 3D Systems Corporation
val = TDSC
val = Common Stock
val = 22,365,000
val = $188.5
val = 3D Systems Corporation (“3D Systems” or the “Company”) is a holding company that operates through subsidiaries in the United States, Europe and the Asia-Pacific region. We design, develop,
manufacture, market and service a suite of additive manufacturing solutions including 3-D modeling, rapid prototyping and manufacturing systems and related products and materials that enable complex
three-dimensional objects to be produced directly from computer data. Our customers use our proprietary systems to produce physical objects from digital data using commonly available computer-aided
design software, often referred to as CAD software, or other digital-media devices such as engineering scanners and MRI or CT medical scanners.
"http://secfilings.nasdaq.com/edgar_conv_html/2008/03/17/0000950144-08-002028.html#FIS_BUSINESS"
(11) ----------
val = 3SBio Inc.
val = SSRX
val = American Depositary Shares
val = 21,797,000
val = $126.4
val = We commenced business operations in 1993 through Shenyang Sunshine Pharmaceutical Co., Ltd., or Shenyang Sunshine, a limited liability company established in China. Prior to our initial public
offering in February 2007, we established a holding company structure through the following series of corporate reorganization transactions: • we formed Collected Mind Limited, a British Virgin
Islands company, in July 2006; • Collected Mind Limited acquired 100% of the equity interests of Shenyang Sunshine, which was reorganized as a wholly foreign owned enterprise,
or WFOE, in July 2006; and • we incorporated 3SBio Inc., an exempted company in the Cayman Islands, which acquired 100% equity interest in Collected Mind in September 2006.
"http://secfilings.nasdaq.com/edgar_...193125-07-146810.html#FIS_COMPANY_INFORMATION"
(12) ----------
val = 51job, Inc.
val = JOBS
val = American Depositary Shares
val = 28,260,000
val = $241.3
val = We commenced our business in 1998. Since our inception, we have conducted substantially all of our operations in China. In March 2000, our founders incorporated a new holding company, now called
51job, Inc., as an exempted limited liability company in the Cayman Islands under the Cayman Islands Companies Law (2004 Revision). Subsequently, 51job, Inc. acquired 51net.com Inc., or 51net, a
British Virgin Islands company, and other subsidiaries to become the holding company of our corporate group. We operate as a foreign investment enterprise in China through our wholly owned
subsidiaries, 51net, which is the registered owner of some of our trademarks and our domain name, 51net Beijing and 51net HR, which are both Cayman Islands companies, as well as our affiliated Chinese
entities, the primary ones being: • Shanghai Qianjin Advertising Co., Ltd. "http://secfilings.nasdaq.com/edgar_...145549-07-001142.html#FIS_COMPANY_INFORMATION"
(13) ----------
val = 8x8 Inc
val = EGHT
val = Common Stock
val = 62,175,000
val = $42.3
val = Statements contained in this annual report on Form 10-K, or Annual Report, regarding our expectations, beliefs, estimates, intentions or strategies are forward-looking statements within the
meaning of Section 27A of the Securities Act and Section 21E of the Exchange Act. Any statements contained herein that are not statements of historical fact may be deemed to be forward-looking
statements. For example, words such as "may," "will," "should," "estimates," "predicts," "potential," "continue," "strategy," "believes," "anticipates," "plans," "expects," "intends," and similar
expressions are intended to identify forward-looking statements. You should not place undue reliance on these forward-looking statements. Actual results and trends may differ materially from
historical results or those projected in any such forward-looking statements depending on a variety of factors.
"http://secfilings.nasdaq.com/edgar_conv_html/2007/06/29/0001023731-07-000014.html#FIS_BUSINESS"
(14) ----------
val = A-Power Energy Generation Systems, Ltd.
val = APWR
val = Common Stock
val = 32,707,000
val = $136.4
val = A-Power A-Power Energy Generated Systems, Ltd. (formerly known as China Energy Technology Limited) was incorporated under the laws of the British Virgin Islands on May 14, 2007. Until January
18, 2008, A-Power was a wholly-owned subsidiary of Chardan South China Acquisition Corporation. Chardan Chardan South China Acquisition Corporation was a blank check corporation organized under the
laws of the State of Delaware on March 10, 2005. Chardan was originally incorporated as “Chardan China Acquisition Corp. III,” but changed its name to Chardan South China Acquisition Corporation on
July 14, 2005. Chardan was formed to effect a business combination with an unidentified operating business that had its primary operating facilities located in the PRC in any city or province south of
the Yangtze River. "http://secfilings.nasdaq.com/edgar_...144204-08-039652.html#FIS_COMPANY_INFORMATION"
(15) ----------
 
S

sln

On Sat, 11 Oct 2008 21:47:27 GMT, (e-mail address removed) wrote:

[snip]

Small change's ..

- For performance, the transliteration was changed to count $tmp string.
- Added the span modifier on the regex loop.
Thus the option below to keep newlines, and have the original formatting intact,
ie: bullet point location's etc...
Just (un)comment the block that is needed. Try it both ways.


#############
# Csv3 Regex
#############

# http://www.nasdaq.com//asp/symbols.asp?exchange=Q&start=0

use strict;
use warnings;

my $fname = 'c:\temp\symbols.csv';
open CSV, $fname or die "can't open $fname...";

my ($row, $tmp) = ('','');
my ($parsing, $records, $quotes) = (1,1,0);

while ($parsing)
{
## Buffer until a full row
## -------------------------
if (!($_ = <CSV>)) {
$parsing = 0; # eof, parse what's left
} else {
$tmp = $_;

## this block will trim newlines ---
$tmp =~ s/\s+$//s;
next if (!length($tmp));
$row .= " $tmp";
## ---

## this block will keep newlines ---
# $row .= $tmp;
## ---

$quotes += $tmp =~ tr/"//;
next if (!($quotes % 2 == 0)); # Even number of double quotes?
} # Good to go, parse it ...

print " (".$records++.") ----------\n";

## Parse the row
## -------------------
while ($row =~ /\s*"\s*([^"]*?)\s*"\s*,|\s*"\s*(.*?)\s*"\s*$/gs) # span lines
{
my $val = $1;
if (defined $2) {
# cleanup the description field
# ------------------------------
$val = $2;
$val =~ s/""/"/g;
$val =~ s/\.\.\. More\.\.\.//ig;
$val =~ s/&nbsp;/ /ig;
}
print "val = $val\n";
}
$row = '';
$quotes = 0;
}
close CSV;

__END__
 
H

hotkitty

On Sat, 11 Oct 2008 21:47:27 GMT, (e-mail address removed) wrote:

[snip]

Small change's ..

- For performance, the transliteration was changed to count $tmp string.
- Added the span modifier on the regex loop.
  Thus the option below to keep newlines, and have the original formatting intact,
  ie: bullet point location's etc...
  Just (un)comment the block that is needed. Try it both ways.

#############
# Csv3 Regex
#############

#http://www.nasdaq.com//asp/symbols.asp?exchange=Q&start=0

use strict;
use warnings;

my $fname = 'c:\temp\symbols.csv';
open CSV, $fname or die "can't open $fname...";

my ($row, $tmp) = ('','');
my ($parsing, $records, $quotes) = (1,1,0);

while ($parsing)
{
        ## Buffer until a full row
        ## -------------------------
        if (!($_ = <CSV>)) {
                $parsing = 0; # eof, parse what's left
        } else {
                $tmp = $_;

                ## this block will trim newlines ---
                  $tmp =~ s/\s+$//s;
                  next if (!length($tmp));
                  $row .= " $tmp";
                ## ---

                ## this block will keep newlines ---
                  # $row .= $tmp;
                ## ---

                $quotes += $tmp =~ tr/"//;
                next if (!($quotes % 2 == 0));  # Even number of double quotes?
        }                                      # Good to go, parse it ...

        print " (".$records++.") ----------\n";

        ## Parse the row
        ## -------------------
        while ($row =~ /\s*"\s*([^"]*?)\s*"\s*,|\s*"\s*(.*?)\s*"\s*$/gs)   # span lines
        {
                my $val = $1;
                if (defined $2) {
                        # cleanup the descriptionfield
                        # ------------------------------
                        $val = $2;
                        $val =~ s/""/"/g;
                        $val =~ s/\.\.\. More\.\.\.//ig;
                        $val =~ s/&nbsp;/ /ig;
                }
                print "val = $val\n";
        }
        $row = '';
        $quotes = 0;}

close CSV;

__END__

This works great! Now, I realize that my next question should be
categorized in the beginner's group but for whatever reason I will
post here:
How would I just print out every 4th occurrence of $val (i.e. the
Market Value column)?
 
S

sln

On Sat, 11 Oct 2008 21:47:27 GMT, (e-mail address removed) wrote:

[snip]

Small change's ..

- For performance, the transliteration was changed to count $tmp string.
- Added the span modifier on the regex loop.
  Thus the option below to keep newlines, and have the original formatting intact,
  ie: bullet point location's etc...
  Just (un)comment the block that is needed. Try it both ways.

#############
# Csv3 Regex
#############

#http://www.nasdaq.com//asp/symbols.asp?exchange=Q&start=0

use strict;
use warnings;

my $fname = 'c:\temp\symbols.csv';
open CSV, $fname or die "can't open $fname...";

my ($row, $tmp) = ('','');
my ($parsing, $records, $quotes) = (1,1,0); my $MarketValueTotal = 0;

while ($parsing)
{
        ## Buffer until a full row
        ## -------------------------
        if (!($_ = <CSV>)) {
                $parsing = 0; # eof, parse what's left
        } else {
                $tmp = $_;

                ## this block will trim newlines ---
                  $tmp =~ s/\s+$//s;
                  next if (!length($tmp));
                  $row .= " $tmp";
                ## ---

                ## this block will keep newlines ---
                  # $row .= $tmp;
                ## ---

                $quotes += $tmp =~ tr/"//;
                next if (!($quotes % 2 == 0));  # Even number of double quotes?
        }                                       # Good to go, parse it ...

        print " (".$records++.") ----------\n";

        ## Parse the row
        ## -------------------

my $field = 0;
        while ($row =~ /\s*"\s*([^"]*?)\s*"\s*,|\s*"\s*(.*?)\s*"\s*$/gs)   # span lines
        {
                my $val = $1;
                if (defined $2) {
                        # cleanup the description field
                        # ------------------------------
                        $val = $2;
                        $val =~ s/""/"/g;
                        $val =~ s/\.\.\. More\.\.\.//ig;
                        $val =~ s/&nbsp;/ /ig;
                }
if ($field++ == 4)
{
if ($val =~ /^[\$,\.\d]+$/)
{
$val =~ s/[\$,]//g;
$MarketValueTotal += $val;
print "val = $val\n";
}
else { print STDERR "'$val' is not numeric, record = ".($records-1)."\n";}
}

print STDERR "Market Value Total = $MarketValueTotal (in millions)\n";

output:

'Market Value (millions)' is not numeric, record = 2
'N/A' is not numeric, record = 3122
Market Value Total = 2650685.5 (in millions)
This works great! Now, I realize that my next question should be
categorized in the beginner's group but for whatever reason I will
post here:
How would I just print out every 4th occurrence of $val (i.e. the
Market Value column)?


Not bad, NASDAQ ~ 2.6 trillion dollars.

sln
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top