Dummy regex question

J

JayEs

I have been looking at this for a while, looked up regex tutorials, perldoc
for substr, etc... and just can't come up with the answer that is probably
staring me right in the face. Maybe the matter is obfuscated by the fact
that I am really trying to do 2 things.

Here is my stupid little problem:

Given a string that can be:
"$ 12.00"
"USD 187.54"
"CAD 1.20"
"??? 12.65"

(the latter due to special characters not avail in ASCII)

How do I split these strings into 2, separating the currency code and the
value?

The only thing I know for sure is that there always is a space in the string
followed directly by a numerical (but not integer) value. Maybe I've just
been looking at it too long. Can't figure it out. Prompted me to ordering
"Learning Perl" from the online retailer whose name starts with A and ends
in mazon, but that is not very helpful right this second.

Any help would be appreciated.

JS
 
T

Tad McClellan

JayEs said:
How do I split these strings into 2, separating the currency code and the
value?


You have a SAQ.


Q: how do I split a string

A: split()
 
J

JayEs

The function you want is called split.

I tried:

@array = split(/ /,$test);

But when I test with:

print $array[0];

....it contains the entire $test and $array[1] is undef. I checked the part
of html that I am parsing and it looks like "$ 12.25" in the HTML
fragment, but when I do:

print $test;

....it shows me: "$ 12.25"

Is there something special I need to do to handle the   ?

HELP!!
 
A

A. Sinan Unur

The function you want is called split.

I tried:

@array = split(/ /,$test);

But when I test with:

print $array[0];

...it contains the entire $test and $array[1] is undef. I checked the
part of html that I am parsing

Well, you never mentioned that you were parsing HTML.
and it looks like "$ 12.25" in the
HTML fragment, but when I do:

print $test;

...it shows me: "$ 12.25"

Where?

D:\Home>perl -e "print q{$ 12.25}"
$ 12.25
Is there something special I need to do to handle the   ?

Yes. You need to convert it to a space.

You can help others help you by correctly and fully describing the
parameters of your questions. It looks like you would also benefit from
getting an introductory Perl book and actually studying it.

Here is one way to do it. You might also find the HTML::Entities module
useful if there is chance other HTML entitites may appear in the input.

D:\Home>cat t.pl
use strict;
use warnings;

while(my $line = <DATA>) {
chomp $line;
next unless $line;
$line =~ s/&nbsp;/ /g;
my ($currency, $amount) = split /\s+/, $line;
print "Currency: $currency\tAmount: $amount\n";
}

__DATA__
$&nbsp;12.00
USD 187.54
CAD&nbsp;&nbsp;1.20
??? 12.65

D:\Home>perl t.pl
Currency: $ Amount: 12.00
Currency: USD Amount: 187.54
Currency: CAD Amount: 1.20
Currency: ??? Amount: 12.65


Sinan.
 
J

Jürgen Exner

JayEs said:
The function you want is called split.

I tried:

@array = split(/ /,$test);

But when I test with:

print $array[0];

...it contains the entire $test and $array[1] is undef. I checked the
part of html that I am parsing and it looks like "$&nbsp;12.25" in

Well, if you want to split at the text '&nbsp;' then you should tell split()
to split at the text '&nbsp;' and not at a non-existing space character.
the HTML fragment, but when I do:

print $test;

...it shows me: "$ 12.25"

I can not reproduce this behaviour:
C:\tmp>type t.pl
use strict; use warnings;
my $foo ='$&nbsp;12.25';
print $foo;

C:\tmp>t.pl
$&nbsp;12.25

As you can see, my mini test program prints '$&nbsp;12.25'.
Could you please show us a minimal sample script that exhibits the behaviour
you described above?
Is there something special I need to do to handle the &nbsp; ?

No, you just have to tell split the actually text at which you want to
split.

jue
 
S

Scott Bryce

JayEs said:
The function you want is called split.


I tried:

@array = split(/ /,$test);

But when I test with:

print $array[0];

...it contains the entire $test and $array[1] is undef. I checked the part
of html that I am parsing and it looks like "$&nbsp;12.25" in the HTML
fragment, but when I do:

What is in the HTML fragment is irrelevant. What is important is what is
in $test. Your original problem definition said that you wanted to split
on a space.

print $test;

...it shows me: "$ 12.25"

It looks like a space to me.

Is there something special I need to do to handle the &nbsp; ?

No, because $test does not contain the &nbsp;

The posting guidelines for this group suggest that you post a short but
complete program that demonstrates your problem. If you were to do that,
it would help us help you.



use strict;
use warnings;

my $test = '$ 12.25';

my @array = split / /, $test;

print "$array[0]\n$array[1]\n";



OUTPUT:

$
12.25
 
J

JayEs

You can help others help you by correctly and fully describing the
parameters of your questions. It looks like you would also benefit from
getting an introductory Perl book and actually studying it.


Ok, I put a script together that displays the problem. It's rather lengthy,
but see what you (or anyone else) can make of this:

use warnings;
use WWW::Mechanize; # Use Andy Lester's Mech for navigation
use HTML::TokeParser; # Use TokeParser to parse the HTML

my $url =
'http://www.hotels.com/property.do?m...ds=3&thisPageNumber=1&qKey=&roomTypeCode=-1';

my $agent = WWW::Mechanize->new();
$agent->get($url);

my $htmlpage = $agent->{content};
$out="c:/OpenPerl/HTML-TEST.html";
open OUT, ">$out" or die "Cannot open $out for write :$!";
print OUT "$htmlpage \n";
close OUT;

#my $htmlpage = $agent->{content};
my $stream = new HTML::TokeParser($out);
while (my $tag = $stream->get_tag("table")) {
if ($tag->[1]{class} and $tag->[1]{class} eq "matrix-bg") {
$stream->get_tag;
$stream->get_tag;
$stream->get_tag;
print 'DESCRIPTION: '.$stream->get_trimmed_text("/span")."\n"; # RATE
DESC.
$stream->get_tag;
$stream->get_tag;
$stream->get_tag;
$stream->get_tag;
$stream->get_tag;
$stream->get_tag;
$stream->get_tag;
$stream->get_tag;
$stream->get_tag;
$stream->get_tag;
$stream->get_tag;
$stream->get_tag;
$stream->get_tag;
#print $stream->get_trimmed_text("/span");
$rates = $stream->get_trimmed_text("/span");
print "ALL-$rates \n";
print "--- GOING TO SPLIT NOW --- \n";
my ($currency, $rate) = split /\s/, $rates;
print "--- DONE WITH SPLIT ---\n";
print "currency-$currency \n";
print "RATE-$rate \n";
$stream->get_tag;
if ($stream->get_trimmed_text("/td") =~ /\(max\)\*/i) { # RATE CHNG
DURING LOS
print 'max-change'; #MAX RATE WAS DISPLAYED
}
print "\n";
}
}
 
S

Scott Bryce

JayEs said:
Ok, I put a script together that displays the problem. It's rather lengthy,
but see what you (or anyone else) can make of this:

The problem is that the character separating the currency and the amount
is \xA0, which Perl does not recognize as white space.

If I change this line:
my ($currency, $rate) = split /\s/, $rates;

to:

my ($currency, $rate) = split /[\xA0\s]/, $rates;

or:

my ($currency, $rate) = split /\xA0/, $rates;

I get the expected results.

I don't know if this solution is cross-platform. I am on Win98SE.
 
P

Paul Lalli

I am (relatively) new to Perl but having worked my way through the Llama
Book, I think I can say that I'm at least somewhat Perl literate. So,
here's my suggestion...
I don't get why everyone tries to use split() here. I'd take a simple
regexp looking like

m/(.+)\s+(\d+\.\d+)/

I fail to see how that's more 'simple' than split /\s+/...
my ($currency, $amount) = ($1, $2);

Now, if this is wrong as hell, don't flame me - I'm a newby after all
;-).

This is, of course, perfectly valid (it doesn't actually help the OP,
because the OP didn't give accurate information about his problem). And
there are, of course, more than one ways to do it. In general however,
you should use split when you know what you want to throw away, and use
regexps when you know what you want to keep.

Paul Lalli
 
J

JayEs

Scott Bryce said:
The problem is that the character separating the currency and the amount >
is \xA0, which Perl does not recognize as white space.

how did you find that?? I guess I should have taken the script out of
OpenPerl (IDE) and exec from the command line. Although that gives me
ascii-whatever, which I still would not have interpreted as \xA0. Would
HTML::Entities have fixed this or this just an artifact of bein on a Win32
box (XP here)?
 
S

Scott Bryce

JayEs said:
how did you find that??

I looked at the output in a hex editor. I suspected something like this.

Would HTML::Entities have fixed this

I doubt it.
or this just an artifact of bein on a Win32 box (XP here)?

I don't know. It could be a "feature" of WWW::Mechanize. I suspect that
\xA0 is an ANSI non-breaking space. If so, it could be Win32 specific.
 
T

Tad McClellan

Sven-Thorsten Fahrbach said:
JayEs wrote:



I don't get why everyone tries to use split() here.


Because it is the right tool for the job.

You use split() when you know what you want to discard.

You use m// in a list context when you know what you want to keep.

I'd take a simple
regexp looking like

m/(.+)\s+(\d+\.\d+)/
my ($currency, $amount) = ($1, $2);


You should never use the dollar-digit variables unless you have
first ensured that the pattern match *succeeded*.

if ( m/(.+)\s+(\d+\.\d+)/ )
{ ($currency, $amount) = ($1, $2) }

or

my ($currency, $amount) = m/(.+)\s+(\d+\.\d+)/; # m// in list context
 
A

A. Sinan Unur

I looked at the output in a hex editor. I suspected something like
this.


I doubt it.


I don't know. It could be a "feature" of WWW::Mechanize. I suspect
that \xA0 is an ANSI non-breaking space. If so, it could be Win32
specific.

Oh, come on, take a look at the source code of HTML::Entities:

nbsp => "\240", # non breaking space

Sinan
 
S

Sven-Thorsten Fahrbach

JayEs wrote:

Given a string that can be:
"$ 12.00"
"USD 187.54"
"CAD 1.20"
"??? 12.65"

(the latter due to special characters not avail in ASCII)

How do I split these strings into 2, separating the currency code and the
value?

First, I'm new to this group, so, hi everybody.
I am (relatively) new to Perl but having worked my way through the Llama
Book, I think I can say that I'm at least somewhat Perl literate. So,
here's my suggestion...
I don't get why everyone tries to use split() here. I'd take a simple
regexp looking like

m/(.+)\s+(\d+\.\d+)/
my ($currency, $amount) = ($1, $2);

Now, if this is wrong as hell, don't flame me - I'm a newby after all
;-).
Hope this will come in handy...
 
S

Sven-Thorsten Fahrbach

Paul said:
I fail to see how that's more 'simple' than split /\s+/...

Okay, it's simpler *for me* since I can't recall ever having used
split. I always got away using some regexp or another in the past...
This is, of course, perfectly valid (it doesn't actually help the OP,
because the OP didn't give accurate information about his problem). And
there are, of course, more than one ways to do it. In general however,
you should use split when you know what you want to throw away, and use
regexps when you know what you want to keep.

Okay, I think this is worth reading more about. Thanks for the comment
anyway.
 
J

JayEs

This is, of course, perfectly valid (it doesn't actually help the OP,
because the OP didn't give accurate information about his problem). And
there are, of course, more than one ways to do it. In general however,
you should use split when you know what you want to throw away, and use
regexps when you know what you want to keep.


How much more accurate do I need to be?

* I asked a question with a clear description of the problem: How to split
string that contains a space. * Proposed solution didn't do the job, after
some discussion Scott figured out what the problem was after I gave the NG a
code sample reproducing the problem. * Not due to not giving inaccurate
information, but due to the problem being obfuscated by an idiosyncracy of
the OS (or so it seems).

I don't think that saying I didn't give accurate information is entirely
fair...
Then again, problem is solved (thanks Mr, Bryce) so all is well that ends
well.
 
J

JayEs

I looked at the output in a hex editor. I suspected something like this.

Haven't used one of those in years... eheheh Thank God!
I don't know. It could be a "feature" of WWW::Mechanize. I suspect that
\xA0 is an ANSI non-breaking space. If so, it could be Win32 specific.

If it's any module "feature", then I would say it's TokeParser's. Before I
give it to TokeParser I write the HTML to a file, open the file and hand
*THAT* to the parser.

Regardless, the problem is solved (thank you) and I'll add this one to my
"things to watch for list" in future scarping attempts.
 
A

A. Sinan Unur

....

How much more accurate do I need to be?

* I asked a question with a clear description of the problem: How to
split string that contains a space.

Clearly, however, the string you were trying to split did not contain a
space but &nbsp;, the HTML entity for 'non breaking space'.
* Proposed solution didn't do the job,

The proposed solution did the job for the problem you described.
* Not due to not giving inaccurate information, but due to
the problem being obfuscated by an idiosyncracy of the OS

It has nothing to do with the OS. &nbsp; means 'non-breaking space'. As
such, HTML::Entities converts it to "\240" not a space character.
I don't think that saying I didn't give accurate information is
entirely fair...

It is more than fair, and your resistance to learning is astounding me.
 
T

Tad McClellan

JayEs said:
* I asked a question with a clear description of the problem: How to split
string that contains a space.


But that was *not* your problem, as you have seen.

So it was inaccurate.

I don't think that saying I didn't give accurate information is entirely
fair...


There is no concept of "fair" when discussing _facts_.

The Nameless quote above (please provide an attribution when you
quote someone) was not about _you_ it was about the _information_.

The information was clearly inaccurate, a factual truth.

Then again, problem is solved (thanks Mr, Bryce) so all is well that ends
well.


Bugs are most often a slap-the-forehead I-sure-do-feel-silly
type of thing. Get used to it. :)

Nameless did not say anything about you, Nameless (this is getting
tedious) said something about the facts that were presented.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,777
Messages
2,569,604
Members
45,235
Latest member
Top Crypto Podcasts_

Latest Threads

Top