hCard parsing

  • Thread starter Klaus Alexander Seistrup
  • Start date
K

Klaus Alexander Seistrup

Hi group,

I am new to xgawk (and seemingly to xml also), and I've been struggling
all afternoon to have xgawk¹ parsing an XHTML file containing a hCard²,
without luck. I wonder if you guys could give me a push...

Let's say I have the following XHTML file:

#v+

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>hCard example</title>
</head>
<body>
<h1>hCard</h1>
<div class="vcard">
<h2 class="fn n">
<span class="given-name">John</span>
<span class="additional-name">Brian</span>
<span class="family-name">Doe</span>
</h2>
<p class="adr">
<div class="street-address">123 Circle Drive</div>
<div class="locality">South Metropolis</div>
<span class="region">XYZ</span> <span class="postal-code">012345</span><br />
<abbr title="Aaland Islands"><span class="country">AX</span></abbr>
</p>
</div>
</body>
</html>

#v-

I would like to end up with something like:

#v+

John Brian Doe
123 Circle Drive
South Metropolis
XYZ 012345
AX

#v-

I have been playing with xgawk and the ECB forex reference rates³,
and I have no problems extracting the exchange rates and have xgawk
calculating cross rates, but I can't seem to get xgawk parsing the
simple hCard above. I have read the XMLgawk documentation and
studied the examples, and I have been googling this group. Still
no luck.

Thanks for any help or hints.

Cheers,
Klaus.

¹) http://home.vrweb.de/~juergen.kahrs/gawk/XML/
²) http://microformats.org/wiki/hcard
³) http://www.ecb.int/stats/eurofxref/eurofxref-daily.xml
 
?

=?ISO-8859-15?Q?J=FCrgen_Kahrs?=

Klaus said:
I would like to end up with something like:

#v+

John Brian Doe
123 Circle Drive
South Metropolis
XYZ 012345
AX

#v-

Try this one:

XMLCHARDATA { data = $0 }
XMLENDELEM == "span" { name = name data " " }
XMLENDELEM == "h2" ||
XMLENDELEM == "br" ||
XMLENDELEM == "abbr" { print name; name = "" }
XMLENDELEM == "div" { print data }


xgawk -lxml -f hcard.awk hcard.xml

John Brian Doe
123 Circle Drive
South Metropolis
XYZ 012345
AX

Thanks for the link.
 
K

Klaus Alexander Seistrup

Jürgen Kahrs said:
Try this one:

XMLCHARDATA { data = $0 }
XMLENDELEM == "span" { name = name data " " }
XMLENDELEM == "h2" ||
XMLENDELEM == "br" ||
XMLENDELEM == "abbr" { print name; name = "" }
XMLENDELEM == "div" { print data }

Thanks. I guess I should have been more specific -- or general, as
it is.

What I want to achieve is a general hCard parser. While your code
above works for the specific hCard sample, it doesn't trigger on
hCard markup, and it will fail for most other hcards in the universe.

Rather than relying on certain tags, the parser should look for
things like 'class="given-name"', 'class="street-address"', and only
if found within a 'class="vcard"'.

Cheers,
Klaus.
 
?

=?ISO-8859-15?Q?J=FCrgen_Kahrs?=

Klaus said:
Rather than relying on certain tags, the parser should look for
things like 'class="given-name"', 'class="street-address"', and only
if found within a 'class="vcard"'.

That's easy. It should go like this:

XMLCHARDATA { data = $0 }
XMLSTARTELEM == "vcard" { delete addr }
"class" in XMLATTR {
addr[XMLATTR["class"]] = data
}
XMLENDELEM == "vcard" {
print addr["given-name"}
print addr["street-address"}
}
 
K

Klaus Alexander Seistrup

Jürgen Kahrs said:
Rather than relying on certain tags, the parser should look
for things like 'class="given-name"', 'class="street-address"',
and only if found within a 'class="vcard"'.

That's easy. It should go like this:

XMLCHARDATA { data = $0 }
XMLSTARTELEM == "vcard" { delete addr }
"class" in XMLATTR {
addr[XMLATTR["class"]] = data
}
XMLENDELEM == "vcard" {
print addr["given-name"}
print addr["street-address"}
}

Either it's a bit more complicated than that, or my brain is just
not working...

The hCard is not found within a <vcard/>, rather within "something"
with a "vcard" attribute, in my example case it's

<div class="vcard">
:
</div>

but it could be a <p/>, a <ul/> or any other tag. I guess I could
find the tag by looking for '"vcard" in XMLATTR', but how do I find
the corresponding XMLENDELEM (the </div>) so that I know when to stop?

Cheers,
 
?

=?ISO-8859-15?Q?J=FCrgen_Kahrs?=

Klaus said:
The hCard is not found within a <vcard/>, rather within "something"
with a "vcard" attribute, in my example case it's

<div class="vcard">
:
</div>

Yes, you are right, I overlooked that "vcard" is
not a tag but an attribute of any tag.
but it could be a <p/>, a <ul/> or any other tag. I guess I could
find the tag by looking for '"vcard" in XMLATTR', but how do I find
the corresponding XMLENDELEM (the </div>) so that I know when to stop?

You are on the right track. It should go like this:

XMLCHARDATA { data = $0 }
"vcard" in XMLATTR { delete addr; vcard_in = XMLSTARTELEM }
"class" in XMLATTR {
addr[XMLATTR["class"]] = data
}
XMLENDELEM == vcard_in {
print addr["given-name"]
print addr["street-address"]
}
 
K

Klaus Alexander Seistrup

Jürgen Kahrs said:
It should go like this:

XMLCHARDATA { data = $0 }
"vcard" in XMLATTR { delete addr; vcard_in = XMLSTARTELEM }
"class" in XMLATTR {
addr[XMLATTR["class"]] = data
}
XMLENDELEM == vcard_in {
print addr["given-name"]
print addr["street-address"]
}

We're getting closer, but still no cigar.

In the specific case "XMLENDELEM == vcard_in" will match all </div>s,
which is not what I want. I have better luck with:

vcard_in = XMLPATH
and
XMLENDELEM && XMLPATH == vcard_in

However, while xgawk will find all relevant attributes, it seems that
the data variable will always either be empty or contain a linefeed.
I don't quite understand why, but I'm trying to find out...

Cheers,
Klaus.
 
?

=?ISO-8859-15?Q?J=FCrgen_Kahrs?=

Klaus said:
In the specific case "XMLENDELEM == vcard_in" will match all </div>s,
which is not what I want. I have better luck with:

vcard_in = XMLPATH
and
XMLENDELEM && XMLPATH == vcard_in

This should be equivalent to

XMLENDELEM == vcard_in && XMLDEPTH == 1

However, while xgawk will find all relevant attributes, it seems that
the data variable will always either be empty or contain a linefeed.
I don't quite understand why, but I'm trying to find out...

The data variable is filled with the content of the very
last character data that occurred immediately before
the </div> (usually a newline). If you want anything else
to be assigned to the data variable, change the condition
in front of the assignment of the data variable.
 
M

Manuel Collado

Jürgen Kahrs escribió:
The data variable is filled with the content of the very
last character data that occurred immediately before
the </div> (usually a newline). If you want anything else
to be assigned to the data variable, change the condition
in front of the assignment of the data variable.

Parsing hCard is a really challenging task. The XSLT transformation program
from Brian Suda exceeds 1900 lines of text. This is a touchstone for any
XML processor.

Recognizing the elements with hCard content is the easy part. Collecting
values is much more difficult. The following code shows how to collect text
data for hCard properties/subproperties that can be arbitrarily nested. It
is a quick and dirty hack. The hCard structure is lost, and converted to a
flat list of property name/values. Only text data is collected. Properties
with value in attributes (like 'href') are not handled properly.

Maybe someday I could write a useable hCard parser, just to show the
capabilities of xmlgawk.


-- hcard.awk ---------------------------------------------------------
# Extract content of hCard

# Partial work. Just a quick and dirty hack!
# Author: Manuel Collado

# Global variables:
# hcroot (string) : path of the hCard root element
# hcard (array prop.name->value) : the extracted hCard (flat structure)
# hclevel (number) : nesting level of hCard prop/subprop
# hcpath (array hclevel->path) : stack of nested prop/subprop
# hckey (array hclevel->prop.name) : stack of nested prop/subprop
# hcvalue (array hclevel->value) : stack of nested prop/subprop

@load xml

BEGIN {
XMLCHARSET = "UTF-8"
hcsep = "|"
hckeys =
"\\<(fn|n|family-name|given-name|additional-name|honorific-prefix|honorific-suffix|nickname|sort-string|url|email|type|value|tel|type|value|adr|post-office-box|extended-address|street-address|locality|region|postal-code|country-name|type|value|label|geo|latitude|longitude|tz|photo|logo|sound|bday|title|role|org|organization-name|organization-unit|category|note|class|key|mailer|uid|rev)\\>"
}

# hCard start
XMLSTARTELEM && (XMLATTR["class"] ~ "\\<vcard\\>") {
hcroot = XMLPATH
delete hcard
hclevel = 0
delete hcpath
delete hckey
delete hcvalue
}

# skip content outside the hCard
!hcroot {
next
}

# data element (property) start
XMLSTARTELEM && hcroot {
# push each keyword (if any) on the stack
split( XMLATTR["class"], keylist, " " )
for (k in keylist) {
if (keylist[k] ~ hckeys) {
hclevel++
hckey[hclevel] = keylist[k]
hcpath[hclevel] = XMLPATH
hcvalue[hclevel] = ""
}
}
}

# character data
XMLCHARDATA && hclevel {
# concatenate text fragments inside the same property
hcvalue[hclevel] = hcvalue[hclevel] $0
}

# data element (property) end
XMLENDELEM && XMLPATH == hcpath[hclevel] {
# pop the value fron the stack and accumulate on parent data and hcard
while (XMLPATH == hcpath[hclevel]) {
value = hcvalue[hclevel]
key = hckey[hclevel]
if (key in hcard) {
hcard[key] = hcard[key] hcsep xs_trim(value)
} else {
hcard[key] = xs_trim(value)
}
delete hcvalue[hclevel]
hclevel--
if (hclevel) {
hcvalue[hclevel] = hcvalue[hclevel] value
}
}
}

# hCard end
XMLENDELEM && XMLPATH == hcroot {
hcroot = ""
# dump the collected data
print "------------------------------------------"
for (key in hcard) {
print key ": " hcard[key]
}
print "------------------------------------------"
}

END {
XmlCheckError()
}

# XMLgawk error reporting needs some redesign.
# Interim code: uses both ERRNO and XMLERROR to generate consistent messages
function XmlCheckError() {
if (XMLERROR) {
printf("\n%s:%d:%d:(%d) %s\n", FILENAME, XMLROW, XMLCOL, XMLLEN,
XMLERROR)
} else if (ERRNO) {
printf("\n%s\n", ERRNO)
ERRNO = ""
}
}

#------------------------------------------------------------------
# xs_trim: remove leading and trailing [[:space:]] characters, and
# collapse repeated spaces into a single one
#------------------------------------------------------------------
function xs_trim( string ) {
sub(/^[[:space:]]+/, "", string)
if (string) sub( /[[:space:]]+$/, "", string )
if (string) gsub( /[[:space:]]+/, " ", string )
return string
}
 
?

=?ISO-8859-15?Q?J=FCrgen_Kahrs?=

Hello Manuel,
Parsing hCard is a really challenging task. The XSLT transformation
program from Brian Suda exceeds 1900 lines of text. This is a touchstone
for any XML processor.

I wasnt aware that this hCard format has such a widespread support.
I just looked at the example from Klaus and tried to
help him analyse his particular example. Do you have
any link to some kind of specification for the hCard format ?
 
K

Klaus Alexander Seistrup

Manuel said:
The following code shows how to collect text data for hCard
properties/subproperties that can be arbitrarily nested. It
is a quick and dirty hack.

Thanks a lot, this is of great help to me in order to understand
XML processing with xgawk, I certainly appreciate it!

Cheers,
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,786
Messages
2,569,626
Members
45,323
Latest member
XOBJamel3

Latest Threads

Top