hCard parsing

Klaus Alexander Seistrup · Oct 17, 2006

Hi group,

I am new to xgawk (and seemingly to xml also), and I've been struggling
all afternoon to have xgawk¹ parsing an XHTML file containing a hCard²,
without luck. I wonder if you guys could give me a push...

Let's say I have the following XHTML file:

#v+

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>hCard example</title>
</head>
<body>
<h1>hCard</h1>
<div class="vcard">
<h2 class="fn n">
<span class="given-name">John</span>
<span class="additional-name">Brian</span>
<span class="family-name">Doe</span>
</h2>
<p class="adr">
<div class="street-address">123 Circle Drive</div>
<div class="locality">South Metropolis</div>
<span class="region">XYZ</span> <span class="postal-code">012345</span><br />
<abbr title="Aaland Islands"><span class="country">AX</span></abbr>
</p>
</div>
</body>
</html>

#v-

I would like to end up with something like:

#v+

John Brian Doe
123 Circle Drive
South Metropolis
XYZ 012345
AX

#v-

I have been playing with xgawk and the ECB forex reference rates³,
and I have no problems extracting the exchange rates and have xgawk
calculating cross rates, but I can't seem to get xgawk parsing the
simple hCard above. I have read the XMLgawk documentation and
studied the examples, and I have been googling this group. Still
no luck.

Thanks for any help or hints.

Cheers,
Klaus.

¹) http://home.vrweb.de/~juergen.kahrs/gawk/XML/
²) http://microformats.org/wiki/hcard
³) http://www.ecb.int/stats/eurofxref/eurofxref-daily.xml

=?ISO-8859-15?Q?J=FCrgen_Kahrs?= · Oct 17, 2006

Klaus said:
I would like to end up with something like:

#v+

John Brian Doe
123 Circle Drive
South Metropolis
XYZ 012345
AX

#v-

Try this one:

XMLCHARDATA { data = $0 }
XMLENDELEM == "span" { name = name data " " }
XMLENDELEM == "h2" ||
XMLENDELEM == "br" ||
XMLENDELEM == "abbr" { print name; name = "" }
XMLENDELEM == "div" { print data }

xgawk -lxml -f hcard.awk hcard.xml

John Brian Doe
123 Circle Drive
South Metropolis
XYZ 012345
AX

³) http://www.ecb.int/stats/eurofxref/eurofxref-daily.xml

Thanks for the link.

Klaus Alexander Seistrup · Oct 17, 2006

Jürgen Kahrs said:
Try this one:

XMLCHARDATA { data = $0 }
XMLENDELEM == "span" { name = name data " " }
XMLENDELEM == "h2" ||
XMLENDELEM == "br" ||
XMLENDELEM == "abbr" { print name; name = "" }
XMLENDELEM == "div" { print data }

Thanks. I guess I should have been more specific -- or general, as
it is.

What I want to achieve is a general hCard parser. While your code
above works for the specific hCard sample, it doesn't trigger on
hCard markup, and it will fail for most other hcards in the universe.

Rather than relying on certain tags, the parser should look for
things like 'class="given-name"', 'class="street-address"', and only
if found within a 'class="vcard"'.

Cheers,
Klaus.

=?ISO-8859-15?Q?J=FCrgen_Kahrs?= · Oct 17, 2006

Klaus said:
Rather than relying on certain tags, the parser should look for
things like 'class="given-name"', 'class="street-address"', and only
if found within a 'class="vcard"'.

That's easy. It should go like this:

XMLCHARDATA { data = $0 }
XMLSTARTELEM == "vcard" { delete addr }
"class" in XMLATTR {
addr[XMLATTR["class"]] = data
}
XMLENDELEM == "vcard" {
print addr["given-name"}
print addr["street-address"}
}

Klaus Alexander Seistrup · Oct 17, 2006

Jürgen Kahrs said:
Rather than relying on certain tags, the parser should look
for things like 'class="given-name"', 'class="street-address"',
and only if found within a 'class="vcard"'.

Click to expand...

That's easy. It should go like this:

XMLCHARDATA { data = $0 }
XMLSTARTELEM == "vcard" { delete addr }
"class" in XMLATTR {
addr[XMLATTR["class"]] = data
}
XMLENDELEM == "vcard" {
print addr["given-name"}
print addr["street-address"}
}

Either it's a bit more complicated than that, or my brain is just
not working...

The hCard is not found within a <vcard/>, rather within "something"
with a "vcard" attribute, in my example case it's

<div class="vcard">
:
</div>

but it could be a <p/>, a <ul/> or any other tag. I guess I could
find the tag by looking for '"vcard" in XMLATTR', but how do I find
the corresponding XMLENDELEM (the </div>) so that I know when to stop?

Cheers,

=?ISO-8859-15?Q?J=FCrgen_Kahrs?= · Oct 17, 2006

Klaus said:
The hCard is not found within a <vcard/>, rather within "something"
with a "vcard" attribute, in my example case it's

<div class="vcard">
:
</div>

Yes, you are right, I overlooked that "vcard" is
not a tag but an attribute of any tag.

but it could be a <p/>, a <ul/> or any other tag. I guess I could
find the tag by looking for '"vcard" in XMLATTR', but how do I find
the corresponding XMLENDELEM (the </div>) so that I know when to stop?

You are on the right track. It should go like this:

XMLCHARDATA { data = $0 }
"vcard" in XMLATTR { delete addr; vcard_in = XMLSTARTELEM }
"class" in XMLATTR {
addr[XMLATTR["class"]] = data
}
XMLENDELEM == vcard_in {
print addr["given-name"]
print addr["street-address"]
}

Klaus Alexander Seistrup · Oct 18, 2006

Jürgen Kahrs said:
It should go like this:

XMLCHARDATA { data = $0 }
"vcard" in XMLATTR { delete addr; vcard_in = XMLSTARTELEM }
"class" in XMLATTR {
addr[XMLATTR["class"]] = data
}
XMLENDELEM == vcard_in {
print addr["given-name"]
print addr["street-address"]
}

We're getting closer, but still no cigar.

In the specific case "XMLENDELEM == vcard_in" will match all </div>s,
which is not what I want. I have better luck with:

vcard_in = XMLPATH
and
XMLENDELEM && XMLPATH == vcard_in

However, while xgawk will find all relevant attributes, it seems that
the data variable will always either be empty or contain a linefeed.
I don't quite understand why, but I'm trying to find out...

Cheers,
Klaus.

=?ISO-8859-15?Q?J=FCrgen_Kahrs?= · Oct 18, 2006

Klaus said:
In the specific case "XMLENDELEM == vcard_in" will match all </div>s,
which is not what I want. I have better luck with:

vcard_in = XMLPATH
and
XMLENDELEM && XMLPATH == vcard_in

This should be equivalent to

XMLENDELEM == vcard_in && XMLDEPTH == 1

However, while xgawk will find all relevant attributes, it seems that
the data variable will always either be empty or contain a linefeed.
I don't quite understand why, but I'm trying to find out...

The data variable is filled with the content of the very
last character data that occurred immediately before
the </div> (usually a newline). If you want anything else
to be assigned to the data variable, change the condition
in front of the assignment of the data variable.

Manuel Collado · Oct 19, 2006

Jürgen Kahrs escribió:

The data variable is filled with the content of the very
last character data that occurred immediately before
the </div> (usually a newline). If you want anything else
to be assigned to the data variable, change the condition
in front of the assignment of the data variable.

Parsing hCard is a really challenging task. The XSLT transformation program
from Brian Suda exceeds 1900 lines of text. This is a touchstone for any
XML processor.

Recognizing the elements with hCard content is the easy part. Collecting
values is much more difficult. The following code shows how to collect text
data for hCard properties/subproperties that can be arbitrarily nested. It
is a quick and dirty hack. The hCard structure is lost, and converted to a
flat list of property name/values. Only text data is collected. Properties
with value in attributes (like 'href') are not handled properly.

Maybe someday I could write a useable hCard parser, just to show the
capabilities of xmlgawk.

-- hcard.awk ---------------------------------------------------------
# Extract content of hCard

# Partial work. Just a quick and dirty hack!
# Author: Manuel Collado

# Global variables:
# hcroot (string) : path of the hCard root element
# hcard (array prop.name->value) : the extracted hCard (flat structure)
# hclevel (number) : nesting level of hCard prop/subprop
# hcpath (array hclevel->path) : stack of nested prop/subprop
# hckey (array hclevel->prop.name) : stack of nested prop/subprop
# hcvalue (array hclevel->value) : stack of nested prop/subprop

@load xml

BEGIN {
XMLCHARSET = "UTF-8"
hcsep = "|"
hckeys =
"\\<(fn|n|family-name|given-name|additional-name|honorific-prefix|honorific-suffix|nickname|sort-string|url|email|type|value|tel|type|value|adr|post-office-box|extended-address|street-address|locality|region|postal-code|country-name|type|value|label|geo|latitude|longitude|tz|photo|logo|sound|bday|title|role|org|organization-name|organization-unit|category|note|class|key|mailer|uid|rev)\\>"
}

# hCard start
XMLSTARTELEM && (XMLATTR["class"] ~ "\\<vcard\\>") {
hcroot = XMLPATH
delete hcard
hclevel = 0
delete hcpath
delete hckey
delete hcvalue
}

# skip content outside the hCard
!hcroot {
next
}

# data element (property) start
XMLSTARTELEM && hcroot {
# push each keyword (if any) on the stack
split( XMLATTR["class"], keylist, " " )
for (k in keylist) {
if (keylist[k] ~ hckeys) {
hclevel++
hckey[hclevel] = keylist[k]
hcpath[hclevel] = XMLPATH
hcvalue[hclevel] = ""
}
}
}

# character data
XMLCHARDATA && hclevel {
# concatenate text fragments inside the same property
hcvalue[hclevel] = hcvalue[hclevel] $0
}

# data element (property) end
XMLENDELEM && XMLPATH == hcpath[hclevel] {
# pop the value fron the stack and accumulate on parent data and hcard
while (XMLPATH == hcpath[hclevel]) {
value = hcvalue[hclevel]
key = hckey[hclevel]
if (key in hcard) {
hcard[key] = hcard[key] hcsep xs_trim(value)
} else {
hcard[key] = xs_trim(value)
}
delete hcvalue[hclevel]
hclevel--
if (hclevel) {
hcvalue[hclevel] = hcvalue[hclevel] value
}
}
}

# hCard end
XMLENDELEM && XMLPATH == hcroot {
hcroot = ""
# dump the collected data
print "------------------------------------------"
for (key in hcard) {
print key ": " hcard[key]
}
print "------------------------------------------"
}

END {
XmlCheckError()
}

# XMLgawk error reporting needs some redesign.
# Interim code: uses both ERRNO and XMLERROR to generate consistent messages
function XmlCheckError() {
if (XMLERROR) {
printf("\n%s:%d:%d

%d) %s\n", FILENAME, XMLROW, XMLCOL, XMLLEN,
XMLERROR)
} else if (ERRNO) {
printf("\n%s\n", ERRNO)
ERRNO = ""
}
}

#------------------------------------------------------------------
# xs_trim: remove leading and trailing [[:space:]] characters, and
# collapse repeated spaces into a single one
#------------------------------------------------------------------
function xs_trim( string ) {
sub(/^[[:space:]]+/, "", string)
if (string) sub( /[[:space:]]+$/, "", string )
if (string) gsub( /[[:space:]]+/, " ", string )
return string
}

=?ISO-8859-15?Q?J=FCrgen_Kahrs?= · Oct 19, 2006

Hello Manuel,

Parsing hCard is a really challenging task. The XSLT transformation
program from Brian Suda exceeds 1900 lines of text. This is a touchstone
for any XML processor.

I wasnt aware that this hCard format has such a widespread support.
I just looked at the example from Klaus and tried to
help him analyse his particular example. Do you have
any link to some kind of specification for the hCard format ?

Klaus Alexander Seistrup · Oct 19, 2006

Jürgen Kahrs said:
Do you have any link to some kind of specification for the
hCard format ?

There was a link to microformats in my original posting:
http://microformats.org/

Cheers,

Johannes Koch · Oct 19, 2006

Jürgen Kahrs said:
Do you have
any link to some kind of specification for the hCard format ?

<http://microformats.org/wiki/hcard>

Klaus Alexander Seistrup · Oct 19, 2006

Manuel said:
The following code shows how to collect text data for hCard
properties/subproperties that can be arbitrarily nested. It
is a quick and dirty hack.

Thanks a lot, this is of great help to me in order to understand
XML processing with xgawk, I certainly appreciate it!

Cheers,

Only one table shows up with the information	2	Mar 29, 2023
Help with code	0	Jun 12, 2022
br closing element name expected	4	Jan 2, 2005
Can someone tell me if this a real tracker? Or is it one designed to show you a different message at certain times, ie. acting like one?	0	Jan 10, 2021
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
Help with my responsive home page	2	Dec 14, 2022
How to keep an a:hover effect with css like "onMouseOut()"?	2	Apr 4, 2013
Unhelpful HTML validation	2	Mar 29, 2012

hCard parsing

Klaus Alexander Seistrup

=?ISO-8859-15?Q?J=FCrgen_Kahrs?=

Klaus Alexander Seistrup

=?ISO-8859-15?Q?J=FCrgen_Kahrs?=

Klaus Alexander Seistrup

=?ISO-8859-15?Q?J=FCrgen_Kahrs?=

Klaus Alexander Seistrup

=?ISO-8859-15?Q?J=FCrgen_Kahrs?=

Manuel Collado

=?ISO-8859-15?Q?J=FCrgen_Kahrs?=

Klaus Alexander Seistrup

Johannes Koch

Klaus Alexander Seistrup

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads