XML, DTDs and getting the newspaper on the web

  • Thread starter Brandons of mass destruction
  • Start date
B

Brandons of mass destruction

Ok, I work for a small newspaper thats looking to translate our content
to the web.

XML sounds like a good way to do this, but i'm a little confused.

Is it possible to have <b> and <i> tags in an XML document? we'd need to
do this editorial style (movie names get are italic etc)

When writing a dtd for the paper, it seems as though everything must
appear in a linear fashion i.e. headline, body, endnotes.
yet how do i handle sidebars, which appear within the body?
can i make sub elements?

Also, our paper has different sections with different types of content.
Should i write one DTD that covers every possible style in our paper and
break it down to several different DTDs? does it matter?

Is anyone using Indesign CS to export content to XML? if so, how
successful is it?
 
M

Martin Honnen

Brandons said:
Ok, I work for a small newspaper thats looking to translate our content
to the web.

XML sounds like a good way to do this, but i'm a little confused.

Is it possible to have <b> and <i> tags in an XML document? we'd need to
do this editorial style (movie names get are italic etc)

You can have any tag in an XML document (besides the restrictions that
the XML specification puts on element/tag names) so of course you can
have a <b> or <i> element however the semantics of such elements is not
automatically the HTML presentational semantics of bold text or italic
text. Unless you use a namespace of a well specified document markup
language like XHTML e.g.
<b xmlns="http://www.w3.org/1999/xhtml">text</b>
your element is just an element with the name "b" which can mean bold or
big or brother or whatever you want it to mean and a browser will not
understand it as meaning bold.
 
B

Brandons of mass destruction

Martin Honnen said:
You can have any tag in an XML document (besides the restrictions that
the XML specification puts on element/tag names) so of course you can
have a <b> or <i> element however the semantics of such elements is not
automatically the HTML presentational semantics of bold text or italic
text. Unless you use a namespace of a well specified document markup
language like XHTML e.g.
<b xmlns="http://www.w3.org/1999/xhtml">text</b>
your element is just an element with the name "b" which can mean bold or
big or brother or whatever you want it to mean and a browser will not
understand it as meaning bold.

ah.

Well, how would I go about preserving those bold and italic tags?
 
V

Victor

Brandons said:
Well, how would I go about preserving those bold and italic tags?

Two options come to mind:
- Save your files as XHTML, the XML-ified version of HTML. At least with
the older versions of XHTML <b> and <i> tags should have their
traditional meaning
- Enter the semantic web. This is a relatively new way of designing web
pages where content (data/semantics) is separated from presentation.
E.g., if you know movie names should be presented as italic text, you
have a <movie> tag with a child <name> which is formatted using a
separate stylesheet rule (CSS or XSLT). This way, when the layout
changes, you only need to change the stylesheet to reflect the change in
all documents. Also, "boldness" in this case is not information _about
the movie_, so you would avoid adding data which really isn't needed.

To learn how to use the semantic web, you can have a look at the
W3Schools XML tutorial (http://www.w3schools.com/xml/) to get you
started. Afterwards, you might want to check out the tutorials for other
W3C technologies (http://www.w3schools.com/), an interesting article
about the semantic web
(http://www.creativebehavior.com/index.php?PID=87), and the W3C
standards themselves (http://www.w3.org/).
 
A

Andy Dingley

Brandons of mass destruction said:
Ok, I work for a small newspaper thats looking to translate our content
to the web.

Look first at existing standards like RSS 1.0, NITF NewsML, XHTML,
DocBook, Dublin Core etc.

Secondly, look at off-the-shelf content management systems. This is a
big project you're dealing with here – it's the sort where
organisations can disappear into a software tar-pit (and some go bust
as a result!). If you aren't "a software guy", then grab a copy of
Steve McConnell's "Rapid Development", which is an essential for
anyone managing a software project for the first time.

You're not just writing a DTD here. You're writing a _system_. Think
system all the way through (or else your life will become tiresome).
How is authoring done, and how do things get into this DTD format ?
How do you store and manage the great many articles you'll have
(including editorial review and embargo) ? How do you finally publish
them ? How do you let users write dummy articles and publish them,
for training or demo purposes ? All this stuff needs to be thought
through, because the alternative is to go live in an unclear manner
(and we can imagine how publishing deadlines are conducive to
re-working software on a now-live system).
When writing a dtd for the paper,

Don't set out by writing a DTD. Really, truly bad idea. DTD-writing
is for those who are the first team to ever address a particular
problem. If you're Yet Another Newspaper, then use someone else's
existing DTD, don't go writing your own, use a pre-existing one.

As a separate point, I suggest using Schema instead of DTD.

If you can't find a schema to suit all your needs (which is likely),
then assemble a composite from several sources. In almost all cases,
you can solve 90% of your problem by bolting together (XML
namespacing) existing schemas.
it seems as though everything must
appear in a linear fashion

No, everything must appear in a branching tree, which can be
serialised in a linear manner. It's not _quite_ as restrictive,
although it is close.

My last few years of work has been with using RDF for seriously
complex content management (i.e. automated editorial and content
_production_, not just publishing). That's an area where I needed to
go beyond XML's Infoset tree model. For the general publishing case of
articles authored and subbed by humans though, then XML is adequate.

These days I'm working on magazine sites like www.t3.co.uk,
www.laptopmagazine.co.uk and http://gamesradar.msn.co.uk These are
examples of several dozen magazine sites generated by the same XML /
XSLT-based CMS. They use their own schema (bad move in hindsight).

Content Management seems to be a different problem with every user you
talk to. There are three broad directions to approach it from though;
content, page layout and site structure. Some users worry much more
about one aspect than others. A newspaper probably has a fairly simple
site structure, and certainly one that's long-term stable (so the
one-off design costs are less crucial). Page layout may or may not be
an issue, depending on the destination of your content – if "publish
on the web" means offering RSS syndication, then you're effectively
avoiding the question anyway. For a page on your own site, you will
care about this though.
i.e. headline, body, endnotes.
yet how do i handle sidebars, which appear within the body?

No big deal. There is a "document order" (which as we've just noted
needs to fit onto XML) and there is a layout order (which is two or
more dimensional). You need to relate one to the other, but this isn't
a problem. The linearity of one doesn't enforce that same order onto
the other.

One question is how final layout is specified. Is the placing of a
call-out etc. a matter for the text content author (so they embed
positioning information in the article body), or the page designer ?
Does it need to move around, depending on how text is filling the
available space ? An entirely workable solution is to simply have a
"callout" property (or set of them) and leave the positioning up to
the final rendering engine. After all, this is how it's done by the
art ed. in a paper-based world.

The "content" aspect of CMS for newspapers is one of the most complex
(in terms of Schema), but fortunately it's also one of the best
established. You don't need to invent here, there's a lot you can
borrow from pre-existing standards.

Think of your "article" structure at two levels; one is "newspaper"
structure, bylines, callouts etc. The other is lower than this,
generalised text markup such as <b>, <i>. Steal this from XHTML
(particularly the use of <div>, <span> and the concept of coreattrs)

A problem with article authoring is maintaining consistency between
authors. You'll extract some structure (i.e. "headline", "abstract")
into a formal Schema and out of the article body. Other less-obvious
properties, such as bylines, might not be treated so explicitly and
find themselves styled in-line by the text editing tool. If you don't
give your users a good way to do something, they'll only find a bad
way to try it instead. Encourage them to style regularly-used elements
as <div class="byline" >…, not just a mish-mash of <newline /><b><i>…

Can inline formatting (<em>, <b> etc.) be applied to the major
properties (headline etc.)? You need to decide this early on (either
way will work), then make your system consistent in where it permits
it! A schema is a specification of the data model you support – you
need to extend this to every part of the system, from the authoring
tool to the database to the publishing engine and the final rendering
as HTML/CSS or PDF. It doesn't always have to implement every feature
at every level, but you have to _know_ how it is handled (or not) at
_every_ step, or you will go crazy when debugging it. This is a big
system you're dealing with here.

For newspapers, I don't see content representation as being a big
problem. I don't even see page layout as insurmountable. Magazine
page-layout OTOH is much trickier. It varies far more, there's more
branding distinction between titles, and it's generally more
design-centred than content-centred. But that you should want to hear
_my_ troubles….

can i make sub elements?

What's a sub element ? If you mean "paragraph, but not quite like a
standard paragraph" then look at permitting a "class" attribute on
_all_ of the elements in your text formatting set. This is what HTML
does with the coreattrs set. You now know that _every_ element you
have can be treated equally, with a class to allow sub-classing its
behaviour like this (and an easy binding to CSS when you publish it).
Also allow some arbitrary containers like <div> and <span>, just as a
placeholder to carry such attributes. The rest of the set; title, id
and lang are nearly as useful too.
Also, our paper has different sections with different types of content.
Should i write one DTD that covers every possible style in our paper and
break it down to several different DTDs? does it matter?

Write it in modules, because it makes their management easier. But be
able to generate a single composite schema from all of this, as a
global overview to check you're avoiding collisions.

Also allow the "documents" to have a choice of root element. Maybe
everything gets published as "publication", but there will be many
times in the content management system when it's useful to deal with
an "article" or a "competition-question-set" and still be able to
validate this fragment against the schema.


I really should write a white paper on "What's needed in an ideal
magazine publishing CMS" – I have a meeting on Monday where it would
be useful, so I might even make time for it. Maybe…
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,535
Members
45,007
Latest member
obedient dusk

Latest Threads

Top