Search results

Simplifying DITA authoring by using a Markdown to HTML to DITA workflow

by Tom Johnson on Oct 29, 2014
categories: dita • technical-writing

12/2/15 update: For more technical details on how to convert Markdown to HTML to DITA, see Convert Markdown to DITA in 20 seconds.

The other day I started to organize my notes on Java, and knowing that I eventually plan to publish these notes, I wondered what format I should write the content in. My first thought was, hey, I wrote my DITA QRG in DITA, so why not store my Java notes in DITA as well?

And then I had this nasty feeling of dread where something in my chest cringes and shrivels. If I was going to be drafting this content, the last thing I wanted to worry about was XML markup tags and structural complexity.

Were I to draft the content in DITA, I would be running scenarios in my head like, what if I want to put a list after a section? What if I want several lists? Will I be nesting tasks? Should I use the context element to begin narration around tasks, or use sections with the general task instead? What if I have two examples following a task list? What if I don't want to use formal steps at all, but rather show a sequence of commands and responses, and maybe a developed illustration?

You know what. Java is complicated enough as it is. I don't need a lot of XML markup complexity to consider as well. When I'm drafting content, I want to focus on the actual content I'm creating, not the formatting and publishing validation. I want to write the content in the structure that best matches the content.

So I decided to write my Java notes in Markdown instead. As I settled on writing in Markdown, I felt a huge feeling of relief. Then I wondered, shouldn't authors like the syntax they use as they write? If I have a gut reaction of dread when it comes to the DITA XML syntax, isn't that generally a sign of doom for that markup?

In general, people dislike writing in DITA XML. But they enjoy writing in Markdown.

To use an analogy, DITA XML is that ugly turtleneck wool sweater that makes you feel hot, itchy, and uncomfortable when you wear it. In contrast, Markdown is the comfortable stretchy nylon duo-dry shirt that you can wear for a few days at a time without wanting to take it off.

But the problem is more difficult, because even if you feel great writing in Markdown, you're going to be red in the face if you have to cut and paste content for multiple outputs, or if your publishing scenario is more complicated than something Markdown supports (e.g., 7 outputs, 80% similar text, 3 vertical channels, and translation).

In that case, those XML tags that allow you to do conditional profiling come in quite handy. You configure your filters for your transform scenario, and voila, you bypassed a ton of copying and pasting and published your 7 guides with the click of a few buttons.

What is the solution for a pain-free authoring process like Markdown while also tapping into robust publishing capability of DITA?

I know quite a few people are working on this problem. Efforts at a DITA-based wiki, lightweight DITA, rST to DITA parsers, Microsoft Word-like DITA editors, WYSIWYG DITA browser editors, and more are a few of these efforts.

While it is tempting to abandon DITA for a static file generator such as Docpad or Jekyll, I have implemented DITA because my publishing requirements are complicated. I'm generating at least 8 different outputs from the same source. Between two product channels and four different audiences, the content outputs are tailored to show the audience only the content that's relevant to them.

If I switched to Docpad, which has some query-engine capability, I doubt the query syntax would be any easier than the DITA syntax for accomplishing the same tricks. Static file generators work well if you have a single output that you're not pushing to multiple channels, audiences, and translation.

So if I stick with DITA, how can I make this cross easier to bear both for me and other contributors?

Here are some thoughts on a possible two-step approach. Note that I'm in brainstorming mode here. These are fairly untested ideas. In fact, I'm hoping to learn from others, i.e., you, before throwing myself headlong down a rabbit hole chasing an idea.

Step 1: Use DITA topics instead of concepts or tasks

First, I can reduce out some of the unnecessary complexity in the DITA markup. Instead of writing in concepts, tasks, and reference, why not use the more general topic type?

The difference between concept types and topic types is rather subtle. With a concept, you can't follow a section with an element that isn't a section. Once you start using sections, it's sections all the way from there on out.

The big question is in foregoing the task structure. All of those steps, step, cmd, info, stepxmp, etc. elements are robust. They allow you to provide a lot of semantic accuracy to a complex task structure, which is where we live, right, in tasks? But what's really the point? When you transform the DITA content to XHTML, all of these elements just get rendered as regular old HTML list and paragraph tags (and a few divs and spans). Deliberating about whether to use stepxmp or info in a step element is somewhat a waste of time, since both get rendered into block elements.

For example, here's a simple task of three steps written in DITA:

<steps>
     <step>
         <cmd>This is the first step.</cmd>
         <info>This is some additional information included in an info element.</info>
     </step>
      <step>
          <cmd>This is the second step.</cmd>
          <stepxmp>this is an stepxmp content.</stepxmp>
      </step>
      <step><cmd>This is the third step.</cmd>
           <tutorialinfo>This is some tutorial info stuff.</tutorialinfo>
      </step>
</steps>

The XHTML transform renders it into this:

<ol class="ol steps">
            <li class="li step stepexpand">
                <span class="ph cmd">This is the first step.</span>
                <div class="itemgroup info">This is some additional information included in an info
                    element.</div>
            </li>
            <li class="li step stepexpand"><span class="ph cmd">This is the second step.</span>
                <div class="itemgroup stepxmp">this is an stepxmp content.</div>
            </li>
            <li class="li step stepexpand"><span class="ph cmd">This is the third step.</span>
                <div class="itemgroup tutorialinfo">This is some tutorial info stuff.</div>
            </li>
</ol>

We mostly get ol, li, and p tags with some classes.

How about just writing the same content in a topic type? You would use ol, li, and p tags instead, like this:

<ol>
      <li>This is the first step.
        <p>This is some additional information included in an p element.</p>
      </li>
      <li>This is the second step.
        <p outputclass="special">this is a p content with class.</p>
      </li>
      <li>This is the third step.
        <p>This is some tutorial info stuff.</p>
      </li>
</ol>

(Notice how much cleaner and lighter that markup itself feels?) This is what it gets transformed into when you use the XHTML transform:

<ol class="ol">
      <li class="li">This is the first step.
        <p class="p">This is some additional information included in an p element.</p>
      </li>
      <li class="li">This is the second step.
        <p class="p special">this is a p content with class.</p>
      </li>
      <li class="li">This is the third step.
        <p class="p">This is some tutorial info stuff.</p>
      </li>
</ol>

It's mostly the same thing but with different classes. Your level of styling control is slightly reduced, but unless you were going to have different styles for info elements versus stepxmp elements versus tutorialinfo elements (potentially resulting in a cornucopia of styles), there isn't much point is giving each element its own class.

Was there any great win in using the task type instead of the general topic type? Not in my opinion. The HTML output is pretty much the same. Unless you're styling all of these things differently, there's little benefit in using instead of simply

. If you need a special class on a paragraph, you can just add it using the outputclass attribute.

Using the general topic, you don't have to worry about whether an info element is allowed inside stepxmp or not, and so on. It's simpler and more straightforward.

The information type argument

DITA's information types do much more than simply transform content. They make sure that certain information patterns that are optimized for user learning are followed. In particular, the DITA information model encourages "small non-linear chunks readable in any order".

I would love to see an actual study (more than one guy's pre-Internet research 25 yrs ago at IBM) that explores whether DITA's information types actually optimize learning. (Any academics out there? This would be a great topic for a dissertation.) I'm much more persuaded by the "Every Page Is Page One" philosophy. At any rate, I trust my writerly instincts about how to communicate information more than I trust information typing patterns.

Regardless of the efficacy of information patterns, you can still follow them even if you aren't using DTD's that enforce them. For example, you can write lots of task-based information using the general topic type. You can choose to separate out the conceptual information into their own topics. You can choose to create topics that are just long tables of information.

In summary, if you want to follow the information types, you can do so in your general topics. But by using the general topic types, you'll also have the leeway to make exceptions to the information types if warranted -- without going out of your way to implement a workaround.

Step 2: Draft in Markdown

Here's a key principle to keep in mind. When you're still developing the content, focus on the content. When you're thinking about how to publish it, focus on the markup.

I'm planning to develop my content in Markdown (as I'm doing with this post now). When the content is ready to be published, I'll convert it to HTML or DITA XML and make sure all those little tags are in place.

If you use Eclipse, the workflow from Markdown to DITA is pretty quick. The Mylyn Wikitext plugin allows you to right-click a Markdown document and transform it to HTML. Then using the Eclipse OxygenXML plugin, you right-click an HTML document and choose what DITA topic type you want to convert it to. Here's a 30 second video showing this conversion process here:

[video width="984" height="729" mp4="https://s3.us-west-1.wasabisys.com/idbwmedia.com/video/mdtohtmltodita.mp4"][/video]

(By the way, Mylyn Wikitext supposedly allows you to convert directly from a wiki syntax into DITA, but I couldn't figure it out. I think you have to configure some Ant scripts and run the conversion from there. If you know how to configure this, let me know.)

Feedback?

I'm interested to hear your process for writing. Do you draft your content in DITA markup from the start, or do you add the tags later? If you add the tags from the start, doesn't it add extra overhead in managing the tags (I mean, you can't even writing a paragraph without surrounding it with

tags). If you add the tags later, doesn't that put you in a time crunch?

About Tom Johnson

I'm an API technical writer based in the Seattle area. On this blog, I write about topics related to technical writing and communication — such as software documentation, API documentation, AI, information architecture, content strategy, writing processes, plain language, tech comm careers, and more. Check out my API documentation course if you're looking for more info about documenting APIs. Or see my posts on AI and AI course section for more on the latest in AI and tech comm.

If you're a technical writer and want to keep on top of the latest trends in the tech comm, be sure to subscribe to email updates below. You can also learn more about me or contact me. Finally, note that the opinions I express on my blog are my own points of view, not that of my employer.

Email Newsletter

Recent posts from Zen and the Art of Motorcycle Maintenance series

Recent posts from my AI tech comm series

Recent blog posts

Popular series

Archives