Search results

DITA: Converting HTML to DITA

Although you can convert content to DITA manually, you can also convert your HTML content to DITA through the XHTML to DITA transform in Oxygen.

Tip: See also DITA: Author in Markdown, publish with DITA for another way of converting HTML to DITA.

Converting content to DITA isn't a small undertaking, because you'll essentially have to retag everything with the DITA markup. There are some automated ways of converting content, but if your source content isn't already in a DITA friendly format, for example, if you have lots of topics that combine lists and concepts, or that have nested subsections (third-level headers), the conversion might require some restructuring. Nevertheless, you can speed up the process using a combination of HTML Tidy and Oxygen's XHTML to DITA transform.

Note: If you have large conversion projects, this method probably isn't suitable. If you have thousands of topics to convert, for example, take a look at Stilo or some other automated process. You may need to write custom scripts that tag content based on your structure. If, on the other hand, you have less than 100 pages to convert, the method described here might be just fine.

Grab and clean HTML source code

First view the source code and copy the HTML inside the body tags.

Most tools, including Microsoft Word, allow you to generate an HTML version of the content. You can view the source code in a browser page by right-clicking the page and choosing View source.
Go to HTML Tidy and paste the copied content content through this processor to clean it.

There are a variety of settings on the HTML Tidy page. You can just use the defaults. Paste your source content into the HTML box, click Tidy, and then click View Tidied HTML. You don't have to include all the page content. Most likely when you look at the source of a page, you'll see the navigation, header content, footer content, etc. You might not want to bring this over. Just insert the body content. Tidy will supply the necessary HTML head tags to make the page valid.

After cleaning the HTML, copy the entire output.
In OxygenXML, go to File > New, expand the New Document folder, and select HTML.
Save the file with a generic name such as "html template."
You'll use this same html template for converting each page. When you run the HTML to DITA transform, Oxygen will create a new file from this template.
Press Ctrl+A to highlight everything on your sample htmltemplate file and delete it. Then paste in the HTML you copied from HTML Tidy and save the file.
For the title of your document, add the title between h1 tags right below the opening body tag.

The transform will look for the first h1 tag and insert this as the document title. If you don't have an an h1 tag, the first heading level tag will get rendered as the document title. That heading level will then actually be removed! Therefore, it's important not to forget to add the h1 tag to your content before running the transform.
Note: If you're converting a page with a lot of code, the transform may not recognize the code samples unless they're wrapped in pre tags. If the transform can't recognize the code, it may eliminate the code section.
Click the Configure transformation scenarios button and select XHTML to DITA Concept.
You could also choose Topic or Task, but if you choose Task, you'll need to make sure the content already mostly conforms to the task topic type.
Save the new file with the proper name and, if desired, choose the .dita extension.
Compare the newly converted DITA file with the original HTML file and make sure all the sections carried over. Before you start applying post-processing, you want to be certain all the content is actually there.

Although you've converted the content to DITA, there is still some clean-up and other post-processing tasks to do.

Clean up the conversion notes

Look in the source code of the newly converted topics and address any warnings, notes, or other conversion problems.
When you look at the source of the newly converted DITA topics, you'll see that many of them have sections that have comments in them, such as this:
```
    
      <span class="ph aui-icon icon-warning">Icon</span>
                    
```
In this case, the original source used this class for notes. The transform doesn't know how to map classes to note elements, so you'll have to manually tag these sections as notes.

DITA will convert classes to an outputclass element. (The outputclass element converts back to a class element when you transform your DITA content into HTML.) However, most likely the class tags on your previous platform won't have the same meaning as your new platform.
You can bulk delete content across all DITA files by going to Find > Find/Replace in Files.
Bulk find and replace is handy for cleaning up all of these notes in bulk.

Find opportunities for re-use (DITA)

One of the reasons for converting to DITA is to harness the content re-use capabilities. Now you should extract redundant content into separate files for re-use. This is the tricky part. If you migrated content out of Confluence, and you were using multiexerpt include macros to single source content into multiple files, you'll want to assess the content and figure out how you want to single source the material.

You have a couple of options for re-using content:

Conref. You could create a generic file to store common content, and then use conref tags where you want to insert this content. See Conref (re-use of content) for more details. Using conref makes sense especially for notes and other small chunks that are re-used across many different files.
Conditionalization. You could conditionalize the content so that you have attributes corresponding to different outputs on parts of the page. See DITA: Conditional profiling for more details. Conditional profiling makes sense when you have a few variations of the same topic for different audiences.

About Tom Johnson

I'm an API technical writer based in the Seattle area. On this blog, I write about topics related to technical writing and communication — such as software documentation, API documentation, AI, information architecture, content strategy, writing processes, plain language, tech comm careers, and more. Check out my API documentation course if you're looking for more info about documenting APIs. Or see my posts on AI and AI course section for more on the latest in AI and tech comm.

If you're a technical writer and want to keep on top of the latest trends in the tech comm, be sure to subscribe to email updates below. You can also learn more about me or contact me. Finally, note that the opinions I express on my blog are my own points of view, not that of my employer.

Email Newsletter

Recent posts from Zen and the Art of Motorcycle Maintenance series

Recent posts from my AI tech comm series

Recent blog posts

Popular series

Archives