Search results

DITA: Converting HTML to DITA

Although you can convert content to DITA manually, you can also convert your HTML content to DITA through the XHTML to DITA transform in Oxygen.

Converting content to DITA isn't a small undertaking, because you'll essentially have to retag everything with the DITA markup. There are some automated ways of converting content, but if your source content isn't already in a DITA friendly format, for example, if you have lots of topics that combine lists and concepts, or that have nested subsections (third-level headers), the conversion might require some restructuring. Nevertheless, you can speed up the process using a combination of HTML Tidy and Oxygen's XHTML to DITA transform.

Sponsored content

Grab and clean HTML source code

  1. First view the source code and copy the HTML inside the body tags.

    Most tools, including Microsoft Word, allow you to generate an HTML version of the content. You can view the source code in a browser page by right-clicking the page and choosing View source.

  2. Go to HTML Tidy and paste the copied content content through this processor to clean it.

    There are a variety of settings on the HTML Tidy page. You can just use the defaults. Paste your source content into the HTML box, click Tidy, and then click View Tidied HTML. You don't have to include all the page content. Most likely when you look at the source of a page, you'll see the navigation, header content, footer content, etc. You might not want to bring this over. Just insert the body content. Tidy will supply the necessary HTML head tags to make the page valid.

    After cleaning the HTML, copy the entire output.

  3. In OxygenXML, go to File > New, expand the New Document folder, and select HTML.
  4. Save the file with a generic name such as "html template."
    You'll use this same html template for converting each page. When you run the HTML to DITA transform, Oxygen will create a new file from this template.
  5. Press Ctrl+A to highlight everything on your sample htmltemplate file and delete it. Then paste in the HTML you copied from HTML Tidy and save the file.
  6. For the title of your document, add the title between h1 tags right below the opening body tag.

    The transform will look for the first h1 tag and insert this as the document title. If you don't have an an h1 tag, the first heading level tag will get rendered as the document title. That heading level will then actually be removed! Therefore, it's important not to forget to add the h1 tag to your content before running the transform.
  7. Click the Configure transformation scenarios button and select XHTML to DITA Concept.
    You could also choose Topic or Task, but if you choose Task, you'll need to make sure the content already mostly conforms to the task topic type.
  8. Save the new file with the proper name and, if desired, choose the .dita extension.
  9. Compare the newly converted DITA file with the original HTML file and make sure all the sections carried over. Before you start applying post-processing, you want to be certain all the content is actually there.
Although you've converted the content to DITA, there is still some clean-up and other post-processing tasks to do.

Clean up the conversion notes

  1. Look in the source code of the newly converted topics and address any warnings, notes, or other conversion problems.

    When you look at the source of the newly converted DITA topics, you'll see that many of them have sections that have comments in them, such as this:

        <!--Original: <span @class=aui-icon icon-warning>-->
          <span class="ph aui-icon icon-warning">Icon</span>

    In this case, the original source used this class for notes. The transform doesn't know how to map classes to note elements, so you'll have to manually tag these sections as notes.

    DITA will convert classes to an outputclass element. (The outputclass element converts back to a class element when you transform your DITA content into HTML.) However, most likely the class tags on your previous platform won't have the same meaning as your new platform.

  2. You can bulk delete content across all DITA files by going to Find > Find/Replace in Files.
    Bulk find and replace is handy for cleaning up all of these notes in bulk.

Find opportunities for re-use (DITA)

One of the reasons for converting to DITA is to harness the content re-use capabilities. Now you should extract redundant content into separate files for re-use. This is the tricky part. If you migrated content out of Confluence, and you were using multiexerpt include macros to single source content into multiple files, you'll want to assess the content and figure out how you want to single source the material.

You have a couple of options for re-using content:
  • Conref. You could create a generic file to store common content, and then use conref tags where you want to insert this content. See Conref (re-use of content) for more details. Using conref makes sense especially for notes and other small chunks that are re-used across many different files.
  • Conditionalization. You could conditionalize the content so that you have attributes corresponding to different outputs on parts of the page. See DITA: Conditional profiling for more details. Conditional profiling makes sense when you have a few variations of the same topic for different audiences.
Buy me a coffeeBuy me a coffee