Structured content from old documents

Gone are the days when technical documentation was written in editors like Word or WordPerfect: modern technical documentation is written in small chunks of text (topics) and managed in Component Content Management Systems (CCMSs).

Converting old, unstructured files in a format that is compatible with a CCMS is not straightforward. I can help you plan and execute such a conversion.

Unstructured vs. structured formats

Unstructured formats like PDF or raster images focus on how a document looks like, and little else.

Structured formats like XML or JSON support many more functionalities, for example:

  • semantic tagging: a description of the meaning of content, instead of what content looks like.

  • metadata management: data about your content, including author, audience, applicability to specific products, models, orders, or customers.

  • section hierarchy: an outline of the document chapters, sections, or sub-sections at different levels.

  • filtering and conditional processing: mechanisms to include or exclude chunks of content during a publication.

  • content re-use: the possibility to re-use content across different documents.

  • relationship management: an outline how topics are related or linked to each other and to other documents.

  • Single-source publishing: the possibility to publish the same source content to different outputs (PDF, HTML, JSON, ePUB, etc…).

Unstructured documents have none of these functionalities. They also a poor match for Artificial Intelligence (AI) applications and increase the chance of hallucinations.

Converting unstructured content

Converting unstructured documents to a structured format is not straightforward: nowhere will you find information about, for example, the hierarchy of sections or the metadata. This information can, however, be recreated thanks to AI models.

I am not talking here of popular, general models like ChatGPT, Claude, or Gemini, but of more specialized software like Mineru or Docling.

These programs can transform OCR-scanned documents or PDFs into Markdown or XML. After some additional clan-up, these formats can in turn be fed into a CCMS.