Automated Documentation Check with LanguageTool
Here at CoreMedia we write our documentation in DocBook using IntelliJ Idea as an editor for the XML sources. From this XML we generate PDF and WebHelp manuals.
The documentation is part of our source code repository and is also integrated in CoreMedia’s continuous integration process with Jenkins, Sonar and the like. Naturally, the demand for a Sonar like quality measurement for documentation arouse.
The first task is to determine the metrics that we want to monitor. Unfortunately, there is, at least now, no way to automatically test for accuracy and completeness of the information, so we have stick to more obvious features, such as:
- Size of the manual measured through the number of chapters, tables, figures…
- Spelling errors
- Grammar errors
- CoreMedia style guide errors
The first point is easy; simply count the corresponding DocBook tags in the manual using XPATH. The others require a checker that can be integrated into the build process and that delivers a usable format for further processing.
After searching the web we stumbled upon LanguageTool (www.languagetool.org). LanguageTool is an open source tool that offers a stand-alone client, a web front-end and a Java library for all the checks we want to do.
Integrating the Java library in our adapted version of the docbkx-maven-plugin was easy. Adding the Maven dependency to the project and creating a new Maven goal which instantiates the LanguageTool object:
langTool = new JLanguageTool(new AmericanEnglish()); langTool.activateDefaultPatternRules();
The second line shows the big power of LanguageTool, the rules. Spell checking is done with hunspell but all of the grammar and style checks are defined in rules, either written in Java code or in XML. A simple XML rule that checks for the correct usage of email, would look like this:
<rule id="mode" name="Style: Do not write e-mail"> <pattern> <token>e-mail</token> </pattern> <message>CoreMedia Style: Its <suggestion>email</suggestion> not e-mail</message> <example type="correct">Send an <marker>email</marker></example> <example type="incorrect">Send an <marker><match no="1"/></marker></example> </rule>
More complicated rules are possible using regular expressions and POS (part of speech, see http://en.wikipedia.org/wiki/Part_of_speech) tags. LanguageTool comes with a huge chunk of predefined rules for common grammar errors and can be extended by own rules. So, we implemented our style guide with XML rules.
When we start the check we get the results as a list of RuleMatch objects:
List<RuleMatch> matches = langTool.check(textString);
From a RuleMatch objects we can get all interesting information, such as the error message, the position, a suggested correction and more. In our HTML result pages, we show, for instance, the following information from a predefined rule:
In the build process we generate an overview site for all manuals:
At the beginning we got a lot of errors that were not real errors but shortcomings of the checker. There were mostly three reasons for this:
- Words not known by the spellchecker (all of these acronyms used in IT writing, for example)
- Grammar rules not applicable to the format of our text
- Words like file names or class names that can’t be known by the spellchecker
We applied three measures to overcome the false positives:
- Creating a list of ignored words for the spellchecker. The list is managed in the repository so everyone can add new words.
- Deactivating rules in LanguageTool with
langTool.disableRule(deactivatedRule);. The list of deactivated rules is also managed in the repository
- Tagging all specific words with the appropriate DocBook element and filtering the DocBook sources.
With this approach we were able to remove nearly all false positives.
Having an overview page for the documentation enhances the visibility and leads to better quality of the documentation. LanguageTool is a great product for this. It’s easy to integrate and to use and is very powerful. Questions in the forum or the mailing lists have been answered quickly. So, give it a try when you want to monitor the quality of your documentation.