Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate additional Schematron Rules for GeekoDoc #6

Open
3 of 10 tasks
tomschr opened this issue Oct 24, 2016 · 13 comments
Open
3 of 10 tasks

Investigate additional Schematron Rules for GeekoDoc #6

tomschr opened this issue Oct 24, 2016 · 13 comments
Assignees
Labels
geekodoc Version independent general GeekoDoc issues question ❓ Further information is requested schematron Issues about Schematron
Milestone

Comments

@tomschr
Copy link
Contributor

tomschr commented Oct 24, 2016

In openSUSE/suse-doc-style-checker#117, I raised the question if a Schematron schema could be useful for SDSC. The same question can be asked for GeekoDoc as well.

A Schematron schema can be used in two ways:

  • Embedded
    Schematron rules are embedded inside the RNG schema.
  • Separate
    Schematron rules are collected outside in a different file (extension .sch). They are independant of the existing GeekoDoc RNG.

The validation procedure would be different:

  • Validation with embedded Schematron rules
    The validation with Schematron would be an integral part. In other words, after structural validation
    the rule-based validation process would be performed. Both can't be separated.
  • Validation with separate Schematron schema
    The validation with a separate Schematron schema would be step-wise. First step would be always
    the structural validation with RNG. If wanted (or needed), additional validation can be performed
    with Schematron. Both validation processes can be separated.

Rick Jelliffe, the inventor of Schematron, describe the language as "a feather duster to reach the parts other schema languages cannot reach". ;-)

Benefits

  • Additional checks which cannot be expressed by RNG.
  • Relationship conditions don't need to be checked in SDSC.
  • Kind of structural quality checks (are there any lonely sections? Procedure with a single step?)
  • Conformance checks (IDs should adhere to a certain pattern?)
  • Schematron validation step can be optional or imperative depending on our definition of validation.
  • Additional validation step can be included into DAPS gradually.

Schematron Versions

Currently, there are two versions of Schematron:

  • ISO-Schematron (published Mai 2006)
    the de-facto standard of Schematron. The new namespace http://purl.oclc.org/dsdl/schematron.

  • Schematron 1.5 (published 2001)
    The old reference implementation in pure XSLT. The namespace is http://xml.ascc.net/schematron/.

Tools

Schematron validation are supported by:

  • xmllint and option --schematron.
  • The Python library lxml, see http://lxml.de/validation.html#id2
  • Jing supports Schematron 1.5. Implementation is partely XSLT and partely Java.

See also

Personal

From my perspective, I prefer the separate Schematron schema (assuming all is possible, feasible, or useful). It seems, this doesn't introduce too many changes and gives greater flexibility.

I see it more as a "conformance and consistency" check rather than a hard validation. Of course, the rules shouldn't bother our writers too much.

Maybe we should also (re?)think about our definition of "validity/validation".

--

Update: List of Checks

Hard Rules

  • Import/check against the rules from docbook.sch (upstream DocBook).
  • Check for spaces in xml:id
  • Check for more than 1 step inside a procedure.
  • Check for more than 1 member inside a simplelist.

Soft Rules

  • Check for more than 1 listitem inside orderelist or itemizedlist.
  • Check for more than 1 varlistentry inside variablelist.
  • Check if you have more than 10 steps inside a procedure.
  • Check for a title inside admonition elements (note, tip, warning).
  • Check for specific rules following xml:id attributes.
  • Check for lonely sections(?)

@sknorr I've separated the discussion in SDSC from the GeekoDoc aspect. Feel free to comment. :)

@tomschr tomschr added question ❓ Further information is requested docbook5 labels Oct 24, 2016
@tomschr tomschr self-assigned this Oct 24, 2016
@ghost
Copy link

ghost commented Oct 26, 2016

I guess adding this to GeekoDoc might be the better idea for the time being...

For an idea of what we could do with Schematron directly in GeekoDoc, see: openSUSE/suse-xsl#222 . There is quite a number of cases associated with table markup and you generally notice those issues currently when going the step from FO->PDF because FOP balks.

This is also not really style checker territory because it really leads to hard errors that are not caught by current validation methods. Then again, if we have more such cases, we could move some checks from the style checker to GeekoDoc.

@tomschr tomschr modified the milestones: 0.9.7, Future Nov 22, 2016
@tomschr
Copy link
Contributor Author

tomschr commented Nov 28, 2016

DocBook >= 5.0 brings also some (ISO) Schematron files, see /usr/share/xml/docbook/schema/sch/5.1/docbook.sch. For example, it checks if footnote contains another footnote child.

However, it seems, oXygen is not that happy with the schema. It shows this error message:

cvc-complex-type.3.2.2: Attribute 'name' is not allowed to appear in element 's:pattern'.

This is the respective line:

<s:pattern name="Glossary 'firstterm' type constraint">

which should be corrected like this:

<s:pattern>
    <s:title>Glossary 'firstterm' type constraint</s:title>

@tomschr tomschr added the geekodoc Version independent general GeekoDoc issues label Nov 28, 2016
@ghost
Copy link

ghost commented Nov 28, 2016

The tools side of Schematron seems to be interesting ...

  • jing supports Schematron 1.5 (with some limitations, according to toms); toms says he does not really want to use the older version of the standard that is supported there
  • libxml (i.e. xmllint) has (some) Schematron 1.5 support [which is not mentioned in the man page]
  • lxml has ISO Schematron support (written in Python, needs a small wrapper, there is active development, provides Schematron->XSLT conversion based on reference implementation but no native Schematron implementation)
  • ph-schematron supports ISO Schematron but would be a new tool (written in Java, seems like there is active development, provides Schematron->XSLT conversion or native Schematron implementation) -- seems like our best shot
  • Probatron supports ?? (basically dead, but there are lots of forked projects on GitHub)

Websites related to Schematron are also interesting: They seem to either show lots of 404 errors (schematron.com has a working front page but all sub pages 404), lead to ad farms (Rick Jeliffe's home page with the reference implementation, Probatron) or advertise proprietary software (Oxygen, XML Buddy, Topologi).

I am starting to think that investing in Schematron at this point might not be such a good idea.

[edit 1, sknorr: libxml does have Schematron 1.5 support but it is not mentioned in the man page.]
[edit 2, sknorr: lxml has ISO Schematron support which I overlooked initially.]

@tomschr
Copy link
Contributor Author

tomschr commented Nov 28, 2016

libxml (i.e. xmllint, xsltproc & lxml) do not support Schematron

Actually, this is not quite true. There is the option --schematron. However, as far as I can see, you can only use Schematron 1.5 with that. So in a way, you can say libxml "supports" Schematron---although I wouldn't say nicely.

I wouldn't consider this a valid alternative...

@tomschr
Copy link
Contributor Author

tomschr commented Nov 28, 2016

I think the best approach would be to write a wrapper in Python using lxml library. This library supports ISO Schematron.

A quick fix reveals some nice features:

from lxml import isoschematron
from lxml import etree

# Create a Schematron parser:
sch_doc = etree.parse("geekodoc5.sch")
schematron = isoschematron.Schematron(sch_doc)

# Parse our DocBook5 source:
doc = etree.parse("foo.xml")
schematron.validate(doc)
# => False

print(schematron.error_log)
# => Prints an extensive error log (XML) which can be parsed

I think, this can be easily created into a small Python "Schematron validation script". ;-)

@tomschr
Copy link
Contributor Author

tomschr commented Nov 28, 2016

[...] I am starting to think that investing in Schematron at this point might not be such a good idea.

Yes, I can understand that you get this impression. I've recently discovered this 404 page as well. Not sure why this isn't available anymore. Nevertheless, I don't think it is that bad. As I've shown in my earlier post, it can be used in lxml, with some minimal scripting efforts.

All in all, I don't think this is something I would abandon Schematron at this stage. Of course, if lxml reveals some technical problems. we will need to think again.

@tomschr
Copy link
Contributor Author

tomschr commented Nov 29, 2016

Apart from my last comment, we should add specific rules depending on GeekoDoc and our styleguide.

Definitions

I would suggest to distinguish between "hard" and "soft" rules:

  • Hard rules are "must have" rules; if the result is false, these rules issue an error warning and abort the validation.
  • Soft rules are recommendations. They issue informative messages, but don't break nor abort the validation.

Hard Rules

  • Import/check against the rules from docbook.sch (upstream DocBook).
  • Check for more than 1 step inside a procedure.
  • Check for more than 1 listitem inside orderelist or itemizedlist.
  • Check for more than 1 varlistentry inside variablelist.
  • Check for more than 1 member inside a simplelist.

Soft Rules

  • Check if you have more than 10 steps inside a procedure.
  • Check for a title inside admonition elements (note, tip, warning).
  • Check for specific rules following xml:id attributes.
  • Check for lonely sections(?)

Probably I miss other rules.

@ghost
Copy link

ghost commented Nov 29, 2016

toms wrote...

  • Check for more than 1 listitem inside orderelist or itemizedlist.
  • Check for more than 1 varlistentry inside variablelist.

Both of those rules are good ways to make our "documentation updates" sections fail validation... :/

@tomschr
Copy link
Contributor Author

tomschr commented Nov 29, 2016

Both of those rules are good ways to make our "documentation updates" sections fail validation... :/

Ahh, right! Ok, we could move these from hard to soft rules. I just try to collect some examples...

@ghost
Copy link

ghost commented Nov 29, 2016

As I said somewhere above: within tables, counting the actual columns v/ columns set up via colspec would be great. And there are more issues concerning tables that should make validation fail but don't: such as bad column name references etc.

We could also check for spaces in ID attributes, such as in e.g. xml:id=" foo.bar" which will also go through current validation unhindered but fail when building HTML or PDF.

These would also give us added value as opposed to reimplementing something that is already covered by SDSC.

@tomschr
Copy link
Contributor Author

tomschr commented Nov 29, 2016

counting the column numbers of tables v/ within colspec would be great. And there are more issues concerning tables that should make validation fail but don't: such as the column name references etc.

Well, we could check if the value of @cols and the number of colspec elements are the same. That is easy. Also checking column name references shouldn't be too hard. I'll add that into our list.

However, tables can get complicated when spanning a cell or row are involved.

We could also check for spaces in ID attributes

Great idea!

These would also give us added value as opposed to reimplementing something that is already covered by SDSC.

But don't we want to move these parts into the Schematron schema?

@tomschr
Copy link
Contributor Author

tomschr commented Nov 29, 2016

Moved the list of checks into original description.

@tomschr
Copy link
Contributor Author

tomschr commented Dec 21, 2016

From #6 (comment), I've tried to create a script which can validate our (yet to be definied) Schematron schema. In the long run, the script can be integrated into daps (if not, it was a good exercise 😀 ).

@sknorr: For a first draft, see https://github.com/openSUSE/schvalidator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
geekodoc Version independent general GeekoDoc issues question ❓ Further information is requested schematron Issues about Schematron
Projects
None yet
Development

No branches or pull requests

1 participant