OBO Format and Manchester Syntax
At Neuroinformatics 2009, David Sutherland and I talked about the problems of ontology building. One of the current (and past!) difficulties is to choose an appropriate language for representing the knowledge in your ontology. I thought I would write my thoughts up as a post; this will probably result in the most boring thing I have ever written (I am sure someone will point out worse offenses); syntax is dull but distressingly important.
In bioinformatics, there are essentially two choices that is OWL and OBO (format). A second issue, is finding a good environment for developing the ontology; this divides between Protege, OBO-Edit and the ever-present “text editor”. It’s often the case, that we want to use both of these at the same time. Take, for example, OBI, which I am involved in. While the ontology itself is being developed in OWL, many of its dependent ontologies are built using OBO; being purist and demanding one is really not an option. OWL itself has many different syntaxes; at the moment, I generally prefer Manchester sytnax because you can edit it with text-editor, which is really not so easy with any of the XML representations.
While these two languages have somewhat different expressivity, there have been a number of descriptions of how to translate both the syntax and the semantics which have been described elsewhere. One of the recurrent problems, however, stems from the best practices and the syntax of identifiers.
OBO makes use of a numerical, semantics-free identifier and a namespace,
with a syntax of NAMESPACE:IDENTIFER
. So, a Gene Ontology term looks
like GO:0003674
. The namespace is not constrained to be two-letters
and has mechanisms for world-uniqueness, in that people talk to each
other and sort it out, if they clash. The use of a semantics-free
identifier means that term names can be changed while maintaining the
implied meaning with the term; the label for the term, meanwhile,
provides a human readable version, which can be shown to users of the
ontology. I will call these the OBO identifier and OBO label
respectively.
Translating this, however, into OWL, including Manchester syntax causes
significant problems. The naturalistic translation is to turn the OBO
identifier onto the identifier in OWL; the OBO namespace would become an
XML namespace, the OBO identifier would become an XML identifier.
Unfortunately, this doesn’t work. First, the OBO identifier is genuniely
just a short string and XML requires a URI; so a mapping between OBO
identifiers and URIs is necessary. Second, the OBO identifier is
numerical; unfortunately, while the identifiers in OWL can contain
numbers they have to start with a non-numerical character. The standard
translation, therefore, uses in most cases an OBO wide URL
(http://purl.obolibrary.org/obo/), although some ontologies have their
own namespace (GO uses http://purl.org/obo/owl/GO#). The OBO
identifier is mapping to an valid identifer by sticking a prefix onto
the numbers. So, we have identifiers such as GO:GO_0042101
or
obo:OBI_1110045
. There are also some OBO ontologies for which this
does NOT occur; for instance, BFO classes in OBI come out with
identifiers of the form snap:Continuant
or span:Process
, except for
one which is bfo:Entity
.
Again, all perfectly reasonable, but unfortunately, when converted to Manchester syntax it means that we end up with classes that look like this slightly elided class from OBI:
Class: obo:OBI_1110161
Annotations:
rdfs:label "T cell epitope ELISA IL-1b assay"@en,
SubClassOf:
obo:OBI_0000661,
obo:OBI_0000299 some (obo:IAO_0000109
and (obo:IAO_0000136 some obo:OBI_1110196))
which completely defeats the aim of a human-readable syntax. Now OBO format has much the same problem; relationships to other classes are specified using cross-referenes to their identifiers which are, essentially, unreadable. OBO format works around this with a denormalisation as can be seen from this somewhat elided example from IAO:
[Term]
id: IAO:0000027
name: data item
def:"a data item is an information content entity that is intended...."
is_a: IAO:0000030 ! information content entity
The cross reference in this case is a subsumption link to IAO:0000030
One solution would be to use the rdfs:label
in place of the
identifier. So, we would have something that looked like this:
Class: "T cell epitope ELISA IL-1b assay" @en
Annotations:
obo:identifier "1110161"
SubClassOf:
obo:OBI_0000661,
obo:OBI_0000299 some (obo:IAO_0000109
and (obo:IAO_0000136 some obo:OBI_1110196))
Other identifiers would also have to be changed, also. I’ve also added
the odo:identifier
line (which I think would be valid, but might
require the creation of an OWL individual). Without this, it would not
be possible to go backward.
However, this is problematic as it changes the serializiation between
the OWL Manchester syntax and other syntaxes of OWL. The class
identifier has to be URI legal, and OBO label here is not. We could do a
syntactic conversion (e.g. T%20%cell%20%epitope
) but this, again,
reduces readiblity, defeating the point. Also, the rdfs:label
would
become part of the final identifier URI, which then becomes a semantics
heavy identifier. Finally, it would require a OBO specific loading of
the Manchester syntax, taking the URI identifier from the annotation
block, and the rdfs:label
from the class name.
So, is there any solution. First, there are tooling solutions. In
Protege, it is already possible to use any component of the definition
in the display. So, you can set the rdfs:label
as the main display
form. Tooling solutions are attractive, but there is a problem; you have
to extend all tools to support this view; I realise that the number of
freaks who wish to edit OWL with emacs is not that large, so this might
not seem an issue. However, many people wish to develop ontologies
collaboratively using version control; if you want to compare versions
you use diff, so we now need an Manchester syntax diff viewer. Also, if
you want to do some perl hacking, or straight-forward search and
replace, again, it’s all harder.
To some extent this might seem trivial, but then the entire purpose of Manchester syntax (and the functional syntax) is to have an easy to read and manipulate syntax which the XML version of OWL is not. This purpose is defeated if it’s hard to read.
So, a second non-tooling solution. The obvious answer is to take the OBO approach and add comments. Now, the Manchester syntax includes a comment character (#), although last time I tried the Protege parser doesn’t implement this. None then less, it allows this:
Class: obo:OBI_1110161 #"T cell epitope ELISA IL-1b assay"@en
Annotations:
rdfs:label "T cell epitope ELISA IL-1b assay"@en,
SubClassOf:
obo:OBI_0000661,
obo:OBI_0000299 some (obo:IAO_0000109
and (obo:IAO_0000136 some obo:OBI_1110196))
This is not too bad, but it doesn’t work well for complex class expressions. I can’t be bothered to look up the labels and have reused one, but you get something like:
Class: obo:OBI_1110161 #"T cell epitope ELISA IL-1b assay"@en,
Annotations:
rdfs:label "T cell epitope ELISA IL-1b assay"@en,
SubClassOf:
obo:OBI_0000661, #"T cell epitope ELISA IL-1b assay"@en
obo:OBI_0000299 #"T cell epitope ELISA IL-1b assay"@en
some (obo:IAO_0000109 #"T cell epitope ELISA IL-1b assay"@en
and (obo:IAO_0000136 #"T cell epitope ELISA IL-1b assay"@en
some obo:OBI_11101 #"T cell epitope ELISA IL-1b assay"@en
))
This has three problems. Firstly, we have used comments “meaningfully” as we can’t distinguish between these comments and other normal comments. Secondly, we have had to reformat the output because we have only a “to-end-of-line” comment character. Thirdly, it looks horrible.
So, my minimal solution would be this; we introduce some new comment characters, which are treated as comments normally, but which carry enough semantics to allow a warning when they are wrong; rather like Javadoc, which is a comment wrt the language, but is structured and meaningful wrt the documentation. Tooling could be used to check that the comment masquerading labels are correct wrt to the identifiers.
Class: obo:OBI_1110161 [T cell epitope ELISA IL-1b assay],
Annotations:
rdfs:label "T cell epitope ELISA IL-1b assay"@en,
SubClassOf:
obo:OBI_0000661 [blah],
obo:OBI_0000299 [longer blah]
some (obo:IAO_0000109 [more]
and (obo:IAO_0000136 [stuff]
some obo:OBI_11101 [OBI Thing]
))
This is still not ideal; it would require extension to Manchester syntax, but it’s minimal, and it does support the semantics free identifiers in OBO in a way which does not require extensive tooling. It’s worth reiterating here that OBOs semantics-free identifiers are a good thing; so, supporting them supports others people who may wish to do the same, sensible thing. It does have the disadvantages of duplicating information, but at least in a way that is checkable.
Comments welcome!