The research leading to these results has received
support from the Innovative Medicines Initiative
(IMI) Joint Undertaking under
grant agreement n° 115191, resources of which are composed of financial
contribution from the European Union's Seventh Framework Programme
(FP7/2007-2013) and EFPIA companies’ in kind contribution.
These RDF-guidelines are intended for data providers who want to expose their data as RDF to the Open PHACTS platform.
The HTML source code and full revision history of this document can be found in this git repository.
There are many sides to making data semantic. This guidelines document restricts itself to using RDF, and will not go into ontological discussions, such as when to use a class or an instance. The document will also be limited to giving pointers, and some rules of thumb, and the reader is most invited to read the below-listed further reading.
The most important message is to use RDF not find the best representation for your data, but to be explicit in how you represent your data.
Open PHACTS requires:
Open PHACTS does not specify requirements or guidelines around:
Before you start thinking about converting something into RDF, the first two questions you should ask yourself:
Because this information is also important for all people who will want to use your data, you must specify as metadata these pieces of crucial information along with the shared data. This step does not imply that the data must be Open, but it does simplify a lot of things when it is. The least you must do is to provide clarity as to whether the data is Open or not.
VoID should be used to encode license information for data sets to be used in Open PHACTS [[OPSDD]].
The Dublin Core ontology [[Nilsson2008]] is a common alternative, and the given example is purely
illustrative for encoding license information in RDF:
When creating triples from your data, it is important to think about the data in terms of concepts and their relations in scientific terms, not in terms of database terminologies. The triples must in no way reflect concepts like database tables or other details that originate from the format in which the data was previously stored.
So, the following code example shows bad practices. This generated example RDF shows a compound database, listing molecules, their synonyms, mol weight and SMILES representations. The RDF output reflects the original data structure, and adds little useful meaning (i.e. the semantics) to the data.
Input tables with header on the first line, for names, smiles, and properties:
ID:MW
1:180.1578
2:151.1629
ID:Synonym
1:aspirin
1:acetyl salicylic acid
2:paracetamol
ID:SMILES
1:O=C(O)c1ccccc1(OC(=O)C)
1:CC(=O)Oc1ccccc1C(O)=O
2:O=C(Nc1ccc(O)cc1)C
Output:
@prefix any23: <http://any23.org/tmp/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix props: <http://any23.org/tmp/propsTable/> .
@prefix names: <http://any23.org/tmp/synonymsTable/> .
@prefix smiles: <http://any23.org/tmp/smilesTable/> .
props:row1
any23:molid "1" ;
any23:mw "180.1578" .
props:row2
any23:molid "2" ;
any23:mw "151.1629" .
names:row1
any23:molid "1" ;
rdfs:label "aspirin" .
names:row2
any23:molid "1" ;
rdfs:label "acetyl salicylic acid" .
names:row3
any23:molid "2" ;
rdfs:label "paracetamol" .
smiles:row1
any23:molid "1" ;
any23:smiles "O=C(O)c1ccccc1(OC(=O)C)" .
smiles:row2
any23:molid "1" ;
any23:smiles "CC(=O)Oc1ccccc1C(O)=O" .
smiles:row3
any23:molid "2" ;
any23:smiles "O=C(Nc1ccc(O)cc1)C" .
Importantly, the notion of columns and rows in the RDF must be removed. Better would be:
@prefix any23: <http://any23.org/tmp/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix compound: <http://any23.org/tmp/compound/> .
compound:molid1
rdfs:label "aspirin" ;
rdfs:label "acetyl salicylic acid" ;
any23:smiles "O=C(O)c1ccccc1(OC(=O)C)" ;
any23:smiles "CC(=O)Oc1ccccc1C(O)=O" ;
any23:mw "180.1578" .
compound:molid2
rdfs:label "paracetamol" ;
any23:smiles "O=C(Nc1ccc(O)cc1)C" ;
any23:mw "151.1629" .
One can notice that identifiers used in relational databases typically find a role in the URI of resources, as was used in this example too.
Another important difference is that tables require an external schema to provide meaning. RDF is much more self-explanatory. Therefore, one must not only think about the structure, but also about data types. With this approach, we can further improve the semantic equivalent of the three tables:
@prefix any23: <http://any23.org/tmp/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix compound: <http://any23.org/tmp/compound/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
compound:molid1
rdfs:label "aspirin"@en ;
rdfs:label "acetyl salicylic acid"@en ;
any23:smiles "O=C(O)c1ccccc1(OC(=O)C)" ;
any23:smiles "CC(=O)Oc1ccccc1C(O)=O" ;
any23:mw "180.1578"^^xsd:float .
compound:molid2
rdfs:label "paracetamol"@en ;
any23:smiles "O=C(Nc1ccc(O)cc1)C" ;
any23:mw "151.1629"^^xsd:float .
Be aware of making assumptions that are not true. For example, the enriched example above
assumes that all labels are all in English, which is not generally true for compound
databases.
The first step in creating your RDF is to create a list of all concepts that are found in your data. Are there proteins, metabolites, cell cultures, organisms, targets, assays? At what level are those concepts represented in your data? Are they protein names without exactly known point mutations? Are the accurate masses resulting from metabolomics experiments? Are the exact metabolic structures known? Are there multiple identifiers known? Does your data contain references with a PubMed identifier or a DOI? In this step you define the types, rather than the entities: you observe that they are proteins, but do not enumerate each of them.
The purpose here is not to develop an ontology, but to get clear what the content of your data is, allowing you to identify existing ontologies (see Step 4) that capture that information. To support this process, for each concept found in your data a human readable label and a short definition must be provided, both in English. Here too, the underlying rule is that everything must have an explicit and well-defined meaning. This list may be in a Word document, Excel spreadsheet, but also in RDF itself, for example using SKOS [[SKOS]]. The choice, however, must be chosen such to improve the thinking about the concepts.
For example:
| "Activity" | Biochemical property that chemical entities exhibit in some experiment. |
Once you know what concepts are found in your data, it is time to identify how those concepts are linked in your data set. These relations must be identified and listed too, and in the same manner as in Step 2. These relations preferably have a verb form, making them easier to understand. For example, a predicate label has name is preferred over just name.
For each relation, the list should provide a human readable label, and a short definition, again in English. Again, the method of recording the list of relations and properties must be chosen such to improve the thinking about the information in the data.
For example:
| "has sequence" | Protein property linking an amino acid sequence to a protein. |
| "binds to" | Relation between a drug and a drug target reflecting a chemical interaction. |
Like in Step 2, you focus on the types only, not in actual binding interactions, etc.
Because existing software already knows about existing, common ontologies, you should use those existing, common ontologies, if you care about having an impact. This sections lists below a number of suggestions, for the various data types that will be covered in Open PHACTS. You should explore those ontologies and check if for each concept and relation you find matching entries in those existing, common ontologies. If you find that only a minor amount of items are missing, you should contact the ontology authors, and see if the missing terms can be added. Only if that fails should you be looking for less common ontologies and see if these provide a substantial higher coverage. Services that allow you to find uncommon ontologies are listed below. You can use the SKOS vocabulary to express relatedness to existing concepts.
As an alternative, rdfs:subClassOf should be discussed here.
It is OK to keep a number of entries not mapped to existing ontologies. In this case the entries have to be made openly available, i.e at purl.org. This of course also applies for creating a new ontology.
In all cases, you must never use ontologies that you are not allowed to share with your data, as that will effectively leave you with triplified data, of which your users have no means in the future to figure out what is what, and thus is "meaningless".
Importantly, the following two resources must be watched with respect to recommended vocabularies: first, Open PHACTS project deliverables, such as D1.6 and D1.7 [[D16D17]]; second, the Open PHACTS project on BioPortal: http://bioportal.bioontology.org/projects/163. These documents precede the following suggestions.
Below is a list of pointers to ontologies and vocabularies related to the scope of Open PHACTS. For each ontology, the prefix or website is also given. Furthermore, a list of search engines is provided where further ontologies and vocabularies can be found.
For document identifiers use (in order, if existing)
This requires that the structures have been deposited with ChemSpider already. If not, then use in descending order of preference:
The use of the CHEMINF ontology is encouraged for these identifiers [[Hastings2011]]. Also, you should register small molecule names with ChemSpider. Documentation how to deposit individual structures in ChemSpider can be found at https://www.chemspider.com/Help_DepositStructures.aspx. Larger sets of compounds can be deposited as SD files, but if the purpose is to have those exposed via Open PHACTS, the Scientific Advisory Board should be contacted to give permission for that exposure.
The following search engines can be useful to find suitable ontologies.
Schema.org is a very generic ontology. It may not be directly applicable to life sciences data, but is adopted by major search engines, like Google.com. It can be considered to use this ontology in addition to more detailed domain ontologies, and as such make your data more easily found. It has types for creative works, non-text objects, events, health and medical types, organization, persons, places, products, and reviews.
The next step is to explore what related data sets are available as Linked (Open) Data,
and link out to those data sets. For example, if your data contains ChemSpider, ChEBI,
ChEMBL, PubChem, DrugBank, KEGG, Uniprot, and PDB identifiers, you can link to the
respective RDF variants of those databases. Various RDF versions of these databases are
around, including Bio2RDF [[Belleau2008]], LODD [[Samwald2011]], and Chem2Bio2RDF [[Chen2010]],
but preferably to the original source directly. The figure below (CC-BY-SA, [[Cyganiak2011]])
shows a diagram of the larger network, including Linked Data relevant the Open PHACTS:
Careful consideration must be taken here in to what relation (predicate) is used. In the table below various options are outlined, the specific meaning, and how and when which predicate can be used. The Data Set guidelines specify Open PHACTS standards in detail [[OPSDD]] and are the definite documentation, but here follows a general outline of predicates demonstrating the different implications they have:
rdf:seeAlso |
General link, that indicates that the resource linked to is relevant to the subject. See http://www.w3.org/TR/rdf-schema/. |
skos:relatedMatch |
This link indicates that the linked resources are somewhat related. See http://www.w3.org/2004/02/skos/core.html. |
skos:closeMatch |
This link indicates that the linked resources are the same, under some assumptions or applications. This link is not transitive. See http://www.w3.org/2004/02/skos/core.html. |
skos:exactMatch |
This link subclasses skos:closeMatch but is stronger, and the same as now applies to a
wide range of applications, implying that the link is transitive. See
http://www.w3.org/2004/02/skos/core.html. |
owl:sameAs |
Link that indicates that the subject is an instance, and that the object resource is an instance too, and the same resources as the subject. This link is transitive. See http://www.w3.org/TR/owl-ref/. |
owl:equivalentClass |
The same as owl:sameAs but then for OWL classes instead of instances. This link is
transitive. See http://www.w3.org/TR/owl-ref/. |
The owl:sameAs and owl:equivalentClass predicates are very powerful and should be used with
care since all attributes and relations of two therewith connected entities are merged
together. In Open PHACTS the use of the less restrictive skos:exactMatch is recommended.
These first steps ensure you have IRIs for all resources and predicates, and know where to put all relations, it is time to create triples. It is irrelevant to the triple creation process and thus up to the user to pick whatever tool they find most convenient. Triples can be created with dedicated semantic web tools, as listed below, but also using simple regular expressions, or scripting tools in any language. Of course, generated triples should be validated, but the tool to create them is merely a tool; there is nothing semantic about that. The output in which the triples are serialized can be in any of the standardized or proposed RDF serialization formats, such as RDF/XML [[RDFXML]], Notation3 [[Notation3]], Turtle (preferred) [[Turtle]], or plain N-Triples [[NTriples]]. These guidelines do not encourage nor disallow named graphs; the user is free to use them, but it is not required.
Importantly, this process should be well documented. You must keep track of what versions of the input data was used, who created the RDF data, when that was done, and preferable what tools were used. This information should be available to users along with the data itself. Provenance is really important in the process of creating RDF, and you should in detail track how the transformation was done. However, the exact guidelines for tracking provenance information is under development, and future Open PHACTS guidelines will document in detail how it is captured. The reader is referred to the W3C PROV Model Primer as reference for now [[Gil2012]].
Update the above paragraph to point to specifications how Open PHACTS decides to expose provenance information.
Because the Open PHACTS GUI is human language oriented, all entities in the data must be associated with a human readable label. It is important that for all texts, like labels and definitions, the language it is represented in is explicitly identified. For example (not a full RDF serialization):
ex:methane rdfs:label “methaan”@nl .
Occasionally, there are concepts in your data that do not have labels in the data source. For example, the interaction between two proteins or the property of a molecule. Labels like "The interaction between protein A and protein B" and "The molecular weight of molecule A" can be autogenerated. If this label follows implicitly from the semantic typing of the entities and relations between those entities, then a label may be omitted. An example is the molecular weight property in the next section.
Relations can be modeled as both a predicate and a concept. In the former type it is commonly the relation type that is represented by a predicate. For example, "binds to" clearly has a label. But if the relationship is a unique one, for example, a specific binding affinity with which you want to associate further information, you would commonly model this as a concept instead of a predicate. In such cases, you can omit the label, which is favored over labels like "an affinity between target X and compound Y".
Blank nodes must not be used, following for example the Banff Manifesto [[banffmanifesto]]; each concept or thing should have a unique IRI, which may be similar to those of more principle resources. For example, the following CHEMINF example describes a molecule with one of its properties, where the property is an instance itself and has a IRI quite similar to that of the molecule it characterizes:
ex:m1 cheminf:CHEMINF_000200 ex:m1/full_mwt .
ex:m1/full_mwt a cheminf:CHEMINF_000198 .
ex:m1/full_mwt cheminf:SIO_000300 "341.748" .
At this moment, no further restrictions on the RDF triple structure is made, but it may be useful to read up on considerations in URIs patterns, labels names, etc, others have made in the past. For example, the OBO Foundry wrote up a series of principles which provide an interesting read into what the consequences can be of practices you adopt.
Below is a brief overview of tools that may assist the triple generation.
Description: Sesame is a Java framework for handling RDF data. It includes functionality for parsing, storing, inferencing and querying of RDF data. Development is support by the Dutch company Aduna.
Homepage: http://www.openrdf.org
Audience: Java programmers
Tutorial: http://www.openrdf.org/doc/sesame2/users/
Description: “Jena is another Java framework for building Semantic Web applications, originally developed by HP, now under the Apache umbrella. It provides am environment for handling RDF, RDFS and OWL, SPARQL.
Homepage: http://jena.sourceforge.net/
Audience: Java programmers
Tutorial: http://www.ibm.com/developerworks/xml/library/j-jena/
Description: Tripliser is a Java library and command-line tool for creating triple graphs from XML. It is particularly suitable for data exhibiting any of the following characteristics: messy - missing data, badly formatted data, changeable structure; bulky - large volumes; and volatile - ongoing changes to data and structure.
Homepage: http://daverog.github.com/tripliser/
Audience: Java programmers
Tutorial: http://daverog.github.com/tripliser/
Description: A Ruby library for working with RDF data.
Homepage: http://rdf.rubyforge.org/
Audience: Ruby programmers
Tutorial: http://rdf.rubyforge.org/
Description: A tool than can convert anything to triples, supporting microformats, RDFa, Microdata, RDF/XML, Turtle, N-Triples and NQuads.
Homepage: http://any23.org
Audience: Everybody
Tutorial: http://any23.org
Description: An Eclipse plugin [[Marx2013]].
Homepage: http://lod2.inf.puc-rio.br/site/download/
Audience: Everybody
While dedicated semantic web tools make it hard to introduce syntactic errors, it is still possible to make mistakes in the resulting RDF, and the generated triples should be validated.
There are various levels at which the data should be validated. First, it should be validated that the created syntax notation is correct, for which various online services are available. Remark: Some encodings of special characters may pose problems and may have to converted or be replaced. One such validator tool is the W3C RDF Validation Service, at http://www.w3.org/RDF/Validator/.
Second, the output should be checked that the selected common ontologies are correctly used. For example, that predicates with literal domains are indeed used for such in the output. An example of common misuse, is using the wrong Dublin Core namespace [[Nilsson2008]]; there are two, both defining a dc:title predicate, but only one namespace should be used with literal values.
This also applies to the use of links as outlined in step 5, where these linking predicates can make claims of the nature of resources. For example, skos:closeMatch implies that the subject and object resources are also SKOS concepts. That should not conflict with other triples.
One aspect here is that the resulting data should be verified for internal consistency. This is particularly important if the used common ontologies define relations (predicates) that specify what types of objects it links (RDF domain and range). Tools like Protégé (http://protege.stanford.edu/plugins/owl/api/) and Pellet (http://clarkparsia.com/pellet/) can be used for that.
Last but not least, the whole transformation should be unit tested. This testing can be done as part of this step, or after later steps. These tests make assertions regarding number of resources in the RDF data, testing that they match those in the original data. Additionally, the tests should test that the anticipated RDF structure is accurately reflected in the triple data set. Unit tests can be exposed as SPARQL queries and a query tool, like Rasqal, can then be used to see if expected results are returned.
In addition to the tools mentioned above, the Manchester University team has developed a few validation webservices, including validators for RDF documents.
Below is a brief overview of tools that may assist the triple validation.
Description: Webpage that accepts raw triple content and URLs pointing to RDF documents.
Homepage: http://www.w3.org/RDF/Validator/
Description: Command line utility.
Homepage: http://librdf.org/raptor/rapper.html
cat data.ttl | rapper -i turtle -t -q - . > /dev/null
Description: Command line utility.
Homepage: http://librdf.org/rasqal/
roqet --results=ntriples -i sparql -e 'CONSTRUCT WHERE { ?s ?p ?o } LIMIT 5' -D data.ttl
Description: Website that spots common mistakes.
Homepage: http://graphite.ecs.soton.ac.uk/checker/
Description: A general-purpose command line utility for the semantic web.
Homepage: http://www.w3.org/2000/10/swap/doc/cwm.html
There are various ways to make your data available for others to use:
Linked Open Data requires the data to be linkable, and therefore that URIs are dereferencable (see the Linked Open Data star system). Dereferencable means that IRIs identifying resources can be used using the web design (via domain name and web servers) resolve in triples about that resources. For example, the following resource IRI for methane is dereferencable:
http://rdf.openmolecules.net/?InChI=1S/CH4/h1H4
However, because data used in Open PHACTS will be loaded into a central cache, all data must be available for bulk download. This means that all triples, including provenance, etc, are archived into a .zip or .tar.bz2 file, and shared via a HTTP and FTP server, allowing others to download all triples and use that locally.
Additionally, a third option is highly recommend as a minimal way to make the triples accessible: via a SPARQL end point. Various tools are available for this purpose, including tools mentioned earlier to create triples, such as Sesame and Jena. These both provide store functionality, including SPARQL functionality, but are APIs primarily, and can wrap around triple stores that scale better, such as Virtuoso and Owlim. A comparison of some triple stores was done by the FU Berlin and can be found here, but we also node that performance depends strongly in your use case [[Erling2011]]. Information about the capacity of triple stores can be found at w3.org (link). We note that these statistics change every half year, and the reader is strongly encouraged to look up recent numbers.
The list of tools that provide SPARQL end point functionality include those below. Other overviews exists, like this one on w3.org.
Homepage: http://www.openrdf.org/
Documentation: http://www.openrdf.org/documentation.jsp
Homepage: http://jena.sourceforge.net/
Documentation: http://jena.sourceforge.net/documentation.html
Homepage: http://virtuoso.openlinksw.com/
Documentation: http://docs.openlinksw.com/virtuoso/
Homepage: http://www.ontotext.com/owlim
Documentation: http://owlim.ontotext.com/display/OWLIMv42/Home
Homepage: http://4store.org/
Documentation: http://4store.org/trac/wiki/Documentation
Homepage: http://www.mulgara.org/
Documentation: http://www.mulgara.org/documentation.html
Homepage: http://www.systap.com/bigdata.htm
Documentation: http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=GettingStarted
Homepage: https://github.com/semsol/arc2/wiki
Documentation: https://github.com/semsol/arc2/wiki/Getting-started-with-ARC2
Now that you know what data you started with, what the RDF looks like, how you generated it, and how you make it available, you must document this provenance. In Open PHACTS the Dataset Descriptions for the Open Pharmacological Space specification details out how you are expecting to do this in very much detail [[OPSDD]]. The key ontology used by this specification is the VoID (Vocabulary of Interlinked Datasets ontology [[Cyganiak2011b]].
The data you should record includes but is not limited to:
The Manchester University team has developed a validator specifically aimed at VoID provenance information, and verified compliance with the Open PHACTS data set description specification [[OPSDD]].
The final step in creating RDF, is to advertise your RDF as to get it used, and to get it linked to. Various options can be considered, such as announcing the data on mailing lists, on the Open PHACTS website (or in the Open PHACTS weblogs), or more traditional channels like presenting a poster on a conference.
Like with conference posters, advertising RDF goes with certain requirements. Similar to the requirement that conference posters must be of a certain size, advertising RDF data sets must include, for example, license (or waiver) information (see Step 0), what ontologies are used (see Step 4), and their embedding in the Linked Open Data network (see Step 5). The VoID-encoded meta data from Step 11 can be reused.
Additionally, your data point should be registered with the appropriate registries. One of these is the Data Hub, formerly know as CKAN (http://thedatahub.org/).
To make data provided by a SPARQL end point as linked data, the PHP library Puelia can be used.
It is useful to compare your results at the end with what others have been doing. One option is to look at the Linked Open Data stars scheme.
As an additional feature, the steps are complemented with details on how that step addresses the requirements for Linked Open Data outlined by Berners-Lee [[BernersLee2006]], and popularized as the Linked Open Data start Scheme by Hausenblas [[Hausenblas2012]] and can be found at http://lab.linkeddata.deri.ie/2010/star-scheme-by-example/. These are provided as further information on the context of those steps, rather than requirements resulting from these guidelines. In short, the starts have the following meaning according to Hausenblas:
| ★ | make your stuff available on the Web (whatever format) under an open license |
| ★★ | make it available as structured data (e.g., Excel instead of image scan of a table) |
| ★★★ | use non-proprietary formats (e.g., CSV instead of Excel) |
| ★★★★ | use URIs to identify things, so that people can point at your stuff |
| ★★★★★ | link your data to other data to provide context |
To comply to any stars, the license is the first step, and in these guidelines too (see Step 0). For most of the readers of this document, the second star comes for free: the data you are converting into in RDF is most likely already in a structured format (see Step 1). The third star does not require your data to be RDF either, but does insist on using Open Standards (see Step 6). Therefore, once your data is converted into some RDF, you reached a three star state. In fact, the fourth star requires the use of RDF (see Step 6), but also requires to provide a linked data version of your RDF data (see Step 8). Linked Data is also about people linking to your data. The interlinking of your RDF data to other linked data, is rewarded with a fifth star (see Step 5).