Welcome to PyTripalSerializer’s documentation!

PyTripalSerializer

Documentation Status build & test

Serialize Tripal’s JSON-LD API into RDF format

This package implements a recursive algorithm to parse the JSON-LD API of a Tripal genomic database webservice and serialize the encountered terms into a RDF document. Output will be saved in a turtle file (.ttl).

Motivation

This work is a byproduct of a data integration project for multiomics data at MPI for Evolutionary Biology. Among various other data sources, we run an instance of the Tripal genomic database website engine. This service provides a JSON-LD API, i.e., all data in the underlying relational database is accessible through appropriate http GET requests against that API. So far so good. Now, in our project, we are working on integrating data based on Linked Data technology; in particular, all data sources should be accessible via (federated) SPARQL queries. Hence, the task is to convert the JSON-LD API into a SPARQL endpoint.

The challenge here is that the JSON-LD API only provides one document at a time. Querying a single document with e.g. the arq utility (part of the Apache-Jena package) is no problem. The problem starts when one then attempts to run queries against other JSON-LD documents referenced in the first document as object URIs but. These object URIs are not part of the current document (graph). Instead, they point to separate graph. SPARQL in its current implementation does not support dynamic generation of graph URIs from e.g. object URIs. Hence the need for a code that recursively parses a JSON-LD document including all referenced documents.

Of course this is a generic problem. This package implements a solution targeted for Tripal JSON-LD APIs but with minimal changes it should be adaptable for other JSON-LD APIs.

Installation

PyPI Releases

This package is released via the Python Package Index (PyPI). To install it, run

$ pip install pytripalserializer

Github development snapshot

To install the latest development snapshot from github, clone this repository

git clone https://github.com/mpievolbio-scicomp/PyTripalSerializer

Navigate into the cloned directory and run a local pip install:

cd PyTripalSerializer
pip install [-e] .

The optional flag -e would instruct pip to install symlinks to the source files, this is recommended for developers.

Usage

The simplest way to use the package is via the command line interface. The following example should be instructive enough to get started:

$ cd PyTripalSerializer
$ cd src
$ ./tripser http://pflu.evolbio.mpg.de/web-services/content/v0.1/CDS/11846 -o cds11846.ttl

Running this command should produce the RDF turtle file “cds11846.ttl” in the src/ directory. “cds11846” has only 42 triples.

Be aware that running the command on a top level URL such as http://pflu.evolbio.mpg.de/web-services/content/v0.1/ would parse the entire tree of documents which results in a graph of ~2 million triples and takes roughly 14hrs to complete on a reasonably well equipped workstation with 48 CPUs.

Testing

Run the test suite with

pytest tests

Documentation

Click the documentation badge at the top of this README to access the online manual.

Reference Manual

module:

tripser - main module.

class tripser.RecursiveJSONLDParser(entry_point=None, graph=None, serialize_nodes=False)
Class:

This class implements recursive parsing of JSON-LD documents.

property graph

Access the graph of the parser.

parse_page(page)

This function will attempt to get the json-ld blob from the passed page (URL) and pass it on to Graph.parse(). It then calls the recursively_add function on the local scope’s graph and for each member’s URI.

The constructed Graph instance is returned.

Parameters:

page (str) – URL of the json-ld document

Returns:

A Graph instance constructed from the downloaded json document.

Return type:

Graph

recursively_add(g, ref)

Parse the document in ref into the graph g. Then call this function on all ‘member’ objects of the subgraph with the same graph g.

Parameters:
  • g (rdflib.Graph) – The graph into which all terms are to be inserted.

  • ref (URIRef | str) – The URL of the document to (recursively) parse into the graph

recursively_add_serial(g, ref)

Parse the document in ref into the graph g. Then call this function on all ‘member’ objects of the subgraph with the same graph g. Serial implementation

Parameters:
  • g (rdflib.Graph) – The graph into which all terms are to be inserted.

  • ref (URIRef | str) – The URL of the document to (recursively) parse into the graph

property serialize_nodes

Get the ‘serialize_nodes’ flag.

tripser.cleanup(grph)

Remove: - All subjects of type <http://pflu.evolbio.mpg.de/web-services/content/v0.1/PartialCollectionView> - All objects with property <hydra:PartialCollectionView>

Parameters:

grph – The graph to cleanup

tripser.get_graph(page, serialize=False)

Workhorse function to download the json document and parse into the graph to be returned.

Parameters:

page (str | URIRef) – The URL of the json-ld document to download and parse.

Returns:

A graph containing all terms found in the downloaded json-ld document.

Return type:

rdflib.Graph

tripser.remove_terms(grph, terms)

Remove terms matching the passed pattern terms from grph.

Parameters:
  • grph (rdflib.Graph) – The graph from which to remove terms.

  • terms (3-tuple (subject, predicate, object)) – Triple pattern to match against.

Indices and tables