Scheduling Refresh Queries for Keeping Results from a SPARQL Endpoint Up-to-Date

Magnus Knuth, Olaf Hartig, and Harald Sack

This work will be presented as a short paper (preprint) at ODBASE 2016 - The 15th International Conference on Ontologies, DataBases, and Applications of Semantics.

An extended version of the ODBASE paper is available at arXiv.org.

Abstract

Many datasets change over time. As a consequence, long-running applications that cache and repeatedly use query results obtained from a SPARQL endpoint may resubmit the queries regularly to ensure up-to-dateness of the results. While this approach may be feasible if the number of such regular refresh queries is manageable, with an increasing number of applications adopting this approach, the SPARQL endpoint may become overloaded with such refresh queries. Therefore, a more scalable approach would be to use a middle-ware component at which the applications register their queries and get notified with updated query results once the results have changed. Then, this middle-ware can schedule the repeated execution of the refresh queries without overloading the endpoint. In this paper, we study the problem of scheduling refresh queries for a large number of registered queries by assuming an overload-avoiding upper bound on the length of a regular time slot available for testing refresh queries. We investigate a variety of scheduling strategies and compare them experimentally in terms of time slots needed before they recognize changes and number of changes that they miss.

Experimental Data

We provide the data gathered from the experiments in form of a full MySQL database dump and an RDF dump with the query executions as planned by the evaluated strategies.

The database dump includes the plain results of all query executions, while the RDF dataset refers to their hash values. The RDF dataset applies the LSQ vocabulary. We extended the vocabulary to describe relevant metadata, such as the delay and the missed updates of individual query executions.