Scheduling Refresh Queries for Keeping Results from a SPARQL Endpoint Up-to-Date

Abstract

Many datasets change over time. As a consequence, long-running applications that cache and repeatedly use query results obtained from a SPARQL endpoint may resubmit the queries regularly to ensure up-to-dateness of the results. While this approach may be feasible if the number of such regular refresh queries is manageable, with an increasing number of applications adopting this approach, the SPARQL endpoint may become overloaded with such refresh queries. Therefore, a more scalable approach would be to use a middle-ware component at which the applications register their queries and get notified with updated query results once the results have changed. Then, this middle-ware can schedule the repeated execution of the refresh queries without overloading the endpoint. In this paper, we study the problem of scheduling refresh queries for a large number of registered queries by assuming an overload-avoiding upper bound on the length of a regular time slot available for testing refresh queries. We investigate a variety of scheduling strategies and compare them experimentally in terms of time slots needed before they recognize changes and number of changes that they miss.

Experimental Data

We provide the data gathered from the experiments in form of a full MySQL database dump and an RDF dump with the query executions as planned by the evaluated strategies.

The database dump includes the plain results of all query executions, while the RDF dataset refers to their hash values. The RDF dataset applies the LSQ vocabulary. We extended the vocabulary to describe relevant metadata, such as the delay and the missed updates of individual query executions.

We randomly selected 10,000 DBpedia queries from the Linked SPARQL Queries dataset (LSQ):
- queries.txt
Full MySQL database dump:
- reevaluate_query_2016-07-18.sql.gz (412K): queries with metadata
- reevaluate_revision_2016-07-18.sql.gz (45K): hourly revisions with statistics
- reevaluate_query_execution_all_2016-07-18.sql.gz (80M): all query executions with metadata (hashed query results)
- reevaluate_evaluation_2016-07-18.sql.gz (295M): query re-evaluation as performed by scheduling strategies with metadata
- reevaluate_results_2016-07-18.sql.gz (62G): plain query results for all query executions
RDF dataset:
- ESClairvoyant.ttl.gz (1.2M)
- ESRoundRobin.ttl.gz (130M)
- ESShortestJobFirst.ttl.gz (402M)
- ESLongestJobFirst.ttl.gz (140M)
- ESChangeRateDecay.ttl.gz (206M)
- ESChangeRatioJ.ttl.gz (74M)
- ESTtl.ttl.gz (165M)