Abstract
The decentralization envisioned for the current centralized web requires querying approaches capable of accessing multiple small data sources while complying with legal constraints related to personal data, such as licenses and the GDPR. Link Traversal-based Query Processing (LTQP) is a querying approach designed for highly decentralized environments that satisfies these legal requirements. An important optimization avenue in LTQP is the order in which links are dereferenced, which involves prioritizing links to query-relevant documents. However, assessing and comparing the algorithmic performance of these systems is challenging due to various compounding factors during query execution. Therefore, researchers need an implementation-agnostic and deterministic metric that accurately measures the marginal effectiveness of link prioritization algorithms in LTQP engines. In this paper, we motivate the need for accurately measuring link prioritization performance, define and test such a metric, and outline the challenges and potential extensions of the proposed metric. Our findings show that the proposed metric highlights differences in link prioritization performance depending on the queried data fragmentation strategy. The proposed metric allows for evaluating link prioritization performance and enables easily assessing the effectiveness of future link prioritization algorithms.
Introduction
Currently, web user data is stored in centralized data silos, which restricts innovation and poses privacy concerns [1]. Decentralized efforts like Solid distribute user data across multiple personal data stores, which are highly personal and permissioned, requiring a querying approach that effectively handles decentralized and permissioned data.
Link Traversal-based Query Processing (LTQP) is an integrated querying approach designed to meet these requirements [2]. Starting from seed data sources, the LTQP query engine dynamically discovers new data sources by following hyperlinks found in previously discovered sources.
However, LTQP is currently slower compared to centralized alternatives. An optimization avenue is link prioritization, where the engine dynamically determines the order in which links should be dereferenced based on their expected relevance to the query [3, 4].
To optimize LTQP prioritization strategies, researchers need insights into the marginal performance of proposed algorithms and the theoretically optimal performance these strategies can achieve.
Previous studies have used metrics like result arrival times, total execution time, and other arrival time-based metrics [5] to assess prioritization strategies [3, 4, 6]. These measures fail to capture how close implemented prioritization approaches are to the theoretically optimal strategy. Additionally, these metrics fall short when comparing different engine implementations and environments due to unrelated differences that may skew link prioritization performance results:
- Programming Languages
- Different languages can substantially affect engine performance [7], hindering purely time-based comparisons of algorithms.
- Dereferencing Overhead
- Network communication times vary widely [8, 9], introducing significant noise into execution times, even for comparisons of the same engine.
- Orthogonal Optimization Strategies
- Different query optimization [10] strategies might create differences in performance independent of prioritization strategies.
This leads to our research questions: How can we compare the marginal algorithmic performance of prioritization algorithms in an implementation-agnostic manner and How can we define an Oracle baseline that represents optimal prioritization performance?
In this paper, we introduce the metric: Relevant Retrieval Ratio (R^3), pronounced r-cubed. This metric allows for straightforward observation of the algorithmic performance of prioritization approaches, independent of implementations, and provides a benchmark against the theoretically optimal prioritization strategy.
Existing Metrics
Existing LTQP and federated querying metrics [11] are insufficient for measuring the marginal algorithmic performance of link prioritization algorithms:
- Query execution time
- Prioritization algorithms are unlikely to improve total query execution time [4, 3]. However, Link prioritization can improve [3, 4] arrival times of the first results. However, the arrival times are influenced by both prioritization strategies and other unrelated engine optimizations obscuring the prioritization algorithm’s effect.
- Diefficiency Metrics
- Diefficiency metrics measure the continuous efficiency of query engines by reflecting the density of produced results over a given time interval [5]. This is in contrast to measuring the first arrival times, which only measures engine performance at discrete points. However, the same limitations remain, as these metrics are based on result arrival times.
- Number of HTTP Requests
-
For federated query engines, the number of HTTP requests represents the efficiency of source selection and join planning during query execution [12, 13]. Federated engines often query SPARQL endpoints, allowing them to delegate parts of the query execution to the server [14]. In contrast, in LTQP, the engine uses HTTP requests to download documents, so the number of requests corresponds to the number of dereferenced documents. Thus, the number of HTTP requests measures the engine’s ability to prune irrelevant links. However, it does not indicate the effectiveness of prioritization algorithms, which reorder link dereference order, rather than pruning links.
The metric proposed in this paper complements existing metrics by measuring the marginal performance of link prioritization algorithms rather than the overall performance of the LTQP system.
Relevant Retrieval Ratio
The subweb [15] traversed during LTQP can be represented as a directed graph, where an edge from document A to document B indicates that a URI pointing to document B was first discovered in the data obtained from document A. Fig. 1 shows an example of such a directed graph.
During LTQP query execution, an optimal link dereference order that finds all documents needed to answer a query exists, as shown in the right-most directed graph in Fig. 1. This theoretical optimum can be considered an oracle link prioritization algorithm with perfect information on the location of query-relevant documents. The optimal path can always be achieved, as the oracle only uses links between documents that an engine could also use. An engine should strive to achieve a traversal path close to this optimal path. As such, we use the ratio between the optimal traversal path length and the actual traversal path length during query execution to evaluate the performance of link prioritization algorithms. The optimal traversal path is defined as the shortest path needed to dereference all documents required for the complete query result and is determined retrospectively.
Given the length of the optimal traversal path and the actual traversal path, we define the metric as
where is the length of the engine traversal path and is the length of the optimal path. In this definition, a higher value is better, with indicating optimal algorithmic link prioritization performance. Note that when or , our query has no results and the notion of optimal link prioritization does not exist. When an engine performs poorly, resulting in smaller differences in the metric’s values, taking the log2 can help visualize relative differences.
Using the metric, we can compare the algorithmic efficiency of the two traversal algorithms in our example. Using breadth-first search, the path to dereference the query-relevant document contains six nodes, while the optimal traversal order contains two. Thus, the metric for breadth-first traversal equals . Depth-first visits only three nodes, yielding an value of . From this analysis, we can conclude that depth-first traversal outperforms breadth-first traversal on this subweb.
To find the optimal traversal path to dereference all query-relevant documents, we notice that it is equivalent to solving the directed Steiner tree problem [16] on our directed graph, with the query-relevant documents serving as terminals. As such, we can use existing Steiner tree solvers to efficiently find our optimal traversal path.
Results
Our experiment uses the Discover queries from the SolidBench benchmark [3]. To track the required information for computing the metric during query execution, we use the query engine Comunica [17]. To compute the optimal path, we will use a heuristic Steiner tree solver [18] to speed up computations. Although an exact solver would be ideal, it significantly increases computational complexity for large subwebs [19, 20].
Our experiments compare depth and breadth-first prioritization [4] using and time until the final query result. We aim to validate the substantiated assumption that Discover queries do not benefit from improved link prioritization [21].
Table 1 shows the computed metrics per template instantiation and the difference in the log2 of the metric value. We find that either method can outperform the other based on the used template. Furthermore, there is a significant difference in performance depending on the template instantiation, likely due to the different fragmentation strategies used in SolidBench [3]. Some pods store query-relevant data in a single file, while others fragment these into directories, complicating optimal traversal.
The differences in prioritization performance are not observed in time until the last result; the Pearson correlation coefficients between and time until the last result are -0.042 and 0.201 for depth and breadth-first respectively. These low correlations suggest a weak link between and query execution time, possibly caused by noise in execution time.
Using the , we confirm our previous paper [21] by computing a metric value instead of an engine-specific in-depth analysis of the query execution progress.
Query | DF | BF | Δ log2 (R3) | Query | DF | BF | Δ log2 (R3) | |
---|---|---|---|---|---|---|---|---|
D1.0 | 0.23 | 0.08 | 1.617 | D5.0 | 0.13 | 0.15 | -0.134 | |
D1.1 | 0.50 | 0.50 | 0.000 | D5.1 | 0.50 | 0.50 | 0.000 | |
D1.2 | 0.50 | 0.50 | 0.000 | D5.2 | 0.33 | 0.67 | -1.000 | |
D1.3 | 0.75 | 0.75 | 0.000 | D5.3 | 0.04 | 0.03 | 0.250 | |
D1.4 | 0.18 | 0.18 | 0.000 | D5.4 | 0.18 | 0.16 | 0.196 | |
D2.0 | 0.50 | 0.51 | -0.031 | D6.0 | 0.53 | 0.53 | 0.000 | |
D2.1 | 0.50 | 0.67 | -0.415 | D6.1 | 0.40 | 0.40 | 0.000 | |
D2.2 | 0.50 | 0.67 | -0.415 | D6.2 | 0.40 | 0.40 | 0.000 | |
D2.3 | 0.90 | 0.90 | 0.000 | D6.3 | 0.13 | 0.13 | 0.000 | |
D2.4 | 0.44 | 0.56 | -0.349 | D6.4 | 0.73 | 0.46 | 0.690 | |
D3.0 | 0.50 | 0.48 | 0.074 | D7.0 | 0.05 | 0.05 | 0.000 | |
D3.1 | 0.91 | 0.95 | -0.058 | D7.1 | 0.40 | 0.40 | 0.000 | |
D3.2 | 0.93 | 0.93 | -0.000 | D7.2 | 0.40 | 0.40 | 0.000 | |
D3.3 | 0.62 | 0.62 | 0.000 | D7.3 | 0.15 | 0.15 | 0.000 | |
D3.4 | 0.61 | 0.61 | 0.000 | D7.4 | 0.06 | 0.03 | 0.955 | |
D4.0 | 0.09 | 0.06 | 0.520 | D8.0 | 0.45 | 0.64 | -0.500 | |
D4.1 | 0.44 | 0.44 | 0.000 | D8.1 | - | - | - | |
D4.2 | 0.50 | 0.44 | 0.170 | D8.2 | 0.47 | 1.00 | -1.100 | |
D4.3 | 0.03 | 0.03 | 0.000 | D8.3 | - | - | - | |
D4.4 | 0.11 | 0.11 | 0.000 | D8.4 | 0.36 | 0.43 | -0.263 |
Table 1: The for depth and breadth-first prioritization (DF, BF) separated by query template and instantiation. D1.2 template one, instantiation two.
Conclusion
The current definition and implementation of the Relevant Retrieval Ratio () has several limitations. First, due to the computational complexity of the Steiner tree problem for graphs and the lack of readily available exact solvers that work on directed graphs, our implementations use heuristics, thus leading to potentially suboptimal traversal path lengths. Second, the metric can not be computed when a query produces no results, either due to timeout or no results existing for the query. Finally, our metric uses theoretically optimal paths. In practice, document dereference times can vary, making the theoretically optimal path potentially suboptimal. In future work, more extensive benchmarking of the metric is required to validate its effectiveness in measuring prioritization performance. Furthermore, a new metric that includes a penalty term for HTTP request time would account for real-world uncertainties in LTQP scenarios. Finally, an metric for the first results can be defined.
Acknowledgements
This research was supported by SolidLab Vlaanderen (Flemish Government, EWI and RRF project VV023/10). Ruben Taelman is a postdoctoral researcher at the Research Foundation – Flanders (FWO).