KGCW 2023 Challenge @ ESWC 2023
- Van Assche, Dylan 1
- Chaves-Fraga, David 2
- Dimou, Anastasia 3
- Şimşek, Umutcan 4
- Iglesias, Ana 2
- 1 IDLab - Ghent University - imec
-
2
Universidad Politécnica de Madrid
info
-
3
KU Leuven
info
- 4 STI Insbruck
Editor: Zenodo
Ano de publicación: 2023
Tipo: Dataset
Resumo
Knowledge Graph Construction Workshop 2023: challenge Knowledge graph construction of heterogeneous data has seen a lot of uptakein the last decade from compliance to performance optimizations with respectto execution time. Besides execution time as a metric for comparing knowledgegraph construction, other metrics e.g. CPU or memory usage are not considered.This challenge aims at benchmarking systems to find which RDF graphconstruction system optimizes for metrics e.g. execution time, CPU,memory usage, or a combination of these metrics. Task description The task is to reduce and report the execution time and computing resources(CPU and memory usage) for the parameters listed in this challenge, comparedto the state-of-the-art of the existing tools and the baseline results providedby this challenge. This challenge is not limited to execution times to createthe fastest pipeline, but also computing resources to achieve the most efficientpipeline. We provide a tool which can execute such pipelines end-to-end. This tool alsocollects and aggregates the metrics such as execution time, CPU and memoryusage, necessary for this challenge as CSV files. Moreover, the informationabout the hardware used during the execution of the pipeline is available aswell to allow fairly comparing different pipelines. Your pipeline should consistof Docker images which can be executed on Linux to run the tool. The tool isalready tested with existing systems, relational databases e.g. MySQL andPostgreSQL, and triplestores e.g. Apache Jena Fuseki and OpenLink Virtuosowhich can be combined in any configuration. It is strongly encouraged to usethis tool for participating in this challenge. If you prefer to use a differenttool or our tool imposes technical requirements you cannot solve, please contactus directly. Part 1: Knowledge Graph Construction Parameters These parameters are evaluated using synthetic generated data to have moreinsights of their influence on the pipeline. Data Number of data records: scaling the data size vertically by the number of records with a fixed number of data properties (10K, 100K, 1M, 10M records). Number of data properties: scaling the data size horizontally by the number of data properties with a fixed number of data records (1, 10, 20, 30 columns). Number of duplicate values: scaling the number of duplicate values in the dataset (0%, 25%, 50%, 75%, 100%). Number of empty values: scaling the number of empty values in the dataset (0%, 25%, 50%, 75%, 100%). Number of input files: scaling the number of datasets (1, 5, 10, 15). Mappings Number of subjects: scaling the number of subjects with a fixed number of predicates and objects (1, 10, 20, 30 TMs). Number of predicates and objects: scaling the number of predicates and objects with a fixed number of subjects (1, 10, 20, 30 POMs). Number of and type of joins: scaling the number of joins and type of joins (1-1, N-1, 1-N, N-M) Part 2: GTFS-Madrid-Bench The GTFS-Madrid-Bench provides insights in the pipeline with real data from thepublic transport domain in Madrid. Scaling GTFS-1 SQL GTFS-10 SQL GTFS-100 SQL GTFS-1000 SQL Heterogeneity GTFS-100 XML + JSON GTFS-100 CSV + XML GTFS-100 CSV + JSON GTFS-100 SQL + XML + JSON + CSV Example pipeline The ground truth dataset and baseline results are generated in different stepsfor each parameter: The provided CSV files and SQL schema are loaded into a MySQL relational database. Mappings are executed by accessing the MySQL relational database to construct a knowledge graph in N-Triples as RDF format. The constructed knowledge graph is loaded into a Virtuoso triplestore, tuned according to the Virtuoso documentation. The provided SPARQL queries are executed on the SPARQL endpoint exposed by Virtuoso. The pipeline is executed 5 times from which the median execution time of eachstep is calculated and reported. Each step with the median execution time isthen reported in the baseline results with all its measured metrics.Query timeout is set to 1 hour and knowledge graph construction timeoutto 24 hours. The execution is performed with the following tool: https://github.com/kg-construct/challenge-tool,you can adapt the execution plans for this example pipeline to your own needs. Each parameter has its own directory in the ground truth dataset with thefollowing files: Input dataset as CSV. Mapping file as RML. Queries as SPARQL. Execution plan for the pipeline in metadata.json. Datasets Knowledge Graph Construction Parameters The dataset consists of: Input dataset as CSV for each parameter. Mapping file as RML for each parameter. SPARQL queries to retrieve the results for each parameter. Baseline results for each parameter with the example pipeline. Ground truth dataset for each parameter generated with the example pipeline. Format All input datasets are provided as CSV, depending on the parameter that is beingevaluated, the number of rows and columns may differ. The first row is alwaysthe header of the CSV. GTFS-Madrid-Bench The dataset consists of: Input dataset as CSV with SQL schema for the scaling and a combination of XML, CSV, and JSON is provided for the heterogeneity. Mapping file as RML for both scaling and heterogeneity. SPARQL queries to retrieve the results. Baseline results with the example pipeline. Ground truth dataset generated with the example pipeline. Format CSV datasets always have a header as their first row.JSON and XML datasets have their own schema. Evaluation criteria Submissions must evaluate the following metrics: Execution time of all the steps in the pipeline. The execution time of a step is the difference between the begin and end time of a step. CPU time as the time spent in the CPU for all steps of the pipeline. The CPU time of a step is the difference between the begin and end CPU time of a step. Minimal and maximal memory consumption for each step of the pipeline. The minimal and maximal memory consumption of a step is the minimum and maximum calculated of the memory consumption during the execution of a step. Expected output Duplicate values Scale Number of Triples 0 percent 2000000 triples 25 percent 1500020 triples 50 percent 1000020 triples 75 percent 500020 triples 100 percent 20 triples Empty values Scale Number of Triples 0 percent 2000000 triples 25 percent 1500000 triples 50 percent 1000000 triples 75 percent 500000 triples 100 percent 0 triples Mappings Scale Number of Triples 1TM + 15POM 1500000 triples 3TM + 5POM 1500000 triples 5TM + 3POM 1500000 triples 15TM + 1POM 1500000 triples Properties Scale Number of Triples 1M rows 1 column 1000000 triples 1M rows 10 columns 10000000 triples 1M rows 20 columns 20000000 triples 1M rows 30 columns 30000000 triples Records Scale Number of Triples 10K rows 20 columns 200000 triples 100K rows 20 columns 2000000 triples 1M rows 20 columns 20000000 triples 10M rows 20 columns 200000000 triples Joins 1-1 joins Scale Number of Triples 0 percent 0 triples 25 percent 125000 triples 50 percent 250000 triples 75 percent 375000 triples 100 percent 500000 triples 1-N joins Scale Number of Triples 1-10 0 percent 0 triples 1-10 25 percent 125000 triples 1-10 50 percent 250000 triples 1-10 75 percent 375000 triples 1-10 100 percent 500000 triples 1-5 50 percent 250000 triples 1-10 50 percent 250000 triples 1-15 50 percent 250005 triples 1-20 50 percent 250000 triples 1-N joins Scale Number of Triples 10-1 0 percent 0 triples 10-1 25 percent 125000 triples 10-1 50 percent 250000 triples 10-1 75 percent 375000 triples 10-1 100 percent 500000 triples 5-1 50 percent 250000 triples 10-1 50 percent 250000 triples 15-1 50 percent 250005 triples 20-1 50 percent 250000 triples N-M joins Scale Number of Triples 5-5 50 percent 1374085 triples 10-5 50 percent 1375185 triples 5-10 50 percent 1375290 triples 5-5 25 percent 718785 triples 5-5 50 percent 1374085 triples 5-5 75 percent 1968100 triples 5-5 100 percent 2500000 triples 5-10 25 percent 719310 triples 5-10 50 percent 1375290 triples 5-10 75 percent 1967660 triples 5-10 100 percent 2500000 triples 10-5 25 percent 719370 triples 10-5 50 percent 1375185 triples 10-5 75 percent 1968235 triples 10-5 100 percent 2500000 triples GTFS Madrid Bench Generated Knowledge Graph Scale Number of Triples 1 395953 triples 10 3959530 triples 100 39595300 triples 1000 395953000 triples Queries Query Scale 1 Scale 10 Scale 100 Scale 1000 Q1 58540 results 585400 results No results available No results available Q2 636 results 11998 results 125565 results 1261368 results Q3 421 results 4207 results 42067 results 420667 results Q4 13 results 130 results 1300 results 13000 results Q5 35 results 350 results 3500 results 35000 results Q6 1 result 1 result 1 result 1 result Q7 68 results 67 results 67 results 53 results Q8 35460 results 354600 results No results available No results available Q9 130 results 1300 results 13000 results 130000 results Q10 1 result 1 result 1 result 1 result Q11 130 results 260 results 260 results 260 results Q12 13 results 130 results 1300 results 13000 results Q13 265 results 2650 results 26500 results 265000 results Q14 2234 results 22340 results 223400 results No results available Q15 592 results 8684 results 35502 results 206628 results Q16 390 results 780 results 260 results 780 results Q17 855 results 8550 results 85500 results 855000 results Q18 104 results 1300 results 13000 results 130000 results