GitHub - ahmadassaf/linked-dataset-profiling-tool: Generate structured profiles of Linked Datasets

Linked Dataset Profiling Tool

The Linked Dataset Profiling tool is an implementation of the approach proposed in [1]. Its main purpose is generating structured profiles of Linked Datasets. A profile in this case represents a graph, consisting of linked datasets, resource instances, and topics. The topics are DBpedia categories, which are extracted through a Named Entity Disambiguation process by analyzing textual literals from resources. The main steps executed by the profiling tool are:

Dataset metadata extraction from DataHub
Resource instance extraction
Entity and topic extraction through NED from extracted resources using NED tools like Spotlight or TagMe!
Profile graph construction and topic ranking
Exporting of profiles in JSON format.

The individual steps are explained in detail in [1]. Here we provide a brief overview of the output from each step.

The input required from the tool is a dataset ID, extracted from DataHub, i.e., lak-dataset or a group ID of datasets, i.e., lodcloud.

As an output, the LDP tool extracts the metadata from the datasets, such as the SPARQL endpoint, name, maintainer, etc., and stores it in a directory given by the user.
LDP extracts resource instances from the given datasets in (1). It has the option to sample the extracted resources based on three sampling strategies: random, weighted, and centrality (see [1]). Furthermore, the user can define what percentage of resources he/she wants to extract, i.e., 5, 10,..., 95% of resources.
From the extracted resources, the tool performs the NED process by analyzing the textual literals of resources. Here, one can define what datatype properties are of interest for the NED process, which can be fed into the tool during the process. In this step, LDP extracts entities as DBpedia entities and the topics from the extracted entities through the datatype property dcterms:subject. The last step of the LDP tool
Building the dataset topic graph as our profile from the extracted datasets and their corresponding sampled resources, and the extracted entities and topics in step (3). The topics are ranked for their relevance to the respective datasets by different graphical models that can be fed into the LDP tool by the user, i.e., prank, kstep, hits, for PageRank with Priors, K-Step Markov, and HITS, respectively. Finally, after ranking the topics for their relevance, the LDP tool can export the profiles into JSON format, such that they can be further analyzed or exported into RDF or other formats. For RDF, we provide the tool which exposes the profiles into RDF using the VoID and VoL schema.

In order to run the LDP tool, it requires a few variables to be added to its config file. We show here the possible input values for the different variables (where with "|" we show all acceptable and recognizable values by the tool), whereas for others we provide a simple textual description. See below for the sample config file. The defined variables and values should be stored in a separate file and should be given as an inline argument to the LDP tool, e.g.,

java -jar ldp.jar config.ini

Example config values

Only one value at a time, 0 - is for step (1), 1 for step (2), and so on

loadcase=0|1|2|3|4

An existing directory where the extracted datasets and resources will be stored

datasetpath=directory location

Path and the name of the file which will hold the computed values for the normalized topic relevance score computed as in (1)

normalised_topic_score=file location

Path and the name of the file which will hold the extracted de.l3s.bfetahu.ldp.entities and topics from DBpedia

annotationindex=file location

Sample size, which defines the ratio of extracted resources for a dataset. Be aware here that Step (3) for large sample sizes takes a long time, and as shown in [1], a sample size of 10% is representative

sample_size=1|2|...|95

Sampling strategy to extract the resources, 'centrality' performs best in terms of profiling accuracy

sampling_type=random|weighted|centrality

Path to an existing directory for the output directory location

outdir=directory location

Path to an existing directory for the output generated by the different topic ranking approaches

topic_ranking_objects=directory location

Dataset ID or group ID from DataHub for which you want to perform the profiling

query_str=datahub_dataset_id|datahub_group_id

In case you are looking for a dataset ID, then the value here should be false, and vice versa

is_dataset_group_search=true|false

Topic ranking approaches in Step (4), which determines the relevance of topics for a dataset

topic_ranking_strategies=prank|kstep|hits

Path to a file containing datatype properties of interest for the NED process. Here the datatype properties should be one per line and their object values should be textual literals

property_lookup=file location of properties of interest which need to be considered for NED analysis

Location where to store the generated dataset topic graphs

raw_graph_dir=directory location

Define which NED process to use. Spotlight doesn't require any changes, while for TagMe! one needs to get the API credentials (contact at http://tagme.di.unipi.it/) and provide it under the tagme_api_key

ned_operation=tagme|spotlight

In case the NED process is carried out by the TagMe! NED tool, then you have to request an API KEY at http://tagme.di.unipi.it/ and provide the key as the value for the variable.

tagme_api_key=TagMe! API KEY

DBpedia SPARQL endpoints in different languages

dbpedia_endpoint=en	http://dbpedia.org/sparql,de	http://de.dbpedia.org/live/sparql

This has to be set to true as it checks for the extracted de.l3s.bfetahu.ldp.entities whether their corresponding topics (categories) are extracted

load_entity_categories=true

URL for the English DBpedia used for the extraction of entity categories

dbpedia_url=http://dbpedia.org/sparql

Timeout (in seconds) when extracting resources from the datasets

timeout=10000

Define whether the de.l3s.bfetahu.ldp.entities should be included in the profiles or should be left out of the ranking process

includeEntities=false

File location where the dataset_topic_graph is stored

dataset_topic_graph=raw_graph/dataset_topic_graph.obj

Values used to initialize the KStep and PageRank models

alpha=0.1

K value for K-Step Markov

k_steps=3

Number of iterations used for the ranking of topics with K-Step Markov and PageRank with Priors

ranking_iterations=10

The code and the tool are provided under the GNU GENERAL PUBLIC LICENSE. When using the LDP tool, please cite the paper in [1]. For additional information, refer to this website.

[1] Besnik Fetahu, Stefan Dietze, Bernardo Pereira Nunes, Marco Antonio Casanova, Davide Taibi, Wolfgang Nejdl: A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles. ESWC 2014: 519-53

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.settings		.settings
src/main		src/main
target/classes		target/classes
.classpath		.classpath
.project		.project
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Linked Dataset Profiling Tool

Example config values

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Linked Dataset Profiling Tool

Example config values

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages