rdflib-hdt

A Store back-end for rdflib to allow for reading and querying HDT documents.

Requirements

Python version 3.6.4 or higher
pip
gcc/clang with c++11 support
Python Development headers

You should have the Python.h header available on your system.
For example, for Python 3.6, install the python3.6-dev package on Debian/Ubuntu systems.

Installation

Installation using pipenv or a virtualenv is strongly advised!

PyPi installation (recommended)

# you can install using pip
pip install rdflib-hdt

# or you can use pipenv
pipenv install rdflib-hdt

Manual installation

Requirement: pipenv

git clone https://github.com/Callidon/pyHDT
cd pyHDT/
./install.sh

Getting started

You can use the rdflib-hdt library in two modes: as an rdflib Graph or as a raw HDT document.

Graph usage (recommended)

from rdflib import Graph
from rdflib_hdt import HDTStore
from rdflib.namespace import FOAF

# Load an HDT file. Missing indexes are generated automatically
# You can provide the index file by putting it in the same directory as the HDT file.
store = HDTStore("test.hdt")

# Display some metadata about the HDT document itself
print(f"Number of RDF triples: {len(store)}")
print(f"Number of subjects: {store.nb_subjects}")
print(f"Number of predicates: {store.nb_predicates}")
print(f"Number of objects: {store.nb_objects}")
print(f"Number of shared subject-object: {store.nb_shared}")

# Create an RDFlib Graph with the HDT document as a backend
graph = Graph(store=store)

# Fetch all triples that matches { ?s foaf:name ?o }
# Use None to indicates variables
for s, p, o in graph.triples((None, FOAF("name"), None)):
  print(triple)

Using the RDFlib API, you can also execute SPARQL queries over an HDT document. If you do so, we recommend that you first call the optimize_sparql function, which optimize the RDFlib SPARQL query engine in the context of HDT documents.

from rdflib import Graph
from rdflib_hdt import HDTStore, optimize_sparql

# Calling this function optimizes the RDFlib SPARQL engine for HDT documents
optimize_sparql()

graph = Graph(store=HDTStore("test.hdt"))

# You can execute SPARQL queries using the regular RDFlib API
qres = graph.query("""
  PREFIX foaf: <http://xmlns.com/foaf/0.1/>
  SELECT ?name ?friend WHERE {
    ?a foaf:knows ?b.
    ?a foaf:name ?name.
    ?b foaf:name ?friend.
  }""")

for row in qres:
  print(f"{row.name} knows {row.friend}")

HDT Document usage

from rdflib_hdt import HDTDocument
from rdflib.namespace import FOAF

# Load an HDT file. Missing indexes are generated automatically.
# You can provide the index file by putting it in the same directory as the HDT file.
document = HDTDocument("test.hdt")

# Display some metadata about the HDT document itself
print(f"Number of RDF triples: {document.total_triples}")
print(f"Number of subjects: {document.nb_subjects}")
print(f"Number of predicates: {document.nb_predicates}")
print(f"Number of objects: {document.nb_objects}")
print(f"Number of shared subject-object: {document.nb_shared}")

# Fetch all triples that matches { ?s foaf:name ?o }
# Use None to indicates variables
triples, cardinality = document.search((None, FOAF("name"), None))

print(f"Cardinality of (?s foaf:name ?o): {cardinality}")
for s, p, o in triples:
  print(triple)

# The search also support limit and offset
triples, cardinality = document.search((None, FOAF("name"), None), limit=10, offset=100)
# etc ...

An HDT document also provides support for evaluating joins over a set of triples patterns.

from rdflib_hdt import HDTDocument
from rdflib import Variable
from rdflib.namespace import FOAF, RDF

document = HDTDocument("test.hdt")

# find the names of two entities that know each other
tp_a = (Variable("a"), FOAF("knows"), Variable("b"))
tp_b = (Variable("a"), FOAF("name"), Variable("name"))
tp_c = (Variable("b"), FOAF("name"), Variable("friend"))
query = set([tp_a, tp_b, tp_c])

iterator = document.search_join(query)
print(f"Estimated join cardinality: {len(iterator)}")

# Join results are produced as ResultRow, like in the RDFlib SPARQL API
for row in iterator:
  print(f"{row.name} knows {row.friend}")

Handling non UTF-8 strings in python

If the HDT document has been encoded with a non UTF-8 encoding the previous code won't work correctly and will result in a UnicodeDecodeError. More details on how to convert string to str from C++ to Python here

To handle this, we doubled the API of the HDT document by adding:

search_triples_bytes(...) return an iterator of triples as (py::bytes, py::bytes, py::bytes)
search_join_bytes(...) return an iterator of sets of solutions mapping as py::set(py::bytes, py::bytes)
convert_tripleid_bytes(...) return a triple as: (py::bytes, py::bytes, py::bytes)
convert_id_bytes(...) return a py::bytes

Parameters and documentation are the same as the standard version

from rdflib_hdt import HDTDocument

document = HDTDocument("test.hdt")
it = document.search_triple_bytes("", "", "")

for s, p, o in it:
  print(s, p, o) # print b'...', b'...', b'...'
  # now decode it, or handle any error
  try:
    s, p, o = s.decode('UTF-8'), p.decode('UTF-8'), o.decode('UTF-8')
  except UnicodeDecodeError as err:
    # try another other codecs, ignore error, etc
    pass

Name		Name	Last commit message	Last commit date
Latest commit History 178 Commits
.github		.github
docs		docs
include		include
rdflib_hdt		rdflib_hdt
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
README.rst		README.rst
install.sh		install.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rdflib-hdt

Requirements

Installation

PyPi installation (recommended)

Manual installation

Getting started

Graph usage (recommended)

HDT Document usage

Handling non UTF-8 strings in python

About

Releases 3

Packages

Contributors 9

Languages

License

RDFLib/rdflib-hdt

Folders and files

Latest commit

History

Repository files navigation

rdflib-hdt

Requirements

Installation

PyPi installation (recommended)

Manual installation

Getting started

Graph usage (recommended)

HDT Document usage

Handling non UTF-8 strings in python

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 9

Languages

Packages