Skip to content

Commit b643e31

Browse files
Pierre-Yves VandenbusschePierre-Yves Vandenbussche
Pierre-Yves Vandenbussche
authored and
Pierre-Yves Vandenbussche
committed
Update paper
1 parent cea2897 commit b643e31

File tree

4 files changed

+12
-9
lines changed

4 files changed

+12
-9
lines changed

.gitignore

+2
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,5 @@
22
*.bbl
33
*.blg
44
*.syntex.gz
5+
.idea/
6+
*.gz

paper/iswc2016/qa4lov.tex

+10-9
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@
5555

5656
\begin{abstract}
5757

58-
There is no doubt that there is an increasing presence of structured data on the web. At the same time, web users have different skills and want to be able to interact with Linked datasets in various manner, such as asking questions in natural language.
58+
\todo{don't like ``there is no doubt''}There is no doubt that there is an increasing presence of structured data on the web. At the same time, web users have different skills and want to be able to interact with Linked datasets in various manner, such as asking questions in natural language.
5959
%Over the last years, the QALD challenges series are becoming the references for benchmarking question answering systems. However, QALD questions are targeted on datasets, not on vocabulary catalogues.
6060
This paper proposed a first implementation of Query Answering system (QA) applied to the Linked Open Vocabularies (LOV) catalogue, mainly focused on metadata information retrieval. The goal is to provide to end users yet another access to metadata information available in LOV using natural language questions.
6161
% Currently, the system handles almost 92,99\% of the vocabularies in LOV.
@@ -76,7 +76,7 @@ \section{Introduction}\label{sec:introduction}
7676
However, users need to have a minimal knowledge of SPARQL language and RDF skills to query RDF datasets. It is one barrier that is overcome by Question Answering (QA) systems, which directly take as input questions in natural language. The need for more advanced tools and QA systems that operate over large repositories of Linked Data has also been the motivation for the QALD Question Answering over Linked Data (QALD) series of workshops \cite{lopezetal2013}.
7777
%Started in 2011, QALD challenges offer a test sets questions whose answers can be found in the DBpedia and MusicBrainz, with more complex questions added each year. However, there is not yet a specific set of questions specific to retrieve metadata at schema level.
7878

79-
Vocabulary catalogues and semantic search engines are special datasets as classes and properties are used to model and generate datasets available in the Linked Data space. Ontologies are important part in the semantic web layer to build full interoperable datasets. Most vocabulary catalogues provide terms search and APIs to access their datasets. LOV provides five types of search criteria: metadata search, ontology search, APIs access, dump file in RDF and SPARQL endpoint access \cite{vandenbusschelov}.
79+
Vocabulary catalogues and semantic search engines are special datasets as classes and properties are used to model and generate datasets available in the Linked Data space. Ontologies are an important part in the semantic web layer to build full interoperable datasets. Most vocabulary catalogues provide terms search and APIs to access their datasets. LOV provides five types of search methods: metadata search, ontology search, APIs access, dump file in RDF and SPARQL endpoint access \cite{vandenbusschelov}.
8080
%Consider the following question in natural language \textit{``What is dcterms?''}. The answer in English corresponds to the following SPARQL query in the LOV endpoint:\footnote{\url{http://lov.okfn.org/dataset/lov/sparql}}
8181

8282
%\begin{verbatim}
@@ -93,7 +93,7 @@ \section{Introduction}\label{sec:introduction}
9393
%}}
9494
%\end{verbatim}
9595

96-
This paper presents a prototype for a vocabulary backed question answering system that can transform natural language questions into SPARQL queries, thus giving the end users access to the information stored in vocabulary repositories. The paper is structured as follows: Section \ref{sec:questions} describes the set of questions, followed by a system description in Section \ref{sec:system}. An evaluation is presented in Section \ref{sec:evaluation} and a short conclusion and future work in Section \ref{sec:conclusion}.
96+
This paper presents a prototype for a vocabulary backed question answering system that can transform natural language questions into SPARQL queries, thus giving the end users access to the information stored in vocabulary repositories. The paper is structured as follows: Section \ref{sec:questions} describes the set of questions, followed by a system description in Section \ref{sec:system}\todo{May be more flowing if presenting the system first and then the type of questions supported?}. An evaluation is presented in Section \ref{sec:evaluation} and a short conclusion and future work in Section \ref{sec:conclusion}.
9797

9898

9999

@@ -115,7 +115,7 @@ \section{Types of Questions}
115115
%\item The different languages in which the vocabulary is available.
116116
%\end{itemize}
117117

118-
Additionally, users could be interested in other interesting facts, such as the number of versions, the number of datasets using the vocabulary, the number of external vocabularies reusing the vocabulary and the category to which belong the vocabulary. A first set of 14 questions in natural language can be handle by the prototype, covering different type of metadata information available in a vocabulary. Table \ref{tab:qtable} shows the list of questions, where in the column ``Template'', [be] can be either the present or the past form of the verb, and [vocab] is the preferred namespace of the vocabulary.
118+
Additionally, users could be interested in other interesting facts, such as the number of versions, the number of datasets using the vocabulary, the number of external vocabularies reusing the vocabulary and the category to which belong the vocabulary. A first set of 14 questions in natural language can be handle by the prototype, covering different type of metadata information available in a vocabulary. Table \ref{tab:qtable} shows the list of questions, where in the column ``Template'', [be] can be either the present or the past form of the verb, and [vocab] is the preferred prefix of the vocabulary.\todo{Here we have the feeling only be and vocab can change whereas in reality we use lemma or other words...}
119119

120120
\begin{table*}
121121
\centering
@@ -156,13 +156,14 @@ \section{System Description}
156156
\includegraphics[scale=.6]{qa4lov-archi.pdf}
157157
\label{fig:q4lovarchi}
158158
\end{figure}
159+
\todo{make this figure way shorter and add a screenshot}
159160

160-
The implementation uses the Quepy tool from Machinalis \footnote{\url{https://github.com/machinalis/quepy/}}. The POS tagset used by Quepy is the Penn Tagset \cite{marcus1993building}. First, regular expressions are defined to match the natural language questions and transform them into an abstract semantic representation. Then, specific templates is defined to handle the questions that the system can handle. To handle regular expressions, Quepy uses the \texttt{refo} library\footnote{\url{https://github.com/machinalis/refo}} which work with regular expressions as objects.
161+
The implementation uses the Quepy tool from Machinalis \footnote{\url{https://github.com/machinalis/quepy/}}. The POS tagset used by Quepy is the Penn Tagset \cite{marcus1993building}. First, regular expressions are defined to match the natural language questions and transform them into an abstract semantic representation. Then, specific templates are defined for the system to handle users' questions. To handle regular expressions, Quepy uses the \texttt{refo} library\footnote{\url{https://github.com/machinalis/refo}} which work with regular expressions as objects.
161162

162163
A vocabulary is defined by a fixed relation \texttt{voaf:Vocabulary}\footnote{All the prefixes are the ones used at \url{http://lov.okfn.org/dataset/lov/vocabs}} and a POS associated to a \texttt{vann:preferredNamespacePrefix}. LOV uses a unique prefix to identify namespaces, which is a string from length 2 to length 17, although the recommendation for publishers is to use a prefix with less than 10 characters.
163164
%\cite{pybernard12}.
164165

165-
The syntactic processor is based on regular expressions using POS terms. As a vocabulary is identified by its prefix, we use the following syntactic patterns: NN, NNS, FW, DT, JJ and VBN. Each question from Q1 to Q14 is associated to a unique template. After, when a prefix is recognized, the semantic interpreter uses fixed relations with the English tag, which are properties in RDF triple pattern.
166+
The syntactic processor is based on regular expressions using POS terms. As a vocabulary is identified by its prefix, we use the following syntactic patterns: NN, NNS, FW, DT, JJ and VBN. Each question from Q1 to Q14 is associated to a unique template. After, when a prefix is recognized, the semantic interpreter uses fixed relations with the English tag, which are properties in RDF triple pattern. \todo{provide an example of regex?}
166167
%Table \ref{tab:propTable} presents the different fixed relations currently used in the system to cover the set of the 14 questions.
167168

168169
%\begin{table}
@@ -199,8 +200,8 @@ \section{System Evaluation}
199200
\label{sec:evaluation}
200201

201202
%This first set of questions are not found in the QALD challenges, where answers had to be found either in DBpedia or in a federated dataset such as Yago2 and MusicBrainz. In terms of complexity, the test questions are either simple questions related to retrieve metadata information.
202-
The system allows the user to interact with the LOV catalogue through the answers generated. Depending on the types of the results (e.g., agents, versions, categories), the system allows the users to further explore the dataset with more interactions.
203-
All the questions which generated SPARQL query gives satisfactory results. The most challenging issue is to determine the most suitable POS that cover all the vocabulary prefixes. For example, out of 528\footnote{This number corresponds to the total number of vocabularies inserted in LOV as of January, 8th 2016.} vocabularies in LOV, 13 of them contain an hyphen and the system can not generate query: \texttt{elseweb-modelling}, \texttt{sdmx-dimension}, \texttt{omn-federation}, \\ \texttt{omn-lifecycle}, \texttt{elseweb-data}, \texttt{dbpedia-owl}, \texttt{geo-deling}, \texttt{sdmx-code}, \texttt{wf-invoc}, \texttt{iso-thes}, \texttt{eac-cpf}, \texttt{p-plan} and \texttt{ma-ont}. Moreover, 21 prefixes containing a number (e.g., \texttt{g50k}) and 3 special cases (\texttt{homeActivity}, \texttt{LiMo}, and \texttt{juso.kr}) are not currently covered by the system. Currently, the system handles almost 92,99\% of the prefixes in LOV.
203+
The system allows users to interact with the LOV catalogue through the answers generated. Depending on the types of the results (e.g., agents, versions, categories), the system allows users to further explore the dataset with more interactions.
204+
All the questions which generated SPARQL query gives satisfactory results. The most challenging issue is to determine the most suitable POS that cover all the vocabulary prefixes. For example, out of 528\footnote{This number corresponds to the total number of vocabularies inserted in LOV as of January, 8th 2016.} vocabularies in LOV, 13 of them contain an hyphen and the system can not generate query\todo{may be not necessary to list them all}: \texttt{elseweb-modelling}, \texttt{sdmx-dimension}, \texttt{omn-federation}, \\ \texttt{omn-lifecycle}, \texttt{elseweb-data}, \texttt{dbpedia-owl}, \texttt{geo-deling}, \texttt{sdmx-code}, \texttt{wf-invoc}, \texttt{iso-thes}, \texttt{eac-cpf}, \texttt{p-plan} and \texttt{ma-ont}. Moreover, 21 prefixes contain a number (e.g., \texttt{g50k}) and 3 special cases (\texttt{homeActivity}, \texttt{LiMo}, and \texttt{juso.kr})and are not currently covered by the system. Currently, the system handles 92,99\% of the prefixes in LOV.
204205

205206
%One of the issue dealing with QA over a vocabulary catalogue is to find relevant questions that can be asked against the system, based on the prefix. Since there is not a uniform way of describing a prefix, it is almost challenging to find the suitable pattern with NLP tools.
206207

@@ -224,7 +225,7 @@ \section{Conclusion and Future Work}
224225
\label{sec:conclusion}
225226
%\input{conclusion}
226227

227-
In this paper, we have presented a prototype system for answering a set of questions in natural language backed by a vocabulary catalogue. The questions are targeted to retrieve metadata information in vocabularies.
228+
In this paper, we have presented a prototype system for answering a set of questions in natural language backed by a vocabulary catalogue. The questions are targeted to retrieve metadata information in vocabularies. \todo{may be emphasise on the impact for end users...}
228229
%The approach is based on the identification of the vocabulary prefix combining POS and regular expressions.
229230
The implementation uses the LOV dataset in RDF and the Quepy tool. The first results show that the system handles 92,99\% of the metadata of vocabularies in the LOV catalogue.
230231
We plan to extend the types of queries to more complex ones. Moreover, we can use various semantic relationships in LOV to do query expansion by using for instance sub-properties.

src/webapp/lov/authors.pyc

-2.88 KB
Binary file not shown.

src/webapp/lov/people.pyc

-8.83 KB
Binary file not shown.

0 commit comments

Comments
 (0)