forked from coolharsh55/phd-thesis
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathvocabularies.tex
1665 lines (1407 loc) · 189 KB
/
vocabularies.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\chapter{Representing Information for GDPR Compliance using Ontologies}
\label{chapter:vocabularies}
This chapter presents OWL2 ontologies developed to fulfil research objective $RO3$ defined in \autoref{sec:intro:RO} regarding representation of information.
The chapter first presents a more detailed description of the methodology in \autoref{sec:intro:ontology-engineering} regarding developing and evaluating ontologies based on summary presented earlier in \autoref{sec:voc:methodology}.
It then presents the ontologies of: (i) GDPRtEXT (\autoref{sec:voc:GDPRtEXT}) which provides a linked data representation of GDPR text and a glossary of GDPR compliance concepts, and which satisfies research objective $RO3(a)$ by providing an ontological representation of concepts and text of GDPR; (ii) GDPRov \autoref{sec:voc:GDPRov} which enables representing provenance of activities associated with personal data and consent in ex-ante and ex-post phases, and fulfils research objective $RO3(b)$; and (iii) GConsent (\autoref{sec:voc:GConsent}) which enables representing information regarding consent, and fulfils research objective $RO3(c)$.
Each ontology is presented with a summary of its motivation, engineering process, and dissemination. Ontologies are presented with their evaluation based on the extent to which they satisfy the competency questions used in their development and through comparison with analysed approaches in state of the art in \autoref{sec:sota:analysis}.
In addition to these, Data Privacy Vocabulary (DPV), initially presented in \autoref{sec:intro:dpvcg}, is also included as an external contribution of the thesis based on work done by author of this thesis within the W3C Data Protection Vocabularies and Controls Community Group (DPVCG), and overlap of DPV with the research presented. \autoref{sec:voc:DPV} presents an overview of DPV and compares it with GDPRtEXT, GDPRov, and GConsent - and demonstrates their similarity in representing information while drawing attention to distinguishing features.
The section also presents a comparison of DPV with state of the art as identified in \autoref{chapter:sota}.
\section{Methodology for Ontology Engineering}\label{sec:voc:methodology}
\subsection{Utilisation of Existing Ontology Engineering Methodologies}
The creation of ontologies followed guidelines and methodologies deemed `best practice' by semantic web community. In this, `Ontology development 101: A guide to creating your first ontology' by Noy and McGuiness \cite{noy_ontology_2001} was utilised as a guiding document for ontology creation. It provided steps for construction of an ontology with attention on avoiding bad design decisions and common pitfalls. It also suggested use of competency questions to determine scope of an ontology and for evaluation after creation.
% For this, the compliance questions presented in \autoref{} were used as competency questions.
The guide suggested Protégé\footnote{\url{https://protege.stanford.edu/}} - a popular and widely adopted tool - for ontology development as it supports semantic reasoners to detect logical inconsistencies arising from asserted facts and axioms in ontology.
Use of this guide provided a foundational basis for initiating the ontology development process and for using compliance questions from \autoref{sec:info:compliance-questions} as competency questions to identify concepts and relationships, testing for inconsistencies using Protégé, and iteratively building an ontology.
The development of ontologies followed a combination of NeOn methodology \cite{suarez-figueroa_neon_2012} and UPON Lite methodology \cite{de_nicola_lightweight_2016}. NeOn provides a flexible workflow for ontology development through use of scenarios such a using a specification, reusing and re-engineering existing ontological and non-ontological resources, and utilisation of ontological design patterns.
UPON Lite is a lightweight methodology for rapid ontology engineering that was used in combination with NeOn for iteratively developing ontologies in an agile fashion. UPON Lite consists of six steps: identification of domain terminology, construction of domain glossary, creating a taxonomy, predication as properties, meronymy for complex components, and conceptualisation into an ontology.
The combination of NeOn and UPON Lite consisted of identifying development scenarios and specifying them as requirements using NeOn, then using UPON Lite to derive actionable tasks and implementing ontology creation.
The methodology used to develop ontologies presented in this chapter explicitly specifies competency questions used to derive its concepts based on compliance questions from previous chapter (\autoref{sec:info:compliance-questions}). This approach enables tracing lineage of a concept to its role in compliance process and provides transparency in development process.
To compare methodology used in this thesis with other methodologies used to develop ontologies within relevant approaches in SotA - some utilise legal experts which act as domain experts to validate developed ontologies and their interpretation - such as in research projects SPECIAL and MIREL (see \autoref{sec:sota:gdpr-semweb}). Such projects involve commercial partners who provide real-world use-cases and data to inform and evaluate developed research.
Others - such as Ujcich et al. \cite{belhajjame_provenance_2018} - interpret GDPR as a set of requirements for compliance in their modelling of information. In either case, approaches within SotA do not provide competency questions that could be used to develop ontologies\footnote{Deliverables of research project provide a description of how the concepts of their ontologies were developed from legal requirements, but such descriptions are argumentative and limited to the specified domain or use-case, and hence do not provide a concrete requirement that can be used to develop an ontology to answer the compliance questions.}.
% The aims and motivations of this thesis (see \autoref{sec:intro:background}) are based on representing information for assisting the compliance process rather than an evaluation of compliance itself. Therefore, the research has been documented methodologically to indicate its aims, motivations, methodology, and resources used to shape conceptualisations and rationalisations, and published in an open and accessible manner to enable transparency and reuse.
\subsection{Ontology Quality}
The quality of an ontology refers to quality of its design of concepts and relationships, and quality as a semantic dataset. While following a suitable ontology engineering methodology provides a structured ontology, it still needs to be inspected for quality in terms of ontology as well as for intended use-cases and scenarios. For this, existing publications \cite{gurk_towards_2017,vrandecic_ontology_2010} list various methods of ontology quality detection, evaluation, and suggest solutions to fix identified problems.
OOPS!\footnote{\url{http://oops.linkeddata.es/}} \cite{poveda-villalon_oops!_2014} is a useful tool for ontology evaluation which detects common pitfalls in design of concepts and relationships and provides a documented output which can be persisted for provenance of ontology development. Each pitfall detected by OOPS! is categorised along structural, functional, and usability-profiling dimensions. The tool also provides an indicative measure of importance regarding pitfalls in terms of critical, important, and minor levels.
OOPS! was used for detecting catalogued common pitfalls in evaluation of developed ontologies. Identified pitfalls were corrected by changing underlying relationships to remove them.
Quality was also assessed and maintained by asserting sufficiency of developed ontology to represent and query information based on collected use-cases presented in \autoref{sec:info:use-cases}. In this process, missing concepts and relationships were added to the ontology, while incorrect ones were removed or rectified.
% Where existing models were found to be insufficient or incorrect, these were rectified by either removing the offending parts or re-designing them - based on its impact on compatibility with previous versions.
\subsection{Ontology Documentation}
Ontology documentation was created by using WIDOCO\footnote{\url{https://dgarijo.github.io/Widoco/}} \cite{garijo_widoco_2017} - a tool which uses ontology metadata to create HTML documents listing its classes and properties. Ontology metadata consists of information regarding the ontology as well as its concepts and properties integrated into its serialisation as annotations. WIDOCO provides a document of suggested metadata indicating best practices for ontology documentation. It builds upon LODE\footnote{\url{http://www.essepuntato.it/lode}} which is itself a popular ontology documentation service.
The output of WIDOCO is a HTML document along with various serialisations of ontology for content negotiation that can be published and used as an online resource. Additional information was manually added to HTML documentation to specify aims and methodologies used in development of ontologies as well as examples of use-cases and diagrams intended for human consumption. WIDOCO integrates OOPS! to detect pitfalls and documents the output. It also provides an interactive visualisation of the ontology using WebVOWL\footnote{\url{http://vowl.visualdataweb.org/webvowl.html}}.
\subsection{Dissemination}
The ontologies were published on internet using a stable IRI through persistent identifiers on servers hosted by ADAPT Research Centre\footnote{\url{https://adaptcentre.ie/}} and School of Computer Science \& Statistics\footnote{\url{https://scss.tcd.ie/}} within Trinity College Dublin. Initially, persistent identifiers were provided using purl\footnote{\url{https://purl.org/}} which later had issues regarding maintenance and frequent problems with URL resolution. The ontologies were then modified to utilise W3ID\footnote{\url{w3id.org/}} persistent identifiers maintained by W3C Permanent Identifier Community Group\footnote{\url{https://www.w3.org/community/perma-id/}}. The ontologies published in this manner followed best practices and principles related to use of Linked Open Data\footnote{\url{https://www.w3.org/TR/ld-bp/}}, Linked Open Vocabularies\footnote{\url{https://dgarijo.github.io/Widoco/doc/bestPractices/index-en.html}}, and FAIR\footnote{Findability, Accessibility, Interoperability, and Reusability (FAIR) \url{https://doi.org/10.1038\%2Fsdata.2016.18}} principles.
Each ontology was added to Linked Open Vocabularies\footnote{\url{https://lov.linkeddata.es/}} (LOV) - a community listing that catalogues vocabularies in semantic web community. Each ontology was published in Zenodo\footnote{\url{zenodo.org/}} which provides open repositories and assigns a unique DOI to repositories. The ontology and its resources were also added to public hosting repositories such as GitHub\footnote{\url{github.com/}} and an instance of OpenGogs\footnote{\url{opengogs.adaptcentre.ie/}} hosted on institution servers. Each ontology was published under an open and permissive license (CC-by-4.0\footnote{\url{https://creativecommons.org/licenses/by/4.0/}}) to promote its use and adoption.
\subsection{Evaluation}
Evaluation was carried out by analysing sufficiency of each ontology to provide concepts for representing information for answering competency questions. This was carried out in an iterative manner where each iteration consisted of developing the ontology, evaluating it, and utilising results of evaluation as feedback to identify areas of improvement such as missing concepts and relationships or incorrect assumptions.
The ontology was also evaluated against common pitfalls using OOPS! as described earlier regarding ontology quality. The OOPS! ontology report is published along with ontology documentation, and can be manually generated by using OOPS! online service. Documentation and publishing standards were evaluated by assessing whether ontologies met existing criteria advocated by the community (such as 5-star principle for linked data \footnote{\url{https://5stardata.info/en/}} and FAIR principles). Finally, each ontology was published and presented in a peer-reviewed venue and publication, with more information about publications provided in the respective ontology's section.
Evaluation of work as a research contribution was carried out based on whether it satisfied its research objectives motivating its development and whether it provided novel contributions compared to existing approaches within state of the art. The details of this are presented in evaluation sections of each ontology.
\subsection*{Summary of Methodology}
Based on above description of ontology engineering processes, the methodology used for ontology engineering and development is summarised through as:
\begin{enumerate}
\item \textbf{Identification of aims, objectives, scope:} The first step was to identify aim and objectives of information to be represented, followed by deciding on scope regarding relation to GDPR compliance. For ontologies presented in this chapter, aims and objectives are listed in \autoref{sec:intro:RQ}. % introduction
\item \textbf{Identify and analyse relevant information:} Using identified scope, relevant information was gathered from various sources including authoritative, community, and publications - and analysed to identify terms of importance and requirements regarding GDPR compliance. The information is presented as background of GDPR in \autoref{sec:background:GDPR} and analysed with regards to compliance in \autoref{chapter:information}.
\item \textbf{Create use-cases and competency questions:} From the analysed information, different use-cases were identified to better understand application of information in compliance scenarios and requirements of different stakeholders in this process. This was done using information interoperability model presented in \autoref{sec:info:model}. The analysed information was used to create compliance questions, as presented in \autoref{sec:info:compliance-questions}, which identify relevant information for evaluation of compliance. These compliance questions were utilised as competency questions in development and evaluation of ontologies.
\item \textbf{Identify concepts and relationships:} Relevant concepts and relationships were identified to express information required to answer compliance questions in identified use-cases. This was an iterative and cyclic process where identified concepts and relationships were re-purposed to better suit some design pattern or compliance requirements.
\item \textbf{Create Ontology:} The identified concepts and relationships were formalised as an ontology in OWL2 using the Protégé ontology development environment. In this process, a semantic reasoner (i.e. Pellet\footnote{\url{https://github.com/stardog-union/pellet}} and HermiT\footnote{\url{http://www.hermit-reasoner.com/}}) was used to identify logical inconsistencies in ontology. Minor inconsistencies were fixed by changing appropriate relationships between concepts, while major inconsistencies required evaluation of information identified in step 4. Development of ontology utilised best practices advocated by semantic web community in terms of ontology metadata, documentation, design patterns, publication, and dissemination.
\item \textbf{Evaluate:} The ontology was evaluated for sufficiency towards representing information for answering competency questions. The use of a semantic reasoner detected logical inconsistencies in expressed facts and axioms, while OOPS! provided detection of common pitfalls and bad design patterns.
The quality of metadata and documentation was evaluated in terms of sufficiency based on community guidelines. Where an ontology was published and/or presented as a resource or as part of a peer-reviewed publication, resulting comments and feedback were used to identify areas of improvement. Citations of ontologies and associated publications were used to identify criticisms (if provided) and to compare them with work presented in citing publication.
\item \textbf{Dissemination:} The ontology and its documentation were published online with a persistent identifier as a FAIR resource with an open and permissive license. This included publication of ontology, datasets, and code in a public repository accompanied by human-readable documentation about its creation and utilisation.
\item \textbf{Progressive iterations:} Within a single iteration of development, an ontology was created and evaluated by following steps 2 to 6. Multiple iterations consisted of repeating these steps as in an development cycle to progressively improve the ontology by adding new concepts or removing existing undesirable ones. Previous versions of ontology were retained with their documentation for provenance where possible to indicate milestones in its development.
\end{enumerate}
% \subsection*{Modularity of Ontologies}
% The research question stated in \autoref{sec:intro:RQ} lays the scope for the work of developing an ontological re{}presentation of information associated with GDPR compliance and concerning activities associated with processing of personal data and consent.
% This information is identified through the research objectives $RO1$ and $RO2$.
% They provide the analysis of GDPR and the compliance questions as presented in \autoref{chapter:information} which act as requirements in the ontology engineering process.
% This directly provides the objective of developing an ontology to represent information about processing of personal data and consent.
\section{GDPRtEXT - Linked Open Dataset of GDPR text \& Glossary of Concepts}\label{sec:voc:GDPRtEXT}
This section describes the GDPRtEXT ontology and dataset which provides a linked data version of text of GDPR and a SKOS glossary of concepts associated with its compliance. The section presents the motivation and creation of GDPRtEXT, its publication, dissemination, and comparison with relevant approaches in state of the art. The latest iteration of GDPRtEXT (v0.6) is available online\footnote{\url{https://w3id.org/GDPRtEXT/}} with its documentation and code repository\footnote{\url{https://github.com/coolharsh55/GDPRtEXT/}}.
\subsection{Motivation}
GDPR as a legislation consists of text which is structured into 173 Recitals, 99 Articles (further grouped into Chapters and Sections), and 21 Citations. Each Article may have one or more Paragraphs which itself may have one or more Sub-Paragraphs. As per norms used in legislations, each individual clause - whether an article, paragraph, or sub-paragraph - is identified with an alphanumeric number as provided. These are commonly referenced in textual notation as identifiers, for example \textit{Article 8 Paragraph 2 Sub-Paragraph c} can be referred to as: \textit{A8(2-c), A(8-2c), A8-2(c), Art.8 2(c), Art-8-2-c}. As there is no standard or accepted commonality in specifying such references, and because such notations are intended for human readability and interpretation - a strict set or specification of notations does not exist. This presents difficulty when representing such information in machine-readable formats.
The EU Publications Office currently publishes legislation metadata at document level which provides information about GDPR as a legislation using ELI ontology and standard \cite{thomas_european_2019} but does not specify granular information about its contents - such as its articles. The EU Publications Office has indicated its intention to provide such granular metadata in future (see footnote in \autoref{sota:analysis:representation}).
Currently, concepts arising from legislations as well as those used in context of GDPR compliance have no standardised reference to provide commonality between two representations in different use-cases.
Within the larger scope of legal compliance, information is always associated with clauses and concepts of a law - in this case GDPR.
Amongst approaches part of SotA presented in \autoref{chapter:sota}, only two approaches consider association of information with GDPR within scope of their work as presented in \autoref{sota:analysis:representation}.
Other approaches, where they reference concepts and clauses of GDPR, do so in an ad-hoc manner using textual notations such as ``Article 4-11''.
As the analysis in \autoref{sota:analysis:representation} points out, the two approaches modelling clauses of GDPR have three drawbacks - (a) the representation is incompatible with ELI ontology, (b) none provide a glossary of terms relevant for compliance, and more importantly (c) neither resource can be reused as it is not published in an open and accessible manner.
Addressing this gap is required to fulfil research objectives related to linking of information with GDPR.
With this motivation research objective $RO3(a)$ was established in \autoref{sec:intro:RO} and is fulfilled by GDPRtEXT - which provides an OWL2 ontology for granular representation of GDPR text and a dataset created using this ontology for a linked data representation of GDPR where each clause has a unique IRI.
GDPRtEXT also provides a SKOS glossary of terms associated with compliance and links them with their definition and relevance to clauses within GDPR using the developed ontology.
It thus enables linking of information with specific clauses and concepts of GDPR.
\subsection{Ontology Engineering and Creation of Resource}\label{sec:voc:gdprtext-engineering}
Following the methodology described in \autoref{sec:voc:methodology}, development of competency questions was based on understanding and analysis of how legal articles are referenced in text in relation to compliance.
Competency questions presented here are categorised based on whether they concern structure of GDPR or representation of concepts associated with compliance. These competency questions investigate structure and concepts of GDPR as a document and thereby differ from compliance questions presented in \autoref{chapter:information} which are concerned with investigating information for compliance. The competency questions for GDPRtEXT are outlined below with identified requirements:
\subsubsection{Structure of GDPR text}
\begin{enumerate}[label={\texttt{CQ.\theenumi}}]
\item How many Recitals are there within GDPR?
\item How many Chapters are there within GDPR?
\item How many Sections are there within GDPR?
\item How many Articles are there within GDPR?
\item How many Paragraphs are there within GDPR?
\item How many Sub-paragraphs are there within GDPR?
\item How many References or Citations are there within GDPR?
\item Article 4 belongs to which Chapter? (generalised to which \textit{Chapter} does \textit{Article X} belong?)
\item Which clause contains the definition of 'personal data'? (generalised to definition of \textit{concept X})
\item What is the structural hierarchy of the document?
\item What are the `Principles' defined in GDPR? (generalised to \textit{conceptA} with types \textit{conceptB}, e.g. `Accountability' as a type of `Principle')
\item Which articles, paragraphs, and sub-paragraphs are relevant to the validity of given consent? (generalised to relevant to \textit{concept X})
\item How to associate information regarding given consent to relevant clauses in the GDPR? (generalised to association information regarding \textit{concept X})
\item How to associate information regarding compliance to a specific article of the GDPR?
\end{enumerate}
Based on these, following requirements were identified with regards to extending the existing ELI ontology:
\begin{itemize}
\item Structure of text must be specified with granularity and a hierarchy of Document, Chapter, Section, Article, Paragraph, Sub-Paragraph along with Recitals and Citations.
\item Relations between clauses must be specified e.g. Paragraph belongs to an Article.
\item Relations must be transitive e.g. Paragraph in an Article must also be in the Article's Chapter.
\item Each individual clause must have a unique IRI to enable linking of information to it.
\end{itemize}
\subsubsection{Concepts associated with GDPR compliance}
\begin{enumerate}[label={\texttt{CQ.\theenumi}},resume]
\item What type of data does the GDPR define?
\item What types of consent does the GDPR define?
\item What are the different entities referred to within GDPR?
\item Which activities are associated with processing of personal data?
\item Which activities are associated with consent?
\item What are the conditions or criteria associated which affect sensitivity of processing?
\item What activities are relevant to a data breach?
\item Which activities are relevant regarding compliance?
\item What are the principles defined in GDPR?
\item What are the rights provided by the GDPR?
\item Which criteria does the GDPR mention for right to data portability?
\item Which criteria does the GDPR mention for right to be informed?
\item What are the obligations mentioned within GDPR?
\item What are the obligations of the Controller?
\item What are the obligations of the Processor?
\item What are the obligations of a DPO?
\item What are the lawful basis for processing of personal data specified in the GDPR?
\item What are the conditions for valid consent under GDPR?
\item Which obligations are mentioned in relation to data collection?
\item Which obligations are mentioned in relation to obtaining consent?
\item Which obligations are mentioned in relation to retaining personal data?
\item Which obligations are mentioned in relation to security of personal data?
\item What concepts are defined regarding seals and certifications?
\end{enumerate}
Based on these, following requirements were identified with regards to representing concepts as a glossary using W3C SKOS standard:
\begin{itemize}
\item The glossary should express concepts in a hierarchy of relation as associated with compliance. This hierarchy is based on which additional concepts are relevant to the given concept. For example, all principles are referred to when referring to the concept of `Principle', and - activities and actions associated with compliance are referred to when using the concept of `Compliance'.
\item The glossary should reference concepts with their definitions within the clauses of the GDPR.
\item The glossary should indicate relevant concepts within GDPR for a given concept in the context of compliance.
\item The glossary should provide concepts regarding:
\begin{itemize}
\item types of data
\item types of consent
\item types of entities
\item types of activities associated with - consent, data, processing, data breaches
\item actions associated with compliance
\item principles defined in the GDPR
\item rights provided by the GDPR
\item obligations mentioned in the GDPR
\item conditions required for valid consent
\item conditions associated with seals and certifications
\end{itemize}
\end{itemize}
\subsubsection{Extending ELI}
The suitability of extending existing ELI ontology for representing hierarchy of clauses in GDPR was evaluated and found to be feasible based on existence of generic extendible concepts. Information models used by ELI - namely FORMEX\footnote{\url{https://op.europa.eu/en/web/eu-vocabularies/formex//}} and Common Data Model\footnote{\url{https://op.europa.eu/en/web/eu-vocabularies/model/-/resource/dataset/cdm}} were also taken into consideration in formulating an extension mechanism for compatibility.
Where the GDPR as a document contains metadata regarding its expression in multiple languages, the extension was modelled to be language agnostic with labels in English for the scope of this thesis. Therefore language specifications provided by FRBR model\footnote{\url{https://www.ifla.org/publications/functional-requirements-for-bibliographic-records}} were not modelled nor included as language labels. The FRBR functionality can be easily integrated in future by extending relevant GDPRtEXT concepts with language tags and expressions.
\subsubsection{Creation of datasets}
Three outputs were decided to be produced under GDPRtEXT based on requirements stated earlier - an OWL2 ontology for representing structure of GDPR text, a dataset of GDPR text using this ontology, and a SKOS glossary of concepts associated with compliance. The OWL2 ontology and SKOS glossary were combined within a single deliverable - namely GDPRtEXT ontology, and representation of GDPR text was provided as a RDF dataset with metadata defined using DCAT\footnote{\url{https://www.w3.org/TR/vocab-dcat/}} and VoID\footnote{\url{https://www.w3.org/TR/void/}} standards.
% Extracting text from GDPR's legislation document as individual clauses proved to be a challenge due to the way the official legislation is structured and provided in HTML, PDF, and XML serialisations.
The task of extracting individual clauses and annotating their structure (e.g. chapter, article, paragraph) was automated using a JavaScript script\footnote{\url{https://github.com/coolharsh55/GDPRtEXT/blob/master/scripts/parse_gdpr.js}}.
Three datasets were produced and published through this process. The first provides description of canonical versions of official legislation, i.e. those published by EU Publications Office which specifies GDPR legislation in HTML, PDF, and XML formats. The second provides a copy of GDPR hosted on institution servers and provides identifiers for individual clauses in HTML, JSON, and plain-text documents. The third provides RDF serialisations of GDPR using GDPRtEXT ontology in RDF/XML, N3, Turtle, and JSON-LD.
The glossary published using SKOS utilised IRIs of individual clauses of GDPR in GDPRtEXT to indicate source, definitions, and related concepts. Definitions were declared using \texttt{rdfs:isDefinedBy} property, and a new property called \texttt{gdprtext:involves} was created to indicate associations between concepts.
\subsubsection{Publication \& Dissemination}
The ontology and dataset are exposed through a SPARQL endpoint\footnote{\url{https://w3id.org/GDPRtEXT/sparql}} on a triple-store hosted on institution servers. Pubby\footnote{\url{http://wifo5-03.informatik.uni-mannheim.de/pubby/}} was used to provide a front-end for browsing a RDF serialisation of GDPRtEXT as shown in \autoref{fig:vocab:gdprtext-pubby}. The dataset was provided under the CC-by-4.0 license to provide resources in an open and reusable manner and was published in Irish Open Data Portal\footnote{\url{https://data.gov.ie/dataset/gdprtext}} which rated it as a 5-star dataset indicating highest quality in following linked open data principles.
\begin{figure}[htbp]
\centering
\includegraphics[width=\linewidth]{img/gdprtext-pubby}
\caption{Article 12(3) in GDPRtEXT as RDF displayed using Pubby \cite{pandit_gdprtext_2018}}
\label{fig:vocab:gdprtext-pubby}
\end{figure}
\subsection{Resource Description \& Application}
An visual overview of concepts within GDPRtEXT is presented in \autoref{fig:vocab:gdprtext-summary-a} and \autoref{fig:vocab:gdprtext-summary-b}.
\begin{figure}[htbp]
\centering
\includegraphics[width=\linewidth]{img/gdprtext-summary-a}
\caption{Visual overview of concepts in GDPRtEXT - part (a) \cite{pandit_gdprtext_2018}}
\label{fig:vocab:gdprtext-summary-a}
\end{figure}
\begin{figure}[htbp]
\centering
\includegraphics[width=0.75\linewidth]{img/gdprtext-summary-b}
\caption{Visual overview of concepts in GDPRtEXT - part (b) \cite{pandit_gdprtext_2018}}
\label{fig:vocab:gdprtext-summary-b}
\end{figure}
\subsubsection{Concepts for description structure of text}
GDPRtEXT extends European Legislation Identifier (ELI) \cite{thomas_european_2019} ontology published by European Publications Office with granular concepts to represent individual clauses within GDPR.
ELI provides the class \texttt{LegalResource} to indicate a legislative document and its sub-class \texttt{LegalSubResource} to indicate a component or part of that resource. GDPRtEXT extends \texttt{LegalSubResource} with sub-classes \texttt{Chapter}, \texttt{Section}, \texttt{Article}, \texttt{Point} (indicating Paragraph), \texttt{SubPoint} (indicating Sub-Paragraph), \texttt{Recital}, and \texttt{Citation}.
ELI provides properties \texttt{has\_part} and its inverse \texttt{is\_part\_of} to indicate connections between two legal resources, which GDPRtEXT extends using sub-properties to indicate hierarchical relations between chapters, sections, articles, points, and sub-points.
\subsubsection{Concepts about Data}
GDPR mentions different types of data which determine applicable obligations and requirements of compliance. GDPRtEXT provides \texttt{Data} as a top-level concept to indicate abstract term of `data'.
GDPR primarily focuses on personal data as defined in Article 4(1) - represented in GDPRtEXT as \texttt{PersonalData}, with special categories of personal data defined in Article 9(1) requiring additional obligations for processing and handling being represented by \texttt{SpecialCategoryPersonalData}. Types of special categories mentioned include criminal data, genetic data, health data, and racial data - which are defined as sub-classes in GDPRtEXT.
GDPR also mentions data in context of anonymisation and pseudo-anonymisation processes - represented in GDPRtEXT as \texttt{AnonymousData} and \texttt{PseudoAnonymousData}.
\subsubsection{Concepts about Consent}
The top-level concept of `consent' is represented by \texttt{Consent} in GDPRtEXT with its definitions based in Articles 4(11), 6(1) and Recitals 32, 40. It is sub-classed as \texttt{GivenConsent} - which is a legal basis and therefore is also a sub-class of \texttt{LegalBasis}. \texttt{GivenConsent} is further sub-classed to indicate `valid consent' which carries obligations of ensuring consent is valid and meets requirements of GDPR - and is therefore also defined as sub-class of \texttt{ObligationForObtainingConsent}. Obligations regarding conditions of valid consent are represented by sub-classing the \texttt{ValidConsent} for indicating - freely given, informed, specific, voluntary, and opt-in.
\subsubsection{Concepts about Entities}
\texttt{Entity} represents an `entity' which could be an individual, institution, company, corporation, partnership, or government agency - to name a few.
It is sub-classed to indicate entities specifically mentioned in GDPR: Data Subject, Controller, Processor, Sub-Processor, Data Protection Officer (DPO), and Data Protection Authority (DPA). Additionally, relevant concepts associated with entities are also defined: Representative of Controller, Representative of Processor, Certification Body, and Regulatory Authority.
\subsubsection{Concepts about Activities}
`Activity' refers to some process or action mentioned, referred, implied, or defined by requirements of GDPR compliance. To represent these, GDPRtEXT defines activities regarding consent and personal data processing, as well as other activities related to functioning of GDPR - such as reporting data breach and demonstrating consent. The top-level concept `Activity' represents abstraction of all activities. `ConsentActivity' and `DataActivity' represent specialised activities involving consent and personal data respectively.
Consent activities defined within GDPRtEXT consist of obtaining consent and withdrawing consent. Data activities include use, archival, collection, cross-border transfer, erasure, copying, rectifying, sharing, and storage of personal data. In these, the activity associated with usage of personal data is equivalent to its common and synonymous usage with term `processing'. Activities for indicating context of processing include - automated processing, automated decision making with significant effects, confirming or matching datasets, large scale processing, processing affected or vulnerable individuals, processing sensitive data, processing using untested technologies, and unlawful processing.
GDPRtEXT also provides activities associated with reporting of data breach, which includes obligations and actions such as - report data breach, maintain record of breach, notify data subject of breach, report breach to controller (for processors), and report breach to DPA within 72 hours. Other activities provided are - security of personal data, appointment of processors, demonstrating consent, exercise rights, identification of data subject, impact assessment, marketing, direct marketing, monitor compliance, propagate rights to third parties, and systematic monitoring.
\subsubsection{Concepts about Compliance}
Concepts associated with compliance are provided to indicate actions or terms used in process of maintaining, documenting, evaluating, and demonstrating compliance. The top-level concept \texttt{Compliance} represents an abstract notion of compliance. Other terms derived from this include - Demonstration of Consent, Monitor Compliance, and Report Data Breach.
\subsubsection{Concepts about Principles}
GDPRtEXT represents principles using top-level concept \texttt{Principle}, which is specialised to indicate principles associated with: Accountability; Accuracy; Data Minimisation; Integrity and Confidentiality; Lawfulness, Fairness, and Transparency; Purpose Limitation; and Storage Limitation.
\subsubsection{Concepts about Rights}
To represent rights, GDPRtEXT provides top-level concepts representing each individual right with further concepts associated each right represented as sub-classes.
The right of data portability is represented by \texttt{RightOfDataPortability} with related concepts regarding: providing copy of personal data, commonly used data format, machine readable format, structured, and supporting reuse.
The right of erasure is represented by \texttt{RightOfErasure} with related concepts provided regarding obligation to erase data when consent is withdrawn, or when data is no longer needed for original purpose. The right to access personal data is represented by concept \texttt{RightToAccessPersonalData} with related concepts for indicating if and where controller is processing data, whether there is automated processing with significant effects on data subject, categories of data being processed, categories of recipients data is shared with, existence of rights, information about processing, source of data, storage period, and ensuring no charges are levied for provision of rights.
\texttt{RightToBasicInformationAboutProcessing} represents right to basic information about processing and is accompanied with its related concept regarding information about third parties. The concept \texttt{RightToRestrictProcessing} represents right to restrict processing, and is accompanied with conditions such as - accuracy is contested, data no longer needed for original purpose, and processing is unlawful. The right to transparency is represented by \texttt{RightToTransparency} with related concepts regarding conditions of concise, easily accessible, intelligible, and transparent. Other represented rights include: right to not be evaluated through automated processing, right to object to direct marketing, right to object to processing, and right of rectification.
\subsubsection{Concepts about Obligations}
GDPRtEXT defines concepts regarding obligations of controllers, processors, DPOs, consent, and compliant processing of personal data based on a legal basis. Obligations of controllers are represented by \texttt{ControllerObligation} with related concepts provided regarding appointment of processors, accountability, controller responsibility, co-operation with DPA, data protection by design and default, data security, liability of joint controller(s), maintaining records of processing activities, privacy by design, propagate rights to third parties, and reporting data breach.
Rights of processors is represented by \texttt{ProcessorObligation} with related concepts for appointing sub-processors, assisting in complying with rights, compliance with controller's instructions, co-operating with DPA, data security, imposing confidentiality on personnel, informing controller of conflict with law, maintaining records of processing activities, only acting on documented instructions, propagating rights to third parties, providing controller with information for compliance, reporting data breach to controller, restrictions on cross-border transfers, and to return or destroy personal data at end of term.
The concept \texttt{DPOObligation} represents obligations of a DPO which include the monitoring of compliance represented by \texttt{MonitoringCompliance} . The obligations related to lawful basis for processing are represented by \texttt{LawfulBasisForProcessing} along with related concepts for contract with data subject, exempted by national law, employment law, given consent, historic, statistical, or scientific purposes, legal claims, legal obligation, legitimate interest, made public by data subject, medical or diagnostics use, not for profit organisation, public interest, purpose of new processing, and vital interest.
Obligations regarding valid consent are represented by \texttt{ValidConsent} with related concepts provided to indicate consent should be freely given, informed, specific, voluntary, and opt-in.
Obligations for obtaining consent are represented by \texttt{ObligationForObtainingConsent} and include concepts for information about third parties, indicating consent can be withdrawn easily, and conditions regarding information provided for obtaining consent such as - it should be clear, providing explanation of processing, should not be from silence or inactivity, should be demonstrable, should be distinguishable from other matters, and that it should produce valid consent.
Obligations for data collection are represented by \texttt{ObligationForDataCollection}, which is accompanied with related concepts for indicating accurate collection, specification of explicit purpose, ensuring legitimate purpose, ensuring it is not further processed than original purpose, and ensuring it is limited to specified purpose.
Obligations for retention of personal data are represented by \texttt{ObligationForRetentionOfPersonalData} and include related concepts about retention of personal data, ensuring it is adequate for processing, ensuring it is identifiable for required processing, obligation to kept it up to date, ensuring it is limited for processing, obligation to rectify inaccuracies, and ensuring it is relevant for processing.
The concept \texttt{ObligationForSecurityOfPersonalData} represents obligations regarding security of personal data with related concepts provided regarding accidental loss, damage, destruction, and unlawful processing.
\subsubsection{Concepts about Seals and Certifications}
GDPRtEXT provides concepts of \texttt{Seal} and \texttt{Certification} for representing seals and certifications as provided by GDPR to assist with maintenance and demonstration of compliance.
The conditions of these are represented by \texttt{ConditionsForSealsAndCertifications}, which is further expanded to represent conditions for seal/certification such as having a maximum validity of 3 years and having a voluntary system of accreditation.
\subsubsection{Example Use-Case: Compliance Reporting}
This example use-case, taken from documentation of GDPRtEXT \cite{pandit_gdprtext_2018}, shows how references to GDPR can aid in creation of reports which document information regarding compliance.
Consider a system for creation of compliance reports that stores information related to obligations it addresses from GDPR. It uses the EARL\footnote{\url{https://www.w3.org/TR/EARL10-Schema/}} vocabulary for expressing results of conformance checks within the report. GDPRtEXT is used to link resources in EARL reports with articles and points within GDPR and to express and define concepts related to compliance in a suitable and comprehensible manner. Through this, information about compliance checks is linked and associated with specific articles of GDPR.
EARL provides a standardised vocabulary to describe specific resources and relationships that are relevant to test reporting. The core construct of EARL is an \texttt{Assertion}, which describes context and outcome of an individual test execution. It uses following concepts (copied verbatim from EARL specification):
\begin{itemize}
\item \texttt{Assertor} - This can include information about who or what ran the test. For example human evaluators, automated accessibility checkers, or combinations of these.
\item \texttt{Test Subject} - This can include web content (such as web pages, videos, applets, etc.), software (such as authoring tools, user agents, etc.), or other things being tested.
\item \texttt{Test Criterion} - What are we evaluating test subject against? This could be a specification, a set of guidelines, a test from a test suite, or some other testable statement.
\item \texttt{Test Result} - What was the outcome of test? A test result could also include contextual information such as error messages or relevant locations.
\end{itemize}
Taking the example of Right to Data Portability, the EARL report in \autoref{earl} represents compliance checks for conditions associated with linked articles in GDPR (Article 20). The compliance system has a module \texttt{\_system\_dataportability} that checks software that handles provision of copy of personal data \texttt{\_org\_dataportability} through test case \texttt{\_test\_provide\_data\_copy} and generates a report showing the test has passed through \texttt{\_result\_pass}.
\begin{listing}
\begin{minted}{turtle}
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix earl: <http://www.w3.org/ns/earl#> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix gdprtext: <http://purl.org/adaptcentre/resources/GDPRtEXT#> .
@prefix ex: <http://example.com/phd-thesis#> .
ex:org_dataportability
a earl:TestSubject, earl:Software ;
dct:description "System that handles data portability requests"@en ;
dct:title "Data Portability Handler"@en .
ex:system_dataportability
a earl:Assertor ;
dct:description "Module checking data portability obligations"@en ;
dct:hasVersion "1.4" ;
dct:title "DataPortability Module"@en ;
earl:asserts [ a earl:Assertion ;
rdf:subject ex:org_dataportability ;
rdf:predicate ex:result_pass ;
rdf:object ex:test_provide_data_copy ] .
ex:result_pass
a earl:ResultProperty ;
earl:date "2018-01-01" ;
earl:validity earl:Pass ;
earl:confidence earl:High .
ex:test_provide_data_copy
a earl:TestCase ;
earl:testMode earl:automatic ;
dct:title "Test provision of data copy"@en ;
dct:description "Tests data portability"@en ;
dct:subject gdprtext:article20 .
\end{minted}
\caption{Use of GDPRtEXT to link tests with GDPR Articles in EARL report}
\label{earl}
\end{listing}
To gather related resources together, a SPARQL query (simplified) would focus on link between \texttt{TestCase} and its result using \texttt{earl:validity}, as shown in \autoref{code:voc:gdprtext-sparql}.
These tests can be further combined into test suites to group compliance checks related to each article or a particular concept and structure documentation around this form of logical grouping of concepts.
In this manner, use of GDPRtEXT to link tests and results with documentation enables automation of information retrieval and management.
A similar use-case of GDPRtEXT in linking constraints and their outcomes with GDPR is demonstrated in \autoref{chapter:testing}.
\begin{listing}
\begin{minted}{sparql}
SELECT ?gdpr ?result ?confidence ?mode WHERE {
?assertor a earl:Assertor .
?assertor earl:asserts ?assertion .
?testcase rdf:predicate ?assertion .
?testcase a earl:TestCase .
?testcase dct:subject ?gdpr .
?testcase ear:testMode ?mode .
?testresult rdf:object ?assertion .
?testresult a earl:ResultProperty .
?testresult earl:validity ?result .
?testresult earl:confidence ?confidence .
}
| gdpr | result | confidence | mode |
---------------------------------------------------------------
| article16 | pass | low | automatic |
| article17 | pass | high | automatic |
| article18 | fail | high | manual |
| article19 | pass | high | automatic |
\end{minted}
\caption{SPARQL query and results showing retrieved GDPR test results by article}
\label{code:voc:gdprtext-sparql}
\end{listing}
\subsubsection{Example Use-Case: Mapping between DPD and GDPR obligations}
The second application of GDPRtEXT, taken from its publication \cite{pandit_gdprtext_2018}, demonstrates linking of obligations between GDPR and its predecessor - Data Protection Directive (DPD. Given that DPD was adopted in 1995, and was superseded by GDPR in 2016, there are a large number of solutions and approaches regarding compliance with DPD that already exist and are used in practice. By linking obligations between DPD and GDPR it is possible to investigate reuse of these existing solutions for GDPR compliance. To that end, a mapping from DPD obligations to GDPR obligations containing annotations that describe nature of changes is constructed by linking articles of DPD and GDPR.
To model annotations as a RDF resource, a linked data version of DPD was created similar to GDPRtEXT by assigning URIs for every clause in legislation. This enabled referring to each individual clause in DPD and linking it with relevant clauses in GDPR.
The annotations (available online\footnote{\url{https://openscience.adaptcentre.ie/projects/GDPRtEXT/dpd_mapping.html}}) consist of references from a clause in DPD to its corresponding clause in GDPR with an expression of change between the two. The nature of change is represented by values: same - indicating no change; reduced - indicating reduction of obligation; slightly changed - indicating minor change; completely changed - indicating major change; and extended - indicating addition of obligations.
Its example demonstration consisted of using XACML\footnote{\url{https://www.oasis-open.org/committees/tc_home.php?wg_abbrev=xacml}} rules for controlling access to data and modelled after DPD obligations.
For each link between DPD and GDPR obligations, a record was created indicating whether the corresponding XACML rules for DPD compliance needed to be changed to be applicable for GDPR. The notation \texttt{N/A} was used to denote cases where no XACML rules existed for a particular DPD obligation and where corresponding obligations in GDPR had changed or had additional requirements.
\begin{listing}[htbp]
\begin{minted}{turtle}
@prefix gdpr: https://w3id.org/GDPRtEXT/gdpr# .
@prefix dpd: https://w3id.org/GDPRtEXT/dpd# .
@prefix rdfs: http://www.w3.org/2000/01/rdf-schema# .
dpd:mappingrule6
a dpd:DPDToGDPR_Annotation ;
dpd:hasChange dpd:ChangeExtended ;
dpd:hasXACMLChange dpd:XACMLNoChange ;
dpd:resourceInDPD dpd:Article7 - a ;
dpd:resourceInGDPR gdpr:Article6-1-a ;
rdfs:comment "added consent given to ..." .
\end{minted}
\caption{Example annotation of associating existing DPD compliance XACML rules with requirements of GDPR}
\label{code:voc:gdprtext-xacml}
\end{listing}
% The value \texttt{No} was used to indicate no changes in the GDPR obligation compared to the DPD obligation, so that the existing XACML rule would be sufficient to meet GDPR requirements. Similarly \texttt{Yes} was used to indicate a change required in the XACML rule to handle the obligation.
The class \texttt{DPDToGDPR\_Annotation} represents annotations between DPD and GDPR, with an example instance depicted in \autoref{code:voc:gdprtext-xacml}. The property \texttt{resourceInDPD} is used to refer to a specific clause within DPD using its IRI. Similarly, the property \texttt{resourceInGDPR} is used to refer to a corresponding clause in GDPR. The nature of change is defined using property \texttt{hasChange} whose value is an instance of class \texttt{ChangeInObligation}, with instances defined for \textit{Extended, Same, Reduced, CompletelyChanged}, and \textit{SlightlyChanged}. Similarly, a change in XACML rules is defined using class \texttt{ChangeInXACMLRule} with instances \textit{Yes, No}, and \textit{N/A}.
\subsection{Evaluation}\label{sec:voc:gdprtext:evaluation}
In terms of ontology assessment, the methodology outlined in \autoref{sec:voc:methodology} provides criterion for evaluation of ontology quality and documentation. GDPRtEXT fulfils these based on using OOPS! tool\footnote{OOPS! results published with ontology documentation. The results can also be independently obtained using the OOPS! online service.} to identify and rectify bad design patterns and by following best practices and community guidelines for ontology documentation.
GDPRtEXT and the work described in this section was published \cite{pandit_gdprtext_2018} at Extended Semantic Web Conference (ESWC) - Resource Track. The publication described creation of resource, summarised its contents, and described mapping of DPD obligations with GDPR using a linked data approach and XACML to denote which obligations from DPD could be re-used towards GDPR compliance.
ESWC is a premier and top-tier conference within semantic web domain, and has a rigorous review process with an open review policy.
The acceptance of GDPRtEXT in this venue demonstrates its value as a semantic web resource.
To date, the publication has received 19 citations from peer-reviewed publications (excluding self-citations) on Google Scholar\footnote{\url{https://scholar.google.com/scholar?cites=2776106745007214232}}.
In addition to these, a 5 star rating given to GDPRtEXT as a dataset in Irish open data portal indicates its adherence to linked data principles.
% The publications associated with PrOnto \cite{palmirani_pronto_2018,palmirani_pronto_compliance_2018} cite GDPRtEXT as a resource of GDPR concepts and comment on modelling of the norms and the legal axioms - which are not within the scope of GDPRtEXT or this thesis. It also mentions the lack of FRBR
% \footnote{Functional Requirements for Bibliographic Records (FRBR) is a conceptual entity–relationship model which allows expressing legal text as an abstract document with expressions in different languages and manifestations in different representations. It was adopted for use in ELI in the legislation passed on 6 November 2017 (OJ C 441, 22.12.2017, p. 8–12 \url{https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:52017XG1222(02)}).}
% information for managing versioning of the legal text over the time.
% However, it must be noted that since the ELI ontology itself uses FRBR and that GDPRtEXT extends ELI, it is capable of supporting the FRBR concepts as well - but does not provide them since the aim of the work is to enable granular references to its clauses rather than a provision of its text in multiple languages and representations.
A survey of legal approaches within state of the art \cite{leone_taking_2019} undertaken by MIREL project analysed GDPRtEXT amongst other legal ontologies and found that GDPRtEXT is singular in its use of ELI and provision of GDPR as a glossary of concepts - a finding shared with the analyses of SotA in \autoref{sec:sota:analysis}.
\subsubsection{Fulfilment of Competency Questions}
The assessment of GDPRtEXT consists of evaluating the extent to which it answers competency questions outlined in \autoref{sec:voc:gdprtext-engineering}.
For this, \autoref{table:gdprtext:eval-cq} shows concepts and relationships of GDPRtEXT relevant towards answering competency questions.
\begin{center}
\footnotesize
\begin{tabularx}{\textwidth}{|l|X|}
\caption{Concepts in GDPRtEXT for answering competency questions} \label{table:gdprtext:eval-cq} \\
\toprule
\textbf{CQ} & \textbf{Concepts/Relationships} \\
\midrule
\endfirsthead
\caption*{Concepts in GDPRtEXT for answering competency questions (cont'd)} \\
\toprule
\textbf{CQ} & \textbf{Concepts/Relationships} \\
\midrule
\endhead
\midrule
\multicolumn{2}{r@{}}{\footnotesize (Cont'd on following page)}\\
\endfoot
\endlastfoot
% & \textbf{Concepts associated with structure of GDPR} \\ \hline
\textit{CQ1-7} & \textit{Recital, Chapter, Section, Article, Point, SubPoint, Reference, Citation} \\ \hline
% \textit{CQ2} & \textit{Chapter} \\ \hline
% \textit{CQ3} & \textit{Section} \\ \hline
% \textit{CQ4} & \textit{Article} \\ \hline
% \textit{CQ5} & \textit{Point} \\ \hline
% \textit{CQ6} & \textit{SubPoint} \\ \hline
% \textit{CQ7} & \textit{Reference}, \textit{Citation} \\ \hline
\textit{CQ8} & \textit{isPartOfChapter} \\ \hline
\textit{CQ9,11} & \textit{rdfs:isDefinedBy [Article, Point, SubPoint]} \\ \hline
\textit{CQ10} & \textit{:\_ hasPart/isPartOf :\_} \\ \hline
% \textit{CQ11} & \textit{Accountability} \\ \hline
\textit{CQ12} & \textit{:GivenConsent rdfs:seeAlso [Article, Point, SubPoint]} \\ \hline
% \textit{CQ13} & \textit{GivenConsent} \\ \hline
\textit{CQ13,14} & \textit{GivenConsent/Compliance :involves [Article, Point, SubPoint]} \\ \hline
% & \textbf{Concepts associated with GDPR compliance} \\ \hline
\textit{CQ15} & \textit{Data}, \textit{PersonalData}, \textit{SensitivePersonalData}, \textit{CriminalData}, \textit{GeneticData}, \textit{HealthData}, \textit{RacialData}, \textit{AnonymousData}, \textit{PseudoAnonymousData} \\ \hline
\textit{CQ16} & \textit{Consent}, \textit{GivenConsent}, \textit{WithdrawnConsent} \\ \hline
\textit{CQ17} & \textit{Entity}, \textit{DataSubject}, \textit{Controller}, \textit{JointController}, \textit{Processor}, \textit{SubProcessor}, \textit{DPO}, \textit{DPA}, \textit{ControllerRepresentative}, \textit{ProcessorRepresentative}, \textit{CertificationBody}, \textit{RegulatoryAuthority} \\ \hline
\textit{CQ18,19} & \textit{DataActivity, ConsentActivity} \\ \hline
% \textit{CQ19} & \textit{ConsentActivity} \\ \hline
\textit{CQ20} & \textit{Processing}, \textit{AutomatedProcessing}, \textit{AutomatedDecisionMakingWithSignificantEffect}, \textit{ConfirmingOrMatchingDatasets}, \textit{LargeScaleProcessing}, \textit{ProcessingAffectedOrVulnerableIndividuals}, \textit{ProcessingSensitiveData}, \textit{ProcessingUsingUntestedTechnologies}, \textit{Unlawful} \textit{Processing} \\ \hline
\textit{CQ21} & \textit{ReportDataBreach}, \textit{MaintainRecordOfBreach}, \textit{NotifyDataSubjectOfBreach}, \textit{ReportBreachToController}, \textit{ReportBreachToDPAWithin72Hours} \\ \hline
\textit{CQ22} & \textit{Compliance}, \textit{Demonstration}, \textit{ConsentMonitor}, \textit{Compliance}, \textit{ReportDataBreach} \\ \hline
\textit{CQ23} & \textit{Principle}, \textit{Accountability}, \textit{Accuracy}, \textit{DataMinimisation}, \textit{IntegrityAndConfidentiality}, \textit{LawfulnessFairnessAndTransparency}, \textit{PurposeLimitation}, \textit{StorageLimitation} \\ \hline
\textit{CQ24} & \textit{Rights}, \textit{RightOfDataPortability}, \textit{RightOfErasure}, \textit{RightToAccessPersonalData}, \textit{RightToTransparency}, \textit{RightToBasicInformationAboutProcessing}, \textit{RightToNotBeEvaluatedThroughAutomatedProcessing}, \textit{RightToObjectForDirectMarketting}, \textit{RightToObjectToProcessing}, \textit{RightToRectify}, \textit{RightToRestrictProcessing} \\ \hline
\textit{CQ25} & \textit{RightOfDataPortability}, \textit{ProvideCopyOfPersonalData}, \textit{ShouldBeCommonlyUsedFormat}, \textit{ShouldBeMachineReadable}, \textit{ShouldBeStructured}, \textit{ShouldSupportReuse} \\ \hline
\textit{CQ26} & \textit{RightToBasicInformationAboutProcessing}, \textit{InformationAboutThirdParties} \\ \hline
% \textit{CQ27} & \textit{Obligation} \\ \hline
\textit{CQ27,28} & \textit{Obligation, ControllerObligation}, \textit{AppointmentOfProcessors}, \textit{Accountability}, \textit{ControllerResponsibility}, \textit{CooperateWithDPA}, \textit{DataProtectionByDesignAndDefault}, \textit{DataSecurityLiabilityOfJointControllers}, \textit{MaintainRecordsOfProcessingActivities}, \textit{PrivacyByDesign}, \textit{PropogateRightsToThirdParties}, \textit{ReportDataBreach} \\ \hline
\textit{CQ29} & \textit{ProcessorObligation}, \textit{AppointingSubprocessors}, \textit{AssistInComplyingWithRights}, \textit{ComplianceWithControllersInstructions}, \textit{CooperateWithDpa}, \textit{DataSecurity}, \textit{ImposeConfidentialityOnPersonnel}, \textit{InformControllerOfConflictWithLaw}, \textit{MaintainRecordsOfProcessingActivities}, \textit{OnlyActOnDocumentedInstructions}, \textit{PropogateRightsToThirdParties}, \textit{ProvideControllerWithInformationForCompliance}, \textit{ReportDataBreachToController}, \textit{RestrictionsOnCross}-\textit{borderTransfers}, \textit{ReturnOrDestroyPersonalDataAtEndTerm} \\ \hline
\textit{CQ30} & \textit{DPOObligation}, \textit{MonitorCompliance} \\ \hline
\textit{CQ31} & \textit{LawfulBasisForProcessing}, \textit{ContractWithDataSubject}, \textit{ExemptedByNationalLaw}, \textit{EmploymentLaw}, \textit{GivenConsent}, \textit{HistoricStatisticalOrScientificPurposes}, \textit{LegalClaims}, \textit{LegalObligation}, \textit{LegitimateInterest}, \textit{MadePublicByDataSubject}, \textit{MedicalDiagnosticOrTreatement}, \textit{NotForProfitOrg}, \textit{PublicInterest}, \textit{PurposeOfNewProcessing}, \textit{VitalInterest} \\ \hline
\textit{CQ32} & \textit{ValidConsent}, \textit{FreelyGivenConsentObligation}, \textit{InformedConsentObligation}, \textit{SpecificConsentObligation}, \textit{VoluntaryOptInConsentObligation} \\ \hline
\textit{CQ33} & \textit{ObligationForDataCollection}, \textit{AccurateCollection}, \textit{ExplicitPurpose}, \textit{LegitimatePurpose}, \textit{NotFurtherProcessedThanOriginalPurpose}, \textit{SpecifiedPurpose} \\ \hline
\textit{CQ34} & \textit{InformationAboutThirdParties}, \textit{ConsentCanBeWithdrawnEasily}, \textit{ClearExplanatinOfProcessing}, \textit{NotFromSilenceOrInactivity}, \textit{Demonstrable}, \textit{DistinguishableFromOtherMatters}, \textit{ValidConsent} \\ \hline
\textit{CQ35} & \textit{RetentionOfPersonalData}, \textit{AdequateForProcessing}, \textit{IdentifiableForRequiredProcessing}, \textit{KeptUpToDate}, \textit{LimitedForProcessing}, \textit{RectifyInaccuracies}, \textit{RelevantForProcessing} \\ \hline
\textit{CQ36} & \textit{SecurityofPersonalData}, \textit{AccidentalLoss}, \textit{Damage}, \textit{Destruction}, \textit{UnlawfulProcessing} \\ \hline
\textit{CQ37} & \textit{Seal}, \textit{Certification} \\
\bottomrule
\end{tabularx}
\end{center}
The table demonstrates that GDPRtEXT provides concepts to answer all competency questions. GDPRtEXT thus meets requirements of representing and linking information with text and concepts of GDPR in a granular manner and fulfils $RO3(a)$.
\subsubsection{Comparison with SotA}
The SotA in representing text of GDPR in machine-readable formats presented in \autoref{sota:analysis:representation} compared three approaches: ELI \cite{thomas_european_2019}, Agarwal et al \cite{agarwal_legislative_2018}, and PrOnto \cite{palmirani_pronto_2018,palmirani_pronto_compliance_2018}.
Their comparison and analysis, summarised in \autoref{table:sota:analysis:GDPR}, depicts relevance of each approach in representing the GDPR as a glossary of concepts, providing a permanent identifier for resources, modelling of GDPR's text, and whether resources are open and accessible.
The conclusion drawn from these is the lack of an approach fulfilling all criteria along with a lack of open and reusable resources concerning GDPR.
The additional resource of ELI+ mentioned in analysis shows intention of EU Publications Office to remedy this gap through an update to the ELI ontology at some time in future.
A comparison of GDPRtEXT with these approaches, depicted in \autoref{table:gdprtext:sota}, shows that GDPRtEXT provides a glossary of concepts, uses permanent identifiers, provides linked data version of text of GDPR, and is available under an open and permissive license (CC-BY-4.0).
This matches the intended contributions of ELI+ (update to ELI) planned by EU Publications Office, and therefore enables GDPRtEXT to fill this gap in this time.
\begin{center}
\footnotesize
\begin{tabularx}{\textwidth}{|l|>{\columncolor[gray]{0.9}}l|l|l|l|l|}
\caption{Comparison of GDPRtEXT with SotA}\label{table:gdprtext:sota} \\
\toprule
Work & \textbf{GDPRtEXT} & ELI & ELI+ & Agarwal et al & PrOnto \\
\midrule
\endhead
Vocabulary & ELI & OWL2 & OWL2 & RDFS & Akoma Ntoso \\ \hline
Granularity & Sub-Paragraph & Legislation & Sub-Paragraph & Paragraph & Sub-Paragraph \\ \hline
Glossary & \cmark & \xmark & \cmark & \xmark & \xmark \\ \hline
PID & \cmark & \cmark & \cmark & \xmark & \xmark \\ \hline
OA & \cmark & \cmark & \cmark & \xmark & \xmark \\ \hline
GDPR text & \cmark & \xmark & \cmark & \xmark & \cmark \\
\bottomrule
\end{tabularx}
\end{center}
A survey of legal ontologies by Leone et al. \cite{leone_taking_2019} includes GDPRtEXT as an ontology relevant for data protection. The survey also includes ELI and PrOnto within the scope of data protection ontologies - which provides external comparison between these and GDPRtEXT. The survey outlines the role of GDPRtEXT in acting as a glossary of concepts rather than a prescriptive set of norms and rules for specification of compliance - such as made available through PrOnto. In this role, GDPRtEXT is novel within state of the art given a lack of other similar resources.
Based on this, GDPRtEXT is argued to provide novel contribution to state of the art and addresses gaps associated with representation of concepts and GDPR text at a granular level, and whose open availability enables usage and adoption.
\subsection*{Summary}
The GDPRtEXT resource represents the first major contribution of this thesis. It provides a linked data version of text of GDPR and a glossary of its concepts, fulfils research objective $RO3(a)$, and assists with research objective $RO5(b)$ - as outlined in \autoref{sec:intro:RQ}. It enables representing each article or point within GDPR as a unique resource through IRIs defined using RDF and semantic web.
GDPRtEXT thus enables machine-readable links to be established between information and clauses of GDPR as well as concepts pertaining to its compliance.
The use of GDPRtEXT makes it possible to create approaches that automate generation and querying of information associated with GDPR - such as for compliance, management of business processes, or generation of privacy policies. The compatibility provided through extension of ELI ontology ensures alignment with official documents produced by European Publications Office.
Finally, GDPRtEXT fills an important gap in the state of the art regarding machine-readable approaches for linking information with legal text.
GDPRtEXT has been released as an open resource, has been published in Zenodo and Datahub, and has been incorporated into Ireland’s open data portal as a 5-star linked open dataset.
% GDPRov
\section{GDPRov - Ontology for GDPR activities associated with Personal Data and Consent}\label{sec:voc:GDPRov}
This section describes the GDPRov ontology for representing activities in ex-ante and ex-post phases associated with processing of personal data and consent for GDPR compliance. GDPRov stands for GDPR Provenance - a reference to the requirement of maintaining provenance information of processes in both ex-ante and ex-post phases for demonstrating GDPR compliance. This section presents motivation, overview, dissemination, and evaluation of GDPRov ontology. It also presents comparisons with relevant approaches in state of the art.
The ontology satisfies the research objectives $RO3(b)$ presented in \autoref{sec:intro:RQ}.
It uses the compliance questions presented in \autoref{sec:info:compliance-questions} as competency questions to identify requirements and for evaluation.
GDPRtEXT is used to define and associate the source of concepts within the text of GDPR.
An earlier version (v0.4) of GDPRov was described in a peer-reviewed publication \cite{pandit_modelling_2017}.
Subsequent revisions included addition of new concepts associated with real-world implementation and interpretation of GDPR compliance requirements (see \autoref{sec:testing:sparql}) and for representing information about consent mechanisms on the internet (see \autoref{sec:testing:shacl}).
The latest version of GDPRov (v0.7) is available online\footnote{\url{http://w3id.org/GDPRov}} with its documentation and code repository\footnote{\url{https://github.com/coolharsh55/GDPRov}}.
\subsection{Identification of requirements from competency questions}\label{sec:gdprov:cq}
The compliance questions presented in \autoref{sec:info:compliance-questions} were selected based on relevance to information regarding activities and provided competency questions for deriving concepts and relationships regarding processes associated with personal data and consent based on GDPR compliance requirements.
These concepts and relationships were collected, combined, and analysed to ensure their cohesion as an ontology and evaluated using compliance questions to ensure they satisfied requirements regarding GDPR compliance and documentation of associated processes.
In this, ex-ante and ex-post representation provide repetition of some information as most processes have counterparts in both phases. The linking of information between phases enables them to be documented in a manner so as to demonstrate prior planning of processes to ensure compliance and later their execution which also needs to be documented for compliance.
Therefore, while GDPR requirements and compliance questions do no explicitly mention or refer to ex-ante and ex-post phases for each activity, GDPRov implicitly considers each activity to have representations in both phases.
The sub-sections below present concepts to answer competency questions derived from compliance questions.
This is followed by an analysis of discovered concepts in ex-ante and ex-post phases.
The analysis is used to derive requirements for construction of GDPRov ontology, and is presented to describe motivations for the ontology design and implementation.
\subsubsection{Actors and Agents involved in activities}
\begin{itemize}
\item \texttt{CMQ2} - Provides concept of \textit{Controller} as an agent controlling processes and its representative \textit{Data Protection Officer (DPO)}.
\item \texttt{CMQ17} - Describes \textit{Processor} as an executor of processes and its representative \textit{DPO}. In this relationship, \textit{Controller} provides processes to be executed as instructions to \textit{Processor} through a \textit{Data Processing Agreement (DPA)}.
\item \texttt{CMQ35} - Describes \textit{Data Subject} as an agent who is associated with provision of personal data, consent, and who is related to exercising of rights.
\end{itemize}
\subsubsection{Details of processing}
\begin{itemize}
\item \texttt{CMQ3} and \texttt{CMQ37} provide concept of \textit{Purpose} which describes purpose of personal data processing. Each purpose can incorporate multiple processing operations, and each processing operation taking place can be associated with multiple purposes.
\item \texttt{CMQ4} describe necessity to specify data subject categories whose personal data is being processed.
\item \texttt{CMQ36} describes personal data, while \texttt{CMQ5} describes categories of personal data being processed. \texttt{CMQ34} specifies special categories of personal data as a sub-category of personal data that needs to be explicitly stated as being processed.
\item \texttt{CMQ38} defines processing of personal data as defined by Article 4-2 of GDPR. The GDPR definition of processing provides types of operations as specified by ``\textit{any operation or set of operations which is performed on personal data or on sets of personal data, whether or not by automated means, such as collection, recording, organisation, structuring, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, restriction, erasure or destruction;}''.
\item \texttt{CMQ6} defines sharing of data as a type of processing. Additional information associated with sharing of data is provided by - \texttt{CMQ7} and \texttt{CMQ20} for categories of recipients; \texttt{CMQ8},\texttt{CMQ21} for identifies of recipients, \texttt{CMQ9} and \texttt{CMQ22} for location where data is being sharing to; \texttt{CMQ10} and \texttt{CMQ23} for safeguards associated with data transfer; \texttt{CMQ15} and \texttt{CMQ25} for purposes of sharing - which is the same concept as purpose of processing except applied for sharing of personal data.
\item \texttt{CMQ11} defines data storage, with additional concepts provided by \texttt{CMQ12} for existence of time limits or conditions for erasure and \texttt{CMQ13} for specification of time limits or conditions for erasure for categories of data.
\item \texttt{CMQ26} defines legal basis for justifying processing of personal data, and \texttt{CMQ27} specifies legal basis associated with a particular purpose. Each purpose can have one or more legal basis associated with it.
\end{itemize}
\subsubsection{Life-cycle of data}
\begin{itemize}
\item \texttt{CMQ28} and \texttt{CMQ30} describe source of personal data which in turn implies an activity that collects data and specifies actors or agents as source of data.
\item \texttt{CMQ29} specifically refers to personal data collected from data subject.
\end{itemize}
\subsubsection{Anonymisation}
\begin{itemize}
\item \texttt{CMQ31} specifies anonymisation of personal data with \texttt{CMQ32} inquiring about different `levels' of anonymisation which affect the application of obligations and requirements of compliance.
\item Levels of anonymisation are specified based on their relevance to investigation of compliance, and are defined as data which is: completely anonymised, pseudo-anonymised, not anonymised. In this, data that is pseudo-anonymised can be considered and used as anonymous data when an organisation does not have additional information to de-anonymise it.
\item Processing activities associated with anonymisation and de-anonymisation of personal data are defined to produce anonymised data.
\end{itemize}
\subsubsection{Activities associated with Consent}
\begin{itemize}
\item Regarding consent, \texttt{CMQ48} inquires about activities associated with provision and collection of consent. This includes information about how consent is requested and collected, used within processes as a legal basis, and is archived for future demonstration of compliance.
\item \texttt{CMQ49} and \texttt{CMQ50} inquire about artefacts associated with collection of consent for determination of consent validity under GDPR, which requires investigation of how choices for consent were offered. This also includes forms through which consent is provided or collected from data subjects. Artefacts such as forms or dialogues are associated with processes where consent choices are offered or requested and whose result is collection of consent or given consent.
\end{itemize}
\subsubsection{Provision of Rights}
\begin{itemize}
\item The rights associated with GDPR need processes to internally (in perspective of an organisation) handle their execution as well as for interactions with data subjects. Therefore, such processes need to be defined and documented for compliance purposes.
\item For right to be informed, \texttt{CMQ88 - CMQ105} provide competency questions regarding how the right is provided and how it is executed or implemented.
\item This includes activities associated with provision of information to data subject, artefacts associated with information provision, inclusion of details such as controller and DPO, purposes, processing, legal basis, and personal data categories.
\item It also includes information about sources of personal data (where not obtained directly from data subject), and whether legal basis used is legitimate interest.
\item Regarding data sharing, information to be specified includes categories of recipients and their location.
\item Right to be informed also includes provision of information regarding existence and application of rights.
\item Information associated with right to be informed is common to other information documented in due course of processing of personal data, and therefore does not require a separate notation or representation of this information in order to execute the right. Existing information or concepts for representation of processing activities can be reused for specifying required information. However, activities associated with executing rights need to be defined to demonstrate existence of processes for handling rights.
\end{itemize}
\subsubsection{Compliance procedures such as Reporting of Data Breach}
\begin{itemize}
\item Reporting of data breach requires information about data breach to be maintained as specified by \texttt{CMQ106 - CMQ120}.
\item This includes information about the data breach consisting of: timestamp of when breach occurred (\texttt{CMQ106}), timestamp of when controller became aware of it (\texttt{CMQ107}), timestamp and method of it being notified to supervisory authority (\texttt{CMQ108}).
\item Information about contents of breach include information about its affected personal data and categories of data subject (\texttt{CMQ112}).
\item Information also needs to be provided to supervisory authorities and in some cases to data subjects based on extent of breach (\texttt{CMQ113}). It therefore requires prior planning of processes for handling data breaches and sending information to data subjects along with any remedial measures (\texttt{CMQ116}).
\end{itemize}
\subsubsection{Specifying requirements for ex-ante and ex-post phases}
Process logs are a convenient and demonstrable form of information to store and document compliant processing of personal data. By verifying logs, it is possible to document, evaluate, and demonstrate that executed processes were compliant with requirements of GDPR. This constitutes as ex-post phase of compliance and consists of evaluating information after processing has been carried out. It also fulfils Article 30 of GDPR concerning processing records to be maintained. Along with ex-post records, it is also essential to demonstrate that executed processes were based on a preconceived plan or template that was ensured to be compliant before execution. Storing such plans is essential to demonstrate prior planning and maintenance of a compliant processing system. This constitutes as ex-ante phase of compliance, and consists of evaluating compliance on plans of processing yet to be carried out. This is necessary for Article 35 of GDPR concerning carrying out a DPIA.
Associating executed processes with their plans allows demonstration of compliance throughout the life-cycle of processes, i.e. from planning of processes to their eventual execution. It also enables documenting changes in plans and their effects on execution of processes - i.e. demonstrating that when a plan changes it also brings about corresponding changes in executed processes. In context of GDPR compliance, requirements of compliance require documentation, maintenance, and demonstration of processes across both ex-ante (planning) and ex-post (execution) phases. Ex-ante plans of processes are described as an organisational measure and their compliance is associated with ensuring processes meet legal requirements before they are actually carried out. In some instances, such as for a Data Protection Impact Assessment (DPIA), existence of ex-ante information about processes is essential in evaluation of compliance.
While compliance questions provide a basis for identifying information to be modelled, requirements of expressing this information in ex-ante and ex-post phase require specification of their intended usage in planning and processing stages respectively which further determines whether compliance evaluation consists of verification of a plan or analysis of processing logs.
The argument and motivation for representing processes in ex-ante and ex-post phases represents a design decision based on separating representation of information across phases of compliance rather than being a compliance requirement itself.
Approaches within state of the art that also follow a similar representation of ex-ante and ex-post information include SPECIAL (\autoref{sec:sota:SPECIAL}) which uses PROV-O to log information in both phases, and MIREL (\autoref{sec:sota:MIREL}) which uses a workflow model to represent a plan and its executions.
Information requirements for modelling information about activities is summarised through following points:
\begin{enumerate}
\item Represent process in ex-post phase as a log or record.
\item Represent process in ex-ante phase as plan or template.
\item Link ex-ante plan with its instantiations or executions in ex-post phase.
\item Track provenance of ex-ante plans i.e. changes in plans of processes.
\item Enable tracking changes in ex-post logs based on corresponding changes in ex-ante plans.
\item Associate information used/generated in activities as artefacts in both ex-ante and ex-post phases.
\item Associate actors/agents with processes.
\item Link processes based on:
\begin{enumerate}
\item dependency - where one process is dependant on another through use of generated artefact,
\item order of execution - where one process is or will be executed before or after another, and
\item composition - where one process is constituted by several sub-processes.
\end{enumerate}
\end{enumerate}
\subsection{Extending PROV-O and P-Plan}
Based on above stated requirements for representing activities or processes in ex-ante and ex-post phases, existing semantic web ontologies of PROV-O \cite{lebo_prov-o_2013} and P-Plan \cite{garijo_p-plan_2014} were extended with relevant GDPR concepts and relationships to create the GDPRov ontology. The necessity of this process and a brief overview of PROV-O and P-Plan ontologies is presented below along with the process of extending ontologies.
\subsubsection{PROV - W3C standard for representing provenance information}
Provenance is information about entities, activities, and people (or software)
involved in producing data or a component which can be used to form an
assessment about its quality, reliability, or trustworthiness. The PROV-O ontology \cite{lebo_prov-o_2013} along with PROV family\footnote{\url{https://www.w3.org/TR/2013/NOTE-prov-overview-20130430/}} of schemas and documents is the W3C recommendation for representing provenance information since 30\textsuperscript{th} April 2013 and has seen significant adoption by semantic web and industrial communities.
It provides definitions for interchange of provenance information by representing entities
and relations between them such as generated by, derived from, and attributions.
The core concepts of PROV-O are summarised in \autoref{fig:prov-o-model} and consist of interactions between \textit{Activities}, \textit{Entities}, and \textit{Agents}.
An \texttt{Entity} in PROV-O is defined as being physical, digital, conceptual, or other
kind of `thing' with some fixed aspects. PROV-O defines an \texttt{Activity} as something
that occurs over a period of time and acts upon or with entities; it may include
consuming, processing, transforming, modifying, relocating, using, or generating
entities.
\begin{figure}[htbp]
\centering
\includegraphics[width=0.8\linewidth]{img/prov-o-model.png}
\caption{Overview of PROV-O model \cite{lebo_prov-o_2013}}
\label{fig:prov-o-model}
\end{figure}
PROV-O is a generic and domain independent ontology for representing provenance information.
In order for it to be applied in domain of GDPR compliance, it needs to incorporate relevant terminology and enable distinction between different types of activities and entities.
Furthermore, PROV-O as a provenance ontology is intended to represent information about activities that have been executed in the past, and is therefore suitable to represent information only in the ex-post aspects.
While PROV-O does provide the concept of \texttt{Plan}\footnote{PROV-O defines a \textit{plan} as a set of actions or steps towards some goal. It clarifies on the lack of concepts relevant to plans as - ``\textit{There exist no prescriptive requirement on the nature of plans, their representation, the actions or steps they consist of, or their intended goals.''}} to represent ex-ante information, it does not provide further concepts or relationships to associate a plan with activities and entities\footnote{PROV-O provides the concept of \texttt{Association} which assigns responsibility to an agent for an activity and indicates that the agent had a role in the activity, which can include a \texttt{Plan} associated using the \texttt{hasPlan} property.}.
In order to adopt PROV-O and use \texttt{Plan} for representing ex-ante information for GDPR compliance, it needs to be extended with additional concepts and relationships.
\subsubsection{P-Plan - extending PROV-O Plans as Workflows}
P-Plan \cite{garijo_p-plan_2014} extends the concept of \texttt{Plan} in PROV-O towards representing scientific
workflows which enable creating a template of a `step' and linking it to executions of activities.
A \texttt{p-plan:Plan} is a subclass of \texttt{prov:Plan} and is composed of smaller activities or steps (\texttt{p-plan:Step}) that use and generate (as inputs or outputs) variables (\texttt{p-plan:Variable}).
An overview of relationship between PROV-O and P-Plan is described in \autoref{fig:p-plan-model}.
P-Plan enables representation of provenance information associated with both ex-ante and ex-post processes by representing them as scientific workflows. It also enables associating plans with their executions, thereby providing a link between ex-ante and ex-post provenance information.
\begin{figure}[htbp]
\centering
\includegraphics[width=0.75\linewidth]{img/p-plan-model.png}
\caption{Overview of P-Plan model and its relationship with PROV-O \cite{garijo_p-plan_2014}}
\label{fig:p-plan-model}
\end{figure}
A \texttt{p-plan:Plan} represents information of `how’ something should happen or a `template’ for executions. A \texttt{p-plan:Activity} is a subclass of \texttt{prov:Activity} and represents execution of process described in a \texttt{p-plan:Step}.
A \texttt{p-plan:Entity} is a subclass of \texttt{prov:Entity} that corresponds to a \texttt{p-plan:Variable} in \texttt{p-plan:Plan}. Therefore, a
\texttt{p-plan:Step} may describe the template including inputs and outputs which can
then be instantiated into multiple instances of \texttt{p-plan:Activity} that can have
distinct inputs to produce different outputs.
As \texttt{p-plan:Plan} extends \texttt{prov:Plan}, which itself extends \texttt{prov:Entity}, it can be
used to treat \texttt{p-plan:Plan} as an object whose provenance can be tracked using
PROV-O or P-Plan. This makes it possible to express `provenance of provenance' - thereby creating a record of how plans were formulated and executed over time.
\subsubsection{Extending ontologies for GDPR}
The PROV-O and P-Plan ontologies were extended to represent concepts and relationships of ex-ante and ex-post activities associated with personal data and consent based on requirements of GDPR compliance.
The decision to extend PROV-O and P-Plan with GDPR concepts was made as both ontologies contain generic concepts associated with activities and workflows which can be used for representing information about GDPR compliance, but doing so would be not be intuitive due to difference in terminology and structuring of information as expected for GDPR compliance.
Extending existing ontologies of PROV-O and P-Plan enables expressing a `template'
or `plan' using \texttt{p-plan:Plan} describing ex-ante activities (as \texttt{p-plan:Step}) that can take place. This template can then be used to denote execution of activities in ex-post phase using \texttt{p-plan:Activity}.
This provides a machine-readable and documented information model for both ex-ante and ex-post activities whose provenance itself can be expressed (using PROV-O and P-Plan) to record how they were created and how they change over time.
This is beneficial in documenting state of a system at a given time through a set of activities that deal with consent and personal data, and
can be helpful in determining changes to be made based on changes in processing of personal data over time.
The extended ontology derived from PROV-O and P-Plan incorporates concepts and relationships associated with GDPR in order to normalise terminology for representing information associated with GDPR compliance.
Concepts and relationships are derived from competency questions and linked with relevant clauses within GDPR through using GDPRtEXT concepts using \texttt{rdfs:isDefinedBy} and \texttt{rdfs:seeAlso}.
This provides documentation regarding origin of concepts and their use in representation of information associated with specific clauses of GDPR.
It also provides a machine-readable link from the ontology to GDPR which can be used to compare, analyse, and align relevant ontologies.
The extension consists of sub-classing existing concepts in PROV-O and P-Plan to represent specific activities associated with GDPR compliance.
The use of subclass mechanism preserves existing concepts and relationships of PROV-O and P-Plan to provide compatibility and reuse. This is particularly important for PROV-O as it is a W3C standard and therefore is more likely to have existing uses.
The compatibility with PROV-O also enables information defined using the ontology to be bundled as an artefact to record its provenance and planning as a form of meta-documentation regarding planning and maintenance of compliance activities This is particularly useful to maintain periodic snapshots of organisational processes associated with compliance and provides opportunities to automate querying and validation of information within a use-case - as demonstrated in \autoref{chapter:testing}.
\subsection{Ontology Description \& Application}
The ontology engineering is named GDPRov (GDPR Provenance Ontology) and is published online along with its documentation at \url{https://w3id.org/GDPRov/} under an open and permissive CC-by-4.0 license.
The ontology was created, documented, and published using the methodology presented in \autoref{sec:voc:methodology}.
The aim of GDPRov is to provide representations of ex-ante and ex-post activities regarding personal data and consent for GDPR compliance.
It uses GDPRtEXT concepts to define origin and relevance of its concepts to GDPR.
\subsubsection{Overview of GDPRov concepts}
GDPRov extends concepts from PROV-O and P-Plan using the sub-class and sub-property mechanisms to represent activities associated with GDPR compliance, with a visual overview provided in \autoref{fig:vocabs:gdprov-overview}.
\begin{figure}[htbp]
\centering
\includegraphics[width=\linewidth]{img/GDProv_relation_prov_pplan.png}
\caption{GDPRov concepts derived by extending PROV-O and P-Plan}
\label{fig:vocabs:gdprov-overview}
\end{figure}
GDPRov extends \texttt{p-plan:Plan} as \texttt{Process} to represent ex-ante plans of activities that will take place. The terminology is based on common use of term in expressions such as `business processes' and `compliance processes'.
Each \texttt{Process} can contain steps (represented by \texttt{p-plan:Step}) to represent activities that interact with data and agents.
To associate steps with a process, the property \texttt{p-plan:isStepOfPlan} is extended as \texttt{isPartOfProcess}.
Another additional property - \texttt{refersToProcess} is also used to enable referring to a process without being a part of it.
Similarly, to associate data (defined in P-Plan as \texttt{p-plan:Variable}) the properties \texttt{p-plan:hasInputVar} and \texttt{p-plan:isOutputVarOf} are extended for activities using inputs and producing outputs respectively.
Ex-post activities in P-Plan are represented by \texttt{p-plan:Activity}.
Data interactions with these activities is represented by \texttt{p-plan:Entity} and the properties \texttt{prov:used} and \texttt{prov:wasGeneratedBy} are used to indicate inputs and outputs respectively.
GDPRov defines steps to indicate automated execution and user interactions regarding collecting data from user (input) and providing data (output).
To indicate a legal basis associated with a process or a step, the property \texttt{hasLegalBasis} is provided.
\subsubsection{Depicting Data Life-cycle}
Activities associated with life-cycle of personal data constitute of collecting, processing or using it, storing, sharing, deleting, transferring, transforming, anonymise, and rectifying data. GDPR defines several more categories of actions in Article 4-2 in its definition of `processing'.
GDPRov provides broad and abstract processes to represent data access, data archival, data erasure, and data rectification given the need to execute these using one or more steps.
GDPRov provides representations of actions in ex-ante phase as \texttt{DataStep} which extends \texttt{p-plan:Step} and in ex-post phase as \texttt{DataActivity} which extends \texttt{p-plan:Activity}.
These are further extended to distinguish between data collection, data deletion, data sharing, data storage, data archival, data transfer, data transformation, data usage, and rectification of data.
A visual overview of steps describing a data life-cycle using GDPRov is provided in \autoref{fig:vocabs:gdprov-data-lifecycle}.
\begin{figure}[htbp]
\centering
\includegraphics[width=\linewidth]{img/GDPRov_data_lifecycle.png}
\caption{Example steps depicting data life-cycle using GDPRov}
\label{fig:vocabs:gdprov-data-lifecycle}
\end{figure}
Anonymisation of data is defined as a sub-class of data transformation to indicate transformation of data that takes place when anonymising it.
As GDPR obligations are based on level of anonymity and capability of de-anonymising it from an organisation's point of view, GDPRov provides the concept of \texttt{anonymisation level} to indicate state of anonymity of data.
GDPRov defines four levels of anonymisation based on existing work in representing anonymous data \cite{hintze_meeting_2017}, which constitute of data that is: (i) completely anonymised, (ii) completely de-anonymised, (iii)pseudo-anonymised, and (iv) pseudo-organisational-anonymised where an organisation does not have additional data required to de-anonymise it and can thus utilise it internally as if it were completely anonymous data.
Sharing of data consists of interactions with actors or agents, which are represented by \texttt{prov:Agent} and associated with respective steps and activities using extended properties.
Personal data used within activities is represented by \texttt{PersonalData} which is sub-classed from \texttt{p-plan:Variable} for ex-ante representation and by \texttt{PersonalDataEntity} which is sub-classed from \texttt{prov:Entity} for ex-post representation.
Further categorisation of personal data into anonymised, sensitive, and representing user identifier is provided through sub-classes.
\subsubsection{Depicting Consent Life-cycle}
Activities associated with the life-cycle of consent are represented in ex-ante phase by sub-classing \texttt{p-plan:Step} as \texttt{ConsentStep} and similarly in ex-post phase by sub-classing \texttt{p-plan:Activity} as \texttt{ConsentActivity}.
These are further sub-classed to represent acquisition, archival, modification, and withdrawal of consent.
Amongst these, withdrawal of consent is defined as sub-class of modification since it modifies state of consent.
A visual summary of steps in a consent life-cycle is provided in \autoref{fig:vocabs:gdprov-consent-lifecycle}.
\begin{figure}[htbp]
\centering
\includegraphics[width=\linewidth]{img/GDPRov_consent_lifecycle.png}
\caption{Consent life-cycle defined using GDPRov}
\label{fig:vocabs:gdprov-consent-lifecycle}
\end{figure}
Artefacts associated with consent and used in activities include choices or offer provided to request consent and subsequent consent given by an individual.
To represent these in ex-ante phase, GDPRov provides concepts of \texttt{ConsentAgreementTemplate} to represent template offered to collect consent, \texttt{ConsentAgreement} to indicate given consent, and \texttt{TermsAndConditions} to indicate applicable policies or terms and conditions.
Corresponding concepts in ex-post phase are \texttt{GivenConsentTemplate}, \texttt{GivenConsent}, and \texttt{TermsAndConditionsEntity}.
\subsubsection{Depicting Compliance-related processes}
In addition to representing activities associated with personal data and consent, GDPRov also provides representations for compliance-related processes.
These include actions such as appointing processor (by a controller), carrying out an impact assessment, marketing and its special case of direct marketing, and monitoring compliance.
Processes are also provided for handling data breaches which include notifying controller (by a processor), notifying data subject, and notifying data protection authority.
The handling of rights required by GDPR is represented through sub-classes of \texttt{Process} for data portability, erasure, access personal data, basic info about processing, no automated processing, object to direct marketing, object processing, rectification, restrict processing, transparency, SAR (subject access request).
% \subsubsection{Actors and Agents}
% Controller Representative
% Data Subject
% DPO
% Processor Representative
% Third Party
% Controller
% Joint Controller
% Processor
% SubProcessor
% \subsubsection{Documentation \& Dissemination}
\subsubsection{Example Use-Case: Querying anonymised sharing of data}
The applicability and usefulness of GDPRov is demonstrated through its use for querying and validation of information for GDPR compliance in \autoref{chapter:testing}.
A simplified example demonstrating a similar application through use of a SPARQL query was published along in a peer-reviewed publication \cite{pandit_modelling_2017}. It is presented in \autoref{code:gdprov:sparql} to demonstrate how GDPRov can assist in answering of compliance questions for GDPR.
The query uses GDPRov concepts to retrieve data being shared, specific steps that share it, anonymisation level of shared data, and steps used to anonymise it. The query is meant to retrieve information relevant in investigation of data being shared and its anonymity.
\begin{listing}[htbp]
\begin{minted}{sparql}
PREFIX gdprov: <https://w3id.org/GDPRov#>
SELECT ?data ?sharestep ?isAnonymised ?anonymisationStep
WHERE {
?data a gdprov:Data .
?sharestep a gdprov:DataSharingStep .
?sharestep gdprov:sharesData ?data.
BIND (
EXISTS { ?data a gdprov:AnonymisedData . }
as ?isAnonymised ) .
OPTIONAL {
?anonymisationStep
gdprov:generatesAnonymisedData ?data .
}
}
\end{minted}
\caption{SPARQL query representing compliance question \texttt{G5} concerning legal basis for processing}
\label{code:gdprov:sparql}
\end{listing}
\subsubsection{Example Use-Case: Detecting changes in activities for updates to consent}
As an use-case, consider the case where a data controller updates a plan of processing activities - such as when a purpose changes or a new processing operation is added to an existing purpose, and where legal basis for such processing is consent.
In such cases, a data controller is required to evaluate whether updating an individual's consent is required based on changes between given consent and new purposes or processing activities. By storing plans of processing operations using GDPRov, it is possible to compare the old and new versions of a plan, detect changes, and identify whether corresponding updates to consent are needed.
An exploration of above as change detection was published in Managing the Evolution and Preservation of the Data Web workshop co-located with ESWC 2018 \cite{pandit_gdpr-driven_2018}. It described comparison of plans represented using P-Plan to identify changes based on the above obligation.
The change detection, visualised in \autoref{fig:vocabs:gdprov-change-detection}, is based on identifying differences between two plans in terms of steps and variables and whether they have been added, removed, or modified.
In the figure, the change reflects removal of a step - which by itself does not require any changes to given consent since no new purposes have been added to an existing given consent.
Using this approach, the detected change can be analysed - manually for complex and legal interpretations and automatically for simpler or simplified graphs - and used to identify whether corresponding changes are necessary based on compliance obligations.
\begin{figure}[htbp]
\centering
\includegraphics[width=\textwidth]{img/GDPRov-change-detection.png}
\caption{Modelling changes in workflows using P-Plan \cite{pandit_gdpr-driven_2018}}
\label{fig:vocabs:gdprov-change-detection}
\end{figure}
\subsection{Evaluation}\label{sec:voc:gdprov:evaluation}
The ontology assessment was based on the methodology outlined in \autoref{sec:voc:methodology} regarding criterion for ontology quality and documentation.
The ontology was used in two applications developed to demonstrate use of SPARQL to query information (see \autoref{sec:testing:sparql}) and SHACL to validate information for GDPR compliance (see \autoref{sec:testing:shacl}). The experience demonstrates suitability of GDPRov in representing the required information, and led to addition of \texttt{ConsentAgreementTemplateBundle} (in v0.7) as a concept for convenience in representing `bundled' consent requests and provisions based on consent workflows on a website where a single dialogue is used to collect consent involving multiple distinct purposes and third parties.
GDPRov was published \cite{pandit_modelling_2017} as a peer-reviewed publication in Workshop on Society, Privacy and the Semantic Web - Policy and Technology (PrivOn) co-located with the 16th International Semantic Web Conference (ISWC). The workshop provided reviews from domain experts in privacy, legal, and semantic web domains; with ISWC being a top-tier conference in semantic web domain.
As of February 2020, this publication has received 18 citations to (excluding self-citations) on Google Scholar\footnote{\url{https://scholar.google.com/scholar?cites=2287149512924017207}} of which 2 are deliverables of CitySPIN research project (see \autoref{sec:sota:gdpr-semweb}).
% Of these, the publications associated with PrOnto \cite{palmirani_pronto_2018,palmirani_pronto_compliance_2018} incorrectly conclude that ``GDPRov aims to describe the provenance of the consent and data lifecycle in the light of the Linked Open Data principles such as Fairness and Trust'' - since the basis of GDPRov is the scientific workflow model similar to the one used within PrOnto.
% The publication of a PROV-O model representing activities for GDPR compliance by Ujcich et al. \cite{belhajjame_provenance_2018} cites GDPRov as a relevant approach with the comparison to GDPRov specified as - ``Our model allows for more flexible specifications of how data can be used (i.e.,under consent for particular purposes while being legally valid for a period of time). Furthermore, our model focuses on temporal reasoning and online data usage control, whereas it is not clear how amenable GDPRov is to such reasoning or enforcement.'' While the example provided in the publication does provide temporal annotation of activities for GDPR compliance - these are achieved through the use of PROV-O rather than specialised classes using GDPR terminology. Since GDPRov extends PROV-O, it supports such use of inherent PROV-O concepts. In addition, GDPRov provides expression of ex-ante activities and workflows which cannot be expressed using the model proposed by Ujcich et. al.
A publication describing an approach for annotating DFDs (data flow diagrams) with information for analysing compliance \cite{debruyneOntologyRepresentingAnnotating2019} utilised GDPRov to represent personal data as an entity used in activities within its ontology for representing DFDs to abstract processing operations as data flows.
\subsubsection{Fulfilment of Competency Questions}
An assessment of the extent to which GDPRov satisfies competency questions by providing concepts and relationships is presented here as part of its evaluation.
The competency questions, summarised in \autoref{sec:gdprov:cq}, were used to guide development of ontology and therefore are used to evaluate the extent to which the developed ontology meets requirements of representing this information.
\autoref{table:gdprov:cq} lists concepts and properties for answering competency question (with \textit{N/S} used to indicate not in scope).
\begin{center}
\footnotesize
\begin{tabularx}{\textwidth}{|l|X|p{5cm}|l|}
\caption{Concepts in GDPRov for answering competency questions} \label{table:gdprov:cq} \\
\toprule
\textbf{CQ} & \textbf{Class} & \textbf{Property} & \textbf{Phase} \\
\midrule
\endfirsthead
\caption*{Concepts in GDPRov for answering competency questions (cont'd)} \\
\toprule
\textbf{CQ} & \textbf{Class} & \textbf{Property} & \textbf{Phase} \\
\midrule
\endhead
\midrule
\multicolumn{4}{r@{}}{\footnotesize (Cont'd on following page)}\\
\endfoot
\endlastfoot
\multicolumn{4}{|l|}{\textbf{Actors and Agents involved in activities}} \\ \hline
\textit{CMQ2} & \textit{Controller}, \textit{ControllerRepresentative}, \textit{DPO} & & \\ \hline
\textit{CMQ17} & \textit{Processor}, \textit{ProcessorRepresentative}, \textit{DPO} & & \\ \hline
\textit{CMQ35} & \textit{DataSubject} & & \\ \hline
\multicolumn{4}{|l|}{\textbf{Details of processing}} \\ \hline
\textit{CMQ3} & \textit{Process} & \textit{refersToProcess} & \textit{Ex}-\textit{ante} \\ \hline
\textit{CMQ4} & \textit{DataSubject} & & \\ \hline
\textit{CMQ5} & \textit{PersonalData}, \textit{SensitiveData} & \textit{usesData} & \textit{Ex}-\textit{ante} \\ \hline
& \textit{PersonalDataEntity}, \textit{SensitiveDataEntity} & & \textit{Ex}-\textit{post} \\ \hline
\textit{CMQ6} & \textit{DataSharingStep} & \textit{sharesData}, \textit{sharesDataWith} & \textit{Ex}-\textit{ante} \\ \hline
& \textit{DataSharingActivity} & \textit{hasSharedDataWith} & \textit{Ex}-\textit{post} \\ \hline
\textit{CMQ7} & \textit{ThirdParty} & \textit{sharesDataWithThirdParty} & \textit{Ex}-\textit{ante} \\ \hline
\textit{CMQ8} & \textit{ThirdParty} & \textit{sharesData}, \textit{sharesDataWith} & \textit{Ex}-\textit{ante} \\ \hline
\textit{CMQ9} & \textit{N}/\textit{S} & \textit{N}/\textit{S} & \\ \hline
\textit{CMQ10} & \textit{N}/\textit{S} & \textit{N}/\textit{S} & \\ \hline
\textit{CMQ11} & \textit{DataStorageStep} & \textit{usesData}, \textit{generatesData} & \textit{Ex}-\textit{ante} \\ \hline
& \textit{DataStorageActivity} & & \textit{Ex}-\textit{post} \\ \hline
\textit{CMQ12} & \textit{N}/\textit{S} & \textit{N}/\textit{S} & \\ \hline
\textit{CMQ13} & \textit{N}/\textit{S} & \textit{N}/\textit{S} & \\ \hline
\textit{CMQ26} & & \textit{hasLegalBasis} & \\ \hline
\multicolumn{4}{|l|}{\textbf{Lifecycle of data}} \\ \hline
\textit{CMQ28} & \textit{DataCollectionStep} & \textit{collectsData} & \textit{Ex}-\textit{ante} \\ \hline
& \textit{DataCollectionActivity} & & \textit{Ex}-\textit{post} \\ \hline
\textit{CMQ29} & \textit{DataCollectionStep} & \textit{collectsDataFromAgent} & \textit{Ex}-\textit{ante} \\ \hline
& \textit{DataCollectionActivity} & \textit{collectedDataFromAgent} & \textit{Ex}-\textit{post} \\ \hline
\multicolumn{4}{|l|}{\textbf{Anonymisation}} \\ \hline
\textit{CMQ31} & \textit{DataAnonymisationStep}, \textit{AnonymisedData} & & \textit{Ex}-\textit{ante} \\ \hline
& \textit{DataAnonymisationActivity}, \textit{AnonymisedDataEntity} & & \textit{Ex}-\textit{post} \\ \hline
\textit{CMQ32} & \textit{PersonalData}, \textit{SensitiveData} & \textit{hasAnonymityLevel} & \textit{Ex}-\textit{ante} \\ \hline
\multicolumn{4}{|l|}{\textbf{Activities associated with Consent}} \\ \hline
\textit{CMQ48} & \textit{ConsentStep}, \textit{ConsentAcquisitionStep}, \textit{ConsentModificationStep}, \textit{ConsentArchivalStep}, \textit{ConsentAgreement}, \textit{ConsentAgreementTemplate} & \textit{usesConsentAgreement}, \textit{generatesConsentAgreement} & \textit{Ex}-\textit{ante} \\ \hline
& \textit{ConsentActivity}, \textit{AcquireConsentActivity}, \textit{ArchiveConsentActivity}, \textit{ModifyConsentActivity}, \textit{GivenConsent}, \textit{GivenConsentTemplate} & \textit{collectedConsentFromAgent} & \textit{Ex}-\textit{post} \\ \hline
\textit{CMQ49} & \textit{ConsentAgreementTemplate} & & \textit{Ex}-\textit{ante} \\ \hline
& \textit{GivenConsentTemplate} & & \textit{Ex}-\textit{post} \\
\bottomrule
\end{tabularx}
\end{center}
% Some questions which were not considered in scope regarding expression of activities are marked with \textit{N/S}.
Questions not in scope (marked as \textit{N/S} either require clarity from authoritative sources regarding interpretation of information to provide a concrete design pattern or have multiple possible representations of which it cannot be determined which is more useful from a legal compliance point of view. Examples include location of recipients - which can be expressed either through an property/annotation associated with a data sharing activity or attached with a particular third party; and specifying time limits or duration or conditional events associated with data storage and deletion periods. These have been identified as future work regarding further development of the ontology based on differing interpretations of representation, complexity of specifying values such as ``EU membership'' and ``as long as required'', and pending expert opinion of legal authorities on these issues through courts or executive decisions.
These are considered minor issues regarding representation of information as they do not have a major impact on design and use of GDPRov.
Approaches and ontologies in SotA provide alternative design patterns which can be used as non-authoritative approaches for short-term mitigation of this issue.
The presented evaluation demonstrates GDPRov satisfies requirements of answering competency questions regarding representation of activities and identifies those that are needed to be resolved as future work based on availability of legal opinion and decisions. GDPRov thus fulfils research objective $RO3(b)$ by providing representations of activities associated with personal data and consent in ex-ante and ex-post phases.
\subsubsection{Comparison with SotA}
The representation of process flows and activities associated with GDPR compliance in existing approaches was presented and analysed as part of state of the art in \autoref{sota:analysis:process-flows}.
The attributes for this analysis involved features or concepts that could be represented using the specified approach and basis for representation in existing vocabularies and standards.
The analysis demonstrated existence of a variety of approaches that utilised existing standards of PROV-O and BPMN to model GDPR-specific information regarding process flows and activities in both ex-ante and ex-post phase. It found that approaches modelling both ex-post and ex-ante phases exist and utilise PROV-O as their basis for representation of information.
A comparison of GDPRov with SotA is provided in \autoref{table:gdprov:sota} using the same attributes used for analysis.
The table lists features supported by each approach using a check mark (\cmark) with a blank indicating no information regarding the feature was found.
Column headings corresponding with expression of information supported by an approach, and use following abbreviations - (Repr): method used for representation of process flow; (EA): whether it permits Ex-ante modelling; (EP): whether it permits Ex-post modelling; (Pu): whether Purpose can be specified; (Pr): whether Processing can be specified; (DS): if Data Sharing can be modelled; (Rp): if Recipients are associated with data sharing; (St): whether Data Storage occurs; (Rg): if provision of Rights can be modelled; and (LB): if Legal Basis can be associated with a process flow.
The table demonstrates that GDPRov supports all of analysed features and is the only one currently providing all of them. However, this analysis only takes into consideration abstract existence or provision of features and does not take into consideration context of an approach or its granularity. For example, while an approach may provide representation of data storage concepts, there are additional features such as storage duration, condition, form or medium, security, and policy which are also relevant in evaluation of GDPR compliance.
These are highly dependant on individual use-cases and domains, and contain existing work which can be used to represent them such as Time ontology \cite{cox_time_2017} for temporal annotations and ODRL ontology \cite{iannella_odrl_2018} for conditions and events as policies.
Since the scope of GDPRov is limited to expression of information regarding activities in ex-ante and ex-post phases, representation of such granular attributes is relevant but not the primary focus within its scope and is therefore not considered in its evaluation or comparison with SotA.
% Table GDPRov vs SotA
\begin{center}
\footnotesize
\begin{tabularx}{\textwidth}{|l|l|X|X|X|X|X|X|X|X|X|}
\caption{Comparison of GDPRov with SotA}\label{table:gdprov:sota} \\
\toprule
\textbf{Work} & \textbf{Repr} & \textbf{EA} & \textbf{EP} & \textbf{Pu} & \textbf{Pr} & \textbf{DS} & \textbf{Rp} & \textbf{St} & \textbf{Rg} & \textbf{LB} \\
\midrule
\endfirsthead
\caption*{Comparison of GDPRov with SotA (cont'd)} \\
\toprule
\textbf{Work} & \textbf{Repr} & \textbf{EA} & \textbf{EP} & \textbf{Pu} & \textbf{Pr} & \textbf{DS} & \textbf{Rp} & \textbf{St} & \textbf{Rg} & \textbf{LB} \\
\midrule
\endhead
\midrule
\multicolumn{11}{r@{}}{\footnotesize (Cont'd on following page)}\\
\endfoot
\endlastfoot
\rowcolor[gray]{0.8}
\textbf{GDPRov} & PROV-O,P-Plan & \cmark & \cmark & \cmark & \cmark & \cmark & \cmark & \cmark & \cmark & \cmark \\ \hline
SPECIAL & PROV-O & \cmark & \cmark & \cmark & \cmark & \cmark & \cmark & \cmark & & \\ \hline
SPL+CitySPIN & PROV-O & \cmark & \cmark & \cmark & \cmark & \cmark & \cmark & \cmark & & \\ \hline
MIREL & PWO & \cmark & & \cmark & \cmark & & & \cmark & \cmark & \\ \hline
MRL+DAPRECO & PWO & \cmark & & \cmark & \cmark & & & \cmark & \cmark & \\ \hline
BPR4GDPR & & \cmark & \cmark & \cmark & \cmark & \cmark & \cmark & & & \\ \hline
Ujcich et al. & PROV-O & & \cmark & \cmark & \cmark & \cmark & \cmark & \cmark & \cmark & \cmark \\ \hline
Lodge et al & & \cmark & & \cmark & & & & & & \\ \hline
Tom et al & BPMN & \cmark & & & \cmark & \cmark & \cmark & \cmark & \cmark & \\ \hline
LUCE & & \cmark & \cmark & & & \cmark & \cmark & & & \\ \hline
Sion et al & & \cmark & & \cmark & \cmark & \cmark & \cmark & \cmark & & \cmark \\ \hline
privacyTracker & & \cmark & \cmark & & & \cmark & \cmark & & & \\ \hline
Basin et al & & \cmark & & \cmark & & & & & & \\ \hline
RestAssured & & & & \cmark & \cmark & \cmark & \cmark & \cmark & & \\
\bottomrule
\end{tabularx}
\end{center}
Regarding expression of information as being in either ex-ante or ex-post phase - GDPRov is the only approach to do so using existing ontologies of PROV-O and P-Plan based on a scientific workflow model which is useful in investigation of executions based on a plan and association of information between ex-ante and ex-post phases. The use of PWO \cite{gangemi_publishing_2017} in MIREL project also follows a similar rationale though it does not provide the same extent of concepts and representations as GDPRov.
In addition, utilisation of PROV-O as its base ontology enables information represented using GDPRov to be captured and recorded as provenance information using PROV-O or GDPRov itself. This provides capabilities for documenting evolution of a system as well as representing state of compliance at a given moment in time.
This feature is shared by all approaches that utilise a provenance based ontology at their core, and especially the ones that utilise PROV-O as it is a well-recognised and adopted standard.
The combination of PROV-O and P-Plan enables a flexible representation of activities by specifying their constituent steps and involved artefacts at an arbitrary level of granularity while providing annotations and classes to link these with GDPR. In this, GDPRov is unique and novel within SotA.
Based on this, GDPRov's novelty within SotA is based on it being one of few approaches using PROV-O to represent activities for GDPR compliance in both ex-ante and ex-post phases. GDPRov is also novel in its provision of concepts associated with GDPR and granularity afforded by use of PROV-O and P-Plan to link information in ex-ante and ex-post phases.
Furthermore, GDPRov is one of few approaches to be available under an open and permissive license (CC-by-4.0) thereby enabling its use, adoption, and further evolution.
\subsubsection{Application to external use-case from SPECIAL project}\label{sec:gdprov:use-case:SPECIAL}
% description of the use-case
The SPECIAL project uses a scenario\footnote{NOTE: the scenario explicitly mentions use of an immutable distributed ledger developed by the SPECIAL project to provide transparency and log accountability regarding metadata. This is omitted from the adapted use-case used to evaluate GDPRov.} to motivate their work and demonstrate use of developed technologies in their deliverables (D1.7 \cite{bonatti_d1.7_2018}) and peer-reviewed publications (\cite{kirrane_scalable_2018}).
In this section, the scenario is adapted as an external use-case to evaluate GDPRov's suitability to express required concepts.
The use-case is summarised as follows with GDPR concepts added in parenthesis for relevance: Sue (Data Subject) buys a wearable appliance for fitness tracking from BeFit (Data Controller), and is presented with an informed consent request that describes collection of biomedical parameters such as heart rate (Personal Data) and how they will be processed, which are stored in BeFit's cloud and transmitted for purposes of: giving Sue feedback on her activity, such as calories consumption; and creating an activity profile that will be shared with other companies for targeted ads related to fitness - an optional purpose to which Sue opts-in. After two years, Sue starts receiving annoying SMS messages from a local gym that advertise its activities. Sue discovers following facts: (i) the gym has an activity profile referring to Sue, that, due to appliance’s malfunctioning, reports that she is not doing any physical exercise; (ii) the gym received Sue's profile from BeFit, associated with a policy that allows the gym to send targeted ads to Sue based on the profile; (iii) BeFit built Sue's profile by mining data collected by appliance; and (iv) all these operations are permitted by consent agreement previously signed by Sue and BeFit. Using this information BeFit and the gym prove that they used Sue’s data in accordance with Sue's given consent. Sue now asks both BeFit and the gym to delete all of her data.
The use-case is accompanied with information on its interpretation in terms of GDPR terminology \cite{bonatti_d1.7_2018} and its representation using SPECIAL vocabularies \cite{kirrane_scalable_2018}. To represent the use-case using GDPRov, concepts used by SPECIAL are mapped or aligned to their closest relative concepts within GDPRov (see \autoref{table:gdprov:use-case:SPECIAL}) with its RDF/Turtle representation provided in \autoref{code:gdprov:use-case:special}. The SPARQL queries used to retrieve information depicted by statements (i) to (iv) in the scenario are provided in \autoref{code:gdprov:use-case:special-sparql}. The RDF representation and SPARQL query utilised a simplified representation of the scenario to present only the essential fact for answering of questions.
\begin{center}
\footnotesize
\begin{tabularx}{\textwidth}{|p{0.35\linewidth}|X|X|}
\caption{GDPRov concepts to represent external use-case from SPECIAL} \label{table:gdprov:use-case:SPECIAL} \\
\toprule
\textbf{Statement} & \textbf{GDPRov concept} & \textbf{SPECIAL concept} \\
\midrule
\endhead
Sue & \texttt{DataSubject} & \texttt{DataSubject} \\ \hline
BeFit & \texttt{DataController} & \texttt{Controller} \\ \hline
Biomedical parameters, heart rate, calories consumption, activity profile & \texttt{PersonalData} & \texttt{Data} \\ \hline
Collect data & \texttt{DataCollectionActivity} & \texttt{Collect} \\ \hline
Provide feedback on activity & \texttt{Purpose} & \texttt{Purpose} \\ \hline
Give consent (opt-in) & \texttt{AcquireConsentActivity} & \texttt{ConsentAssertion} \\ \hline
Targeted ads related to fitness & \texttt{Purpose} & \texttt{Purpose} \\ \hline
Share data & \texttt{DataSharingActivity} & \texttt{Recipient} \\ \hline
Gym & \texttt{ThirdParty} & \texttt{Recipient} \\ \hline
Consent agreement & \texttt{GivenConsent} & \texttt{LogEntryContent} \\ \hline
Delete data & \texttt{DataDeletionActivity} & N/A \\ \hline
Withdraw consent & \texttt{WithdrawConsentActivity} & \texttt{ConsentRevocation} \\
\bottomrule
\end{tabularx}
\end{center}
\begin{listing}[htbp]
\begin{minted}{turtle}
# Entities
:Sue a gdprov:DataSubject .
:BeFit a gdprov:DataController .
:Gym a gdprov:ThirdParty .
# Personal Data
:Biomedical_Parameters a gdprov:PersonalData .
:Activity_Profile a gdprov:PersonalData .
# Register with BeFit, given consent, and generate activity profile
:Registration a gdprov:Process .
:Sue_consent a gdprov:GivenConsent, gdprtext:LawfulBasisForProcessing .
:Collect_consent a gdprov:AcquireConsentActivity ;
gdprov:isPartOfProcess :Registration .
gdprov:collectedConsentFromAgent :Sue ;
gdprov:generatedConsent :Sue_consent .
:Collect_data a gdprov:DataCollectionActivity ;
gdprov:isPartOfProcess :Registration ;
prov-o:wasInformedBy :Collect_consent ;
gdprov:collectedDataFromAgent :Sue ; # from Sue's device
gdprov:generatedData :Activity_Profile .