From bc4932708254edea3187b7eaa4d5d311c7463859 Mon Sep 17 00:00:00 2001 From: acezen Date: Mon, 4 Mar 2024 16:46:44 +0800 Subject: [PATCH 1/6] [Feat][Doc] Refact and update the format specification document --- docs/format/file-format.rst | 17 +++++++++++++++-- docs/index.rst | 9 +++++++++ docs/overview.rst | 32 -------------------------------- docs/overview/concepts.rst | 15 +++++++++++++++ docs/overview/motivation.rst | 10 ++++++++++ docs/overview/overview.rst | 16 ++++++++++++++++ 6 files changed, 65 insertions(+), 34 deletions(-) delete mode 100644 docs/overview.rst create mode 100644 docs/overview/concepts.rst create mode 100644 docs/overview/motivation.rst create mode 100644 docs/overview/overview.rst diff --git a/docs/format/file-format.rst b/docs/format/file-format.rst index 35b8fa8d1..17c6eaeaf 100644 --- a/docs/format/file-format.rst +++ b/docs/format/file-format.rst @@ -1,10 +1,16 @@ GraphAr File Format ============================ -What is Property Graph +Property Graph ------------------------ -GraphAr is designed for representing and storing the property graphs. Graph (in discrete mathematics) is a structure made of vertices and edges. Property graph is then a type of graph model where the vertices/edges could carry a name (also called as type or label) and some properties. Since carrying additional information than non-property graphs, the property graph is able to represent connections among data scattered across diverse data databases and with different schemas. Compared with the relational database schema, the property graph excels at showing data dependencies. Therefore, it is widely-used in modeling modern applications including social network analytics, data mining, network routing, scientific computing and so on. +GraphAr is designed for representing and storing the property graphs. Graph (in discrete mathematics) is a structure made of vertices and edges. +Property graph is then a type of graph model where the vertices/edges could carry a name (also called as type or label) and some properties. +Since carrying additional information than non-property graphs, the property graph is able to represent +connections among data scattered across diverse data databases and with different schemas. +Compared with the relational database schema, the property graph excels at showing data dependencies. +Therefore, it is widely-used in modeling modern applications including social network analytics, data mining, +network routing, scientific computing and so on. A property graph consists of vertices and edges, with each vertex contains a unique identifier and: @@ -25,6 +31,13 @@ The following is an example property graph containing two types of vertices ("pe :align: center :alt: property graph +Property +-------- + +Property Group +-------------- + + Vertices in GraphAr ------------------------ diff --git a/docs/index.rst b/docs/index.rst index 266489519..94d568b93 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -10,6 +10,15 @@ :caption: Overview :hidden: + Overview + Motivation + Concepts + +.. toctree:: + :maxdepth: 1 + :caption: Format Specification + :hidden: + Overview File Format Implementation Status diff --git a/docs/overview.rst b/docs/overview.rst deleted file mode 100644 index 5a3ca80df..000000000 --- a/docs/overview.rst +++ /dev/null @@ -1,32 +0,0 @@ -Overview -============================ - -What is GraphAr ------------------------- - -Graph processing serves as the essential building block for a diverse variety of real-world applications such as social network analytics, data mining, network routing, and scientific computing. As the graph processing becomes increasingly important, there are many in-memory and out-of-core graph storages, databases, graph computing systems and interactive graph query frameworks have emerged. - -To accommodate this fragmented graph processing ecology, **GraphAr (Graph Archive, GAR)** is established to enable diverse graph applications or existing systems to build and access the graph data conveniently and efficiently. It specifies a standardized system-independent file format for graphs and provides a set of libraries to generate, access and transform such formatted files. - -GraphAr is intended to serve as the standard file format for importing/exporting and persistent storage of the graph data which can be used by diverse existing systems, reducing the overhead when various systems co-work. Additionally, it can also serve as the direct data source for graph processing applications. - -The GraphAr project includes such topics as: - -- Design of the Graph Archive (GAR) file format. (see `GraphAr File Format `_) -- A set of libraries for reading, writing and transforming GAR files. (now `the C++ library `_ , `JAVA library `_ and `the Spark library `_ are available) -- Examples about how to use GraphAr to write graph algorithms, or to work with existing systems such as GraphScope. (see `Application Cases <../cpp/examples/out-of-core.html>`_) - -.. image:: images/overview.png - :alt: overview - - -GraphAr Features ------------------------- - -The features of GraphAr include: - -- It supports the property graphs and different representations for the graph structure (COO, CSR and CSC). -- It is compatible with existing widely-used file types including CSV, ORC and Parquet. -- Apache Spark can be utilized to generate, load and transform the GAR files. -- It is convenient to be used by a variety of single-machine/distributed graph processing systems, databases, and other downstream computing tasks. -- It enables to modify the topology structure or the properties of the graph, or to construct a new graph with a set of selected vertices/edges. \ No newline at end of file diff --git a/docs/overview/concepts.rst b/docs/overview/concepts.rst new file mode 100644 index 000000000..cbeb70380 --- /dev/null +++ b/docs/overview/concepts.rst @@ -0,0 +1,15 @@ +Concepts +========= + +Glossary of relevant concepts and terms. + +- Property Graph: A graph with nodes and edges, where both nodes and edges can have properties. Nodes and edges can be labeled with one or more labels. This is the most common type of graph database. +- Vertex +- Edge +- Property group +- Adjacency list +- CSR +- CSC +- COO +- Vertex chunk +- Edge chunk diff --git a/docs/overview/motivation.rst b/docs/overview/motivation.rst new file mode 100644 index 000000000..078f6e2d4 --- /dev/null +++ b/docs/overview/motivation.rst @@ -0,0 +1,10 @@ +Motivation +=========== + +Numerous graph systems, such as Neo4j, Nebula Graph, and Apache HugeGraph, have been developed in recent years. +Each of these systems has its own graph data storage format, complicating the exchange of graph data between different systems. +The need for a standard data file format for large-scale graph data storage and processing that can be used by diverse existing systems is evident, as it would reduce overhead when various systems work together. + +Our aim is to fill this gap and contribute to the open-source community by providing a standard data file format for graph data storage and exchange, as well as for out-of-core querying. +This format, which we have named GraphAr, is engineered to be efficient, cross-language compatible, and to support out-of-core processing scenarios, such as those commonly found in data lakes. +Furthermore, GraphAr's flexible design ensures that it can be easily extended to accommodate a broader array of graph data storage and exchange use cases in the future. diff --git a/docs/overview/overview.rst b/docs/overview/overview.rst new file mode 100644 index 000000000..5dd963835 --- /dev/null +++ b/docs/overview/overview.rst @@ -0,0 +1,16 @@ +Overview +========= + +.. image:: images/overview.png + :alt: overview + +GraphAr is a project to standardize the graph data format and provide a set of libraries to generate, access and transform such formatted files. + +It is intended to serve as the standard file format for importing/exporting and persistent storage of the graph data which can be used by diverse existing systems, reducing the overhead when various systems co-work. + +Additionally, it can also serve as the direct data source for graph processing applications. + + +Motivation + +Concepts From 05e09c5b4c0a3f296592509be6944df16b00ae6f Mon Sep 17 00:00:00 2001 From: acezen Date: Mon, 4 Mar 2024 17:45:16 +0800 Subject: [PATCH 2/6] Add some items to format --- docs/format/file-format.rst | 42 +++++++++++++++++++++++++++++++++++-- 1 file changed, 40 insertions(+), 2 deletions(-) diff --git a/docs/format/file-format.rst b/docs/format/file-format.rst index 17c6eaeaf..f27b31411 100644 --- a/docs/format/file-format.rst +++ b/docs/format/file-format.rst @@ -31,12 +31,50 @@ The following is an example property graph containing two types of vertices ("pe :align: center :alt: property graph -Property --------- + +Property Data Types +------------------- +GraphAr support a set of built-in property data types that are common in real use cases and supported by most file types (CSV, ORC, Parquet), includes: + +``` +- Boolean +- Int32: Integer with 32 bits +- Int64: Integer with 64 bits +- Float: 32-bit floating point values +- Double: 64-bit floating point values +- String: Textual data +- Date: days since the Unix epoch +- Timestamp: milliseconds since the Unix epoch +- Time: milliseconds since midnight +- List: A list of values of the same type +``` + Property Group -------------- +GraphAr splits the properties of vertices and edges into groups, with each group containing a set of properties and the properties in the same group are stored in the same file. + + +Adjacency List +-------------- + +GraphAr supports the storage of multiple types of adjLists for a given group of edges, e.g., a group of edges could be accessed in both CSR and CSC way when two copies (one is **ordered_by_source** and the other is **ordered_by_dest**) of the relevant data are present in GraphAr. + + +Configurations +-------------- + +Vertex Chunk Size +````````````````` + +Edge Chunk Size +```````````````` + +File Format +```````````` + + Vertices in GraphAr From 44c1d2dc77fda8f60d2389da2347a6e44f448616 Mon Sep 17 00:00:00 2001 From: acezen Date: Tue, 5 Mar 2024 13:11:39 +0800 Subject: [PATCH 3/6] Update --- docs/format/file-format.rst | 45 ++++++++++++++----------------------- 1 file changed, 17 insertions(+), 28 deletions(-) diff --git a/docs/format/file-format.rst b/docs/format/file-format.rst index f27b31411..5b41dbfd2 100644 --- a/docs/format/file-format.rst +++ b/docs/format/file-format.rst @@ -2,7 +2,7 @@ GraphAr File Format ============================ Property Graph ------------------------- +--------------- GraphAr is designed for representing and storing the property graphs. Graph (in discrete mathematics) is a structure made of vertices and edges. Property graph is then a type of graph model where the vertices/edges could carry a name (also called as type or label) and some properties. @@ -55,11 +55,15 @@ Property Group GraphAr splits the properties of vertices and edges into groups, with each group containing a set of properties and the properties in the same group are stored in the same file. +TODO: add a figure to illustrate the property group. + Adjacency List -------------- -GraphAr supports the storage of multiple types of adjLists for a given group of edges, e.g., a group of edges could be accessed in both CSR and CSC way when two copies (one is **ordered_by_source** and the other is **ordered_by_dest**) of the relevant data are present in GraphAr. +GraphAr supports the storage of multiple types of adjLists for a given group of edges, e.g., a group of edges could be accessed in both CSR, CSC or COO. The adjacency list is used to store the topology information in a way similar to CSR/CSC + +TODO: add a figure to illustrate the adjList types. Configurations @@ -71,13 +75,14 @@ Vertex Chunk Size Edge Chunk Size ```````````````` -File Format -```````````` - +Data File Format +```````````````` +Adjacency List Type +```````````````````` -Vertices in GraphAr +Vertex Chunks in GraphAr ------------------------ Logical table of vertices @@ -110,7 +115,7 @@ Take the "person" vertex table as an example, if the chunk size is set to be 500 **Note**: For efficiently utilize the filter push-down of the payload file format like Parquet, the internal vertex id is stored in the payload file as a column. And since the internal vertex id is continuous, the payload file format can use the delta encoding for the internal vertex id column, which would not bring too much overhead for the storage. -Edges in GraphAr +Edge Chunks in GraphAr ------------------------ Logical table of edges @@ -133,10 +138,10 @@ As same with the vertex table, the logical edge table is also partitioned into s - **unordered_by_source**: the internal id of the source vertex is used as the partition key to divide the edges into different sub-logical-tables, and the edges in each sub-logical-table are unordered, which can be seen as the COO format. - **unordered_by_dest**: the internal id of the destination vertex is used as the partition key to divide the edges into different sub-logical-tables, and the edges in each sub-logical-table are unordered, which can also be seen as the COO format. -After that, a sub-logical-table is further divided into edge chunks of a predefined, fixed number of rows (referred to as edge chunk size). Finally, an edge chunk is separated into physical tables in the following way: +After that, the whole logical table of edges will be divided into multiple sub-logical-tables with each sub-logical-table contains edges that the source (or destination) vertices are in the same vertex chunk. Then, a sub-logical-table is further divided into edge chunks of a predefined, fixed number of rows (referred to as edge chunk size). Finally, an edge chunk is separated into physical tables in the following way: - an adjList table (which contains only two columns: the internal vertex id of the source and the destination). -- 0 or more edge property tables, with each table contains a group of properties. +- 0 or more property group tables (each contains the properties of the edges). Additionally, there would be an offset table for **ordered_by_source** or **ordered_by_dest** edges. The offset table is used to record the starting point of the edges for each vertex. The partition of the offset table should be in alignment with the partition of the corresponding vertex table. The first row of each offset chunk is always 0, indicating the starting point for the corresponding sub-logical-table for edges. @@ -158,11 +163,11 @@ Take the "person knows person" edges to illustrate. Suppose the vertex chunk siz File Format ------------------------- +------------ Information files ````````````````` -GraphAr uses two kinds of files to store a graph: a group of Yaml files to describe meta information; and data files to store actual data for vertices and edges. +GraphAr uses two kinds of files to store a graph: a group of Yaml files to describe metadata information; and data files to store actual data for vertices and edges. A graph information file which named ".graph.yml" describes the meta information for a graph whose name is . The content of this file includes: - the graph name; @@ -201,24 +206,8 @@ As previously mentioned, each logical vertex/edge table is divided into multiple - `Apache ORC `_ - `Apache Parquet `_ - CSV +- JSON Both of Apache ORC and Apache Parquet are column-oriented data storage formats. In practice of graph processing, it is common to only query a subset of columns of the properties. Thus, the column-oriented formats are more efficient, which eliminate the need to read columns that are not relevant. They are also used by a large number of data processing frameworks like `Apache Spark `_, `Apache Hive `_, `Apache Flink `_, and `Apache Hadoop `_. See also `Gar Data Files `_ for an example. - -Data Types -`````````` -GraphAr provides a set of built-in data types that are common in real use cases and supported by most file types (CSV, ORC, Parquet), includes: - -- bool -- int32 -- int64 -- float -- double -- string -- list (of int32, int64, float, double, string; not supported by CSV) - -.. tip:: - - We are continuously adding more built-in data types in GraphAr, and self-defined data types will be supported. - \ No newline at end of file From 72f193239f070a53cbd6ded39fbb7b28550d41dc Mon Sep 17 00:00:00 2001 From: acezen Date: Tue, 12 Mar 2024 18:57:49 +0800 Subject: [PATCH 4/6] Update the document Signed-off-by: acezen --- docs/format/file-format.rst | 55 +++++++++++++++++++----------------- docs/format/status.rst | 2 ++ docs/overview/concepts.rst | 46 +++++++++++++++++++++++------- docs/overview/motivation.rst | 3 +- 4 files changed, 69 insertions(+), 37 deletions(-) diff --git a/docs/format/file-format.rst b/docs/format/file-format.rst index 5b41dbfd2..8e47c8364 100644 --- a/docs/format/file-format.rst +++ b/docs/format/file-format.rst @@ -49,37 +49,45 @@ GraphAr support a set of built-in property data types that are common in real us - List: A list of values of the same type ``` - -Property Group --------------- - -GraphAr splits the properties of vertices and edges into groups, with each group containing a set of properties and the properties in the same group are stored in the same file. - -TODO: add a figure to illustrate the property group. - - -Adjacency List --------------- - -GraphAr supports the storage of multiple types of adjLists for a given group of edges, e.g., a group of edges could be accessed in both CSR, CSC or COO. The adjacency list is used to store the topology information in a way similar to CSR/CSC - -TODO: add a figure to illustrate the adjList types. - +GraphAr also supports the user-defined data types, which can be used to represent complex data structures, +such as the struct, map, and union types. Configurations -------------- Vertex Chunk Size ````````````````` +The vertex chunk size is a configuration parameter that determines the number of vertices in a vertex chunk +and used to partition the logical vertex table into multiple physical vertex tables. +The vertex chunk size should be set to a value that is large enough to reduce the overhead of reading/writing files, +but small enough to avoid reading/writing too many vertices at once. We recommend setting the vertex chunk size to +empirical value 2^18 (262,144) for most cases. Edge Chunk Size ```````````````` +The edge chunk size is a configuration parameter that determines the number of edges in an edge chunk +and used to partition the logical edge table into multiple physical edge tables. +The edge chunk size should be set to a value that is large enough to reduce the overhead of reading/writing files, +but small enough to avoid reading/writing too many edges at once. We recommend setting the edge chunk size to +empirical value 2^22 (4,194,304) for most cases. + Data File Format ```````````````` +GraphAr supports multiple file formats for storing the actual data of vertices and edges, +including Apache ORC, Apache Parquet, CSV, and JSON. The file format should be chosen +based on the specific use case and the data processing framework that will be used to +process the graph data. For example, if the graph data will be processed using Apache Spark, +then the Apache Parquet file format is recommended. Adjacency List Type ```````````````````` +Adjacency list is a data structure used to represent the edges of a graph. GraphAr supports multiple types of adjacency lists for a given group of edges, including: + +- **ordered_by_source**: all the edges in the logical table are ordered and further partitioned by the internal vertex id of the source, which can be seen as the CSR format. +- **ordered_by_dest**: all the edges in the logical table are ordered and further partitioned by the internal vertex id of the destination, which can be seen as the CSC format. +- **unordered_by_source**: the internal id of the source vertex is used as the partition key to divide the edges into different sub-logical-tables, and the edges in each sub-logical-table are unordered, which can be seen as the COO format. +- **unordered_by_dest**: the internal id of the destination vertex is used as the partition key to divide the edges into different sub-logical-tables, and the edges in each sub-logical-table are unordered, which can also be seen as the COO format. Vertex Chunks in GraphAr @@ -131,12 +139,7 @@ Take the logical table for "person likes person" edges as an example, the logica Physical table of edges ``````````````````````` -As same with the vertex table, the logical edge table is also partitioned into some sub-logical-tables, with each sub-logical-table contains edges that the source (or destination) vertices are in the same vertex chunk. According to the partition strategy and the order of the edges, edges can be stored in GraphAr following one of the four types: - -- **ordered_by_source**: all the edges in the logical table are ordered and further partitioned by the internal vertex id of the source, which can be seen as the CSR format. -- **ordered_by_dest**: all the edges in the logical table are ordered and further partitioned by the internal vertex id of the destination, which can be seen as the CSC format. -- **unordered_by_source**: the internal id of the source vertex is used as the partition key to divide the edges into different sub-logical-tables, and the edges in each sub-logical-table are unordered, which can be seen as the COO format. -- **unordered_by_dest**: the internal id of the destination vertex is used as the partition key to divide the edges into different sub-logical-tables, and the edges in each sub-logical-table are unordered, which can also be seen as the COO format. +As same with the vertex table, the logical edge table is also partitioned into some sub-logical-tables, with each sub-logical-table contains edges that the source (or destination) vertices are in the same vertex chunk. According to the partition strategy and the order of the edges, edges can be stored in GraphAr following the setting adjacency list type. After that, the whole logical table of edges will be divided into multiple sub-logical-tables with each sub-logical-table contains edges that the source (or destination) vertices are in the same vertex chunk. Then, a sub-logical-table is further divided into edge chunks of a predefined, fixed number of rows (referred to as edge chunk size). Finally, an edge chunk is separated into physical tables in the following way: @@ -161,10 +164,6 @@ Take the "person knows person" edges to illustrate. Suppose the vertex chunk siz When the edge type is **ordered_by_source**, the sorted adjList table together with the offset table can be used as CSR, supporting the fast access of the outgoing edges for a given vertex. Similarly, a CSC view can be constructed by sorting the edges by destination and recording corresponding offsets, supporting the fast access of the incoming edges for a given vertex. - -File Format ------------- - Information files ````````````````` GraphAr uses two kinds of files to store a graph: a group of Yaml files to describe metadata information; and data files to store actual data for vertices and edges. @@ -211,3 +210,7 @@ As previously mentioned, each logical vertex/edge table is divided into multiple Both of Apache ORC and Apache Parquet are column-oriented data storage formats. In practice of graph processing, it is common to only query a subset of columns of the properties. Thus, the column-oriented formats are more efficient, which eliminate the need to read columns that are not relevant. They are also used by a large number of data processing frameworks like `Apache Spark `_, `Apache Hive `_, `Apache Flink `_, and `Apache Hadoop `_. See also `Gar Data Files `_ for an example. + +Implementation +``````````````` +The GraphAr libraries may implement part of the GraphAr format. The implementation status of the GraphAr libraries can refer to the `GraphAr implementation status `_. diff --git a/docs/format/status.rst b/docs/format/status.rst index c8c04da71..79602bef8 100644 --- a/docs/format/status.rst +++ b/docs/format/status.rst @@ -59,6 +59,8 @@ Payload Data File Formats +-----------------------------+---------+---------+-------+------------+ | HDF5 | | | | | +-----------------------------+---------+---------+-------+------------+ +| JSON | | | | | ++-----------------------------+---------+---------+-------+------------+ Notes: diff --git a/docs/overview/concepts.rst b/docs/overview/concepts.rst index cbeb70380..d52bad278 100644 --- a/docs/overview/concepts.rst +++ b/docs/overview/concepts.rst @@ -3,13 +3,39 @@ Concepts Glossary of relevant concepts and terms. -- Property Graph: A graph with nodes and edges, where both nodes and edges can have properties. Nodes and edges can be labeled with one or more labels. This is the most common type of graph database. -- Vertex -- Edge -- Property group -- Adjacency list -- CSR -- CSC -- COO -- Vertex chunk -- Edge chunk +- *Property Group*: GraphAr splits the properties of vertex/edge into groups to + allow for efficient storage and access without the need to load all properties. + Also benefits appending of new properties. Each property group is the unit + of storage and is stored in a separate directory. + +- *Adjacency List*: The storage method to store the edges of certain vertex type. Which include + - ordered by source vertex id: the edges are ordered and aligned by the source vertex + - ordered by destination vertex id: the edges are ordered and aligned by the destination vertex + - unordered by source vertex id: the edges are unordered but aligned by the source vertex + - unordered by destination vertex id: the edges are unordered but aligned by the destination vertex + +- *Compressed Sparse Row (CSR)*: The storage layout the edges of certain vertex type. + Corresponding to the ordered by source vertex id adjacency list, the edges are + stored in a single array and the offsets of the edges of each vertex are stored + in a separate array. + +- *Compressed Sparse Column (CSC)*: The storage layout the edges of certain vertex type. + Corresponding to the ordered by destination vertex id adjacency list, the edges are + stored in a single array and the offsets of the edges of each vertex are stored in + a separate array. + +- *Coordinate List (COO)*: The storage layout the edges of certain vertex type. + Corresponding to the unordered by source vertex id or unordered by target vertex id + adjacency list, the edges are stored in a single array and no offsets are stored. + +- *Vertex Chunk*: The storage unit of vertex. Each vertex chunk contains a fixed number + of vertices and is stored in a separate file. + +- *Edge Chunk*: The storage unit of edge. Each edge chunk contains a fixed number of edges + and is stored in a separate file. + +*Highlights*: + The design of property group and vertex/edge chunk allows users to + - Access the data without reading all the data into memory + - Conveniently append new properties to the graph without the need to reorganize the data + - Efficiently store and access the data in a distributed environment and parallel processing diff --git a/docs/overview/motivation.rst b/docs/overview/motivation.rst index 078f6e2d4..fa48ee730 100644 --- a/docs/overview/motivation.rst +++ b/docs/overview/motivation.rst @@ -1,7 +1,8 @@ Motivation =========== -Numerous graph systems, such as Neo4j, Nebula Graph, and Apache HugeGraph, have been developed in recent years. +Numerous graph systems, +such as Neo4j, Nebula Graph, and Apache HugeGraph, have been developed in recent years. Each of these systems has its own graph data storage format, complicating the exchange of graph data between different systems. The need for a standard data file format for large-scale graph data storage and processing that can be used by diverse existing systems is evident, as it would reduce overhead when various systems work together. From 30d2cd5dd228282bb1802c29480fc99e85545764 Mon Sep 17 00:00:00 2001 From: acezen Date: Wed, 13 Mar 2024 11:42:19 +0800 Subject: [PATCH 5/6] Update Signed-off-by: acezen --- CONTRIBUTING.rst | 13 ++++++++ docs/format/file-format.rst | 36 +++++++++++----------- docs/format/status.rst | 6 ++-- docs/index.rst | 5 ++-- docs/overview/concepts.rst | 60 +++++++++++++++++-------------------- docs/overview/overview.rst | 6 ++-- 6 files changed, 69 insertions(+), 57 deletions(-) diff --git a/CONTRIBUTING.rst b/CONTRIBUTING.rst index 122e62398..cde16e861 100644 --- a/CONTRIBUTING.rst +++ b/CONTRIBUTING.rst @@ -422,6 +422,15 @@ however, it is not uncommon for the CI infrastructure itself to fail on specific platforms ("be red"). It is vital to visually inspect the results of all failed ("red") tests to determine whether the failure was caused by the changes in the pull request. +Format specification & Libraries implementation +----------------------------------------------- + +The GraphAr is consist of the format specification and libraries implementation. The libraries implementation is based on the format specification. +When you request a new feature to the format specification, you should first open a feature request issue and discuss with the community. +If the feature is accepted, you can submit a pull request update the `format specification design`_. After the format specification is updated, +you can submit a pull request to the related libraries implementation to implement the new feature and update the `implementation status`_. + + .. _pre-commit: https://pre-commit.com/ .. _Code of Conduct: https://github.com/alibaba/GraphAr/blob/main/CODE_OF_CONDUCT.md @@ -447,3 +456,7 @@ to determine whether the failure was caused by the changes in the pull request. .. _Contributor License Agreement: https://cla-assistant.io/alibaba/GraphAr .. _glossary: https://chromium.googlesource.com/chromiumos/docs/+/HEAD/glossary.md + +.. _format specification design: https://github.com/alibaba/GraphAr/tree/main/docs/format/file-format.rst + +.. _implementation status: https://github.com/alibaba/GraphAr/tree/main/docs/format/status.rst diff --git a/docs/format/file-format.rst b/docs/format/file-format.rst index 8e47c8364..197e22696 100644 --- a/docs/format/file-format.rst +++ b/docs/format/file-format.rst @@ -36,18 +36,16 @@ Property Data Types ------------------- GraphAr support a set of built-in property data types that are common in real use cases and supported by most file types (CSV, ORC, Parquet), includes: -``` -- Boolean -- Int32: Integer with 32 bits -- Int64: Integer with 64 bits -- Float: 32-bit floating point values -- Double: 64-bit floating point values -- String: Textual data -- Date: days since the Unix epoch -- Timestamp: milliseconds since the Unix epoch -- Time: milliseconds since midnight -- List: A list of values of the same type -``` +- **Boolean** +- **Int32**: Integer with 32 bits +- **Int64**: Integer with 64 bits +- **Float**: 32-bit floating point values +- **Double**: 64-bit floating point values +- **String**: Textual data +- **Date**: days since the Unix epoch +- **Timestamp**: milliseconds since the Unix epoch +- **Time**: milliseconds since midnight +- **List**: A list of values of the same type GraphAr also supports the user-defined data types, which can be used to represent complex data structures, such as the struct, map, and union types. @@ -59,6 +57,7 @@ Vertex Chunk Size ````````````````` The vertex chunk size is a configuration parameter that determines the number of vertices in a vertex chunk and used to partition the logical vertex table into multiple physical vertex tables. + The vertex chunk size should be set to a value that is large enough to reduce the overhead of reading/writing files, but small enough to avoid reading/writing too many vertices at once. We recommend setting the vertex chunk size to empirical value 2^18 (262,144) for most cases. @@ -68,6 +67,7 @@ Edge Chunk Size The edge chunk size is a configuration parameter that determines the number of edges in an edge chunk and used to partition the logical edge table into multiple physical edge tables. + The edge chunk size should be set to a value that is large enough to reduce the overhead of reading/writing files, but small enough to avoid reading/writing too many edges at once. We recommend setting the edge chunk size to empirical value 2^22 (4,194,304) for most cases. @@ -75,8 +75,9 @@ empirical value 2^22 (4,194,304) for most cases. Data File Format ```````````````` GraphAr supports multiple file formats for storing the actual data of vertices and edges, -including Apache ORC, Apache Parquet, CSV, and JSON. The file format should be chosen -based on the specific use case and the data processing framework that will be used to +including Apache ORC, Apache Parquet, CSV, and JSON. + +The file format should be chosen based on the specific use case and the data processing framework that will be used to process the graph data. For example, if the graph data will be processed using Apache Spark, then the Apache Parquet file format is recommended. @@ -165,7 +166,7 @@ Take the "person knows person" edges to illustrate. Suppose the vertex chunk siz When the edge type is **ordered_by_source**, the sorted adjList table together with the offset table can be used as CSR, supporting the fast access of the outgoing edges for a given vertex. Similarly, a CSC view can be constructed by sorting the edges by destination and recording corresponding offsets, supporting the fast access of the incoming edges for a given vertex. Information files -````````````````` +------------------ GraphAr uses two kinds of files to store a graph: a group of Yaml files to describe metadata information; and data files to store actual data for vertices and edges. A graph information file which named ".graph.yml" describes the meta information for a graph whose name is . The content of this file includes: @@ -199,7 +200,7 @@ An edge information file which named "__`_ for an example. Data files -`````````` +---------- As previously mentioned, each logical vertex/edge table is divided into multiple physical tables stored in one of the following file formats: - `Apache ORC `_ @@ -212,5 +213,6 @@ Both of Apache ORC and Apache Parquet are column-oriented data storage formats. See also `Gar Data Files `_ for an example. Implementation -``````````````` +-------------- The GraphAr libraries may implement part of the GraphAr format. The implementation status of the GraphAr libraries can refer to the `GraphAr implementation status `_. + diff --git a/docs/format/status.rst b/docs/format/status.rst index 79602bef8..19a0078c1 100644 --- a/docs/format/status.rst +++ b/docs/format/status.rst @@ -25,9 +25,11 @@ Data Types +-------------------+-------+-------+-------+------------+ | String | ✓ | ✓ | ✓ | ✓ | +-------------------+-------+-------+-------+------------+ -| Date | | | | | +| Date | ✓ | | | | +-------------------+-------+-------+-------+------------+ -| Timestamp | | | | | +| Timestamp | ✓ | | | | ++-------------------+-------+-------+-------+------------+ +| Time | | | | | +-------------------+-------+-------+-------+------------+ +-------------------+-------+-------+-------+------------+ diff --git a/docs/index.rst b/docs/index.rst index 94d568b93..b7093300e 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -10,16 +10,15 @@ :caption: Overview :hidden: - Overview + Overview Motivation - Concepts + Concepts .. toctree:: :maxdepth: 1 :caption: Format Specification :hidden: - Overview File Format Implementation Status diff --git a/docs/overview/concepts.rst b/docs/overview/concepts.rst index d52bad278..44271634b 100644 --- a/docs/overview/concepts.rst +++ b/docs/overview/concepts.rst @@ -3,38 +3,34 @@ Concepts Glossary of relevant concepts and terms. -- *Property Group*: GraphAr splits the properties of vertex/edge into groups to - allow for efficient storage and access without the need to load all properties. - Also benefits appending of new properties. Each property group is the unit - of storage and is stored in a separate directory. - -- *Adjacency List*: The storage method to store the edges of certain vertex type. Which include - - ordered by source vertex id: the edges are ordered and aligned by the source vertex - - ordered by destination vertex id: the edges are ordered and aligned by the destination vertex - - unordered by source vertex id: the edges are unordered but aligned by the source vertex - - unordered by destination vertex id: the edges are unordered but aligned by the destination vertex - -- *Compressed Sparse Row (CSR)*: The storage layout the edges of certain vertex type. - Corresponding to the ordered by source vertex id adjacency list, the edges are - stored in a single array and the offsets of the edges of each vertex are stored - in a separate array. - -- *Compressed Sparse Column (CSC)*: The storage layout the edges of certain vertex type. - Corresponding to the ordered by destination vertex id adjacency list, the edges are - stored in a single array and the offsets of the edges of each vertex are stored in - a separate array. - -- *Coordinate List (COO)*: The storage layout the edges of certain vertex type. - Corresponding to the unordered by source vertex id or unordered by target vertex id - adjacency list, the edges are stored in a single array and no offsets are stored. - -- *Vertex Chunk*: The storage unit of vertex. Each vertex chunk contains a fixed number - of vertices and is stored in a separate file. - -- *Edge Chunk*: The storage unit of edge. Each edge chunk contains a fixed number of edges - and is stored in a separate file. - -*Highlights*: +- **Property Group**: GraphAr splits the properties of vertex/edge into groups to allow for efficient storage + and access without the need to load all properties. Also benefits appending of new properties. Each property + group is the unit of storage and is stored in a separate directory. + +- **Adjacency List**: The storage method to store the edges of certain vertex type. Which include: + - *ordered by source vertex id*: the edges are ordered and aligned by the source vertex + - *ordered by destination vertex id*: the edges are ordered and aligned by the destination vertex + - *unordered by source vertex id*: the edges are unordered but aligned by the source vertex + - *unordered by destination vertex id*: the edges are unordered but aligned by the destination vertex + +- **Compressed Sparse Row (CSR)**: The storage layout the edges of certain vertex type. Corresponding to the + ordered by source vertex id adjacency list, the edges are stored in a single array and the offsets of the + edges of each vertex are stored in a separate array. + +- **Compressed Sparse Column (CSC)**: The storage layout the edges of certain vertex type. Corresponding to the + ordered by destination vertex id adjacency list, the edges are stored in a single array and the offsets of the + edges of each vertex are stored in a separate array. + +- **Coordinate List (COO)**: The storage layout the edges of certain vertex type. Corresponding to the unordered + by source vertex id or unordered by target vertex id adjacency list, the edges are stored in a single array and + no offsets are stored. + +- **Vertex Chunk**: The storage unit of vertex. Each vertex chunk contains a fixed number of vertices and is stored + in a separate file. + +- **Edge Chunk**: The storage unit of edge. Each edge chunk contains a fixed number of edges and is stored in a separate file. + +**Highlights**: The design of property group and vertex/edge chunk allows users to - Access the data without reading all the data into memory - Conveniently append new properties to the graph without the need to reorganize the data diff --git a/docs/overview/overview.rst b/docs/overview/overview.rst index 5dd963835..449b84805 100644 --- a/docs/overview/overview.rst +++ b/docs/overview/overview.rst @@ -1,7 +1,7 @@ Overview ========= -.. image:: images/overview.png +.. image:: ../images/overview.png :alt: overview GraphAr is a project to standardize the graph data format and provide a set of libraries to generate, access and transform such formatted files. @@ -11,6 +11,6 @@ It is intended to serve as the standard file format for importing/exporting and Additionally, it can also serve as the direct data source for graph processing applications. -Motivation +`Motivation `_ -Concepts +`Concepts `_ From c24bc7a9b3f30ec49aaab314e66af367502957bf Mon Sep 17 00:00:00 2001 From: acezen Date: Wed, 13 Mar 2024 11:43:48 +0800 Subject: [PATCH 6/6] Update Signed-off-by: acezen --- CONTRIBUTING.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CONTRIBUTING.rst b/CONTRIBUTING.rst index cde16e861..dbbcac40f 100644 --- a/CONTRIBUTING.rst +++ b/CONTRIBUTING.rst @@ -425,7 +425,7 @@ to determine whether the failure was caused by the changes in the pull request. Format specification & Libraries implementation ----------------------------------------------- -The GraphAr is consist of the format specification and libraries implementation. The libraries implementation is based on the format specification. +The GraphAr includes the format specification and libraries implementation. The libraries implementation is based on the format specification. When you request a new feature to the format specification, you should first open a feature request issue and discuss with the community. If the feature is accepted, you can submit a pull request update the `format specification design`_. After the format specification is updated, you can submit a pull request to the related libraries implementation to implement the new feature and update the `implementation status`_.