diff --git a/CONTRIBUTING.rst b/CONTRIBUTING.rst index 122e62398..dbbcac40f 100644 --- a/CONTRIBUTING.rst +++ b/CONTRIBUTING.rst @@ -422,6 +422,15 @@ however, it is not uncommon for the CI infrastructure itself to fail on specific platforms ("be red"). It is vital to visually inspect the results of all failed ("red") tests to determine whether the failure was caused by the changes in the pull request. +Format specification & Libraries implementation +----------------------------------------------- + +The GraphAr includes the format specification and libraries implementation. The libraries implementation is based on the format specification. +When you request a new feature to the format specification, you should first open a feature request issue and discuss with the community. +If the feature is accepted, you can submit a pull request update the `format specification design`_. After the format specification is updated, +you can submit a pull request to the related libraries implementation to implement the new feature and update the `implementation status`_. + + .. _pre-commit: https://pre-commit.com/ .. _Code of Conduct: https://github.com/alibaba/GraphAr/blob/main/CODE_OF_CONDUCT.md @@ -447,3 +456,7 @@ to determine whether the failure was caused by the changes in the pull request. .. _Contributor License Agreement: https://cla-assistant.io/alibaba/GraphAr .. _glossary: https://chromium.googlesource.com/chromiumos/docs/+/HEAD/glossary.md + +.. _format specification design: https://github.com/alibaba/GraphAr/tree/main/docs/format/file-format.rst + +.. _implementation status: https://github.com/alibaba/GraphAr/tree/main/docs/format/status.rst diff --git a/docs/format/file-format.rst b/docs/format/file-format.rst index 35b8fa8d1..197e22696 100644 --- a/docs/format/file-format.rst +++ b/docs/format/file-format.rst @@ -1,10 +1,16 @@ GraphAr File Format ============================ -What is Property Graph ------------------------- +Property Graph +--------------- -GraphAr is designed for representing and storing the property graphs. Graph (in discrete mathematics) is a structure made of vertices and edges. Property graph is then a type of graph model where the vertices/edges could carry a name (also called as type or label) and some properties. Since carrying additional information than non-property graphs, the property graph is able to represent connections among data scattered across diverse data databases and with different schemas. Compared with the relational database schema, the property graph excels at showing data dependencies. Therefore, it is widely-used in modeling modern applications including social network analytics, data mining, network routing, scientific computing and so on. +GraphAr is designed for representing and storing the property graphs. Graph (in discrete mathematics) is a structure made of vertices and edges. +Property graph is then a type of graph model where the vertices/edges could carry a name (also called as type or label) and some properties. +Since carrying additional information than non-property graphs, the property graph is able to represent +connections among data scattered across diverse data databases and with different schemas. +Compared with the relational database schema, the property graph excels at showing data dependencies. +Therefore, it is widely-used in modeling modern applications including social network analytics, data mining, +network routing, scientific computing and so on. A property graph consists of vertices and edges, with each vertex contains a unique identifier and: @@ -26,7 +32,66 @@ The following is an example property graph containing two types of vertices ("pe :alt: property graph -Vertices in GraphAr +Property Data Types +------------------- +GraphAr support a set of built-in property data types that are common in real use cases and supported by most file types (CSV, ORC, Parquet), includes: + +- **Boolean** +- **Int32**: Integer with 32 bits +- **Int64**: Integer with 64 bits +- **Float**: 32-bit floating point values +- **Double**: 64-bit floating point values +- **String**: Textual data +- **Date**: days since the Unix epoch +- **Timestamp**: milliseconds since the Unix epoch +- **Time**: milliseconds since midnight +- **List**: A list of values of the same type + +GraphAr also supports the user-defined data types, which can be used to represent complex data structures, +such as the struct, map, and union types. + +Configurations +-------------- + +Vertex Chunk Size +````````````````` +The vertex chunk size is a configuration parameter that determines the number of vertices in a vertex chunk +and used to partition the logical vertex table into multiple physical vertex tables. + +The vertex chunk size should be set to a value that is large enough to reduce the overhead of reading/writing files, +but small enough to avoid reading/writing too many vertices at once. We recommend setting the vertex chunk size to +empirical value 2^18 (262,144) for most cases. + +Edge Chunk Size +```````````````` + +The edge chunk size is a configuration parameter that determines the number of edges in an edge chunk +and used to partition the logical edge table into multiple physical edge tables. + +The edge chunk size should be set to a value that is large enough to reduce the overhead of reading/writing files, +but small enough to avoid reading/writing too many edges at once. We recommend setting the edge chunk size to +empirical value 2^22 (4,194,304) for most cases. + +Data File Format +```````````````` +GraphAr supports multiple file formats for storing the actual data of vertices and edges, +including Apache ORC, Apache Parquet, CSV, and JSON. + +The file format should be chosen based on the specific use case and the data processing framework that will be used to +process the graph data. For example, if the graph data will be processed using Apache Spark, +then the Apache Parquet file format is recommended. + +Adjacency List Type +```````````````````` +Adjacency list is a data structure used to represent the edges of a graph. GraphAr supports multiple types of adjacency lists for a given group of edges, including: + +- **ordered_by_source**: all the edges in the logical table are ordered and further partitioned by the internal vertex id of the source, which can be seen as the CSR format. +- **ordered_by_dest**: all the edges in the logical table are ordered and further partitioned by the internal vertex id of the destination, which can be seen as the CSC format. +- **unordered_by_source**: the internal id of the source vertex is used as the partition key to divide the edges into different sub-logical-tables, and the edges in each sub-logical-table are unordered, which can be seen as the COO format. +- **unordered_by_dest**: the internal id of the destination vertex is used as the partition key to divide the edges into different sub-logical-tables, and the edges in each sub-logical-table are unordered, which can also be seen as the COO format. + + +Vertex Chunks in GraphAr ------------------------ Logical table of vertices @@ -59,7 +124,7 @@ Take the "person" vertex table as an example, if the chunk size is set to be 500 **Note**: For efficiently utilize the filter push-down of the payload file format like Parquet, the internal vertex id is stored in the payload file as a column. And since the internal vertex id is continuous, the payload file format can use the delta encoding for the internal vertex id column, which would not bring too much overhead for the storage. -Edges in GraphAr +Edge Chunks in GraphAr ------------------------ Logical table of edges @@ -75,17 +140,12 @@ Take the logical table for "person likes person" edges as an example, the logica Physical table of edges ``````````````````````` -As same with the vertex table, the logical edge table is also partitioned into some sub-logical-tables, with each sub-logical-table contains edges that the source (or destination) vertices are in the same vertex chunk. According to the partition strategy and the order of the edges, edges can be stored in GraphAr following one of the four types: - -- **ordered_by_source**: all the edges in the logical table are ordered and further partitioned by the internal vertex id of the source, which can be seen as the CSR format. -- **ordered_by_dest**: all the edges in the logical table are ordered and further partitioned by the internal vertex id of the destination, which can be seen as the CSC format. -- **unordered_by_source**: the internal id of the source vertex is used as the partition key to divide the edges into different sub-logical-tables, and the edges in each sub-logical-table are unordered, which can be seen as the COO format. -- **unordered_by_dest**: the internal id of the destination vertex is used as the partition key to divide the edges into different sub-logical-tables, and the edges in each sub-logical-table are unordered, which can also be seen as the COO format. +As same with the vertex table, the logical edge table is also partitioned into some sub-logical-tables, with each sub-logical-table contains edges that the source (or destination) vertices are in the same vertex chunk. According to the partition strategy and the order of the edges, edges can be stored in GraphAr following the setting adjacency list type. -After that, a sub-logical-table is further divided into edge chunks of a predefined, fixed number of rows (referred to as edge chunk size). Finally, an edge chunk is separated into physical tables in the following way: +After that, the whole logical table of edges will be divided into multiple sub-logical-tables with each sub-logical-table contains edges that the source (or destination) vertices are in the same vertex chunk. Then, a sub-logical-table is further divided into edge chunks of a predefined, fixed number of rows (referred to as edge chunk size). Finally, an edge chunk is separated into physical tables in the following way: - an adjList table (which contains only two columns: the internal vertex id of the source and the destination). -- 0 or more edge property tables, with each table contains a group of properties. +- 0 or more property group tables (each contains the properties of the edges). Additionally, there would be an offset table for **ordered_by_source** or **ordered_by_dest** edges. The offset table is used to record the starting point of the edges for each vertex. The partition of the offset table should be in alignment with the partition of the corresponding vertex table. The first row of each offset chunk is always 0, indicating the starting point for the corresponding sub-logical-table for edges. @@ -105,13 +165,9 @@ Take the "person knows person" edges to illustrate. Suppose the vertex chunk siz When the edge type is **ordered_by_source**, the sorted adjList table together with the offset table can be used as CSR, supporting the fast access of the outgoing edges for a given vertex. Similarly, a CSC view can be constructed by sorting the edges by destination and recording corresponding offsets, supporting the fast access of the incoming edges for a given vertex. - -File Format ------------------------- - Information files -````````````````` -GraphAr uses two kinds of files to store a graph: a group of Yaml files to describe meta information; and data files to store actual data for vertices and edges. +------------------ +GraphAr uses two kinds of files to store a graph: a group of Yaml files to describe metadata information; and data files to store actual data for vertices and edges. A graph information file which named ".graph.yml" describes the meta information for a graph whose name is . The content of this file includes: - the graph name; @@ -144,30 +200,19 @@ An edge information file which named "__`_ for an example. Data files -`````````` +---------- As previously mentioned, each logical vertex/edge table is divided into multiple physical tables stored in one of the following file formats: - `Apache ORC `_ - `Apache Parquet `_ - CSV +- JSON Both of Apache ORC and Apache Parquet are column-oriented data storage formats. In practice of graph processing, it is common to only query a subset of columns of the properties. Thus, the column-oriented formats are more efficient, which eliminate the need to read columns that are not relevant. They are also used by a large number of data processing frameworks like `Apache Spark `_, `Apache Hive `_, `Apache Flink `_, and `Apache Hadoop `_. See also `Gar Data Files `_ for an example. -Data Types -`````````` -GraphAr provides a set of built-in data types that are common in real use cases and supported by most file types (CSV, ORC, Parquet), includes: - -- bool -- int32 -- int64 -- float -- double -- string -- list (of int32, int64, float, double, string; not supported by CSV) - -.. tip:: +Implementation +-------------- +The GraphAr libraries may implement part of the GraphAr format. The implementation status of the GraphAr libraries can refer to the `GraphAr implementation status `_. - We are continuously adding more built-in data types in GraphAr, and self-defined data types will be supported. - \ No newline at end of file diff --git a/docs/format/status.rst b/docs/format/status.rst index c8c04da71..19a0078c1 100644 --- a/docs/format/status.rst +++ b/docs/format/status.rst @@ -25,9 +25,11 @@ Data Types +-------------------+-------+-------+-------+------------+ | String | ✓ | ✓ | ✓ | ✓ | +-------------------+-------+-------+-------+------------+ -| Date | | | | | +| Date | ✓ | | | | +-------------------+-------+-------+-------+------------+ -| Timestamp | | | | | +| Timestamp | ✓ | | | | ++-------------------+-------+-------+-------+------------+ +| Time | | | | | +-------------------+-------+-------+-------+------------+ +-------------------+-------+-------+-------+------------+ @@ -59,6 +61,8 @@ Payload Data File Formats +-----------------------------+---------+---------+-------+------------+ | HDF5 | | | | | +-----------------------------+---------+---------+-------+------------+ +| JSON | | | | | ++-----------------------------+---------+---------+-------+------------+ Notes: diff --git a/docs/index.rst b/docs/index.rst index 266489519..b7093300e 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -10,7 +10,15 @@ :caption: Overview :hidden: - Overview + Overview + Motivation + Concepts + +.. toctree:: + :maxdepth: 1 + :caption: Format Specification + :hidden: + File Format Implementation Status diff --git a/docs/overview.rst b/docs/overview.rst deleted file mode 100644 index 5a3ca80df..000000000 --- a/docs/overview.rst +++ /dev/null @@ -1,32 +0,0 @@ -Overview -============================ - -What is GraphAr ------------------------- - -Graph processing serves as the essential building block for a diverse variety of real-world applications such as social network analytics, data mining, network routing, and scientific computing. As the graph processing becomes increasingly important, there are many in-memory and out-of-core graph storages, databases, graph computing systems and interactive graph query frameworks have emerged. - -To accommodate this fragmented graph processing ecology, **GraphAr (Graph Archive, GAR)** is established to enable diverse graph applications or existing systems to build and access the graph data conveniently and efficiently. It specifies a standardized system-independent file format for graphs and provides a set of libraries to generate, access and transform such formatted files. - -GraphAr is intended to serve as the standard file format for importing/exporting and persistent storage of the graph data which can be used by diverse existing systems, reducing the overhead when various systems co-work. Additionally, it can also serve as the direct data source for graph processing applications. - -The GraphAr project includes such topics as: - -- Design of the Graph Archive (GAR) file format. (see `GraphAr File Format `_) -- A set of libraries for reading, writing and transforming GAR files. (now `the C++ library `_ , `JAVA library `_ and `the Spark library `_ are available) -- Examples about how to use GraphAr to write graph algorithms, or to work with existing systems such as GraphScope. (see `Application Cases <../cpp/examples/out-of-core.html>`_) - -.. image:: images/overview.png - :alt: overview - - -GraphAr Features ------------------------- - -The features of GraphAr include: - -- It supports the property graphs and different representations for the graph structure (COO, CSR and CSC). -- It is compatible with existing widely-used file types including CSV, ORC and Parquet. -- Apache Spark can be utilized to generate, load and transform the GAR files. -- It is convenient to be used by a variety of single-machine/distributed graph processing systems, databases, and other downstream computing tasks. -- It enables to modify the topology structure or the properties of the graph, or to construct a new graph with a set of selected vertices/edges. \ No newline at end of file diff --git a/docs/overview/concepts.rst b/docs/overview/concepts.rst new file mode 100644 index 000000000..44271634b --- /dev/null +++ b/docs/overview/concepts.rst @@ -0,0 +1,37 @@ +Concepts +========= + +Glossary of relevant concepts and terms. + +- **Property Group**: GraphAr splits the properties of vertex/edge into groups to allow for efficient storage + and access without the need to load all properties. Also benefits appending of new properties. Each property + group is the unit of storage and is stored in a separate directory. + +- **Adjacency List**: The storage method to store the edges of certain vertex type. Which include: + - *ordered by source vertex id*: the edges are ordered and aligned by the source vertex + - *ordered by destination vertex id*: the edges are ordered and aligned by the destination vertex + - *unordered by source vertex id*: the edges are unordered but aligned by the source vertex + - *unordered by destination vertex id*: the edges are unordered but aligned by the destination vertex + +- **Compressed Sparse Row (CSR)**: The storage layout the edges of certain vertex type. Corresponding to the + ordered by source vertex id adjacency list, the edges are stored in a single array and the offsets of the + edges of each vertex are stored in a separate array. + +- **Compressed Sparse Column (CSC)**: The storage layout the edges of certain vertex type. Corresponding to the + ordered by destination vertex id adjacency list, the edges are stored in a single array and the offsets of the + edges of each vertex are stored in a separate array. + +- **Coordinate List (COO)**: The storage layout the edges of certain vertex type. Corresponding to the unordered + by source vertex id or unordered by target vertex id adjacency list, the edges are stored in a single array and + no offsets are stored. + +- **Vertex Chunk**: The storage unit of vertex. Each vertex chunk contains a fixed number of vertices and is stored + in a separate file. + +- **Edge Chunk**: The storage unit of edge. Each edge chunk contains a fixed number of edges and is stored in a separate file. + +**Highlights**: + The design of property group and vertex/edge chunk allows users to + - Access the data without reading all the data into memory + - Conveniently append new properties to the graph without the need to reorganize the data + - Efficiently store and access the data in a distributed environment and parallel processing diff --git a/docs/overview/motivation.rst b/docs/overview/motivation.rst new file mode 100644 index 000000000..fa48ee730 --- /dev/null +++ b/docs/overview/motivation.rst @@ -0,0 +1,11 @@ +Motivation +=========== + +Numerous graph systems, +such as Neo4j, Nebula Graph, and Apache HugeGraph, have been developed in recent years. +Each of these systems has its own graph data storage format, complicating the exchange of graph data between different systems. +The need for a standard data file format for large-scale graph data storage and processing that can be used by diverse existing systems is evident, as it would reduce overhead when various systems work together. + +Our aim is to fill this gap and contribute to the open-source community by providing a standard data file format for graph data storage and exchange, as well as for out-of-core querying. +This format, which we have named GraphAr, is engineered to be efficient, cross-language compatible, and to support out-of-core processing scenarios, such as those commonly found in data lakes. +Furthermore, GraphAr's flexible design ensures that it can be easily extended to accommodate a broader array of graph data storage and exchange use cases in the future. diff --git a/docs/overview/overview.rst b/docs/overview/overview.rst new file mode 100644 index 000000000..449b84805 --- /dev/null +++ b/docs/overview/overview.rst @@ -0,0 +1,16 @@ +Overview +========= + +.. image:: ../images/overview.png + :alt: overview + +GraphAr is a project to standardize the graph data format and provide a set of libraries to generate, access and transform such formatted files. + +It is intended to serve as the standard file format for importing/exporting and persistent storage of the graph data which can be used by diverse existing systems, reducing the overhead when various systems co-work. + +Additionally, it can also serve as the direct data source for graph processing applications. + + +`Motivation `_ + +`Concepts `_