Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat][Doc] Refactor and update the format specification document #387

Merged
merged 6 commits into from
Mar 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions CONTRIBUTING.rst
Original file line number Diff line number Diff line change
Expand Up @@ -422,6 +422,15 @@ however, it is not uncommon for the CI infrastructure itself to fail on specific
platforms ("be red"). It is vital to visually inspect the results of all failed ("red") tests
to determine whether the failure was caused by the changes in the pull request.

Format specification & Libraries implementation
-----------------------------------------------

The GraphAr includes the format specification and libraries implementation. The libraries implementation is based on the format specification.
When you request a new feature to the format specification, you should first open a feature request issue and discuss with the community.
If the feature is accepted, you can submit a pull request update the `format specification design`_. After the format specification is updated,
you can submit a pull request to the related libraries implementation to implement the new feature and update the `implementation status`_.


.. _pre-commit: https://pre-commit.com/

.. _Code of Conduct: https://github.com/alibaba/GraphAr/blob/main/CODE_OF_CONDUCT.md
Expand All @@ -447,3 +456,7 @@ to determine whether the failure was caused by the changes in the pull request.
.. _Contributor License Agreement: https://cla-assistant.io/alibaba/GraphAr

.. _glossary: https://chromium.googlesource.com/chromiumos/docs/+/HEAD/glossary.md

.. _format specification design: https://github.com/alibaba/GraphAr/tree/main/docs/format/file-format.rst

.. _implementation status: https://github.com/alibaba/GraphAr/tree/main/docs/format/status.rst
115 changes: 80 additions & 35 deletions docs/format/file-format.rst
Original file line number Diff line number Diff line change
@@ -1,10 +1,16 @@
GraphAr File Format
============================

What is Property Graph
------------------------
Property Graph
---------------

GraphAr is designed for representing and storing the property graphs. Graph (in discrete mathematics) is a structure made of vertices and edges. Property graph is then a type of graph model where the vertices/edges could carry a name (also called as type or label) and some properties. Since carrying additional information than non-property graphs, the property graph is able to represent connections among data scattered across diverse data databases and with different schemas. Compared with the relational database schema, the property graph excels at showing data dependencies. Therefore, it is widely-used in modeling modern applications including social network analytics, data mining, network routing, scientific computing and so on.
GraphAr is designed for representing and storing the property graphs. Graph (in discrete mathematics) is a structure made of vertices and edges.
Property graph is then a type of graph model where the vertices/edges could carry a name (also called as type or label) and some properties.
Since carrying additional information than non-property graphs, the property graph is able to represent
connections among data scattered across diverse data databases and with different schemas.
Compared with the relational database schema, the property graph excels at showing data dependencies.
Therefore, it is widely-used in modeling modern applications including social network analytics, data mining,
network routing, scientific computing and so on.

A property graph consists of vertices and edges, with each vertex contains a unique identifier and:

Expand All @@ -26,7 +32,66 @@ The following is an example property graph containing two types of vertices ("pe
:alt: property graph


Vertices in GraphAr
Property Data Types
-------------------
GraphAr support a set of built-in property data types that are common in real use cases and supported by most file types (CSV, ORC, Parquet), includes:

- **Boolean**
- **Int32**: Integer with 32 bits
- **Int64**: Integer with 64 bits
- **Float**: 32-bit floating point values
- **Double**: 64-bit floating point values
- **String**: Textual data
- **Date**: days since the Unix epoch
- **Timestamp**: milliseconds since the Unix epoch
- **Time**: milliseconds since midnight
- **List**: A list of values of the same type

GraphAr also supports the user-defined data types, which can be used to represent complex data structures,
such as the struct, map, and union types.

Configurations
--------------

Vertex Chunk Size
`````````````````
The vertex chunk size is a configuration parameter that determines the number of vertices in a vertex chunk
and used to partition the logical vertex table into multiple physical vertex tables.

The vertex chunk size should be set to a value that is large enough to reduce the overhead of reading/writing files,
but small enough to avoid reading/writing too many vertices at once. We recommend setting the vertex chunk size to
empirical value 2^18 (262,144) for most cases.

Edge Chunk Size
````````````````

The edge chunk size is a configuration parameter that determines the number of edges in an edge chunk
and used to partition the logical edge table into multiple physical edge tables.

The edge chunk size should be set to a value that is large enough to reduce the overhead of reading/writing files,
but small enough to avoid reading/writing too many edges at once. We recommend setting the edge chunk size to
empirical value 2^22 (4,194,304) for most cases.

Data File Format
````````````````
GraphAr supports multiple file formats for storing the actual data of vertices and edges,
including Apache ORC, Apache Parquet, CSV, and JSON.

The file format should be chosen based on the specific use case and the data processing framework that will be used to
process the graph data. For example, if the graph data will be processed using Apache Spark,
then the Apache Parquet file format is recommended.

Adjacency List Type
````````````````````
Adjacency list is a data structure used to represent the edges of a graph. GraphAr supports multiple types of adjacency lists for a given group of edges, including:

- **ordered_by_source**: all the edges in the logical table are ordered and further partitioned by the internal vertex id of the source, which can be seen as the CSR format.
- **ordered_by_dest**: all the edges in the logical table are ordered and further partitioned by the internal vertex id of the destination, which can be seen as the CSC format.
- **unordered_by_source**: the internal id of the source vertex is used as the partition key to divide the edges into different sub-logical-tables, and the edges in each sub-logical-table are unordered, which can be seen as the COO format.
- **unordered_by_dest**: the internal id of the destination vertex is used as the partition key to divide the edges into different sub-logical-tables, and the edges in each sub-logical-table are unordered, which can also be seen as the COO format.


Vertex Chunks in GraphAr
------------------------

Logical table of vertices
Expand Down Expand Up @@ -59,7 +124,7 @@ Take the "person" vertex table as an example, if the chunk size is set to be 500

**Note**: For efficiently utilize the filter push-down of the payload file format like Parquet, the internal vertex id is stored in the payload file as a column. And since the internal vertex id is continuous, the payload file format can use the delta encoding for the internal vertex id column, which would not bring too much overhead for the storage.

Edges in GraphAr
Edge Chunks in GraphAr
------------------------

Logical table of edges
Expand All @@ -75,17 +140,12 @@ Take the logical table for "person likes person" edges as an example, the logica

Physical table of edges
```````````````````````
As same with the vertex table, the logical edge table is also partitioned into some sub-logical-tables, with each sub-logical-table contains edges that the source (or destination) vertices are in the same vertex chunk. According to the partition strategy and the order of the edges, edges can be stored in GraphAr following one of the four types:

- **ordered_by_source**: all the edges in the logical table are ordered and further partitioned by the internal vertex id of the source, which can be seen as the CSR format.
- **ordered_by_dest**: all the edges in the logical table are ordered and further partitioned by the internal vertex id of the destination, which can be seen as the CSC format.
- **unordered_by_source**: the internal id of the source vertex is used as the partition key to divide the edges into different sub-logical-tables, and the edges in each sub-logical-table are unordered, which can be seen as the COO format.
- **unordered_by_dest**: the internal id of the destination vertex is used as the partition key to divide the edges into different sub-logical-tables, and the edges in each sub-logical-table are unordered, which can also be seen as the COO format.
As same with the vertex table, the logical edge table is also partitioned into some sub-logical-tables, with each sub-logical-table contains edges that the source (or destination) vertices are in the same vertex chunk. According to the partition strategy and the order of the edges, edges can be stored in GraphAr following the setting adjacency list type.

After that, a sub-logical-table is further divided into edge chunks of a predefined, fixed number of rows (referred to as edge chunk size). Finally, an edge chunk is separated into physical tables in the following way:
After that, the whole logical table of edges will be divided into multiple sub-logical-tables with each sub-logical-table contains edges that the source (or destination) vertices are in the same vertex chunk. Then, a sub-logical-table is further divided into edge chunks of a predefined, fixed number of rows (referred to as edge chunk size). Finally, an edge chunk is separated into physical tables in the following way:

- an adjList table (which contains only two columns: the internal vertex id of the source and the destination).
- 0 or more edge property tables, with each table contains a group of properties.
- 0 or more property group tables (each contains the properties of the edges).

Additionally, there would be an offset table for **ordered_by_source** or **ordered_by_dest** edges. The offset table is used to record the starting point of the edges for each vertex. The partition of the offset table should be in alignment with the partition of the corresponding vertex table. The first row of each offset chunk is always 0, indicating the starting point for the corresponding sub-logical-table for edges.

Expand All @@ -105,13 +165,9 @@ Take the "person knows person" edges to illustrate. Suppose the vertex chunk siz

When the edge type is **ordered_by_source**, the sorted adjList table together with the offset table can be used as CSR, supporting the fast access of the outgoing edges for a given vertex. Similarly, a CSC view can be constructed by sorting the edges by destination and recording corresponding offsets, supporting the fast access of the incoming edges for a given vertex.


File Format
------------------------

Information files
`````````````````
GraphAr uses two kinds of files to store a graph: a group of Yaml files to describe meta information; and data files to store actual data for vertices and edges.
------------------
GraphAr uses two kinds of files to store a graph: a group of Yaml files to describe metadata information; and data files to store actual data for vertices and edges.
A graph information file which named "<name>.graph.yml" describes the meta information for a graph whose name is <name>. The content of this file includes:

- the graph name;
Expand Down Expand Up @@ -144,30 +200,19 @@ An edge information file which named "<source label>_<edge label>_<destination l
See also `Gar Information Files <cpp/getting-started.html#gar-information-files>`_ for an example.

Data files
``````````
----------
As previously mentioned, each logical vertex/edge table is divided into multiple physical tables stored in one of the following file formats:

- `Apache ORC <https://orc.apache.org/>`_
- `Apache Parquet <https://parquet.apache.org/>`_
- CSV
- JSON

Both of Apache ORC and Apache Parquet are column-oriented data storage formats. In practice of graph processing, it is common to only query a subset of columns of the properties. Thus, the column-oriented formats are more efficient, which eliminate the need to read columns that are not relevant. They are also used by a large number of data processing frameworks like `Apache Spark <https://spark.apache.org/>`_, `Apache Hive <https://hive.apache.org/>`_, `Apache Flink <https://flink.apache.org/>`_, and `Apache Hadoop <https://hadoop.apache.org/>`_.

See also `Gar Data Files <cpp/getting-started.html#gar-data-files>`_ for an example.

Data Types
``````````
GraphAr provides a set of built-in data types that are common in real use cases and supported by most file types (CSV, ORC, Parquet), includes:

- bool
- int32
- int64
- float
- double
- string
- list (of int32, int64, float, double, string; not supported by CSV)

.. tip::
Implementation
--------------
The GraphAr libraries may implement part of the GraphAr format. The implementation status of the GraphAr libraries can refer to the `GraphAr implementation status <status.html>`_.

We are continuously adding more built-in data types in GraphAr, and self-defined data types will be supported.

8 changes: 6 additions & 2 deletions docs/format/status.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,11 @@ Data Types
+-------------------+-------+-------+-------+------------+
| String | ✓ | ✓ | ✓ | ✓ |
+-------------------+-------+-------+-------+------------+
| Date | | | | |
| Date | | | | |
+-------------------+-------+-------+-------+------------+
| Timestamp | | | | |
| Timestamp | ✓ | | | |
+-------------------+-------+-------+-------+------------+
| Time | | | | |
+-------------------+-------+-------+-------+------------+

+-------------------+-------+-------+-------+------------+
Expand Down Expand Up @@ -59,6 +61,8 @@ Payload Data File Formats
+-----------------------------+---------+---------+-------+------------+
| HDF5 | | | | |
+-----------------------------+---------+---------+-------+------------+
| JSON | | | | |
+-----------------------------+---------+---------+-------+------------+

Notes:

Expand Down
10 changes: 9 additions & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,15 @@
:caption: Overview
:hidden:

Overview <overview>
Overview <overview/overview>
Motivation <overview/motivation>
Concepts <overview/concepts>

.. toctree::
:maxdepth: 1
:caption: Format Specification
:hidden:

File Format <format/file-format>
Implementation Status <format/status>

Expand Down
32 changes: 0 additions & 32 deletions docs/overview.rst

This file was deleted.

37 changes: 37 additions & 0 deletions docs/overview/concepts.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
Concepts
=========

Glossary of relevant concepts and terms.

- **Property Group**: GraphAr splits the properties of vertex/edge into groups to allow for efficient storage
and access without the need to load all properties. Also benefits appending of new properties. Each property
group is the unit of storage and is stored in a separate directory.

- **Adjacency List**: The storage method to store the edges of certain vertex type. Which include:
- *ordered by source vertex id*: the edges are ordered and aligned by the source vertex
- *ordered by destination vertex id*: the edges are ordered and aligned by the destination vertex
- *unordered by source vertex id*: the edges are unordered but aligned by the source vertex
- *unordered by destination vertex id*: the edges are unordered but aligned by the destination vertex

- **Compressed Sparse Row (CSR)**: The storage layout the edges of certain vertex type. Corresponding to the
ordered by source vertex id adjacency list, the edges are stored in a single array and the offsets of the
edges of each vertex are stored in a separate array.

- **Compressed Sparse Column (CSC)**: The storage layout the edges of certain vertex type. Corresponding to the
ordered by destination vertex id adjacency list, the edges are stored in a single array and the offsets of the
edges of each vertex are stored in a separate array.

- **Coordinate List (COO)**: The storage layout the edges of certain vertex type. Corresponding to the unordered
by source vertex id or unordered by target vertex id adjacency list, the edges are stored in a single array and
no offsets are stored.

- **Vertex Chunk**: The storage unit of vertex. Each vertex chunk contains a fixed number of vertices and is stored
in a separate file.

- **Edge Chunk**: The storage unit of edge. Each edge chunk contains a fixed number of edges and is stored in a separate file.

**Highlights**:
The design of property group and vertex/edge chunk allows users to
- Access the data without reading all the data into memory
- Conveniently append new properties to the graph without the need to reorganize the data
- Efficiently store and access the data in a distributed environment and parallel processing
Loading
Loading