Skip to content

Commit

Permalink
[Improve][Doc] Revise the README and APIs docstring of GraphAr (#64)
Browse files Browse the repository at this point in the history
  • Loading branch information
acezen authored Jan 9, 2023
1 parent 71a7ec4 commit e8edfe3
Show file tree
Hide file tree
Showing 34 changed files with 1,029 additions and 659 deletions.
122 changes: 47 additions & 75 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,53 +3,41 @@ GraphAr

|GraphAr CI| |Docs CI| |GraphAr Docs|

GraphAr (short for "Graph Archive") is an open source, standard data file format with C++ SDK and Spark tools for graph data storage and retrieval.

The GraphAr project includes such modules as:

- The design of the standardized file format (GAR) for graph data.
- A C++ Library for reading and writing GAR files.
- Apache Spark tools for generating, loading and transforming GAR files (coming soon).
- Examples of applying GraphAr to graph processing applications or existing systems such as GraphScope.


Welcome to GraphAr (short for "Graph Archive"), an open source, standardized file format for graph data storage and retrieval.

What is GraphAr?
-----------------

|Overview Pic|

Graph processing serves as the essential building block for a diverse variety of
real-world applications such as social network analytics, data mining, network routing,
and scientific computing.

Motivation
----------

Graph processing serves as the essential building block for a diverse variety of real-world applications such as social network analytics, data mining, network routing, and scientific computing.

GraphAr (GAR) is established to enable diverse graph applications and systems (in-memory and out-of-core storages, databases, graph computing systems and interactive graph query frameworks) to build and access the graph data conveniently and efficiently. It specifies a standardized system-independent file format for graph and provides a set of interfaces to generate and access such formatted files.

GraphAr (GAR) targets two main scenarios:

- To serve as the standard file format for importing/exporting and persistent storage of the graph data for diverse existing systems, reducing the overhead when various systems co-work.
- To serve as the direct data source for graph processing applications.


What's in GraphAr
---------------------
GraphAr is a project that aims to make it easier for diverse applications and
systems (in-memory and out-of-core storages, databases, graph computing systems, and interactive graph query frameworks)
to build and access graph data conveniently and efficiently.

The **GAR** file format that defines a standard store file format for graph data.
It can be used for importing/exporting and persistent storage of graph data,
thereby reducing the burden on systems when working together. Additionally, it can
serve as a direct data source for graph processing applications.

The **GAR SDK** library that contains a C++ library to provide APIs for accessing and generating the GAR format files.
To achieve this, GraphAr provides:

- The Graph Archive(GAR) file format: a standardized system-independent file format for storing graph data
- Libraries: a set of libraries for reading and writing or transforming GAR files

GraphAr File Format
---------------------
By using GraphAr, you can:

GraphAr specifies a standardized system-independent file format (GAR) for storing property graphs.
It uses metadata to record all the necessary information of a graph, and maintains the actual data
in a chunked way.
- Store and persist your graph data in a system-independent way with the GAR file format
- Easily access and generate GAR files using the libraries
- Use the Apache Spark library to quickly manipulate and transform your GAR files

What is Property Graph
^^^^^^^^^^^^^^^^^^^^^^^

GraphAr is designed for representing and storing the property graphs. Graph (in discrete mathematics) is a structure made of vertices and edges. Property graph is then a type of graph model where the vertices/edges could carry a name (also called as type or label) and some properties. Since carrying additional information than non-property graphs, the property graph is able to represent connections among data scattered across diverse data databases and with different schemas. Compared with the relational database schema, the property graph excels at showing data dependencies. Therefore, it is widely-used in modeling modern applications including social network analytics, data mining, network routing, scientific computing and so on.
The GAR File Format
-------------------
The GAR file format is designed for storing property graphs. It uses metadata to
record all the necessary information of a graph, and maintains the actual data in
a chunked way.

A property graph includes vertices and edges. Each vertex contains:

Expand All @@ -70,7 +58,7 @@ The following is an example property graph containing two types of vertices "per
|Property Graph|

Vertices in GraphAr
^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^

Logical table of vertices
""""""""""""""""""""""""""
Expand Down Expand Up @@ -123,29 +111,31 @@ Take the "person knows person" edges to illustrate, when the vertex chunk size i
|Edge Physical Table2|


Building SDK Steps
---------------------
Building the Libraries
----------------------

Libraries are available for C++ and Spark.

Dependencies
^^^^^^^^^^^^^
Prerequisites
^^^^^^^^^^^^^^

**GraphAr** is developed and tested on ubuntu 20.04. It should also work on other unix-like distributions. Building GraphAr requires the following softwares installed as dependencies.
Basic dependencies:

- A modern C++ compiler compliant with C++17 standard (g++ >= 7.1 or clang++ >= 5).
- `CMake <https://cmake.org/>`_ (>=2.8)

Here are the dependencies for optional features:
Dependencies for optional features:

- `Doxygen <https://www.doxygen.nl/index.html>`_ (>= 1.8) for generating documentation;
- `sphinx <https://www.sphinx-doc.org/en/master/index.html>`_ for generating documentation.

Extra dependencies are required by examples and unit tests:
Extra dependencies are required by examples:

- `BGL <https://www.boost.org/doc/libs/1_80_0/libs/graph/doc/index.html>`_ (>= 1.58).


Building and install GraphAr C++ library
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Building
^^^^^^^^^

Once the required dependencies have been installed, go to the root directory of GraphAr and do an out-of-source build using CMake.

Expand All @@ -158,9 +148,9 @@ Once the required dependencies have been installed, go to the root directory of
**Optional**: Using a Custom Namespace

The `namespace` that `gar` is defined in is configurable. By default,
it is defined in `namespace GraphArchive`; however this can be toggled by
setting `NAMESPACE` option with cmake:
The :code:`namespace` is configurable. By default,
it is defined in :code:`namespace GraphArchive`; however this can be toggled by
setting :code:`NAMESPACE` option with cmake:

.. code:: shell
Expand All @@ -181,45 +171,25 @@ Install the GraphAr library:
sudo make install
Build the documentation of GraphAr library:
Optionally, you can build the documentation for GraphAr library:

.. code-block:: shell
# assume doxygen and sphinx has been installed.
pip3 install -r ../requirements-dev.txt --user
make doc
Using GraphAr C++ library in your own project
-----------------------------------------------

The way we recommend to integrate the GraphAr C++ library in your own C++ project is to use
CMake's `find_package` function for locating and integrating dependencies.

Here is a minimal `CMakeLists.txt` that compiles a source file `my_example.cc` into an executable
target linked with GraphAr C++ shared library.

.. code-block:: cmake
project(MyExample)
The Spark Library
-----------------

find_package(gar REQUIRED)
include_directories(${GAR_INCLUDE_DIRS})
See `GraphAr Spark Library`_ for details about the Spark library.

add_executable(my_example my_example.cc)
target_compile_features(my_example PRIVATE cxx_std_17)
target_link_libraries(my_example PRIVATE ${GAR_LIBRARIES})
Please refer to `examples/pagerank_example.cc` for details.

Contributing to GraphAr
-----------------------

- Read the `Contribution Guide`_.
- Please report bugs by submitting `GitHub Issues`_ or ask me anything in `Github Discussions`_.
- Submit contributions using pull requests.

Thank you in advance for your contributions to GraphAr!
----------------------------

See `Contribution Guide`_ for details on submitting patches and the contribution workflow.

License
-------
Expand Down Expand Up @@ -269,6 +239,8 @@ third-party libraries may not have the same license as GraphAr.

.. _GraphAr File Format: https://alibaba.github.io/GraphAr/user-guide/file-format.html

.. _GraphAr Spark Library: https://alibaba.github.io/GraphAr/user-guide/spark-lib.html

.. _example files: https://github.com/GraphScope/gar-test/blob/main/ldbc_sample/

.. _Contribution Guide: https://alibaba.github.io/GraphAr/user-guide/contributing.html
Expand Down
3 changes: 2 additions & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
user-guide/overview.rst
user-guide/getting-started.rst
user-guide/file-format.rst
user-guide/spark-tool.rst
user-guide/spark-lib.rst

.. toctree::
:maxdepth: 1
Expand All @@ -34,6 +34,7 @@
.. toctree::
:maxdepth: 1
:caption: API Reference
:hidden:

reference/api-reference-cpp.rst
Spark API Reference <reference/spark-api/index>
3 changes: 1 addition & 2 deletions docs/user-guide/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,7 @@ GraphAr aims to serve as the standard file format for importing/exporting and pe
The GraphAr project includes such topics as:

- Design of the standardized file format for graph data. (see `GraphAr File Format <file-format.html>`_)
- The C++ SDK library for reading and writing GAR files. (see `API Reference <../api-reference.html>`_)
- A set of Apache Spark tools for generating, loading and transforming GAR files. (see `GraphAr Spark Tools <spark-tool.html>`_)
- A set of libraries for reading and writing or transforming GAR files. (now the `C++ library <../reference/api-reference-cpp.html>`_ and `Spark library <spark-lib.html>`_ is available)
- How to use GraphAr to write graph algorithms, or to work with existing systems such as GraphScope. (see `Application Cases <../applications/out-of-core.html>`_)

.. image:: ../images/overview.png
Expand Down
16 changes: 8 additions & 8 deletions docs/user-guide/spark-tool.rst → docs/user-guide/spark-lib.rst
Original file line number Diff line number Diff line change
@@ -1,20 +1,20 @@
GraphAr Spark Tools
GraphAr Spark Library
============================

Overview
-----------

GraphAr Spark tools are provided as a library for generating, loading and transforming GAR files with Apache Spark easy. It consists of the following parts:
GraphAr Spark library are provided for generating, loading and transforming GAR files with Apache Spark easy. It consists of the following parts:

- **Information Classes**: As same with in C++ SDK, the information classes are implemented as a part of Spark tools for constructing and accessing the meta information about the graphs, vertices and edges in GraphAr.
- **Information Classes**: As same with in C++ SDK, the information classes are implemented as a part of Spark library for constructing and accessing the meta information about the graphs, vertices and edges in GraphAr.
- **IndexGenerator**: The IndexGenerator helps to generate the indices for vertex/edge DataFrames. In most cases, IndexGenerator is first utilized to generate the indices for a DataFrame (e.g., from primary keys), and then this DataFrame can be written into GAR files through the Writer.
- **Writer**: The GraphAr Spark Writer provides a set of interfaces that can be used to write Spark DataFrames into GAR files. Every time it takes a DataFrame as the logical table for a type of vertices or edges, assembles the data in specified format (e.g., reorganize the edges in the CSR way) and then dumps it to standard GAR files (orc, parquet or CSV files) under the specific directory path.
- **Reader**: The GraphAr Spark Reader provides a set of interfaces that can be used to read GAR files. It reads a set of vertices or edges at a time and assembles the result into Spark DataFrames. Similar with the Reader SDK in C++, it supports the users to specify the data they need, e.g., to read a single property group instead of all properties.

Use Cases
----------

The GraphAr Spark Tools can be applied to these scenarios:
The GraphAr Spark Library can be applied to these scenarios:

- Take GAR as data sources to execute SQL queries or do graph processing (e.g., using GraphX).
- Transform data between GAR and other data sources (e.g., Hive, Neo4j, NebulaGraph, ...).
Expand All @@ -23,10 +23,10 @@ The GraphAr Spark Tools can be applied to these scenarios:
- Modify existing GAR data (e.g., add new vertices/edges).


Get GraphAr Spark Tools
Get GraphAr Spark Library
------------------------------

Make the graphar-spark-tools directory as the current working directory:
Make the graphar-spark-library directory as the current working directory:

.. code-block:: shell
Expand All @@ -46,7 +46,7 @@ How to Use

Information Classes
`````````````````````
The information classes are included in Spark tools for constructing and accessing the meta information about the graphs, vertices and edges in GraphAr. They are also used as the essential parameters for constructing readers/writers. In common cases, the information can be built from reading and parsing existing meta files (Yaml files). Also, we support to construct them in memory from nothing.
The information classes are included in Spark library for constructing and accessing the meta information about the graphs, vertices and edges in GraphAr. They are also used as the essential parameters for constructing readers/writers. In common cases, the information can be built from reading and parsing existing meta files (Yaml files). Also, we support to construct them in memory from nothing.

To build information from Yaml files, please refer to the following code.

Expand Down
4 changes: 2 additions & 2 deletions examples/bgl_example.cc
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,8 @@ int main(int argc, char* argv[]) {
std::string path =
TEST_DATA_DIR + "/ldbc_sample/parquet/ldbc_sample.graph.yml";
auto graph_info = GAR_NAMESPACE::GraphInfo::Load(path).value();
assert(graph_info.GetAllVertexInfo().size() == 1);
assert(graph_info.GetAllEdgeInfo().size() == 1);
assert(graph_info.GetVertexInfos().size() == 1);
assert(graph_info.GetEdgeInfos().size() == 1);

// construct vertices collection
std::string label = "person";
Expand Down
12 changes: 6 additions & 6 deletions examples/construct_info_example.cc
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,8 @@ int main(int argc, char* argv[]) {
// validate
assert(graph_info.GetName() == name);
assert(graph_info.GetPrefix() == prefix);
const auto& vertex_infos = graph_info.GetAllVertexInfo();
const auto& edge_infos = graph_info.GetAllEdgeInfo();
const auto& vertex_infos = graph_info.GetVertexInfos();
const auto& edge_infos = graph_info.GetEdgeInfos();
assert(vertex_infos.size() == 0);
assert(edge_infos.size() == 0);

Expand Down Expand Up @@ -85,7 +85,7 @@ int main(int argc, char* argv[]) {

/*------------------add vertex info to graph------------------*/
graph_info.AddVertex(vertex_info);
assert(graph_info.GetAllVertexInfo().size() == 1);
assert(graph_info.GetVertexInfos().size() == 1);
assert(graph_info.GetVertexInfo(vertex_label).status().ok());
assert(graph_info.GetVertexPropertyGroup(vertex_label, id.name).value() ==
group1);
Expand Down Expand Up @@ -124,7 +124,7 @@ int main(int argc, char* argv[]) {
GAR_NAMESPACE::FileType::PARQUET)
.ok());
assert(
edge_info.GetAdjListFileType(GAR_NAMESPACE::AdjListType::ordered_by_dest)
edge_info.GetFileType(GAR_NAMESPACE::AdjListType::ordered_by_dest)
.value() == GAR_NAMESPACE::FileType::PARQUET);
assert(
edge_info
Expand Down Expand Up @@ -185,7 +185,7 @@ int main(int argc, char* argv[]) {
assert(res1.status().ok());
edge_info = res1.value();
assert(edge_info
.GetAdjListFileType(GAR_NAMESPACE::AdjListType::ordered_by_source)
.GetFileType(GAR_NAMESPACE::AdjListType::ordered_by_source)
.value() == GAR_NAMESPACE::FileType::PARQUET);
auto res2 = edge_info.ExtendPropertyGroup(
group3, GAR_NAMESPACE::AdjListType::ordered_by_source);
Expand All @@ -198,7 +198,7 @@ int main(int argc, char* argv[]) {
/*------------------add edge info to graph------------------*/
graph_info.AddEdge(edge_info);
graph_info.AddEdgeInfoPath("person_knows_person.edge.yml");
assert(graph_info.GetAllEdgeInfo().size() == 1);
assert(graph_info.GetEdgeInfos().size() == 1);
assert(
graph_info.GetEdgeInfo(src_label, edge_label, dst_label).status().ok());
assert(graph_info
Expand Down
Loading

0 comments on commit e8edfe3

Please sign in to comment.