Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improve][Doc] Revise the README and APIs docstring of GraphAr #64

Merged
merged 9 commits into from
Jan 9, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
121 changes: 45 additions & 76 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,53 +3,38 @@ GraphAr

|GraphAr CI| |Docs CI| |GraphAr Docs|

GraphAr (short for "Graph Archive") is an open source, standard data file format with C++ SDK and Spark tools for graph data storage and retrieval.

The GraphAr project includes such modules as:

- The design of the standardized file format (GAR) for graph data.
- A C++ Library for reading and writing GAR files.
- Apache Spark tools for generating, loading and transforming GAR files (coming soon).
- Examples of applying GraphAr to graph processing applications or existing systems such as GraphScope.


Welcome to GraphAr (short for "Graph Archive"), an open source, standardized file format for graph data storage and retrieval.

What is GraphAr?
-----------------

|Overview Pic|

Graph processing serves as the essential building block for a diverse variety of
real-world applications such as social network analytics, data mining, network routing,
and scientific computing.

Motivation
----------

Graph processing serves as the essential building block for a diverse variety of real-world applications such as social network analytics, data mining, network routing, and scientific computing.

GraphAr (GAR) is established to enable diverse graph applications and systems (in-memory and out-of-core storages, databases, graph computing systems and interactive graph query frameworks) to build and access the graph data conveniently and efficiently. It specifies a standardized system-independent file format for graph and provides a set of interfaces to generate and access such formatted files.

GraphAr (GAR) targets two main scenarios:

- To serve as the standard file format for importing/exporting and persistent storage of the graph data for diverse existing systems, reducing the overhead when various systems co-work.
- To serve as the direct data source for graph processing applications.


What's in GraphAr
---------------------
GraphAr is a project that aims to make it easier for diverse applications and
systems (in-memory and out-of-core storages, databases, graph computing systems, and interactive graph query frameworks)
to build and access graph data conveniently and efficiently.
acezen marked this conversation as resolved.
Show resolved Hide resolved

The **GAR** file format that defines a standard store file format for graph data.
To achieve this, GraphAr provides:

The **GAR SDK** library that contains a C++ library to provide APIs for accessing and generating the GAR format files.
- The Graph Archive(GAR) file format: a standardized system-independent file format for storing graph data
- Libraries: a set of language-independent libraries for reading and writing GAR files
acezen marked this conversation as resolved.
Show resolved Hide resolved
- The Spark Library: a set of tools for generating, loading and transforming GAR files with Apache Spark.
acezen marked this conversation as resolved.
Show resolved Hide resolved

By using GraphAr, you can:

GraphAr File Format
---------------------
- Store and persist your graph data in a system-independent way with the GAR file format
- Easily access and generate GAR files using the libraries or the Spark tools
- Use the Apache Spark tools to quickly manipulate and transform your GAR files

GraphAr specifies a standardized system-independent file format (GAR) for storing property graphs.
It uses metadata to record all the necessary information of a graph, and maintains the actual data
in a chunked way.

What is Property Graph
^^^^^^^^^^^^^^^^^^^^^^^

GraphAr is designed for representing and storing the property graphs. Graph (in discrete mathematics) is a structure made of vertices and edges. Property graph is then a type of graph model where the vertices/edges could carry a name (also called as type or label) and some properties. Since carrying additional information than non-property graphs, the property graph is able to represent connections among data scattered across diverse data databases and with different schemas. Compared with the relational database schema, the property graph excels at showing data dependencies. Therefore, it is widely-used in modeling modern applications including social network analytics, data mining, network routing, scientific computing and so on.
The GAR File Format
-------------------
The GAR file format is designed for storing property graphs. It uses metadata to
record all the necessary information of a graph, and maintains the actual data in
a chunked way.

A property graph includes vertices and edges. Each vertex contains:

Expand All @@ -70,7 +55,7 @@ The following is an example property graph containing two types of vertices "per
|Property Graph|

Vertices in GraphAr
^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^

Logical table of vertices
""""""""""""""""""""""""""
Expand Down Expand Up @@ -123,29 +108,31 @@ Take the "person knows person" edges to illustrate, when the vertex chunk size i
|Edge Physical Table2|


Building SDK Steps
---------------------
Building the libraries
acezen marked this conversation as resolved.
Show resolved Hide resolved
----------------------

Libraries are available for C++ and Spark.

Dependencies
^^^^^^^^^^^^^
Prerequisites
^^^^^^^^^^^^^^

**GraphAr** is developed and tested on ubuntu 20.04. It should also work on other unix-like distributions. Building GraphAr requires the following softwares installed as dependencies.
Basic dependencies:

- A modern C++ compiler compliant with C++17 standard (g++ >= 7.1 or clang++ >= 5).
- `CMake <https://cmake.org/>`_ (>=2.8)

Here are the dependencies for optional features:
Dependencies for optional features:

- `Doxygen <https://www.doxygen.nl/index.html>`_ (>= 1.8) for generating documentation;
- `sphinx <https://www.sphinx-doc.org/en/master/index.html>`_ for generating documentation.

Extra dependencies are required by examples and unit tests:
Extra dependencies are required by examples:

- `BGL <https://www.boost.org/doc/libs/1_80_0/libs/graph/doc/index.html>`_ (>= 1.58).


Building and install GraphAr C++ library
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Building
^^^^^^^^^

Once the required dependencies have been installed, go to the root directory of GraphAr and do an out-of-source build using CMake.

Expand All @@ -158,9 +145,9 @@ Once the required dependencies have been installed, go to the root directory of

**Optional**: Using a Custom Namespace

The `namespace` that `gar` is defined in is configurable. By default,
it is defined in `namespace GraphArchive`; however this can be toggled by
setting `NAMESPACE` option with cmake:
The :code:`namespace` is configurable. By default,
it is defined in :code:`namespace GraphArchive`; however this can be toggled by
setting :code:`NAMESPACE` option with cmake:

.. code:: shell

Expand All @@ -181,45 +168,25 @@ Install the GraphAr library:

sudo make install

Build the documentation of GraphAr library:
Optionally, you can build the documentation for GraphAr library:

.. code-block:: shell

# assume doxygen and sphinx has been installed.
pip3 install -r ../requirements-dev.txt --user
make doc

Using GraphAr C++ library in your own project
-----------------------------------------------

The way we recommend to integrate the GraphAr C++ library in your own C++ project is to use
CMake's `find_package` function for locating and integrating dependencies.

Here is a minimal `CMakeLists.txt` that compiles a source file `my_example.cc` into an executable
target linked with GraphAr C++ shared library.

.. code-block:: cmake
The Spark Library
-----------------

project(MyExample)
See `GraphAr Spark Library`_ for details about Spark library.
acezen marked this conversation as resolved.
Show resolved Hide resolved

find_package(gar REQUIRED)
include_directories(${GAR_INCLUDE_DIRS})

add_executable(my_example my_example.cc)
target_compile_features(my_example PRIVATE cxx_std_17)
target_link_libraries(my_example PRIVATE ${GAR_LIBRARIES})

Please refer to `examples/pagerank_example.cc` for details.

Contributing to GraphAr
-----------------------

- Read the `Contribution Guide`_.
- Please report bugs by submitting `GitHub Issues`_ or ask me anything in `Github Discussions`_.
- Submit contributions using pull requests.

Thank you in advance for your contributions to GraphAr!
----------------------------

See `Contribution Guide`_ for details on submitting patches and the contribution workflow.

License
-------
Expand Down Expand Up @@ -269,6 +236,8 @@ third-party libraries may not have the same license as GraphAr.

.. _GraphAr File Format: https://alibaba.github.io/GraphAr/user-guide/file-format.html

.. _GraphAr Spark Library: https://alibaba.github.io/GraphAr/user-guide/spark-tool.html
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename “spark-tool.html” to "spark-library.html"


.. _example files: https://github.com/GraphScope/gar-test/blob/main/ldbc_sample/

.. _Contribution Guide: https://alibaba.github.io/GraphAr/user-guide/contributing.html
Expand Down
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@
.. toctree::
:maxdepth: 1
:caption: API Reference
:hidden:

reference/api-reference-cpp.rst
Spark API Reference <reference/spark-api/index>
16 changes: 8 additions & 8 deletions docs/user-guide/spark-tool.rst
Original file line number Diff line number Diff line change
@@ -1,20 +1,20 @@
GraphAr Spark Tools
GraphAr Spark Library
acezen marked this conversation as resolved.
Show resolved Hide resolved
============================

Overview
-----------

GraphAr Spark tools are provided as a library for generating, loading and transforming GAR files with Apache Spark easy. It consists of the following parts:
GraphAr Spark library are provided for generating, loading and transforming GAR files with Apache Spark easy. It consists of the following parts:

- **Information Classes**: As same with in C++ SDK, the information classes are implemented as a part of Spark tools for constructing and accessing the meta information about the graphs, vertices and edges in GraphAr.
- **Information Classes**: As same with in C++ SDK, the information classes are implemented as a part of Spark library for constructing and accessing the meta information about the graphs, vertices and edges in GraphAr.
- **IndexGenerator**: The IndexGenerator helps to generate the indices for vertex/edge DataFrames. In most cases, IndexGenerator is first utilized to generate the indices for a DataFrame (e.g., from primary keys), and then this DataFrame can be written into GAR files through the Writer.
- **Writer**: The GraphAr Spark Writer provides a set of interfaces that can be used to write Spark DataFrames into GAR files. Every time it takes a DataFrame as the logical table for a type of vertices or edges, assembles the data in specified format (e.g., reorganize the edges in the CSR way) and then dumps it to standard GAR files (orc, parquet or CSV files) under the specific directory path.
- **Reader**: The GraphAr Spark Reader provides a set of interfaces that can be used to read GAR files. It reads a set of vertices or edges at a time and assembles the result into Spark DataFrames. Similar with the Reader SDK in C++, it supports the users to specify the data they need, e.g., to read a single property group instead of all properties.

Use Cases
----------

The GraphAr Spark Tools can be applied to these scenarios:
The GraphAr Spark Library can be applied to these scenarios:

- Take GAR as data sources to execute SQL queries or do graph processing (e.g., using GraphX).
- Transform data between GAR and other data sources (e.g., Hive, Neo4j, NebulaGraph, ...).
Expand All @@ -23,10 +23,10 @@ The GraphAr Spark Tools can be applied to these scenarios:
- Modify existing GAR data (e.g., add new vertices/edges).


Get GraphAr Spark Tools
Get GraphAr Spark Library
------------------------------

Make the graphar-spark-tools directory as the current working directory:
Make the graphar-spark-library directory as the current working directory:

.. code-block:: shell

Expand All @@ -46,7 +46,7 @@ How to Use

Information Classes
`````````````````````
The information classes are included in Spark tools for constructing and accessing the meta information about the graphs, vertices and edges in GraphAr. They are also used as the essential parameters for constructing readers/writers. In common cases, the information can be built from reading and parsing existing meta files (Yaml files). Also, we support to construct them in memory from nothing.
The information classes are included in Spark library for constructing and accessing the meta information about the graphs, vertices and edges in GraphAr. They are also used as the essential parameters for constructing readers/writers. In common cases, the information can be built from reading and parsing existing meta files (Yaml files). Also, we support to construct them in memory from nothing.

To build information from Yaml files, please refer to the following code.

Expand Down
4 changes: 2 additions & 2 deletions examples/bgl_example.cc
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,8 @@ int main(int argc, char* argv[]) {
std::string path =
TEST_DATA_DIR + "/ldbc_sample/parquet/ldbc_sample.graph.yml";
auto graph_info = GAR_NAMESPACE::GraphInfo::Load(path).value();
assert(graph_info.GetAllVertexInfo().size() == 1);
assert(graph_info.GetAllEdgeInfo().size() == 1);
assert(graph_info.GetVertexInfos().size() == 1);
assert(graph_info.GetEdgeInfos().size() == 1);

// construct vertices collection
std::string label = "person";
Expand Down
12 changes: 6 additions & 6 deletions examples/construct_info_example.cc
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,8 @@ int main(int argc, char* argv[]) {
// validate
assert(graph_info.GetName() == name);
assert(graph_info.GetPrefix() == prefix);
const auto& vertex_infos = graph_info.GetAllVertexInfo();
const auto& edge_infos = graph_info.GetAllEdgeInfo();
const auto& vertex_infos = graph_info.GetVertexInfos();
const auto& edge_infos = graph_info.GetEdgeInfos();
assert(vertex_infos.size() == 0);
assert(edge_infos.size() == 0);

Expand Down Expand Up @@ -85,7 +85,7 @@ int main(int argc, char* argv[]) {

/*------------------add vertex info to graph------------------*/
graph_info.AddVertex(vertex_info);
assert(graph_info.GetAllVertexInfo().size() == 1);
assert(graph_info.GetVertexInfos().size() == 1);
assert(graph_info.GetVertexInfo(vertex_label).status().ok());
assert(graph_info.GetVertexPropertyGroup(vertex_label, id.name).value() ==
group1);
Expand Down Expand Up @@ -124,7 +124,7 @@ int main(int argc, char* argv[]) {
GAR_NAMESPACE::FileType::PARQUET)
.ok());
assert(
edge_info.GetAdjListFileType(GAR_NAMESPACE::AdjListType::ordered_by_dest)
edge_info.GetFileType(GAR_NAMESPACE::AdjListType::ordered_by_dest)
.value() == GAR_NAMESPACE::FileType::PARQUET);
assert(
edge_info
Expand Down Expand Up @@ -185,7 +185,7 @@ int main(int argc, char* argv[]) {
assert(res1.status().ok());
edge_info = res1.value();
assert(edge_info
.GetAdjListFileType(GAR_NAMESPACE::AdjListType::ordered_by_source)
.GetFileType(GAR_NAMESPACE::AdjListType::ordered_by_source)
.value() == GAR_NAMESPACE::FileType::PARQUET);
auto res2 = edge_info.ExtendPropertyGroup(
group3, GAR_NAMESPACE::AdjListType::ordered_by_source);
Expand All @@ -198,7 +198,7 @@ int main(int argc, char* argv[]) {
/*------------------add edge info to graph------------------*/
graph_info.AddEdge(edge_info);
graph_info.AddEdgeInfoPath("person_knows_person.edge.yml");
assert(graph_info.GetAllEdgeInfo().size() == 1);
assert(graph_info.GetEdgeInfos().size() == 1);
assert(
graph_info.GetEdgeInfo(src_label, edge_label, dst_label).status().ok());
assert(graph_info
Expand Down
Loading