-
Notifications
You must be signed in to change notification settings - Fork 449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(flex): Replace the Adhoc csv reader with Arrow CSV reader #3154
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
zhanglei1949
changed the title
refactor(flex): Replace Adhoc csv reader with arrow csv reader
refactor(flex): Replace the Adhoc csv reader with Arrow CSV reader
Aug 29, 2023
zhanglei1949
force-pushed
the
rt_more_data_types
branch
2 times, most recently
from
September 7, 2023 07:13
7a6723c
to
497bb8b
Compare
author xiaolei.zl <[email protected]> 1691050268 +0800 committer xiaolei.zl <[email protected]> 1692585489 +0800 add new graph schema and bulk load file, but to be revised stash impl new schema reading fix fix fix fix test ldbc snb 0.1 minor fix introduce vertexLoadingMeta, edgeLoadingMeta Add more log info restore changes use gstest from GraphScope to debug fix fix fix add impl fix edge column mapping fix todo: change back to graphscope gstest some minor fix f Fix hqps engine's support for 3 simple match cases add test in hqps-db-ci fix fix fix stash changes dev: support edge expand with multiple labels from one source fix fix fix test refine log level fix rebase s add dt enum impl csv reader impl csv reader fix test minor changes fix add more configurable options use new images fix grin ci fix dockerfile new implementation refine csv metadata fix grin ci todo: debug string utf fix config parsing fix block size fix parsing to test date loading add batch_reader option fix perf add perf info f fix renmae option
zhanglei1949
force-pushed
the
rt_more_data_types
branch
from
September 8, 2023 02:09
200e207
to
40b7d33
Compare
luoxiaojian
approved these changes
Sep 8, 2023
longbinlai
approved these changes
Sep 8, 2023
zhanglei1949
added a commit
to zhanglei1949/GraphScope
that referenced
this pull request
Sep 11, 2023
…libaba#3154) 1. Use Arrow CSV Reader to replace current adhoc csv reader, to support more configurable options in `bulk_load.yaml`. 2. Introduce `CSVFragmentLoader`, `BasicFragmentLoader` for `MutablePropertyFragment`. With this PR merged, `MutablePropertyFragment` will support loading fragment from csv with options: - delimeter: default '|' - header_row: default true - quoting: default false - quoting_char: default '"' - escaping: default false - escaping_char: default'\\' - batch_size: the batch size of when reading file into memory, default 1MB. - batch_reader: default false. If set to true, `arrow::csv::StreamingReader` will be used to parse the input file. Otherwise, `arrow::TableReader` will be used. With this PR merged, the performance of graph loading will be improved. The Adhoc Reader denote the current implemented csv parser, 1,2,4,8 denotes the parallelism of graph loading, i.e. how many labels of vertex/edge are concurrently processed. Note that TableReader is around 10x faster than StreamingReader. The possible reason could be the multi-threading is used. See [arrow-csv-doc](https://arrow.apache.org/docs/cpp/csv.html) for details. | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph |805s| 468s| 349s| 313s| | Adhoc Reader | Serialization | 126s| 126s| 126s| 126s| | Adhoc Reader | **Total** |931s| 594s| 475s| 439s| | Table Reader | ReadFile | 9s |9s |9s| 9s| | Table Reader | LoadGraph |455s| 280s| 211s| 182s| | Table Reader |Serialization |126s| 126s| 126s| 126s| | Table Reader | **Total** | 600s| 415s| 346s| 317s| | Streaming Reader | ReadFile |91s| 91s| 91s| 91s| | Streaming Reader | LoadGraph | 555s| 289s| 196s| 149s| | Streaming Reader | Serialization |126s| 126s| 126s| 126s| | Streaming Reader | **Total** | 772s| 506s| 413s| 366s| | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph |2720s| 1548s| 1176s| 948s| | Adhoc Reader | Serialization | 409s| 409s| 409s| 409s| | Adhoc Reader | **Total** | 3129s| 1957s| 1585s| 1357s| | Table Reader | ReadFile |24s| 24s| 24s| 24s| | Table Reader | LoadGraph |1576s| 949s| 728s| 602s| | Table Reader |Serialization |409s| 409s| 409s| 409s| | Table Reader | **Total** | 2009s| 1382s| 1161s| 1035s| | Streaming Reader | ReadFile |300s| 300s| 300s| 300s| | Streaming Reader | LoadGraph | 1740s| 965s| 669s| 497s| | Streaming Reader | Serialization | 409s| 409s| 409s| 409s| | Streaming Reader | **Total** | 2539s| 1674s| 1378s| 1206s| | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph | 8260s| 4900s |3603s |2999s| | Adhoc Reader | Serialization | 1201s | 1201s| 1201s |1201s| | Adhoc Reader | **Total** | 9461s| 6101s | 4804s |4200s| | Table Reader | ReadFile | 73s |73s| 96s| 96s| | Table Reader | LoadGraph |4650s| 2768s| 2155s |1778s| | Table Reader |Serialization | 1201s | 1201s| 1201s |1201s| | Table Reader | **Total** | 5924s| 4042s| 3452s| 3075s| | Streaming Reader | ReadFile | 889s |889s | 889s| 889s| | Streaming Reader | LoadGraph | 5589s| 3005s| 2200s| 1712s| | Streaming Reader | Serialization | 1201s| 1201s| 1201s |1201s | | Streaming Reader | **Total** | 7679s | 5095s |4290s| 3802s| FIx alibaba#3116 minor fix and move modern graph fix grin test todo: do_start fix fix stash fix fix make rules unique dockerfile stash
zhanglei1949
pushed a commit
to zhanglei1949/GraphScope
that referenced
this pull request
Sep 12, 2023
refactor(flex): Replace the Adhoc csv reader with Arrow CSV reader (alibaba#3154) 1. Use Arrow CSV Reader to replace current adhoc csv reader, to support more configurable options in `bulk_load.yaml`. 2. Introduce `CSVFragmentLoader`, `BasicFragmentLoader` for `MutablePropertyFragment`. With this PR merged, `MutablePropertyFragment` will support loading fragment from csv with options: - delimeter: default '|' - header_row: default true - quoting: default false - quoting_char: default '"' - escaping: default false - escaping_char: default'\\' - batch_size: the batch size of when reading file into memory, default 1MB. - batch_reader: default false. If set to true, `arrow::csv::StreamingReader` will be used to parse the input file. Otherwise, `arrow::TableReader` will be used. With this PR merged, the performance of graph loading will be improved. The Adhoc Reader denote the current implemented csv parser, 1,2,4,8 denotes the parallelism of graph loading, i.e. how many labels of vertex/edge are concurrently processed. Note that TableReader is around 10x faster than StreamingReader. The possible reason could be the multi-threading is used. See [arrow-csv-doc](https://arrow.apache.org/docs/cpp/csv.html) for details. | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph |805s| 468s| 349s| 313s| | Adhoc Reader | Serialization | 126s| 126s| 126s| 126s| | Adhoc Reader | **Total** |931s| 594s| 475s| 439s| | Table Reader | ReadFile | 9s |9s |9s| 9s| | Table Reader | LoadGraph |455s| 280s| 211s| 182s| | Table Reader |Serialization |126s| 126s| 126s| 126s| | Table Reader | **Total** | 600s| 415s| 346s| 317s| | Streaming Reader | ReadFile |91s| 91s| 91s| 91s| | Streaming Reader | LoadGraph | 555s| 289s| 196s| 149s| | Streaming Reader | Serialization |126s| 126s| 126s| 126s| | Streaming Reader | **Total** | 772s| 506s| 413s| 366s| | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph |2720s| 1548s| 1176s| 948s| | Adhoc Reader | Serialization | 409s| 409s| 409s| 409s| | Adhoc Reader | **Total** | 3129s| 1957s| 1585s| 1357s| | Table Reader | ReadFile |24s| 24s| 24s| 24s| | Table Reader | LoadGraph |1576s| 949s| 728s| 602s| | Table Reader |Serialization |409s| 409s| 409s| 409s| | Table Reader | **Total** | 2009s| 1382s| 1161s| 1035s| | Streaming Reader | ReadFile |300s| 300s| 300s| 300s| | Streaming Reader | LoadGraph | 1740s| 965s| 669s| 497s| | Streaming Reader | Serialization | 409s| 409s| 409s| 409s| | Streaming Reader | **Total** | 2539s| 1674s| 1378s| 1206s| | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph | 8260s| 4900s |3603s |2999s| | Adhoc Reader | Serialization | 1201s | 1201s| 1201s |1201s| | Adhoc Reader | **Total** | 9461s| 6101s | 4804s |4200s| | Table Reader | ReadFile | 73s |73s| 96s| 96s| | Table Reader | LoadGraph |4650s| 2768s| 2155s |1778s| | Table Reader |Serialization | 1201s | 1201s| 1201s |1201s| | Table Reader | **Total** | 5924s| 4042s| 3452s| 3075s| | Streaming Reader | ReadFile | 889s |889s | 889s| 889s| | Streaming Reader | LoadGraph | 5589s| 3005s| 2200s| 1712s| | Streaming Reader | Serialization | 1201s| 1201s| 1201s |1201s | | Streaming Reader | **Total** | 7679s | 5095s |4290s| 3802s| FIx alibaba#3116 minor fix and move modern graph fix grin test todo: do_start fix fix stash fix fix make rules unique dockerfile stash minor change remove plugin-dir fix minor fix debug debug fix fix
zhanglei1949
pushed a commit
to zhanglei1949/GraphScope
that referenced
this pull request
Sep 15, 2023
refactor(flex): Replace the Adhoc csv reader with Arrow CSV reader (alibaba#3154) 1. Use Arrow CSV Reader to replace current adhoc csv reader, to support more configurable options in `bulk_load.yaml`. 2. Introduce `CSVFragmentLoader`, `BasicFragmentLoader` for `MutablePropertyFragment`. With this PR merged, `MutablePropertyFragment` will support loading fragment from csv with options: - delimeter: default '|' - header_row: default true - quoting: default false - quoting_char: default '"' - escaping: default false - escaping_char: default'\\' - batch_size: the batch size of when reading file into memory, default 1MB. - batch_reader: default false. If set to true, `arrow::csv::StreamingReader` will be used to parse the input file. Otherwise, `arrow::TableReader` will be used. With this PR merged, the performance of graph loading will be improved. The Adhoc Reader denote the current implemented csv parser, 1,2,4,8 denotes the parallelism of graph loading, i.e. how many labels of vertex/edge are concurrently processed. Note that TableReader is around 10x faster than StreamingReader. The possible reason could be the multi-threading is used. See [arrow-csv-doc](https://arrow.apache.org/docs/cpp/csv.html) for details. | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph |805s| 468s| 349s| 313s| | Adhoc Reader | Serialization | 126s| 126s| 126s| 126s| | Adhoc Reader | **Total** |931s| 594s| 475s| 439s| | Table Reader | ReadFile | 9s |9s |9s| 9s| | Table Reader | LoadGraph |455s| 280s| 211s| 182s| | Table Reader |Serialization |126s| 126s| 126s| 126s| | Table Reader | **Total** | 600s| 415s| 346s| 317s| | Streaming Reader | ReadFile |91s| 91s| 91s| 91s| | Streaming Reader | LoadGraph | 555s| 289s| 196s| 149s| | Streaming Reader | Serialization |126s| 126s| 126s| 126s| | Streaming Reader | **Total** | 772s| 506s| 413s| 366s| | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph |2720s| 1548s| 1176s| 948s| | Adhoc Reader | Serialization | 409s| 409s| 409s| 409s| | Adhoc Reader | **Total** | 3129s| 1957s| 1585s| 1357s| | Table Reader | ReadFile |24s| 24s| 24s| 24s| | Table Reader | LoadGraph |1576s| 949s| 728s| 602s| | Table Reader |Serialization |409s| 409s| 409s| 409s| | Table Reader | **Total** | 2009s| 1382s| 1161s| 1035s| | Streaming Reader | ReadFile |300s| 300s| 300s| 300s| | Streaming Reader | LoadGraph | 1740s| 965s| 669s| 497s| | Streaming Reader | Serialization | 409s| 409s| 409s| 409s| | Streaming Reader | **Total** | 2539s| 1674s| 1378s| 1206s| | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph | 8260s| 4900s |3603s |2999s| | Adhoc Reader | Serialization | 1201s | 1201s| 1201s |1201s| | Adhoc Reader | **Total** | 9461s| 6101s | 4804s |4200s| | Table Reader | ReadFile | 73s |73s| 96s| 96s| | Table Reader | LoadGraph |4650s| 2768s| 2155s |1778s| | Table Reader |Serialization | 1201s | 1201s| 1201s |1201s| | Table Reader | **Total** | 5924s| 4042s| 3452s| 3075s| | Streaming Reader | ReadFile | 889s |889s | 889s| 889s| | Streaming Reader | LoadGraph | 5589s| 3005s| 2200s| 1712s| | Streaming Reader | Serialization | 1201s| 1201s| 1201s |1201s | | Streaming Reader | **Total** | 7679s | 5095s |4290s| 3802s| FIx alibaba#3116 minor fix and move modern graph fix grin test todo: do_start fix fix stash fix fix make rules unique dockerfile stash minor change remove plugin-dir fix minor fix debug debug fix fix fix bulk_load.yaml bash format some fix fix format fix grin test some fi check ci fix ci set fix ci fix dd f disable tmate
zhanglei1949
pushed a commit
to zhanglei1949/GraphScope
that referenced
this pull request
Sep 19, 2023
refactor(flex): Replace the Adhoc csv reader with Arrow CSV reader (alibaba#3154) 1. Use Arrow CSV Reader to replace current adhoc csv reader, to support more configurable options in `bulk_load.yaml`. 2. Introduce `CSVFragmentLoader`, `BasicFragmentLoader` for `MutablePropertyFragment`. With this PR merged, `MutablePropertyFragment` will support loading fragment from csv with options: - delimeter: default '|' - header_row: default true - quoting: default false - quoting_char: default '"' - escaping: default false - escaping_char: default'\\' - batch_size: the batch size of when reading file into memory, default 1MB. - batch_reader: default false. If set to true, `arrow::csv::StreamingReader` will be used to parse the input file. Otherwise, `arrow::TableReader` will be used. With this PR merged, the performance of graph loading will be improved. The Adhoc Reader denote the current implemented csv parser, 1,2,4,8 denotes the parallelism of graph loading, i.e. how many labels of vertex/edge are concurrently processed. Note that TableReader is around 10x faster than StreamingReader. The possible reason could be the multi-threading is used. See [arrow-csv-doc](https://arrow.apache.org/docs/cpp/csv.html) for details. | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph |805s| 468s| 349s| 313s| | Adhoc Reader | Serialization | 126s| 126s| 126s| 126s| | Adhoc Reader | **Total** |931s| 594s| 475s| 439s| | Table Reader | ReadFile | 9s |9s |9s| 9s| | Table Reader | LoadGraph |455s| 280s| 211s| 182s| | Table Reader |Serialization |126s| 126s| 126s| 126s| | Table Reader | **Total** | 600s| 415s| 346s| 317s| | Streaming Reader | ReadFile |91s| 91s| 91s| 91s| | Streaming Reader | LoadGraph | 555s| 289s| 196s| 149s| | Streaming Reader | Serialization |126s| 126s| 126s| 126s| | Streaming Reader | **Total** | 772s| 506s| 413s| 366s| | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph |2720s| 1548s| 1176s| 948s| | Adhoc Reader | Serialization | 409s| 409s| 409s| 409s| | Adhoc Reader | **Total** | 3129s| 1957s| 1585s| 1357s| | Table Reader | ReadFile |24s| 24s| 24s| 24s| | Table Reader | LoadGraph |1576s| 949s| 728s| 602s| | Table Reader |Serialization |409s| 409s| 409s| 409s| | Table Reader | **Total** | 2009s| 1382s| 1161s| 1035s| | Streaming Reader | ReadFile |300s| 300s| 300s| 300s| | Streaming Reader | LoadGraph | 1740s| 965s| 669s| 497s| | Streaming Reader | Serialization | 409s| 409s| 409s| 409s| | Streaming Reader | **Total** | 2539s| 1674s| 1378s| 1206s| | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph | 8260s| 4900s |3603s |2999s| | Adhoc Reader | Serialization | 1201s | 1201s| 1201s |1201s| | Adhoc Reader | **Total** | 9461s| 6101s | 4804s |4200s| | Table Reader | ReadFile | 73s |73s| 96s| 96s| | Table Reader | LoadGraph |4650s| 2768s| 2155s |1778s| | Table Reader |Serialization | 1201s | 1201s| 1201s |1201s| | Table Reader | **Total** | 5924s| 4042s| 3452s| 3075s| | Streaming Reader | ReadFile | 889s |889s | 889s| 889s| | Streaming Reader | LoadGraph | 5589s| 3005s| 2200s| 1712s| | Streaming Reader | Serialization | 1201s| 1201s| 1201s |1201s | | Streaming Reader | **Total** | 7679s | 5095s |4290s| 3802s| FIx alibaba#3116 minor fix and move modern graph fix grin test todo: do_start fix fix stash fix fix make rules unique dockerfile stash minor change remove plugin-dir fix minor fix debug debug fix fix fix bulk_load.yaml bash format some fix fix format fix grin test some fi check ci fix ci set fix ci fix dd f disable tmate fix some bug fix fix refactor fix fix fix minor some fix
zhanglei1949
pushed a commit
to zhanglei1949/GraphScope
that referenced
this pull request
Sep 21, 2023
refactor(flex): Replace the Adhoc csv reader with Arrow CSV reader (alibaba#3154) 1. Use Arrow CSV Reader to replace current adhoc csv reader, to support more configurable options in `bulk_load.yaml`. 2. Introduce `CSVFragmentLoader`, `BasicFragmentLoader` for `MutablePropertyFragment`. With this PR merged, `MutablePropertyFragment` will support loading fragment from csv with options: - delimeter: default '|' - header_row: default true - quoting: default false - quoting_char: default '"' - escaping: default false - escaping_char: default'\\' - batch_size: the batch size of when reading file into memory, default 1MB. - batch_reader: default false. If set to true, `arrow::csv::StreamingReader` will be used to parse the input file. Otherwise, `arrow::TableReader` will be used. With this PR merged, the performance of graph loading will be improved. The Adhoc Reader denote the current implemented csv parser, 1,2,4,8 denotes the parallelism of graph loading, i.e. how many labels of vertex/edge are concurrently processed. Note that TableReader is around 10x faster than StreamingReader. The possible reason could be the multi-threading is used. See [arrow-csv-doc](https://arrow.apache.org/docs/cpp/csv.html) for details. | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph |805s| 468s| 349s| 313s| | Adhoc Reader | Serialization | 126s| 126s| 126s| 126s| | Adhoc Reader | **Total** |931s| 594s| 475s| 439s| | Table Reader | ReadFile | 9s |9s |9s| 9s| | Table Reader | LoadGraph |455s| 280s| 211s| 182s| | Table Reader |Serialization |126s| 126s| 126s| 126s| | Table Reader | **Total** | 600s| 415s| 346s| 317s| | Streaming Reader | ReadFile |91s| 91s| 91s| 91s| | Streaming Reader | LoadGraph | 555s| 289s| 196s| 149s| | Streaming Reader | Serialization |126s| 126s| 126s| 126s| | Streaming Reader | **Total** | 772s| 506s| 413s| 366s| | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph |2720s| 1548s| 1176s| 948s| | Adhoc Reader | Serialization | 409s| 409s| 409s| 409s| | Adhoc Reader | **Total** | 3129s| 1957s| 1585s| 1357s| | Table Reader | ReadFile |24s| 24s| 24s| 24s| | Table Reader | LoadGraph |1576s| 949s| 728s| 602s| | Table Reader |Serialization |409s| 409s| 409s| 409s| | Table Reader | **Total** | 2009s| 1382s| 1161s| 1035s| | Streaming Reader | ReadFile |300s| 300s| 300s| 300s| | Streaming Reader | LoadGraph | 1740s| 965s| 669s| 497s| | Streaming Reader | Serialization | 409s| 409s| 409s| 409s| | Streaming Reader | **Total** | 2539s| 1674s| 1378s| 1206s| | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph | 8260s| 4900s |3603s |2999s| | Adhoc Reader | Serialization | 1201s | 1201s| 1201s |1201s| | Adhoc Reader | **Total** | 9461s| 6101s | 4804s |4200s| | Table Reader | ReadFile | 73s |73s| 96s| 96s| | Table Reader | LoadGraph |4650s| 2768s| 2155s |1778s| | Table Reader |Serialization | 1201s | 1201s| 1201s |1201s| | Table Reader | **Total** | 5924s| 4042s| 3452s| 3075s| | Streaming Reader | ReadFile | 889s |889s | 889s| 889s| | Streaming Reader | LoadGraph | 5589s| 3005s| 2200s| 1712s| | Streaming Reader | Serialization | 1201s| 1201s| 1201s |1201s | | Streaming Reader | **Total** | 7679s | 5095s |4290s| 3802s| FIx alibaba#3116 minor fix and move modern graph fix grin test todo: do_start fix fix stash fix fix make rules unique dockerfile stash minor change remove plugin-dir fix minor fix debug debug fix fix fix bulk_load.yaml bash format some fix fix format fix grin test some fi check ci fix ci set fix ci fix dd f disable tmate fix some bug fix fix refactor fix fix fix minor some fix fix support default src_dst primarykey mapping in bulk load fix fix fix fix Ci
zhanglei1949
pushed a commit
to zhanglei1949/GraphScope
that referenced
this pull request
Sep 21, 2023
refactor(flex): Replace the Adhoc csv reader with Arrow CSV reader (alibaba#3154) 1. Use Arrow CSV Reader to replace current adhoc csv reader, to support more configurable options in `bulk_load.yaml`. 2. Introduce `CSVFragmentLoader`, `BasicFragmentLoader` for `MutablePropertyFragment`. With this PR merged, `MutablePropertyFragment` will support loading fragment from csv with options: - delimeter: default '|' - header_row: default true - quoting: default false - quoting_char: default '"' - escaping: default false - escaping_char: default'\\' - batch_size: the batch size of when reading file into memory, default 1MB. - batch_reader: default false. If set to true, `arrow::csv::StreamingReader` will be used to parse the input file. Otherwise, `arrow::TableReader` will be used. With this PR merged, the performance of graph loading will be improved. The Adhoc Reader denote the current implemented csv parser, 1,2,4,8 denotes the parallelism of graph loading, i.e. how many labels of vertex/edge are concurrently processed. Note that TableReader is around 10x faster than StreamingReader. The possible reason could be the multi-threading is used. See [arrow-csv-doc](https://arrow.apache.org/docs/cpp/csv.html) for details. | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph |805s| 468s| 349s| 313s| | Adhoc Reader | Serialization | 126s| 126s| 126s| 126s| | Adhoc Reader | **Total** |931s| 594s| 475s| 439s| | Table Reader | ReadFile | 9s |9s |9s| 9s| | Table Reader | LoadGraph |455s| 280s| 211s| 182s| | Table Reader |Serialization |126s| 126s| 126s| 126s| | Table Reader | **Total** | 600s| 415s| 346s| 317s| | Streaming Reader | ReadFile |91s| 91s| 91s| 91s| | Streaming Reader | LoadGraph | 555s| 289s| 196s| 149s| | Streaming Reader | Serialization |126s| 126s| 126s| 126s| | Streaming Reader | **Total** | 772s| 506s| 413s| 366s| | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph |2720s| 1548s| 1176s| 948s| | Adhoc Reader | Serialization | 409s| 409s| 409s| 409s| | Adhoc Reader | **Total** | 3129s| 1957s| 1585s| 1357s| | Table Reader | ReadFile |24s| 24s| 24s| 24s| | Table Reader | LoadGraph |1576s| 949s| 728s| 602s| | Table Reader |Serialization |409s| 409s| 409s| 409s| | Table Reader | **Total** | 2009s| 1382s| 1161s| 1035s| | Streaming Reader | ReadFile |300s| 300s| 300s| 300s| | Streaming Reader | LoadGraph | 1740s| 965s| 669s| 497s| | Streaming Reader | Serialization | 409s| 409s| 409s| 409s| | Streaming Reader | **Total** | 2539s| 1674s| 1378s| 1206s| | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph | 8260s| 4900s |3603s |2999s| | Adhoc Reader | Serialization | 1201s | 1201s| 1201s |1201s| | Adhoc Reader | **Total** | 9461s| 6101s | 4804s |4200s| | Table Reader | ReadFile | 73s |73s| 96s| 96s| | Table Reader | LoadGraph |4650s| 2768s| 2155s |1778s| | Table Reader |Serialization | 1201s | 1201s| 1201s |1201s| | Table Reader | **Total** | 5924s| 4042s| 3452s| 3075s| | Streaming Reader | ReadFile | 889s |889s | 889s| 889s| | Streaming Reader | LoadGraph | 5589s| 3005s| 2200s| 1712s| | Streaming Reader | Serialization | 1201s| 1201s| 1201s |1201s | | Streaming Reader | **Total** | 7679s | 5095s |4290s| 3802s| FIx alibaba#3116 minor fix and move modern graph fix grin test todo: do_start fix fix stash fix fix make rules unique dockerfile stash minor change remove plugin-dir fix minor fix debug debug fix fix fix bulk_load.yaml bash format some fix fix format fix grin test some fi check ci fix ci set fix ci fix dd f disable tmate fix some bug fix fix refactor fix fix fix minor some fix fix support default src_dst primarykey mapping in bulk load fix fix fix fix Ci rename fix java and add get_person_name.cypher
zhanglei1949
pushed a commit
to zhanglei1949/GraphScope
that referenced
this pull request
Sep 21, 2023
refactor(flex): Replace the Adhoc csv reader with Arrow CSV reader (alibaba#3154) 1. Use Arrow CSV Reader to replace current adhoc csv reader, to support more configurable options in `bulk_load.yaml`. 2. Introduce `CSVFragmentLoader`, `BasicFragmentLoader` for `MutablePropertyFragment`. With this PR merged, `MutablePropertyFragment` will support loading fragment from csv with options: - delimeter: default '|' - header_row: default true - quoting: default false - quoting_char: default '"' - escaping: default false - escaping_char: default'\\' - batch_size: the batch size of when reading file into memory, default 1MB. - batch_reader: default false. If set to true, `arrow::csv::StreamingReader` will be used to parse the input file. Otherwise, `arrow::TableReader` will be used. With this PR merged, the performance of graph loading will be improved. The Adhoc Reader denote the current implemented csv parser, 1,2,4,8 denotes the parallelism of graph loading, i.e. how many labels of vertex/edge are concurrently processed. Note that TableReader is around 10x faster than StreamingReader. The possible reason could be the multi-threading is used. See [arrow-csv-doc](https://arrow.apache.org/docs/cpp/csv.html) for details. | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph |805s| 468s| 349s| 313s| | Adhoc Reader | Serialization | 126s| 126s| 126s| 126s| | Adhoc Reader | **Total** |931s| 594s| 475s| 439s| | Table Reader | ReadFile | 9s |9s |9s| 9s| | Table Reader | LoadGraph |455s| 280s| 211s| 182s| | Table Reader |Serialization |126s| 126s| 126s| 126s| | Table Reader | **Total** | 600s| 415s| 346s| 317s| | Streaming Reader | ReadFile |91s| 91s| 91s| 91s| | Streaming Reader | LoadGraph | 555s| 289s| 196s| 149s| | Streaming Reader | Serialization |126s| 126s| 126s| 126s| | Streaming Reader | **Total** | 772s| 506s| 413s| 366s| | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph |2720s| 1548s| 1176s| 948s| | Adhoc Reader | Serialization | 409s| 409s| 409s| 409s| | Adhoc Reader | **Total** | 3129s| 1957s| 1585s| 1357s| | Table Reader | ReadFile |24s| 24s| 24s| 24s| | Table Reader | LoadGraph |1576s| 949s| 728s| 602s| | Table Reader |Serialization |409s| 409s| 409s| 409s| | Table Reader | **Total** | 2009s| 1382s| 1161s| 1035s| | Streaming Reader | ReadFile |300s| 300s| 300s| 300s| | Streaming Reader | LoadGraph | 1740s| 965s| 669s| 497s| | Streaming Reader | Serialization | 409s| 409s| 409s| 409s| | Streaming Reader | **Total** | 2539s| 1674s| 1378s| 1206s| | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph | 8260s| 4900s |3603s |2999s| | Adhoc Reader | Serialization | 1201s | 1201s| 1201s |1201s| | Adhoc Reader | **Total** | 9461s| 6101s | 4804s |4200s| | Table Reader | ReadFile | 73s |73s| 96s| 96s| | Table Reader | LoadGraph |4650s| 2768s| 2155s |1778s| | Table Reader |Serialization | 1201s | 1201s| 1201s |1201s| | Table Reader | **Total** | 5924s| 4042s| 3452s| 3075s| | Streaming Reader | ReadFile | 889s |889s | 889s| 889s| | Streaming Reader | LoadGraph | 5589s| 3005s| 2200s| 1712s| | Streaming Reader | Serialization | 1201s| 1201s| 1201s |1201s | | Streaming Reader | **Total** | 7679s | 5095s |4290s| 3802s| FIx alibaba#3116 minor fix and move modern graph fix grin test todo: do_start fix fix stash fix fix make rules unique dockerfile stash minor change remove plugin-dir fix minor fix debug debug fix fix fix bulk_load.yaml bash format some fix fix format fix grin test some fi check ci fix ci set fix ci fix dd f disable tmate fix some bug fix fix refactor fix fix fix minor some fix fix support default src_dst primarykey mapping in bulk load fix fix fix fix Ci rename fix java and add get_person_name.cypher
zhanglei1949
pushed a commit
to zhanglei1949/GraphScope
that referenced
this pull request
Sep 22, 2023
author shirly121 <[email protected]> 1694167237 +0800 committer xiaolei.zl <[email protected]> 1695348300 +0800 parent 6ab796e author shirly121 <[email protected]> 1694167237 +0800 committer xiaolei.zl <[email protected]> 1695348286 +0800 [GIE Compiler] fix bugs of columnId in schema refactor(flex): Replace the Adhoc csv reader with Arrow CSV reader (alibaba#3154) 1. Use Arrow CSV Reader to replace current adhoc csv reader, to support more configurable options in `bulk_load.yaml`. 2. Introduce `CSVFragmentLoader`, `BasicFragmentLoader` for `MutablePropertyFragment`. With this PR merged, `MutablePropertyFragment` will support loading fragment from csv with options: - delimeter: default '|' - header_row: default true - quoting: default false - quoting_char: default '"' - escaping: default false - escaping_char: default'\\' - batch_size: the batch size of when reading file into memory, default 1MB. - batch_reader: default false. If set to true, `arrow::csv::StreamingReader` will be used to parse the input file. Otherwise, `arrow::TableReader` will be used. With this PR merged, the performance of graph loading will be improved. The Adhoc Reader denote the current implemented csv parser, 1,2,4,8 denotes the parallelism of graph loading, i.e. how many labels of vertex/edge are concurrently processed. Note that TableReader is around 10x faster than StreamingReader. The possible reason could be the multi-threading is used. See [arrow-csv-doc](https://arrow.apache.org/docs/cpp/csv.html) for details. | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph |805s| 468s| 349s| 313s| | Adhoc Reader | Serialization | 126s| 126s| 126s| 126s| | Adhoc Reader | **Total** |931s| 594s| 475s| 439s| | Table Reader | ReadFile | 9s |9s |9s| 9s| | Table Reader | LoadGraph |455s| 280s| 211s| 182s| | Table Reader |Serialization |126s| 126s| 126s| 126s| | Table Reader | **Total** | 600s| 415s| 346s| 317s| | Streaming Reader | ReadFile |91s| 91s| 91s| 91s| | Streaming Reader | LoadGraph | 555s| 289s| 196s| 149s| | Streaming Reader | Serialization |126s| 126s| 126s| 126s| | Streaming Reader | **Total** | 772s| 506s| 413s| 366s| | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph |2720s| 1548s| 1176s| 948s| | Adhoc Reader | Serialization | 409s| 409s| 409s| 409s| | Adhoc Reader | **Total** | 3129s| 1957s| 1585s| 1357s| | Table Reader | ReadFile |24s| 24s| 24s| 24s| | Table Reader | LoadGraph |1576s| 949s| 728s| 602s| | Table Reader |Serialization |409s| 409s| 409s| 409s| | Table Reader | **Total** | 2009s| 1382s| 1161s| 1035s| | Streaming Reader | ReadFile |300s| 300s| 300s| 300s| | Streaming Reader | LoadGraph | 1740s| 965s| 669s| 497s| | Streaming Reader | Serialization | 409s| 409s| 409s| 409s| | Streaming Reader | **Total** | 2539s| 1674s| 1378s| 1206s| | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph | 8260s| 4900s |3603s |2999s| | Adhoc Reader | Serialization | 1201s | 1201s| 1201s |1201s| | Adhoc Reader | **Total** | 9461s| 6101s | 4804s |4200s| | Table Reader | ReadFile | 73s |73s| 96s| 96s| | Table Reader | LoadGraph |4650s| 2768s| 2155s |1778s| | Table Reader |Serialization | 1201s | 1201s| 1201s |1201s| | Table Reader | **Total** | 5924s| 4042s| 3452s| 3075s| | Streaming Reader | ReadFile | 889s |889s | 889s| 889s| | Streaming Reader | LoadGraph | 5589s| 3005s| 2200s| 1712s| | Streaming Reader | Serialization | 1201s| 1201s| 1201s |1201s | | Streaming Reader | **Total** | 7679s | 5095s |4290s| 3802s| FIx alibaba#3116 minor fix and move modern graph fix grin test todo: do_start fix fix stash fix fix make rules unique dockerfile stash minor change remove plugin-dir fix minor fix debug debug fix fix fix bulk_load.yaml bash format some fix fix format fix grin test some fi check ci fix ci set fix ci fix dd f disable tmate fix some bug fix fix refactor fix fix fix minor some fix fix support default src_dst primarykey mapping in bulk load fix fix fix fix Ci rename fix java and add get_person_name.cypher [GIE Compiler] minor fix use graphscope gstest format add movie queries
zhanglei1949
pushed a commit
to zhanglei1949/GraphScope
that referenced
this pull request
Sep 24, 2023
author shirly121 <[email protected]> 1694167237 +0800 committer xiaolei.zl <[email protected]> 1695348300 +0800 parent 6ab796e author shirly121 <[email protected]> 1694167237 +0800 committer xiaolei.zl <[email protected]> 1695348286 +0800 [GIE Compiler] fix bugs of columnId in schema refactor(flex): Replace the Adhoc csv reader with Arrow CSV reader (alibaba#3154) 1. Use Arrow CSV Reader to replace current adhoc csv reader, to support more configurable options in `bulk_load.yaml`. 2. Introduce `CSVFragmentLoader`, `BasicFragmentLoader` for `MutablePropertyFragment`. With this PR merged, `MutablePropertyFragment` will support loading fragment from csv with options: - delimeter: default '|' - header_row: default true - quoting: default false - quoting_char: default '"' - escaping: default false - escaping_char: default'\\' - batch_size: the batch size of when reading file into memory, default 1MB. - batch_reader: default false. If set to true, `arrow::csv::StreamingReader` will be used to parse the input file. Otherwise, `arrow::TableReader` will be used. With this PR merged, the performance of graph loading will be improved. The Adhoc Reader denote the current implemented csv parser, 1,2,4,8 denotes the parallelism of graph loading, i.e. how many labels of vertex/edge are concurrently processed. Note that TableReader is around 10x faster than StreamingReader. The possible reason could be the multi-threading is used. See [arrow-csv-doc](https://arrow.apache.org/docs/cpp/csv.html) for details. | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph |805s| 468s| 349s| 313s| | Adhoc Reader | Serialization | 126s| 126s| 126s| 126s| | Adhoc Reader | **Total** |931s| 594s| 475s| 439s| | Table Reader | ReadFile | 9s |9s |9s| 9s| | Table Reader | LoadGraph |455s| 280s| 211s| 182s| | Table Reader |Serialization |126s| 126s| 126s| 126s| | Table Reader | **Total** | 600s| 415s| 346s| 317s| | Streaming Reader | ReadFile |91s| 91s| 91s| 91s| | Streaming Reader | LoadGraph | 555s| 289s| 196s| 149s| | Streaming Reader | Serialization |126s| 126s| 126s| 126s| | Streaming Reader | **Total** | 772s| 506s| 413s| 366s| | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph |2720s| 1548s| 1176s| 948s| | Adhoc Reader | Serialization | 409s| 409s| 409s| 409s| | Adhoc Reader | **Total** | 3129s| 1957s| 1585s| 1357s| | Table Reader | ReadFile |24s| 24s| 24s| 24s| | Table Reader | LoadGraph |1576s| 949s| 728s| 602s| | Table Reader |Serialization |409s| 409s| 409s| 409s| | Table Reader | **Total** | 2009s| 1382s| 1161s| 1035s| | Streaming Reader | ReadFile |300s| 300s| 300s| 300s| | Streaming Reader | LoadGraph | 1740s| 965s| 669s| 497s| | Streaming Reader | Serialization | 409s| 409s| 409s| 409s| | Streaming Reader | **Total** | 2539s| 1674s| 1378s| 1206s| | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph | 8260s| 4900s |3603s |2999s| | Adhoc Reader | Serialization | 1201s | 1201s| 1201s |1201s| | Adhoc Reader | **Total** | 9461s| 6101s | 4804s |4200s| | Table Reader | ReadFile | 73s |73s| 96s| 96s| | Table Reader | LoadGraph |4650s| 2768s| 2155s |1778s| | Table Reader |Serialization | 1201s | 1201s| 1201s |1201s| | Table Reader | **Total** | 5924s| 4042s| 3452s| 3075s| | Streaming Reader | ReadFile | 889s |889s | 889s| 889s| | Streaming Reader | LoadGraph | 5589s| 3005s| 2200s| 1712s| | Streaming Reader | Serialization | 1201s| 1201s| 1201s |1201s | | Streaming Reader | **Total** | 7679s | 5095s |4290s| 3802s| FIx alibaba#3116 minor fix and move modern graph fix grin test todo: do_start fix fix stash fix fix make rules unique dockerfile stash minor change remove plugin-dir fix minor fix debug debug fix fix fix bulk_load.yaml bash format some fix fix format fix grin test some fi check ci fix ci set fix ci fix dd f disable tmate fix some bug fix fix refactor fix fix fix minor some fix fix support default src_dst primarykey mapping in bulk load fix fix fix fix Ci rename fix java and add get_person_name.cypher [GIE Compiler] minor fix use graphscope gstest format add movie queries dd debug add movie test format format
zhanglei1949
pushed a commit
to zhanglei1949/GraphScope
that referenced
this pull request
Sep 25, 2023
author shirly121 <[email protected]> 1694167237 +0800 committer xiaolei.zl <[email protected]> 1695348300 +0800 parent 6ab796e author shirly121 <[email protected]> 1694167237 +0800 committer xiaolei.zl <[email protected]> 1695348286 +0800 [GIE Compiler] fix bugs of columnId in schema refactor(flex): Replace the Adhoc csv reader with Arrow CSV reader (alibaba#3154) 1. Use Arrow CSV Reader to replace current adhoc csv reader, to support more configurable options in `bulk_load.yaml`. 2. Introduce `CSVFragmentLoader`, `BasicFragmentLoader` for `MutablePropertyFragment`. With this PR merged, `MutablePropertyFragment` will support loading fragment from csv with options: - delimeter: default '|' - header_row: default true - quoting: default false - quoting_char: default '"' - escaping: default false - escaping_char: default'\\' - batch_size: the batch size of when reading file into memory, default 1MB. - batch_reader: default false. If set to true, `arrow::csv::StreamingReader` will be used to parse the input file. Otherwise, `arrow::TableReader` will be used. With this PR merged, the performance of graph loading will be improved. The Adhoc Reader denote the current implemented csv parser, 1,2,4,8 denotes the parallelism of graph loading, i.e. how many labels of vertex/edge are concurrently processed. Note that TableReader is around 10x faster than StreamingReader. The possible reason could be the multi-threading is used. See [arrow-csv-doc](https://arrow.apache.org/docs/cpp/csv.html) for details. | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph |805s| 468s| 349s| 313s| | Adhoc Reader | Serialization | 126s| 126s| 126s| 126s| | Adhoc Reader | **Total** |931s| 594s| 475s| 439s| | Table Reader | ReadFile | 9s |9s |9s| 9s| | Table Reader | LoadGraph |455s| 280s| 211s| 182s| | Table Reader |Serialization |126s| 126s| 126s| 126s| | Table Reader | **Total** | 600s| 415s| 346s| 317s| | Streaming Reader | ReadFile |91s| 91s| 91s| 91s| | Streaming Reader | LoadGraph | 555s| 289s| 196s| 149s| | Streaming Reader | Serialization |126s| 126s| 126s| 126s| | Streaming Reader | **Total** | 772s| 506s| 413s| 366s| | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph |2720s| 1548s| 1176s| 948s| | Adhoc Reader | Serialization | 409s| 409s| 409s| 409s| | Adhoc Reader | **Total** | 3129s| 1957s| 1585s| 1357s| | Table Reader | ReadFile |24s| 24s| 24s| 24s| | Table Reader | LoadGraph |1576s| 949s| 728s| 602s| | Table Reader |Serialization |409s| 409s| 409s| 409s| | Table Reader | **Total** | 2009s| 1382s| 1161s| 1035s| | Streaming Reader | ReadFile |300s| 300s| 300s| 300s| | Streaming Reader | LoadGraph | 1740s| 965s| 669s| 497s| | Streaming Reader | Serialization | 409s| 409s| 409s| 409s| | Streaming Reader | **Total** | 2539s| 1674s| 1378s| 1206s| | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph | 8260s| 4900s |3603s |2999s| | Adhoc Reader | Serialization | 1201s | 1201s| 1201s |1201s| | Adhoc Reader | **Total** | 9461s| 6101s | 4804s |4200s| | Table Reader | ReadFile | 73s |73s| 96s| 96s| | Table Reader | LoadGraph |4650s| 2768s| 2155s |1778s| | Table Reader |Serialization | 1201s | 1201s| 1201s |1201s| | Table Reader | **Total** | 5924s| 4042s| 3452s| 3075s| | Streaming Reader | ReadFile | 889s |889s | 889s| 889s| | Streaming Reader | LoadGraph | 5589s| 3005s| 2200s| 1712s| | Streaming Reader | Serialization | 1201s| 1201s| 1201s |1201s | | Streaming Reader | **Total** | 7679s | 5095s |4290s| 3802s| FIx alibaba#3116 minor fix and move modern graph fix grin test todo: do_start fix fix stash fix fix make rules unique dockerfile stash minor change remove plugin-dir fix minor fix debug debug fix fix fix bulk_load.yaml bash format some fix fix format fix grin test some fi check ci fix ci set fix ci fix dd f disable tmate fix some bug fix fix refactor fix fix fix minor some fix fix support default src_dst primarykey mapping in bulk load fix fix fix fix Ci rename fix java and add get_person_name.cypher [GIE Compiler] minor fix use graphscope gstest format add movie queries dd debug add movie test format format fix script debug fix test script minor sort query results minor minor format
zhanglei1949
pushed a commit
to zhanglei1949/GraphScope
that referenced
this pull request
Sep 25, 2023
author shirly121 <[email protected]> 1694167237 +0800 committer xiaolei.zl <[email protected]> 1695348300 +0800 parent 6ab796e author shirly121 <[email protected]> 1694167237 +0800 committer xiaolei.zl <[email protected]> 1695348286 +0800 [GIE Compiler] fix bugs of columnId in schema refactor(flex): Replace the Adhoc csv reader with Arrow CSV reader (alibaba#3154) 1. Use Arrow CSV Reader to replace current adhoc csv reader, to support more configurable options in `bulk_load.yaml`. 2. Introduce `CSVFragmentLoader`, `BasicFragmentLoader` for `MutablePropertyFragment`. With this PR merged, `MutablePropertyFragment` will support loading fragment from csv with options: - delimeter: default '|' - header_row: default true - quoting: default false - quoting_char: default '"' - escaping: default false - escaping_char: default'\\' - batch_size: the batch size of when reading file into memory, default 1MB. - batch_reader: default false. If set to true, `arrow::csv::StreamingReader` will be used to parse the input file. Otherwise, `arrow::TableReader` will be used. With this PR merged, the performance of graph loading will be improved. The Adhoc Reader denote the current implemented csv parser, 1,2,4,8 denotes the parallelism of graph loading, i.e. how many labels of vertex/edge are concurrently processed. Note that TableReader is around 10x faster than StreamingReader. The possible reason could be the multi-threading is used. See [arrow-csv-doc](https://arrow.apache.org/docs/cpp/csv.html) for details. | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph |805s| 468s| 349s| 313s| | Adhoc Reader | Serialization | 126s| 126s| 126s| 126s| | Adhoc Reader | **Total** |931s| 594s| 475s| 439s| | Table Reader | ReadFile | 9s |9s |9s| 9s| | Table Reader | LoadGraph |455s| 280s| 211s| 182s| | Table Reader |Serialization |126s| 126s| 126s| 126s| | Table Reader | **Total** | 600s| 415s| 346s| 317s| | Streaming Reader | ReadFile |91s| 91s| 91s| 91s| | Streaming Reader | LoadGraph | 555s| 289s| 196s| 149s| | Streaming Reader | Serialization |126s| 126s| 126s| 126s| | Streaming Reader | **Total** | 772s| 506s| 413s| 366s| | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph |2720s| 1548s| 1176s| 948s| | Adhoc Reader | Serialization | 409s| 409s| 409s| 409s| | Adhoc Reader | **Total** | 3129s| 1957s| 1585s| 1357s| | Table Reader | ReadFile |24s| 24s| 24s| 24s| | Table Reader | LoadGraph |1576s| 949s| 728s| 602s| | Table Reader |Serialization |409s| 409s| 409s| 409s| | Table Reader | **Total** | 2009s| 1382s| 1161s| 1035s| | Streaming Reader | ReadFile |300s| 300s| 300s| 300s| | Streaming Reader | LoadGraph | 1740s| 965s| 669s| 497s| | Streaming Reader | Serialization | 409s| 409s| 409s| 409s| | Streaming Reader | **Total** | 2539s| 1674s| 1378s| 1206s| | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph | 8260s| 4900s |3603s |2999s| | Adhoc Reader | Serialization | 1201s | 1201s| 1201s |1201s| | Adhoc Reader | **Total** | 9461s| 6101s | 4804s |4200s| | Table Reader | ReadFile | 73s |73s| 96s| 96s| | Table Reader | LoadGraph |4650s| 2768s| 2155s |1778s| | Table Reader |Serialization | 1201s | 1201s| 1201s |1201s| | Table Reader | **Total** | 5924s| 4042s| 3452s| 3075s| | Streaming Reader | ReadFile | 889s |889s | 889s| 889s| | Streaming Reader | LoadGraph | 5589s| 3005s| 2200s| 1712s| | Streaming Reader | Serialization | 1201s| 1201s| 1201s |1201s | | Streaming Reader | **Total** | 7679s | 5095s |4290s| 3802s| FIx alibaba#3116 minor fix and move modern graph fix grin test todo: do_start fix fix stash fix fix make rules unique dockerfile stash minor change remove plugin-dir fix minor fix debug debug fix fix fix bulk_load.yaml bash format some fix fix format fix grin test some fi check ci fix ci set fix ci fix dd f disable tmate fix some bug fix fix refactor fix fix fix minor some fix fix support default src_dst primarykey mapping in bulk load fix fix fix fix Ci rename fix java and add get_person_name.cypher [GIE Compiler] minor fix use graphscope gstest format add movie queries dd debug add movie test format format fix script debug fix test script minor sort query results minor minor format
zhanglei1949
pushed a commit
to zhanglei1949/GraphScope
that referenced
this pull request
Sep 25, 2023
author shirly121 <[email protected]> 1694167237 +0800 committer xiaolei.zl <[email protected]> 1695348300 +0800 parent 6ab796e author shirly121 <[email protected]> 1694167237 +0800 committer xiaolei.zl <[email protected]> 1695348286 +0800 [GIE Compiler] fix bugs of columnId in schema refactor(flex): Replace the Adhoc csv reader with Arrow CSV reader (alibaba#3154) 1. Use Arrow CSV Reader to replace current adhoc csv reader, to support more configurable options in `bulk_load.yaml`. 2. Introduce `CSVFragmentLoader`, `BasicFragmentLoader` for `MutablePropertyFragment`. With this PR merged, `MutablePropertyFragment` will support loading fragment from csv with options: - delimeter: default '|' - header_row: default true - quoting: default false - quoting_char: default '"' - escaping: default false - escaping_char: default'\\' - batch_size: the batch size of when reading file into memory, default 1MB. - batch_reader: default false. If set to true, `arrow::csv::StreamingReader` will be used to parse the input file. Otherwise, `arrow::TableReader` will be used. With this PR merged, the performance of graph loading will be improved. The Adhoc Reader denote the current implemented csv parser, 1,2,4,8 denotes the parallelism of graph loading, i.e. how many labels of vertex/edge are concurrently processed. Note that TableReader is around 10x faster than StreamingReader. The possible reason could be the multi-threading is used. See [arrow-csv-doc](https://arrow.apache.org/docs/cpp/csv.html) for details. | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph |805s| 468s| 349s| 313s| | Adhoc Reader | Serialization | 126s| 126s| 126s| 126s| | Adhoc Reader | **Total** |931s| 594s| 475s| 439s| | Table Reader | ReadFile | 9s |9s |9s| 9s| | Table Reader | LoadGraph |455s| 280s| 211s| 182s| | Table Reader |Serialization |126s| 126s| 126s| 126s| | Table Reader | **Total** | 600s| 415s| 346s| 317s| | Streaming Reader | ReadFile |91s| 91s| 91s| 91s| | Streaming Reader | LoadGraph | 555s| 289s| 196s| 149s| | Streaming Reader | Serialization |126s| 126s| 126s| 126s| | Streaming Reader | **Total** | 772s| 506s| 413s| 366s| | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph |2720s| 1548s| 1176s| 948s| | Adhoc Reader | Serialization | 409s| 409s| 409s| 409s| | Adhoc Reader | **Total** | 3129s| 1957s| 1585s| 1357s| | Table Reader | ReadFile |24s| 24s| 24s| 24s| | Table Reader | LoadGraph |1576s| 949s| 728s| 602s| | Table Reader |Serialization |409s| 409s| 409s| 409s| | Table Reader | **Total** | 2009s| 1382s| 1161s| 1035s| | Streaming Reader | ReadFile |300s| 300s| 300s| 300s| | Streaming Reader | LoadGraph | 1740s| 965s| 669s| 497s| | Streaming Reader | Serialization | 409s| 409s| 409s| 409s| | Streaming Reader | **Total** | 2539s| 1674s| 1378s| 1206s| | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph | 8260s| 4900s |3603s |2999s| | Adhoc Reader | Serialization | 1201s | 1201s| 1201s |1201s| | Adhoc Reader | **Total** | 9461s| 6101s | 4804s |4200s| | Table Reader | ReadFile | 73s |73s| 96s| 96s| | Table Reader | LoadGraph |4650s| 2768s| 2155s |1778s| | Table Reader |Serialization | 1201s | 1201s| 1201s |1201s| | Table Reader | **Total** | 5924s| 4042s| 3452s| 3075s| | Streaming Reader | ReadFile | 889s |889s | 889s| 889s| | Streaming Reader | LoadGraph | 5589s| 3005s| 2200s| 1712s| | Streaming Reader | Serialization | 1201s| 1201s| 1201s |1201s | | Streaming Reader | **Total** | 7679s | 5095s |4290s| 3802s| FIx alibaba#3116 minor fix and move modern graph fix grin test todo: do_start fix fix stash fix fix make rules unique dockerfile stash minor change remove plugin-dir fix minor fix debug debug fix fix fix bulk_load.yaml bash format some fix fix format fix grin test some fi check ci fix ci set fix ci fix dd f disable tmate fix some bug fix fix refactor fix fix fix minor some fix fix support default src_dst primarykey mapping in bulk load fix fix fix fix Ci rename fix java and add get_person_name.cypher [GIE Compiler] minor fix use graphscope gstest format add movie queries dd debug add movie test format format fix script debug fix test script minor sort query results minor minor format fix ci format gstest
zhanglei1949
pushed a commit
to zhanglei1949/GraphScope
that referenced
this pull request
Sep 25, 2023
author shirly121 <[email protected]> 1694167237 +0800 committer xiaolei.zl <[email protected]> 1695348300 +0800 parent 6ab796e author shirly121 <[email protected]> 1694167237 +0800 committer xiaolei.zl <[email protected]> 1695348286 +0800 [GIE Compiler] fix bugs of columnId in schema refactor(flex): Replace the Adhoc csv reader with Arrow CSV reader (alibaba#3154) 1. Use Arrow CSV Reader to replace current adhoc csv reader, to support more configurable options in `bulk_load.yaml`. 2. Introduce `CSVFragmentLoader`, `BasicFragmentLoader` for `MutablePropertyFragment`. With this PR merged, `MutablePropertyFragment` will support loading fragment from csv with options: - delimeter: default '|' - header_row: default true - quoting: default false - quoting_char: default '"' - escaping: default false - escaping_char: default'\\' - batch_size: the batch size of when reading file into memory, default 1MB. - batch_reader: default false. If set to true, `arrow::csv::StreamingReader` will be used to parse the input file. Otherwise, `arrow::TableReader` will be used. With this PR merged, the performance of graph loading will be improved. The Adhoc Reader denote the current implemented csv parser, 1,2,4,8 denotes the parallelism of graph loading, i.e. how many labels of vertex/edge are concurrently processed. Note that TableReader is around 10x faster than StreamingReader. The possible reason could be the multi-threading is used. See [arrow-csv-doc](https://arrow.apache.org/docs/cpp/csv.html) for details. | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph |805s| 468s| 349s| 313s| | Adhoc Reader | Serialization | 126s| 126s| 126s| 126s| | Adhoc Reader | **Total** |931s| 594s| 475s| 439s| | Table Reader | ReadFile | 9s |9s |9s| 9s| | Table Reader | LoadGraph |455s| 280s| 211s| 182s| | Table Reader |Serialization |126s| 126s| 126s| 126s| | Table Reader | **Total** | 600s| 415s| 346s| 317s| | Streaming Reader | ReadFile |91s| 91s| 91s| 91s| | Streaming Reader | LoadGraph | 555s| 289s| 196s| 149s| | Streaming Reader | Serialization |126s| 126s| 126s| 126s| | Streaming Reader | **Total** | 772s| 506s| 413s| 366s| | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph |2720s| 1548s| 1176s| 948s| | Adhoc Reader | Serialization | 409s| 409s| 409s| 409s| | Adhoc Reader | **Total** | 3129s| 1957s| 1585s| 1357s| | Table Reader | ReadFile |24s| 24s| 24s| 24s| | Table Reader | LoadGraph |1576s| 949s| 728s| 602s| | Table Reader |Serialization |409s| 409s| 409s| 409s| | Table Reader | **Total** | 2009s| 1382s| 1161s| 1035s| | Streaming Reader | ReadFile |300s| 300s| 300s| 300s| | Streaming Reader | LoadGraph | 1740s| 965s| 669s| 497s| | Streaming Reader | Serialization | 409s| 409s| 409s| 409s| | Streaming Reader | **Total** | 2539s| 1674s| 1378s| 1206s| | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph | 8260s| 4900s |3603s |2999s| | Adhoc Reader | Serialization | 1201s | 1201s| 1201s |1201s| | Adhoc Reader | **Total** | 9461s| 6101s | 4804s |4200s| | Table Reader | ReadFile | 73s |73s| 96s| 96s| | Table Reader | LoadGraph |4650s| 2768s| 2155s |1778s| | Table Reader |Serialization | 1201s | 1201s| 1201s |1201s| | Table Reader | **Total** | 5924s| 4042s| 3452s| 3075s| | Streaming Reader | ReadFile | 889s |889s | 889s| 889s| | Streaming Reader | LoadGraph | 5589s| 3005s| 2200s| 1712s| | Streaming Reader | Serialization | 1201s| 1201s| 1201s |1201s | | Streaming Reader | **Total** | 7679s | 5095s |4290s| 3802s| FIx alibaba#3116 minor fix and move modern graph fix grin test todo: do_start fix fix stash fix fix make rules unique dockerfile stash minor change remove plugin-dir fix minor fix debug debug fix fix fix bulk_load.yaml bash format some fix fix format fix grin test some fi check ci fix ci set fix ci fix dd f disable tmate fix some bug fix fix refactor fix fix fix minor some fix fix support default src_dst primarykey mapping in bulk load fix fix fix fix Ci rename fix java and add get_person_name.cypher [GIE Compiler] minor fix use graphscope gstest format add movie queries dd debug add movie test format format fix script debug fix test script minor sort query results minor minor format fix ci format gstest Add License
zhanglei1949
pushed a commit
to zhanglei1949/GraphScope
that referenced
this pull request
Sep 26, 2023
author shirly121 <[email protected]> 1694167237 +0800 committer xiaolei.zl <[email protected]> 1695348300 +0800 parent 6ab796e author shirly121 <[email protected]> 1694167237 +0800 committer xiaolei.zl <[email protected]> 1695348286 +0800 [GIE Compiler] fix bugs of columnId in schema refactor(flex): Replace the Adhoc csv reader with Arrow CSV reader (alibaba#3154) 1. Use Arrow CSV Reader to replace current adhoc csv reader, to support more configurable options in `bulk_load.yaml`. 2. Introduce `CSVFragmentLoader`, `BasicFragmentLoader` for `MutablePropertyFragment`. With this PR merged, `MutablePropertyFragment` will support loading fragment from csv with options: - delimeter: default '|' - header_row: default true - quoting: default false - quoting_char: default '"' - escaping: default false - escaping_char: default'\\' - batch_size: the batch size of when reading file into memory, default 1MB. - batch_reader: default false. If set to true, `arrow::csv::StreamingReader` will be used to parse the input file. Otherwise, `arrow::TableReader` will be used. With this PR merged, the performance of graph loading will be improved. The Adhoc Reader denote the current implemented csv parser, 1,2,4,8 denotes the parallelism of graph loading, i.e. how many labels of vertex/edge are concurrently processed. Note that TableReader is around 10x faster than StreamingReader. The possible reason could be the multi-threading is used. See [arrow-csv-doc](https://arrow.apache.org/docs/cpp/csv.html) for details. | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph |805s| 468s| 349s| 313s| | Adhoc Reader | Serialization | 126s| 126s| 126s| 126s| | Adhoc Reader | **Total** |931s| 594s| 475s| 439s| | Table Reader | ReadFile | 9s |9s |9s| 9s| | Table Reader | LoadGraph |455s| 280s| 211s| 182s| | Table Reader |Serialization |126s| 126s| 126s| 126s| | Table Reader | **Total** | 600s| 415s| 346s| 317s| | Streaming Reader | ReadFile |91s| 91s| 91s| 91s| | Streaming Reader | LoadGraph | 555s| 289s| 196s| 149s| | Streaming Reader | Serialization |126s| 126s| 126s| 126s| | Streaming Reader | **Total** | 772s| 506s| 413s| 366s| | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph |2720s| 1548s| 1176s| 948s| | Adhoc Reader | Serialization | 409s| 409s| 409s| 409s| | Adhoc Reader | **Total** | 3129s| 1957s| 1585s| 1357s| | Table Reader | ReadFile |24s| 24s| 24s| 24s| | Table Reader | LoadGraph |1576s| 949s| 728s| 602s| | Table Reader |Serialization |409s| 409s| 409s| 409s| | Table Reader | **Total** | 2009s| 1382s| 1161s| 1035s| | Streaming Reader | ReadFile |300s| 300s| 300s| 300s| | Streaming Reader | LoadGraph | 1740s| 965s| 669s| 497s| | Streaming Reader | Serialization | 409s| 409s| 409s| 409s| | Streaming Reader | **Total** | 2539s| 1674s| 1378s| 1206s| | Reader | Phase | 1 | 2 | 4 | 8 | | --------- | -------------- | ---------- |---------- |---------- |---------- | | Adhoc Reader | ReadFile\+LoadGraph | 8260s| 4900s |3603s |2999s| | Adhoc Reader | Serialization | 1201s | 1201s| 1201s |1201s| | Adhoc Reader | **Total** | 9461s| 6101s | 4804s |4200s| | Table Reader | ReadFile | 73s |73s| 96s| 96s| | Table Reader | LoadGraph |4650s| 2768s| 2155s |1778s| | Table Reader |Serialization | 1201s | 1201s| 1201s |1201s| | Table Reader | **Total** | 5924s| 4042s| 3452s| 3075s| | Streaming Reader | ReadFile | 889s |889s | 889s| 889s| | Streaming Reader | LoadGraph | 5589s| 3005s| 2200s| 1712s| | Streaming Reader | Serialization | 1201s| 1201s| 1201s |1201s | | Streaming Reader | **Total** | 7679s | 5095s |4290s| 3802s| FIx alibaba#3116 minor fix and move modern graph fix grin test todo: do_start fix fix stash fix fix make rules unique dockerfile stash minor change remove plugin-dir fix minor fix debug debug fix fix fix bulk_load.yaml bash format some fix fix format fix grin test some fi check ci fix ci set fix ci fix dd f disable tmate fix some bug fix fix refactor fix fix fix minor some fix fix support default src_dst primarykey mapping in bulk load fix fix fix fix Ci rename fix java and add get_person_name.cypher [GIE Compiler] minor fix use graphscope gstest format add movie queries dd debug add movie test format format fix script debug fix test script minor sort query results minor minor format fix ci format gstest Add License fix bugs
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
bulk_load.yaml
.CSVFragmentLoader
,BasicFragmentLoader
forMutablePropertyFragment
.With this PR merged,
MutablePropertyFragment
will support loading fragment from csv with options:arrow::csv::StreamingReader
will be used to parse the input file. Otherwise,arrow::TableReader
will be used.With this PR merged, the performance of graph loading will be improved.
The Adhoc Reader denote the current implemented csv parser, 1,2,4,8 denotes the parallelism of graph loading, i.e. how many labels of vertex/edge are concurrently processed.
Note that TableReader is around 10x faster than StreamingReader. The possible reason could be the multi-threading is used.
See arrow-csv-doc for details.
SF30
SF100
SF300
FIx #3116