From 1acddc344442501c063502a441c961cc8dfe0b1d Mon Sep 17 00:00:00 2001
From: lxsbupt <luoxsbupt@163.com>
Date: Wed, 21 Dec 2022 17:01:13 +0800
Subject: [PATCH] Merge gpugraph to develop (#48507)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* merge gpugraph to develop, fix code style

* update for untrainable params for stage3. (#48577)

* merge gpugraph to develop, trigger ci

* [CodeStyle][isort][Dy2St] sort imports in test_error (#48746)

* [CodeStyle][isort][Dy2St] sort imports in test_error

* update lineno

* Clear extra input (Bias, ResidualData) in OpMaker of conv2d (#47579)

* delete Bias and ResidualData in OpMaker of conv2d

* delete extra input of conv3d

* refactor pass of conv_bias_fusion

* fix mkldnn dependency

* fix mkldnn compile

* fix test_conv_bias_mkldnn_fuse_pass

* police some code

* remove useless log

* fix analyzer_vit_ocr_tester

* fix conv_activation_mkldnn_fuse_pass

* fix test_analyzer_ocr

* add fused_conv_sig

* fix performence regression

* fix performance regression

* make bilinear interpolate stable. (#48644)

* make bilinear interpolate stable.

* fix code

* clear tmp var in ptq (#48660)

* merge gpugraph to develop, fix py-api comment

* merge gpugraph to develop, fix mac-python3

* merge gpugraph to develop, fix mac-python3

* [Dy2St] replace deprecated `load_module` with `exec_module` (#48679)

* merge gpugraph to develop, fix mac-python3

* modify d2d copy to xpu::copy in xpu kernel, test=kunlun (#48710)

* rm _test_eager_guard (#48767)

* delete sampling_id api (#48543)

* [NPU] add FLAGS_npu_storage_format env to enable npu storage format, test=develop (#48774)

* optimize nchw<->nhwc kernel in fp16 model (#48692)

* fix: oss just support sm>=75 (#48731)

* update kl1 op list and optimize matmul unitest for kunlun (#48775)

*test=kunlun

* Fix accuracy fp16 kernel return fp32 tensor error (#48803)

* [phi::DenseTensor] Replace Tensor with phi::DenseTensor (#48682)

* [Zero-Dim] Support 0D for paddle.diagflat (#48735)

* [Zero-Dim] Support 0D for paddle.diagflat

* 【fluid api clear】Move batch norm1 (#47965)

* modify slice infershape

* code style

* modify slice_unittest

* temp fix

* batch_norm api move

* code_style

* codestyle

* ci_static

* add __init__

* reset other change

* revert .cc

* add import batchnorm

* conflict and revert

* fix bug

* fix third conflict one day

* fix conflict

* fix conflict bug

* fix conflict bug

* modify api

* code_style

* modify doc

* add lost doc stable

* fix conflict bug

* ci lack of gpu

* [remove fluid] PRelu BilinearTensorProduct Conv2DTranspose SequenceConv RowConv (#48654)

* [remove fluid] PRelu BilinearTensorProduct

* [remove fluid] PRelu BilinearTensorProduct Conv2DTranspose SequenceConv RowConv

* [remove fluid] PRelu BilinearTensorProduct Conv2DTranspose SequenceConv RowConv

* [remove fluid] PRelu BilinearTensorProduct Conv2DTranspose SequenceConv RowConv

* [remove fluid] PRelu BilinearTensorProduct Conv2DTranspose SequenceConv RowConv

* [remove fluid] PRelu BilinearTensorProduct Conv2DTranspose SequenceConv RowConv

* [remove fluid] PRelu BilinearTensorProduct Conv2DTranspose SequenceConv RowConv

* [remove fluid] PRelu BilinearTensorProduct Conv2DTranspose SequenceConv RowConv

* merge gpugraph to develop, rollback graph_send_recv

* fix ci (#48730)

* Remove reduntant numpy output in Example code (1/3), test=document_fix (#48678)

* 修改了英文API文档 (#48219)

* 修改paddle.nn.dynamic_decode，paddle.nn.functional.diag_embed 示例

* mma qk tensor_core (#48087)

* use mma for QK dot computing in fused_multi_transformer.
* Update fused_multi_transformer_op.cu.h

* remove lrn which is not used in paddle 2.0 (#47945)

* replace scatter_nd and scatter_nd_add with paddle.scatter_nd and (#47960)

paddle.scatter_nd_add

* [PHI] Migrate mul_grad kernel (#48061)

* cleanup unused code

* unify is_int8 is_bfloat16

* Simplify matmul_v2 FWD kernel

* remove RunKernel methods

* remove import namespace

* remove headers

* clean fluid/phi cross imports

* remove fluid axpy_handler

* delete fluid methods

* activations

* OneDNNMemDesc

* MKLDNNFormatForSize

* MatchShapeToLayout

* MKLDNNMemoryFormat

* MKLDNNFormat

* ReorderMKLDNNHandler

* to_void_cast

* review suggestions

* interpolate

* remove fluid depedency

* init

* ExecuteMatMulV2

* rm fluid kernel

* matmul_grad

* remove mutable_data

* mul_grad

* delete unnecessary shape and slice op (#48112)

* 修改英文文档。

* 修改segment operator等英文文档。

* 重新修改了paddle.einsum，paddle.unique_consecutive，
paddle.disable_signal_handler的英文文档格式。

* 重新修改了英文文档格式。;test=docs_preview

* Update extension.py

* 重新修改了英文文档格式。;test=docs_preview

* 重新修改了英文文档格式。
待验收：
- paddle.linalg.svd
- paddle.nn.functional.diag_embed
- paddle.set_grad_enabled
- paddle.disable_signal_handler
- paddle.cumprod
- paddle.devaice.cuda.stream_guard

待修改：
- paddle.nn.dynamic_decode
- paddle.einsum
- paddle.unique_consecutive
- paddle.linalg.svd
- paddle.uncubate.segment_min
- paddle.uncubate.segment_max
- paddle.uncubate.segment_sum
- paddle.uncubate.segment_mean

;test=docs_preview

* 重新修改了英文文档格式。
待验收：
- paddle.linalg.svd
- paddle.nn.functional.diag_embed
- paddle.set_grad_enabled
- paddle.disable_signal_handler
- paddle.cumprod
- paddle.devaice.cuda.stream_guard
- paddle.nn.dynamic_decode
- paddle.unique_consecutive
- paddle.linalg.svd

待修改：
- paddle.einsum
- paddle.incubate.segment_min
- paddle.incubate.segment_max
- paddle.incubate.segment_sum
- paddle.incubate.segment_mean

;test=docs_preview

* 重新修改了英文文档格式。
待验收：
- paddle.linalg.svd
- paddle.nn.functional.diag_embed
- paddle.set_grad_enabled
- paddle.disable_signal_handler
- paddle.cumprod
- paddle.devaice.cuda.stream_guard
- paddle.nn.dynamic_decode
- paddle.unique_consecutive
- paddle.linalg.svd

待修改：
- paddle.einsum
- paddle.incubate.segment_min
- paddle.incubate.segment_max
- paddle.incubate.segment_sum
- paddle.incubate.segment_mean

;test=docs_preview

* update

* test=docs_preview

* update formula; test=docs_preview

* update formula; test=docs_preview

* remove this operator; test=docs_preview

* add hyper link; test=docs_preview

* add default value; test=docs_preview

* update format; test=docs_preview

* empty commit; test=docs_preview

* fix codestyle issues; test=docs_preview

* empty commit; test=docs_preview

Co-authored-by: lzy <569782149@qq.com>
Co-authored-by: Vvsmile <450864116@qq.com>
Co-authored-by: Sławomir Siwek <slawomir.siwek@intel.com>
Co-authored-by: RichardWooSJTU <37864677+RichardWooSJTU@users.noreply.github.com>
Co-authored-by: Ligoml <39876205+Ligoml@users.noreply.github.com>
Co-authored-by: Nyakku Shigure <sigure.qaq@gmail.com>

* [PHI] Migrate squeeze and squeeze_grad kernels (#48634)

* squeeze kernel

* squeze fwd

* whitespace

* 修复paddle.nn.functinal包和paddle.nn包下API文档 (#48581)

* assign cve number to pdsa, test=document_fix (#48846)

* [fluid remove]: remove paddle.fluid.layers.yolo_box and paddle.fluid.layers.yolov3_loss (#48722)

* remove paddle.fluid.layers.nn.temporal_shift

* code check

* rm unittest

* remove fluid.yolo_box

* remove fluid.yolov3_loss

* change the comments of yolov3_loss to yolo_loss

* merge gpugraph to develop, fix windows compile

* merge gpugraph to develop, fix windows compile

* merge gpugraph to develop, fix windows compile

* Try add eval() to speedup the eigen performance. (#48855)

* [Fluid Clean]move inplace_apis_indygraph_only from paddle.flud.dygraph.inplace_utils to paddle.utils (#48744)

* move inplace_apis_indygraph_only from paddle.flud.dygraph.inplace_utils to paddle.utils

* modify conflict

* modify conflict

* modify conflict

* modify conflict

* modify conflict

* modify conflict

* modify conflict

* modify static-check ci error

* fix conflict

* modify failed tests

* fix conflict

* fix conflict

* fix pool2d examples

* modify conflict

* fix failed tests

* fix conflict

* fix failed tests

* modfiy problem of deleting pool2d

* merge gpugraph to develop, fix windows compile

* clean fluid task: transfer gaussian random api (#48529)

* Delete duplicate quant nodes in QAT (#48751)

* rm autograd func dynamic eager tests (#48788)

* Setuptools optimization (#48770)

* optimize setup.py

* modify setup.py

* modify setup.py

* modify setup.py

* modify setup.py after zhangbo reviewed

* [CodeStyle][F811] fix some test cases shadowed by the same name (#48745)

* [CodeStyle][F811] fix some unittests

* fix setup.py

* remove ignore from flake8 config

* remove repeat TestAbsDoubleGradCheck

* fix rrelu test

* fix fft ut

* add noqa in fluid.lstm ut

* add rtol and atol in test_matmul_v2_op

* update rtol

* empty commit

* empty commit

* revert changes in matmul ut and add noqa

* rename test case name

* set free_when_no_cache_hit default value to true (#48815)

* [Clean Fluid] Rm and mv some fluid dygrah apis (#48576)

Remove fluid dygrah apis
GroupNorm
TreeConv
Move fluid dygraph apis
Flatten
SpectralNorm

* [Inference] inference add cinn interface (#48741)

* Clean and migrate fluid APIs of paddle.fluid.layers.control_flow (#48233)

* Merge branch 'reduce_sum' of https://github.com/GhostScreaming/Paddle into mine_fluid_clean_common.

* Fix some bugs.

* Clean APIs in python/paddle/fluid/layers/control_flow.py

* Polish code style.

* Change API.

* Fix some bugs.

* Fix some bugs.

* remove gpu_info.h from phi dependencies (#48811)

* [Paddle Inference] Add add onehot trt converter (#48655)

* add onehot trt converter

* add unitest

* fix bug

* opt code

* fix bug

* fix depth_tensor

* fix unitest

* fix bug

* fix unitest

* fix bug

* fix bug

* fix bug

* fix bug

* [PHI decoupling] remove  bbox_util.h from phi dependencies (#48761)

* remove bbox_util.h from phi

* add file bbox_util.h

* reframe bbox_util.h

* Optimize Paddle diagonal (#47904)

* [API Clean]Clean __all__ to avoid exposing usless API (#48713)

* [API Clean]Clean __all__ to avoid exposing usless API

* fix import

* fix typo

* remove tracedLayer unittest

* Clean fluid APIs in distributed and fleet files (#48851)

* Fix bug of reduce_sum op. When input.numel() > INT32_MAX, its result
is wrong.

* Remove climits.

* Clean fluid API in paddle/distributed and paddle/fleetx folders.
Include following files:
python/paddle/distributed/__init__.py
python/paddle/distributed/collective.py
python/paddle/distributed/fleet/utils/fs.py
python/paddle/distributed/fleet/utils/hybrid_parallel_inference.py
python/paddle/distributed/fleet/utils/hybrid_parallel_util.py
python/paddle/distributed/fleet/utils/internal_storage.py
python/paddle/distributed/launch/context/device.py
python/paddle/distributed/parallel.py
python/paddle/distributed/parallel_with_gloo.py
python/paddle/distributed/spawn.py
python/paddle/framework/__init__.py
To be mentioned, 'paddle.fluid.dygraph.parallel.ParallelEnv'
 and 'fluid.framework.core' keeps unchanged in those files.
ParallelEnv is used by paddle.fluid.dygraph.parallel.DataParallel.
However, APIs in paddle.fluid.dygraph.parallel can't be
migrated to paddle.distributed, as there exists cyclic import
dependencies in modules like paddle.static, paddle.tensor. And
'fluid.framework.core' will be changed to import framework.core
after fluid.core is transmitted.

* Change TODO authors.

* rm kunlun xpu2_op_list (#48826)

*test=kunlun

* remove detection_output, iou_similarity and bipartite_match (#48773)

* Set WaiterType of kGpuSync to kCPU (#48758)

* [Migrate Fluid] Migrate Decoder, BeamSearchDecoder (#48754)

* [Inference] Enable infer shape cache. (#48312)

* [Fluid Clean] remove unfold, deformable_roi_pooling, shard_index, hard_swish, mish, uniform_random, unbind (#48451)

* fix-gpups setup.py (#48888)

* fix-gpups

* test=document_fix

* [PHI decoupling] move cuda_graph from fluid to phi (#48686)

* move cuda_graph from fluid to phi

* move device_memory_aligment from fluid to phi

* Revert "move device_memory_aligment from fluid to phi"

This reverts commit b92fcd39a0a50fdac13278f49be0237a85f3a13f.

* update xpu cmake

* fix english docs typo errors (#48599)

* fix english docs typo errors

the errors in docs as same as chinese pr 5468

* update docs; test=docs_preview

Co-authored-by: Ligoml <39876205+Ligoml@users.noreply.github.com>

* [XPU] add load op into oplist. (#48860)

* [XPU] add load op into oplist.

* remove test_sampling_id_op_xpu.py

* 【fluid clean】remove fluid.dygraph.rnn.lstmcell and fluid.dygraph.rnn.grucell (#48719)

* refine bsd doc (#48882)

* [Paddle Inference] General optimization for no_varlen embedding layernorm (#48580)

* general optimization no_varlen embedding layernorm

* fix tmp directories (#48863)

* rm dygraph_to_static eager guard tests part2 minst2ptb_lm (#48793)

* rm dygraph_to_static eager guard tests part2 minst2ptb_lm

* merge gpugraph to develop, fix the_one_ps.py for gpups

* [remove fluid] under unittesets of linear api (#48564)

* [remove fluid] under unittesets of linear api

* [remove fluid] under unittesets of linear api

* [remove fluid] under unittesets of linear api

* [remove fluid] under unittesets of linear api

* [remove fluid] under unittesets of linear api

* [remove fluid] under unittesets of linear api

* [remove fluid] fluid dygrapn linear api

* [remove fluid] fluid dygrapn linear api

* [remove fluid] fluid dygrapn linear api

* [remove fluid.layers.cross_entropy] remove unit tests (part 1) (#48726)

* replace layers.cross_entropy with paddle.entropy

* fix args

* fix codestyle

* proper fix (#48360)

Reenabled ext_reorder recording for TransDataLayoutFromOneDNN

* [remove fluid.layers.matmul] remove fluid.layers.matmul in example code (#48818)

* replace fluid.layers.matmul in fluid/io.py

* fix doc error in fluid.layers.nn.sampling_id

* remove test_auto_search_dist_matmul_op.py (#48794)

* delete mean api (#48764)

* clean test_op_name_conflict (#48704)

* opt kernel_selection error msg (#48864)

* rewrite delete_weight_dequant_linear_op_encoder/decoder pass (#48650)

* rewrite delete_weight_deqquant_linear_op_encoder/decoder pass

* [XPU] add set_value and set_value_grad (#48845)

* merge gpugraph to develop, fix gpups ut

* Add QuantizedMatmul in QAT (#47997)

* fix 'BlasAXPBY unimplemented' error with custom device (#48762)

* fix 'BlasAXPBY unimplemented' error with custom device

* fix utils CmakeLists bug

* first commit (#38143)

* [Auto Parallel] Add cluster partition and dm to pm (#48320)

* add cluster_partition and device_meshes to process_meshes funcs

* add unitest

* fix paddle2cinn float16 type support bug (#48249)

* remove pool2d from fluid (#48512)

* remove pool2d

* [fluid remove]: remove paddle.fluid.layers.detection_map, paddle.fluid.metrics.DetectionMAP and paddle.fluid.evaluator.DetectionMAP (#48674)

* remove paddle.fluid.layers.nn.temporal_shift

* code check

* rm unittest

* remove paddle.fluid.layers.detection_map and the class:DetectionMAP

* [PHI decoupling] move "flags.h" from fluid to phi (#48696)

* add set_lr & get_lr for stage2 optimizer. (#48857)

* move share_buffer kernel to phi (#48858)

* move share_buffer kernel to phi

* fix ut

* add source file

* fix window links

* [Kernel Selection] Simplify kernel selection process in phi, reduce search number to half (#47771)

* simplify SelectKernelOrThrowError function in phi

* opt kernel_selection process

* polish code, fix backend error

* Support static graph code-gen for scalar and int_array (#48792)

* add suppport_tensor for code_gen to static graph

* support code-gen for int_array

* polish code

* fix bug of data_type

* clean unittest test_model_cast_to_bf16 (#48705)

* rm dy2static eager tests part1 bert2loop (#48790)

* rm dygraph_to_static eager guard tests part3 reinforce2yolo (#48795)

* rm distribution uniform eager guard test (#48768)

* rm distribution uniform eager guard test

* review

* replace cross_entropy in python/paddle/fluid/tests/unittests/test_[a-n]*.py except test_dist_transpiler.py (#48913)

* replace cross_entropy except in python/paddle/fluid/tests/unittests/*.py && unittests/*/*.py (#48922)

* [Paddle Inference]add cutlass act set in conv_elementwise_add_act_fuse_pass (#48838)

* add cutlass act set in conv_elementwise_add_act_fuse_pass

* move fluid.layers.create_global_var to static.create_global_var (#48777)

* Modified the Kernel policy. When the compute is NHWC (#48563)

* temporally disable set_value (#48942)

* xpu support inplace flatten (#48909)

This is a PR to catch up with latest xpu white list strategy
(https://github.com/PaddlePaddle/Paddle/pull/48606)
, since original list only include 'fluid' fashion names, but new list
must include 'phi' fashion as well.
Refer to paddle/phi/core/kernel_factory.cc for more details.

* fix:vit_attention ut (#48884)

* mv fused_bias_dropout_residual_ln to fluid manual dir (#48824)

* mv fused_bias_dropout_residual_ln to fluid manual dir

* rm useless comments

* bug fix (#48829)

* move ops_extra_info_gen.py from phi to fluid (#48926)

* fix scale type in alpha and beta (#48887)

* [inference][trt] upgrade prelu op  (#48528)

* add prelu

* 对多个文档按照要求修改 对应中文的#5453 (#48886)

* fix doc

* test=document_fix

Co-authored-by: Ligoml <39876205+Ligoml@users.noreply.github.com>

* replace cross_entropy in python/paddle/fluid/tests/unittests/*.py except test*.py (#48919)

* [remove fluid] Remove fluid APIs (#48641)

* [CodeStyle] fix renamed files not being monitored by Codestyle Check (#48892)

* [fluid remove]: remove paddle.fluid.layers.box_coder and paddle.fluid.layers.polygon_box_transform (#48896)

* remove fluid_box_coder and polygon_box_transform

* code check

* [Custom XPU Support] Custom extension support xpu backend (#48733)

* support custom_xpu

* update cmake to test xpu

* support custom_xpu, verify mechanism

* fix test_custom_relu_op_xpu_setup.py, test=kunlun

* fix FLAGS_init_allocated_mem

* cancel TIMEOUT property

* reset FLAGS_init_allocated_mem property

* rm mlu ops eager guard tests (#48769)

* rm npu instance_np op for eager guard tests (#48785)

* remove xpu eager guard tests (#48786)

* [remove fluid.layers.cross_entropy] remove unit tests (part 3)  (#48918)

* replace cross_entropy in python/paddle/fluid/tests/unittests/test_[o-z]*.py plus test_dist_transpiler.py

* fix test_prune

* [Inference] optimize some code and fix some bug (#48780)

* clean ir_pass_manager and fix map_depthwise_conv_to_conv_pass

* fix unitest timeout

* [PHI] Migrate reshape kernel (#48749)

* reshape

* typo

* remove header

* support py3 in setup.py (#48905)

* support py3 in setup.py

* support setup.py bdist_wheel in py3

* support py3 in setup.py

* modify run_setup

* [Paddle-TRT] add cast between  int64 tensor  and Paddle-TRT (#45547)

* Add cast between int64 tensor and Paddle-TRT
* Add Unit testing.

* fix sharding_stage1 amp O2 decorate bug (#48960)

* [remove fluid] fluid dygraph Embedding (#48806)

* [remove fluid] fluid dygraph Embedding

* [remove fluid] fluid dygraph Embedding

* [remove fluid] fluid dygraph Embedding

* [remove fluid] fluid dygraph Embedding

* [remove fluid] fluid dygraph Embedding

* [remove fluid] fluid dygraph Embedding

* fix for mkldnn (#48852)

* H2D data transfer optimization with usage of structure type for stack kernel (#48899)

* first commit.

* refine performance with fast_divmod

* refine performance with fast_divmod

* rm accuracy and auc in extra __all__ (#48986)

* Add dynamic checks for collective communication on NCCL  (#48915)

* chore: unify `SingleTensor`

* feat: dynamic check

* support sharding in fp16 on xpu,  (#48897)

* support sharding in fp16 on xpu, change reduce_max to reduce_sum for found nan or inf

* update

* Support cross-step stream synchronization for standalone executor (#48809)

* Add UT

* Support cross-step stream synchronization for standalone executor

* Fix typos

* Fix typos

* Update UTs

* Generate static graph code of some ops by yaml (#48771)

* generate static graph code of some ops by yaml, test = develop

* fix 'take_along_axis' yaml style

* reset scatter/scatter_nd_add

* delete the comments of put_along_axis

* fix a bug in GetTrtWeight (#48993)

* add static_ops.yaml for static op (#48991)

* [PHI decoupling] move norm_utils.cu.h from fluid to phi and remove norm_utils.h in fluid (#48930)

* move norm_utils.cu.h from fluid to phi

* remove norm_utils.h in fluid

* fix bugs and replace mutable_data with Alloc

* replace mutable_data with Alloc

* forbid conv op whose weight is not a persistable weight into Paddle-TRT (#48763)

* fix: Move the pass location to the appropriate location (#48951)

* Enhance check_nan_inf implementation for CPU. (#48591)

* Enable to print device info.

* Enhance the nan and inf checking for cpu.

* Implement a common print function.

* Unify the check of complex numbers.

* Rewrite the omp method.

* Count and print the number of nan and inf.

* Change the print content.

* Add unittest.

* [PHI] OneDNN version of Copy (#48539)

* OneDNN version of Copy, tranpose kernels adjusted

* style fixes in tranpose_grad

* redundant headers deleted

* fix: there are some bugs with trt 8.0 (#48921)

* fix: there are some bugs with trt 8.0

* fix:windows CI trt is too old

* Optimization of Eigh op with ssyevj_batched runtime api (#48560)

* fix codestyle

* add double complex<float> complex<double> dtype support for syevj_batched

* fix use_syevj flag for precision loss when input dtype of syevj_batch is complex128 in some case

* optimize eigh in different case

* fix missing ; bug

* fix use_syevj bug

* fix use_cusolver_syevj_batched flag

* replace cross_entropy in python/paddle/fluid/tests/unittests/*/*.py except unittests/*.py (#48920)

* [PHI decoupling] replace dependency of inclusive_scan.h from phi (#48980)

* replace dependency of inclusive_scan.h from phi

* format code

* fluid API magration : Assert, increment, cond (#48885)

* [Clean fluid] Add inner function _elementwise_op_with_axis (#48748)

* add inner function _elementwise_op_with_axis

* fix transformer_model

* polish API code

* remove elementwise_div/mul api

* delete API in __all__

* delete elementwise_mul completely

* polish elementwise_mul call

* polish internal api

* resolve conflict, fix rnn.py

* use non-inplace call

* delete elementwise_mul api test

* delete elementwise_mul api test

* clean elementwise_add/sub

* restore _elementwise_op_in_dygraph in nn.py

* test_convert_to_mixed_precision.py use tempfile for temporary models/params (#48819)

* Tighten the Interception strategy (#48947)

* test approve ,test=document_fix

* test approve ,test=document_fix

* test approve ,test=document_fix

* [CodeStyle][isort][F401] fix some regression issues (#48936)

* [CodeStyle][isort][F401] fix some regression issues

* add import paddle to fix eval call

* rm multinode eager guard tests (#48766)

* rm multinode eager guard tests

* remove unwanted tests

* reset process_mpi test

* rm unittests eager guard tests part5 dataloader2dygraph_mnist (#48816)

* [PHI]Add new Tensor type and migrate save_combine kernel (#47856)

* add new tensor

* fix windows compile bugs

* fix ci bugs

* fix ci bugs

* fix ci bugs

* perfect according comment

* fix ci compile bugs

* add raw tensor

* fix ci bugs

* modify code by comment

* delete String

* [Fluid Clean]move BatchNorm from flud.dygraph.nn to paddle.nn.layer.norm (#48734)

* move BatchNorm from flud.dygraph.nn to paddle.nn.layer.norm

* modfiy conflict

* modify pre-commit error

* modify static-check ci error

* fix failed tests

* modify conflict

* modify conflict

* delete import modelu GRUUnit

* fix falied test

* fix failed testes

* fix failed tests

* fix failed tests

* fix failed test

* fix error in test_fused_resenet_basic_block_op_xpu.py

* modify after xiaoguang reviewed

* [Setup] Ignore @PADDLE_BINARY_DIR@ files (#49002)

* [Setup] Ignore @PADDLE_BINARY_DIR@ files

* test=document_fix

* reshape onednn test reimplemented (#48850)

* - UT reshape onednn

- Fix

test

test2

- test4

- test5

- test6

test7

- test8

- Ut reinvented

- cosmetic

* - fix

* - fix

* - fix

* - fix

* - Fix

* - fix

* - fix

* - fix

* - Fix

* lint

* update fused_multi_transformer_encoder_pass support GPT new matmul API (#48953)

* fit paddle.matmul in fleetx.gpt

* Revert "set free_when_no_cache_hit default value to true (#48815)" (#48968)

This reverts commit 592ed40b58d7bc015de87368cc611e865bdcd6ea.

* [Paddle Inference]fix some transformer unitest (#48929)

* fix some transformer unitest

* Enable Generic-Plugin support FP16 (#48807)

* support conv1d quant & skip calibrate zero-size tensor (#48912)

* enable custom device save model on device memory && fix conflict (#48221)

* [api move] cvm (#48989)

* [api move] cvm

* [api move] cvm

* [api move] cvm

* [api move] cvm

* [api move] cvm

* [api move] cvm

* [api move] cvm

* [api move] ci test

* [api move] ci test

* [api move] ci test

* Bugfix: xpu now only support single node multi-card, bkcl_comm_num should always set to 1 (#48961)

* rm unittests eager guard tests part23 where2zeros (#48895)

* rm unittests eager guard tests part17 number2pool1d (#48840)

* [NPU] fix FLAGS_npu_storage_format flag in python, test=develop (#48976)

* remove fleet eager guard tests (#48765)

* rm unittests eager guard tests part6 eager_run2expand_v2 (#48817)

* rm unittests eager guard tests part12 imperative_optimizer2resnet (#48833)

* [fluid clean] remove 4 fluid.layers api and imigrate 2 fluid.layer api (#48972)

* fluid clean layer

* docs

* remove reset reference in unittest for `fluid.layers.cross_entropy` (#49012)

* replace cross_entropy in test*.py except python/paddle/fluid/tests/unittests/*.py (#48978)

* remove linear_chain_crf and crf_decoding from fluid (#48996)

* remove linear_chain_crf and crf_decoding

* Generate static graph code of some ops by yaml (#48977)

* generate static graph code of some ops by yaml

* fix the code-style of yaml

* fix the framework_ci for triangular_solve

* change the 'data_type' of scatter

* add the 'out: Out' of scatter_nd_add

* [tools] Update summary env (#48627)

* [tools] remove deprecated api , fix macOS get version error

* [tools] Rename the value that returns null

* [tools] add gcc, clang, cmak, libc version

* [tools] fix cudnn read error

* [tools] add gpu devices list, drive based

* [issue] update 3_build-installation-issue.yml

* [tools] fix get gpu list AttributeError

* [Dy2St] transforms.RandomVerticalFlip Support static mode (#49024)

* add static RandomVerticalFlip

* object => unittest.TestCase

* Save fused_attention op memory when dropout_rate = 0.0 (#48902)

* save fused_attention memory when dropout_rate = 0.0

* add ut

* fix ut bug

* fix fused_layernorm_residual_dropout_bias_test.cu

* Correct multiple inputs and outputs (#48872)

* [CodeStyle][isort][Dy2St] sort imports for paddle.jit (#48637)

* isort jit

* refine comment

* remove non-public apis from __all__ (#48952)

* remove non-public apis from __all__

* fix code style

* fix rmsprop_ yaml bug (#49026)

* fix rmsprop_ yaml bug

* Fixed the dead link bug in the API documentation (#48969)

* first pr

* Revise nn.py

* Revise nn.py 2.0

* Revise rnn.py;test=document_fix

* test=document_fix

Co-authored-by: Ligoml <39876205+Ligoml@users.noreply.github.com>

* Change mutable_data to ctx.Alloc. (#49001)

* [inference][trt] add more unary op and square (#48534)

* add more unary op and square

* Support ninja (#48932)

* move inplace_apis_indygraph_only from paddle.flud.dygraph.inplace_utils to paddle.utils

* modify conflict

* modify conflict

* modify conflict

* modify conflict

* modify conflict

* modify conflict

* modify conflict

* modify static-check ci error

* fix conflict

* modify failed tests

* fix conflict

* fix conflict

* fix pool2d examples

* modify conflict

* fix failed tests

* fix conflict

* fix failed tests

* modfiy problem of deleting pool2d

* support Ninja in setup.py

* support different cmake_generators

* modify after reviewed

* delete unused denotes

* Deleted mkldnn_inplace_pass code (#47818)

* Deleted mkldnn_inplace_pass code

* Fixed error with cmake

* Resolve conflicts

* hide log (#49045)

* test=doucment_fix

* test=document_fix

* [Sparse]Optimize performance of sparse conv on T4 (#49009)

* modify cmake file for cuda11.8 compile (#49020)

* modify cmake file for cuda11.8 compile

* add op_library(fused_embedding_eltwise_layernorm_op DEPS bert_encoder_functor)

* remove dropout from fluid (#48319)

* remove dropout

* nullptr bugfix for XPU pg mode (#49043)

* nullptr bugfix for XPU pg mode

Also a few kernels is added to xpu whitelist

* increase error msg length

* Divide elementwise case from BroadcastKernel and refine transpose autotune (#33051)

* First Commit.

* add some codes

* add elementwise loader

* fix code styles

* merge with develop

* add some changes both in elementwise and transpose

* add init operation in broadcast kernel.

* change codes according to pr suggestions about transpose file

* fix error for op-benchmark ci

* fix according to ci

* add condition of skipif (#48791)

* add condition of skipif

* fix code format error

* Update test_fused_gate_attention_op.py

update

* rm unittests eager guard tests part9 histogram2imperative_dataloader (#48825)

* rm unittests eager guard tests part9 histogram2imperative_dataloader

* rm basic

* rm unittests eager guard test part14 initializer2layer_norm (#48835)

* rm unittests eager guard test part14 initializer2layer_norm

* monior change

* [Bugfix] recompute dep filter param (#49010)

* recompute dep filter param

* recompute dep for reshard

* [Paddle Inference] rewrite convert_to_mixed_precision (#48853)

* [CodeStyle] fix c++17-extensions warning on macos (#49017)

* fix c++17-extensions warning on macos

* fix type

fix c++17-extensions warning on macos

fix c++17-extensions warning on macos

* Add custom CUDNN finding paths for 64bit Windows (#49066)

* remove prior_box (#49006)

* remove prior_box

* modify the sequence of paras of prior_box in multi_box_head api

* InstanceNorm1D、InstanceNorm2D、InstanceNorm3D (#48940)

* modified:   python/paddle/nn/layer/norm.py

* modified:   python/paddle/nn/layer/norm.py

* modified:   python/paddle/nn/layer/norm.py

* modified:   python/paddle/nn/layer/norm.py

* modified:   python/paddle/nn/layer/norm.py

* modified:   python/paddle/nn/layer/norm.py

* test=docs_preview

* InstanceNorm2D中文档格式修改

* test=docs_preview

* modified:   python/paddle/nn/functional/loss.py
	modified:   python/paddle/nn/functional/norm.py
	modified:   python/paddle/nn/layer/loss.py
	modified:   python/paddle/nn/layer/norm.py

* test=docs_preview

* test=docs_preview

* [AutoParallel] recompute tuning (#48608)

* [AutoParallel] recompute tuning

* fix conflict

* update comment

* bug fix

* update rc algo

* tiny fix

* fix clear process_group

* remove comment

* update segment print

* fix import OpRole

* adapt amp pass and grad_clip pass for opt_tuner

* update tuning config

* fix import

* annotate recompute info on ops and upgrade recompute pass

* add op_namescope for seed op

* record reserved vars

* fix recompute var's dist_attr

* fix strategy unittest

* adapt for fp16

* update unittest

* revert copy opt

* update unittest

* rename set_recompute_segments

* fix unittest

* fluid API magration : array_read, array_write (#49022)

* del array_write & array_read

* fix import err

* fix import err

* fix example codes

* Keep double-buffer reader for static mode  (#49068)

* Fix nullptr to TestFuseGemmEpilogueReluBWDFP* (#48997)

* support fp16 index sample (#47897)

* add index sample fp16 support

* remove fluid APIs in distributed_strategy.py and role_maker.py

* Revert "remove fluid APIs in distributed_strategy.py and role_maker.py"

This reverts commit 223bbee990d3bf69e252fc3c0f19e3873550a264.

* fix instantiated more than once

* clean codes

* rm unittest eager guard tests part20 sparse_mv2split (#48879)

* rm unittests eager guard tests part11 imperative_layer2ocr (#48828)

* rm unittests eager guard tests part11 imperative_layer2ocr

* review

* rm eager guard tests part3_1 (#49059)

* fix: gloo compatible (#49084)

* rm eager guard tests part3_3 (#49061)

* fix bug (#49081)

* [Inference] memory_optimize and mkdlnn  problem (#49054)

* memory_optimize and mkdlnn problem

* update

* update

* update

* Remove/move 16 fluid APIs (#48377)

* remove density_prior_box

* remove anchor_generator

* remove roi_perspective_transform

* remove generate_proposal_labels

* remove generate_mask_labels

* remove generate_proposals

* remove box_clip

* remove retinanet_detection_output

* remove multiclass_nms

* remove locality_aware_nms

* remove matrix_nms

* remove distribute_fpn_proposals

* remove box_decoder_and_assign

* remove collect_fpn_proposals

* remove 2 trt files

* move prior_box to static/nn/common.py

* move multi_box_head to static/nn/common.py

* fix for CI/CE

* remove retinanet_detection_output

* restore compile_vs_runtime_white_list.py

* restore test_retinanet_detection_output to white list

* replace nn.flatten by paddle.flatten, and fix doc for retinanet_target_assign

* add enable_static in demo and fix bug

* remove roi_perspective_transform in test_layers

* remove multi_box_head

* change self.multiclass_nms to _legacy_C_ops.multiclass_nms

* empty commit

* empty commit

* check code style

* fix prior_box

* fix CI

* remove redundant prior_box in detection.py

* fix docs

* remove detection

* fix prior_box en doc

* delete prior_box in common

* remote proir_box from __init__.py

* fix embedding multihead (#49085)

* SetDeviceId in StreamSafeCUDAAllocation (#49080)

* [PHI decoupling] Remove fluid imports from MKLDNN code (#48981)

* fix wrong handler name

* mkldnn_engine -> onednn_engine

* remove fluid/errors.h imports

* remove fluid/enforce.h imports

* remove note and unnecessary import

* remove fluid/pretty_log.h imports

* remove fluid/place.h imports

* remove fluid/data_layout_transform.h imports

* remove fluid/device_context.h imports

* remove mkldnn_helper code

* remove fluid/mkldnn_reuse.h imports

* pretty_log import

* replace cross_entropy in python/paddle/fluid/tests/unittests/*.py (#48975)

* 修复paddle.amp.decorate等API的文档 (#48983)

* 涉及到的api有
paddle.amp.decorate
paddle.static.npu_places
paddle.signal.istft
paddle.signal.stft
paddle.linalg.eigvalsh
paddle.randint_like

* change signal.stft

* randint_like的low增加optional

* ; test=docs_preview

* 修改了注解格式; test=docs_preview

* 修改了公式格式

* 修改了decorate的models等

* test=document_fix

Co-authored-by: Ligoml <39876205+Ligoml@users.noreply.github.com>

* 按在线文档需求 61~70 更新了部分文档 (#49014)

* Update docstring:
1. 去除 python/paddle/tensor/manipulation.py 中 cast 函数描述中的 This OP；
2. 调整 python/paddle/fluid/layers/control_flow.py 中 Print 函数中参数描述的顺序，添加 optional 描述；
3. 为 python/paddle/tensor/logic.py 中 logical_and 函数添加 optional 描述；
4. 为 python/paddle/fluid/reader.py 中 DataLoader 类中 from_generator、from_dataset 函数添加 optional 描述；
5. 在 python/paddle/fluid/layers/nn.py 中 crf_decoding 函数的 param_attr 在使用中确实可视为存在默认值 None，故添加 optional 描述；
6. 修复 python/paddle/static/nn/common.py 中 data_norm 函数描述里 tex 语法错误的问题，并一并修复同一文件中的相同问题。

* 根据 review 意见修改部分内容。

* 将谓语动词去掉第三人称单数形式。

* 同步中文文档变更。

* string-->str; test=document_fix

Co-authored-by: Ligoml <39876205+Ligoml@users.noreply.github.com>

* merge gpugraph to develop, fix gloo wrapper

* merge gpugraph to develop, fix ci

* merge gpugraph to develop, fix gloo wrapper

* merge gpugraph to develop, fix ci

* merge gpugraph to develop, fix fleet.py

* merge gpugraph to develop, fix merge error

* merge gpugraph to develop, fix merge error

* merge gpugraph to develop, add python ut

* merge gpugraph to develop, fix code style

* merge gpugraph to develop, add c++ ut

* merge gpugraph to develop, fix code style

* merge gpugraph to develop, fix data_feed.h

* merge gpugraph to develop, fix code style

* merge gpugraph to develop, fix code style

* merge gpugraph to develop, fix code style

* merge gpugraph to develop, fix code style

Co-authored-by: wuhuachaocoding <77733235+wuhuachaocoding@users.noreply.github.com>
Co-authored-by: Nyakku Shigure <sigure.qaq@gmail.com>
Co-authored-by: zyfncg <zhangyunfei07@baidu.com>
Co-authored-by: xiongkun <xiongkun03@baidu.com>
Co-authored-by: ceci3 <ceci3@users.noreply.github.com>
Co-authored-by: zhangyikun02 <48021248+zhangyk0314@users.noreply.github.com>
Co-authored-by: Weilong Wu <veyron_wu@163.com>
Co-authored-by: 201716010711 <87008376+201716010711@users.noreply.github.com>
Co-authored-by: Qi Li <qili93@qq.com>
Co-authored-by: zhoutianzi666 <39978853+zhoutianzi666@users.noreply.github.com>
Co-authored-by: feng_shuai <fengshuai03@baidu.com>
Co-authored-by: QingshuChen <chenqingshu@baidu.com>
Co-authored-by: WangZhen <23097963+0x45f@users.noreply.github.com>
Co-authored-by: 张春乔 <83450930+Liyulingyue@users.noreply.github.com>
Co-authored-by: 傅剑寒 <Xs1580802568@gmail.com>
Co-authored-by: xiaoguoguo626807 <100397923+xiaoguoguo626807@users.noreply.github.com>
Co-authored-by: wangzhen38 <41941775+wangzhen38@users.noreply.github.com>
Co-authored-by: Zhou Wei <1183042833@qq.com>
Co-authored-by: Kevin吴嘉文 <417333277@qq.com>
Co-authored-by: Zman <35071129+Atlantisming@users.noreply.github.com>
Co-authored-by: lzy <569782149@qq.com>
Co-authored-by: Vvsmile <450864116@qq.com>
Co-authored-by: Sławomir Siwek <slawomir.siwek@intel.com>
Co-authored-by: RichardWooSJTU <37864677+RichardWooSJTU@users.noreply.github.com>
Co-authored-by: Ligoml <39876205+Ligoml@users.noreply.github.com>
Co-authored-by: hjyp <53164956+Tomoko-hjf@users.noreply.github.com>
Co-authored-by: Vigi Zhang <VigiZhang@users.noreply.github.com>
Co-authored-by: zqw_1997 <118182234+zhengqiwen1997@users.noreply.github.com>
Co-authored-by: Yiqun Liu <Xreki@users.noreply.github.com>
Co-authored-by: risemeup1 <62429225+risemeup1@users.noreply.github.com>
Co-authored-by: Guanghua Yu <742925032@qq.com>
Co-authored-by: 姜永久 <34344716+yjjiang11@users.noreply.github.com>
Co-authored-by: wanghuancoder <wanghuan29@baidu.com>
Co-authored-by: Roc <30228238+sljlp@users.noreply.github.com>
Co-authored-by: Wilber <jiweibo@baidu.com>
Co-authored-by: Ghost Screaming <mofengshenjieII@163.com>
Co-authored-by: Netpunk <69072522+Patrick-Star125@users.noreply.github.com>
Co-authored-by: 六个骨头 <46243324+zrr1999@users.noreply.github.com>
Co-authored-by: Aurelius84 <zhangliujie@baidu.com>
Co-authored-by: Ruibiao Chen <chenruibiao@baidu.com>
Co-authored-by: liu zhengxi <380185688@qq.com>
Co-authored-by: heyanru <81976792+heyanru01@users.noreply.github.com>
Co-authored-by: tianshuo78520a <707759223@qq.com>
Co-authored-by: huangjiyi <43315610+huangjiyi@users.noreply.github.com>
Co-authored-by: Infinity_lee <luhputu0815@gmail.com>
Co-authored-by: houj04 <35131887+houj04@users.noreply.github.com>
Co-authored-by: lugimzzz <63761690+lugimzzz@users.noreply.github.com>
Co-authored-by: Wangzheee <634486483@qq.com>
Co-authored-by: sneaxiy <32832641+sneaxiy@users.noreply.github.com>
Co-authored-by: kangguangli <kangguangli@hotmail.com>
Co-authored-by: jakpiase <jakpia21@gmail.com>
Co-authored-by: HongyuJia <jiahongyu@baidu.com>
Co-authored-by: haosicheng <47998305+HarperCy@users.noreply.github.com>
Co-authored-by: Chang Xu <molixu7@gmail.com>
Co-authored-by: Kai Song <50285351+USTCKAY@users.noreply.github.com>
Co-authored-by: limingshu <61349199+JamesLim-sy@users.noreply.github.com>
Co-authored-by: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Co-authored-by: jiangcheng <thisjiang@qq.com>
Co-authored-by: ccrrong <101700995+ccrrong@users.noreply.github.com>
Co-authored-by: PuQing <me@puqing.work>
Co-authored-by: Leo Chen <chenqiuliang@baidu.com>
Co-authored-by: cyber-pioneer <116002591+cyber-pioneer@users.noreply.github.com>
Co-authored-by: niuliling123 <51102941+niuliling123@users.noreply.github.com>
Co-authored-by: james <zhangxiaoci@baidu.com>
Co-authored-by: wenbin <wang3323032@qq.com>
Co-authored-by: ZZK <359521840@qq.com>
Co-authored-by: Zhang Jun <ewalker@live.cn>
Co-authored-by: yjphhw <43883055+yjphhw@users.noreply.github.com>
Co-authored-by: Yuanle Liu <yuanlehome@163.com>
Co-authored-by: Wen Sun <35923278+HermitSun@users.noreply.github.com>
Co-authored-by: lzydev <1528794076@qq.com>
Co-authored-by: Paulina Gacek <paulina.gacek@intel.com>
Co-authored-by: feifei-111 <2364819892@qq.com>
Co-authored-by: YuanRisheng <yuanrisheng@baidu.com>
Co-authored-by: Jacek Czaja <jacek.czaja@intel.com>
Co-authored-by: weishengying <63448337+weishengying@users.noreply.github.com>
Co-authored-by: engineer1109 <jialiang.wang@xdxct.com>
Co-authored-by: gouzil <66515297+gouzil@users.noreply.github.com>
Co-authored-by: Ryan <44900829+DrRyanHuang@users.noreply.github.com>
Co-authored-by: joanna.wozna.intel <joanna.wozna@intel.com>
Co-authored-by: JYChen <zoooo0820@qq.com>
Co-authored-by: jjyaoao <88936287+jjyaoao@users.noreply.github.com>
Co-authored-by: Hulek <jakub.hulek@intel.com>
Co-authored-by: zhangkaihuo <zhangkaihuo@baidu.com>
Co-authored-by: YUNSHEN XIE <1084314248@qq.com>
Co-authored-by: JZ-LIANG <jianzhongliang10@gmail.com>
Co-authored-by: Tinson Lai <laitingsheng@hotmail.com>
Co-authored-by: Ayuan <79981115+Ayuan2021@users.noreply.github.com>
Co-authored-by: zhaoyingli <86812880+zhaoyinglia@users.noreply.github.com>
Co-authored-by: Ming-Xu Huang <mingh@nvidia.com>
Co-authored-by: wangxiaoning <71813629+wangxn12138@users.noreply.github.com>
Co-authored-by: Haohongxiang <86215757+haohongxiang@users.noreply.github.com>
Co-authored-by: HydrogenSulfate <490868991@qq.com>
Co-authored-by: mjxs <52824616+kk-2000@users.noreply.github.com>
Co-authored-by: 学渣戊 <x19403@163.com>
---
 cmake/external/jemalloc.cmake                 |   35 +
 cmake/external/rocksdb.cmake                  |   31 +-
 cmake/third_party.cmake                       |    3 +
 paddle/fluid/distributed/common/afs_warpper.h |   12 +
 .../fluid/distributed/ps/service/ps_client.h  |   15 +-
 .../distributed/ps/service/ps_local_client.cc |   30 +-
 .../distributed/ps/service/ps_local_client.h  |   17 +-
 .../fluid/distributed/ps/table/CMakeLists.txt |   11 -
 paddle/fluid/distributed/ps/table/accessor.h  |   14 +-
 .../ps/table/common_graph_table.cc            |  521 ++-
 .../distributed/ps/table/common_graph_table.h |  156 +-
 .../distributed/ps/table/ctr_dymf_accessor.cc |   39 +-
 .../distributed/ps/table/ctr_dymf_accessor.h  |   43 +-
 .../ps/table/depends/rocksdb_warpper.h        |  227 +-
 .../distributed/ps/table/graph/graph_node.h   |   61 +-
 .../ps/table/memory_sparse_table.cc           |   22 +-
 .../ps/table/memory_sparse_table.h            |    6 +-
 .../distributed/ps/table/ssd_sparse_table.cc  | 1664 ++++++++--
 .../distributed/ps/table/ssd_sparse_table.h   |   79 +-
 paddle/fluid/distributed/ps/table/table.cc    |    7 +-
 paddle/fluid/distributed/ps/table/table.h     |    3 +
 paddle/fluid/distributed/ps/wrapper/fleet.cc  |   11 +
 paddle/fluid/distributed/ps/wrapper/fleet.h   |    3 +
 paddle/fluid/distributed/the_one_ps.proto     |    1 +
 paddle/fluid/framework/barrier.h              |  113 +
 paddle/fluid/framework/data_feed.cc           |   18 +-
 paddle/fluid/framework/data_feed.cu           | 2524 ++++++++++++---
 paddle/fluid/framework/data_feed.h            |  191 +-
 paddle/fluid/framework/data_feed.proto        |    4 +
 paddle/fluid/framework/data_set.cc            |  132 +-
 paddle/fluid/framework/data_set.h             |   18 +-
 paddle/fluid/framework/device_worker.h        |    2 +
 .../framework/dist_multi_trainer_test.cc      |   29 +
 .../framework/distributed_strategy.proto      |    1 +
 paddle/fluid/framework/fleet/CMakeLists.txt   |    3 +-
 paddle/fluid/framework/fleet/gloo_wrapper.cc  |    3 +-
 paddle/fluid/framework/fleet/heter_context.h  |    2 +
 .../cudf/concurrent_unordered_map.cuh.h       |   19 +-
 .../framework/fleet/heter_ps/feature_value.cu |   46 +-
 .../framework/fleet/heter_ps/feature_value.h  |  175 +-
 .../framework/fleet/heter_ps/gpu_graph_node.h |   92 +-
 .../fleet/heter_ps/gpu_graph_utils.h          |   55 +-
 .../fleet/heter_ps/graph_gpu_ps_table.h       |  137 +-
 .../fleet/heter_ps/graph_gpu_ps_table_inl.cu  | 1266 +++++++-
 .../fleet/heter_ps/graph_gpu_wrapper.cu       |  515 ++-
 .../fleet/heter_ps/graph_gpu_wrapper.h        |  107 +-
 .../framework/fleet/heter_ps/hashtable.h      |   30 +-
 .../fleet/heter_ps/hashtable_kernel.cu        |  306 +-
 .../fleet/heter_ps/hashtable_kernel.kps       |  112 +-
 .../framework/fleet/heter_ps/heter_comm.h     |  425 ++-
 .../framework/fleet/heter_ps/heter_comm_inl.h | 2884 ++++++++++++++---
 .../fleet/heter_ps/heter_comm_kernel.cu       |  796 ++++-
 .../fleet/heter_ps/heter_comm_kernel.h        |  140 +-
 .../fleet/heter_ps/heter_comm_kernel.kps      |  196 +-
 .../framework/fleet/heter_ps/heter_ps.cc      |    2 +-
 .../framework/fleet/heter_ps/heter_ps.cu      |    7 +-
 .../fluid/framework/fleet/heter_ps/heter_ps.h |   17 +-
 .../framework/fleet/heter_ps/heter_ps_base.h  |    9 +-
 .../fleet/heter_ps/heter_resource.cc          |  126 +-
 .../framework/fleet/heter_ps/heter_resource.h |   33 +
 .../fluid/framework/fleet/heter_ps/mem_pool.h |   60 +-
 .../framework/fleet/heter_ps/optimizer.cuh.h  |    7 +-
 .../framework/fleet/heter_ps/optimizer_conf.h |    6 +-
 .../fleet/heter_ps/test_cpu_query.cu          |   19 +-
 .../fluid/framework/fleet/ps_gpu_wrapper.cc   | 1128 +++++--
 paddle/fluid/framework/fleet/ps_gpu_wrapper.h |   92 +-
 paddle/fluid/framework/hogwild_worker.cc      |   56 +-
 paddle/fluid/framework/multi_trainer.cc       |    1 +
 paddle/fluid/inference/CMakeLists.txt         |    3 +-
 paddle/fluid/inference/tensorrt/helper.h      |    1 +
 .../auto_growth_best_fit_allocator.cc         |   33 +-
 .../auto_growth_best_fit_allocator.h          |    7 +
 paddle/fluid/operators/shuffle_batch_op.cu    |   34 +-
 paddle/fluid/platform/monitor.cc              |    1 +
 paddle/fluid/platform/profiler.proto          |    2 +-
 paddle/fluid/pybind/data_set_py.cc            |    6 +
 paddle/fluid/pybind/fleet_py.cc               |   24 +-
 paddle/fluid/pybind/gloo_wrapper_py.cc        |    6 +
 paddle/fluid/pybind/ps_gpu_wrapper_py.cc      |    6 +
 paddle/phi/core/flags.cc                      |   65 +-
 paddle/phi/kernels/gpu/graph_reindex_funcs.h  |    4 +-
 .../phi/kernels/gpu/graph_reindex_kernel.cu   |  285 +-
 paddle/phi/kernels/graph_reindex_kernel.h     |   26 +
 paddle/phi/kernels/send_u_recv_grad_kernel.h  |    1 +
 paddle/scripts/paddle_build.sh                |    1 -
 paddle/utils/string/string_helper.h           |   36 +
 paddle/utils/string/string_helper_test.cc     |   20 +
 python/paddle/distributed/fleet/__init__.py   |    1 +
 .../fleet/base/distributed_strategy.py        |   17 +-
 .../distributed/fleet/base/runtime_factory.py |    8 +-
 python/paddle/distributed/fleet/fleet.py      |  121 +-
 python/paddle/distributed/ps/the_one_ps.py    |    8 +-
 python/paddle/fluid/dataset.py                |   43 +
 .../fluid/tests/unittests/CMakeLists.txt      |    5 +
 .../fleet/test_fleet_rolemaker_new.py         |    6 +-
 .../fluid/tests/unittests/dist_fleet_ctr.py   |    4 +
 .../fluid/tests/unittests/test_dataset.py     |   92 +
 .../unittests/test_dist_fleet_minimize.py     |  249 ++
 .../tests/unittests/test_dist_fleet_spmt.py   |  251 ++
 .../fluid/tests/unittests/test_downpoursgd.py |    1 +
 python/paddle/fluid/trainer_factory.py        |    7 +
 python/paddle/fluid/transpiler/collective.py  |  181 +-
 102 files changed, 13524 insertions(+), 2946 deletions(-)
 create mode 100644 cmake/external/jemalloc.cmake
 mode change 100755 => 100644 paddle/fluid/distributed/ps/wrapper/fleet.h
 create mode 100644 paddle/fluid/framework/barrier.h
 create mode 100644 python/paddle/fluid/tests/unittests/test_dist_fleet_minimize.py
 create mode 100644 python/paddle/fluid/tests/unittests/test_dist_fleet_spmt.py

diff --git a/cmake/external/jemalloc.cmake b/cmake/external/jemalloc.cmake
new file mode 100644
index 0000000000000..efce686b20929
--- /dev/null
+++ b/cmake/external/jemalloc.cmake
@@ -0,0 +1,35 @@
+include(ExternalProject)
+
+set(JEMALLOC_PROJECT "extern_jemalloc")
+set(JEMALLOC_URL
+    https://github.com/jemalloc/jemalloc/releases/download/5.1.0/jemalloc-5.1.0.tar.bz2
+)
+set(JEMALLOC_BUILD ${THIRD_PARTY_PATH}/jemalloc/src/extern_jemalloc)
+set(JEMALLOC_SOURCE_DIR "${THIRD_PARTY_PATH}/jemalloc")
+set(JEMALLOC_INSTALL ${THIRD_PARTY_PATH}/install/jemalloc)
+set(JEMALLOC_INCLUDE_DIR ${JEMALLOC_INSTALL}/include)
+set(JEMALLOC_DOWNLOAD_DIR "${JEMALLOC_SOURCE_DIR}/src/${JEMALLOC_PROJECT}")
+
+set(JEMALLOC_STATIC_LIBRARIES
+    ${THIRD_PARTY_PATH}/install/jemalloc/lib/libjemalloc_pic.a)
+set(JEMALLOC_LIBRARIES
+    ${THIRD_PARTY_PATH}/install/jemalloc/lib/libjemalloc_pic.a)
+
+ExternalProject_Add(
+  extern_jemalloc
+  PREFIX ${JEMALLOC_SOURCE_DIR}
+  URL ${JEMALLOC_URL}
+  INSTALL_DIR ${JEMALLOC_INSTALL}
+  DOWNLOAD_DIR "${JEMALLOC_DOWNLOAD_DIR}"
+  BUILD_COMMAND $(MAKE)
+  BUILD_IN_SOURCE 1
+  INSTALL_COMMAND $(MAKE) install
+  CONFIGURE_COMMAND "${JEMALLOC_DOWNLOAD_DIR}/configure"
+                    --prefix=${JEMALLOC_INSTALL} --disable-initial-exec-tls)
+
+add_library(jemalloc STATIC IMPORTED GLOBAL)
+set_property(TARGET jemalloc PROPERTY IMPORTED_LOCATION
+                                      ${JEMALLOC_STATIC_LIBRARIES})
+
+include_directories(${JEMALLOC_INCLUDE_DIR})
+add_dependencies(jemalloc extern_jemalloc)
diff --git a/cmake/external/rocksdb.cmake b/cmake/external/rocksdb.cmake
index 40af6b564b3fc..8ad2bf1e2e8f1 100644
--- a/cmake/external/rocksdb.cmake
+++ b/cmake/external/rocksdb.cmake
@@ -14,6 +14,13 @@
 
 include(ExternalProject)
 
+# find_package(jemalloc REQUIRED)
+
+set(JEMALLOC_INCLUDE_DIR ${THIRD_PARTY_PATH}/install/jemalloc/include)
+set(JEMALLOC_LIBRARIES
+    ${THIRD_PARTY_PATH}/install/jemalloc/lib/libjemalloc_pic.a)
+message(STATUS "rocksdb jemalloc:" ${JEMALLOC_LIBRARIES})
+
 set(ROCKSDB_PREFIX_DIR ${THIRD_PARTY_PATH}/rocksdb)
 set(ROCKSDB_INSTALL_DIR ${THIRD_PARTY_PATH}/install/rocksdb)
 set(ROCKSDB_INCLUDE_DIR
@@ -22,21 +29,39 @@ set(ROCKSDB_INCLUDE_DIR
 set(ROCKSDB_LIBRARIES
     "${ROCKSDB_INSTALL_DIR}/lib/librocksdb.a"
     CACHE FILEPATH "rocksdb library." FORCE)
-set(ROCKSDB_CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fPIC")
+set(ROCKSDB_COMMON_FLAGS
+    "-g -pipe -O2 -W -Wall -Wno-unused-parameter -fPIC -fno-builtin-memcmp -fno-omit-frame-pointer"
+)
+set(ROCKSDB_FLAGS
+    "-DNDEBUG -DROCKSDB_JEMALLOC -DJEMALLOC_NO_DEMANGLE -DROCKSDB_PLATFORM_POSIX -DROCKSDB_LIB_IO_POSIX -DOS_LINUX -DROCKSDB_FALLOCATE_PRESENT -DHAVE_SSE42 -DHAVE_PCLMUL -DZLIB -DROCKSDB_MALLOC_USABLE_SIZE -DROCKSDB_PTHREAD_ADAPTIVE_MUTEX -DROCKSDB_BACKTRACE -DROCKSDB_SUPPORT_THREAD_LOCAL -DROCKSDB_USE_RTTI -DROCKSDB_SCHED_GETCPU_PRESENT -DROCKSDB_RANGESYNC_PRESENT -DROCKSDB_AUXV_GETAUXVAL_PRESENT"
+)
+set(ROCKSDB_CMAKE_CXX_FLAGS
+    "${ROCKSDB_COMMON_FLAGS} -DROCKSDB_LIBAIO_PRESENT -msse -msse4.2 -mpclmul ${ROCKSDB_FLAGS} -fPIC  -I${JEMALLOC_INCLUDE_DIR} -lz -ldl"
+)
+set(ROCKSDB_CMAKE_C_FLAGS
+    "${ROCKSDB_COMMON_FLAGS} ${ROCKSDB_FLAGS} -DROCKSDB_LIBAIO_PRESENT -fPIC  -I${JEMALLOC_INCLUDE_DIR}"
+)
 include_directories(${ROCKSDB_INCLUDE_DIR})
 
+set(CMAKE_CXX_LINK_EXECUTABLE
+    "${CMAKE_CXX_LINK_EXECUTABLE} -pthread -ldl -lrt -lz")
 ExternalProject_Add(
   extern_rocksdb
   ${EXTERNAL_PROJECT_LOG_ARGS}
   PREFIX ${ROCKSDB_PREFIX_DIR}
-  GIT_REPOSITORY "https://github.com/facebook/rocksdb"
-  GIT_TAG v6.10.1
+  GIT_REPOSITORY "https://github.com/Thunderbrook/rocksdb"
+  GIT_TAG 6.19.fb
   UPDATE_COMMAND ""
   CMAKE_ARGS -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER}
              -DCMAKE_C_COMPILER=${CMAKE_C_COMPILER}
              -DWITH_BZ2=OFF
              -DPORTABLE=1
              -DWITH_GFLAGS=OFF
+             -DWITH_TESTS=OFF
+             -DWITH_JEMALLOC=ON
+             -DWITH_BENCHMARK_TOOLS=OFF
+             -DJeMalloc_LIBRARIES=${JEMALLOC_LIBRARIES}
+             -DJeMalloc_INCLUDE_DIRS=${JEMALLOC_INCLUDE_DIR}
              -DCMAKE_CXX_FLAGS=${ROCKSDB_CMAKE_CXX_FLAGS}
              -DCMAKE_C_FLAGS=${CMAKE_C_FLAGS}
   INSTALL_COMMAND
diff --git a/cmake/third_party.cmake b/cmake/third_party.cmake
index 8516fe8177ca5..199a61fca7cb7 100755
--- a/cmake/third_party.cmake
+++ b/cmake/third_party.cmake
@@ -423,6 +423,9 @@ if(WITH_PSCORE)
 
   include(external/rocksdb) # download, build, install rocksdb
   list(APPEND third_party_deps extern_rocksdb)
+
+  include(external/jemalloc) # download, build, install jemalloc
+  list(APPEND third_party_deps extern_jemalloc)
 endif()
 
 if(WITH_RPC
diff --git a/paddle/fluid/distributed/common/afs_warpper.h b/paddle/fluid/distributed/common/afs_warpper.h
index 542d65d7a649f..44ba6485fafa1 100644
--- a/paddle/fluid/distributed/common/afs_warpper.h
+++ b/paddle/fluid/distributed/common/afs_warpper.h
@@ -66,6 +66,10 @@ class FsReadChannel {
     return 0;
   }
 
+  inline int read(char* data, size_t size) {
+    return fread(data, 1, size, _file.get());
+  }
+
  private:
   uint32_t _buffer_size;
   FsChannelConfig _config;
@@ -114,6 +118,14 @@ class FsWriteChannel {
     return write_line(data.c_str(), data.size());
   }
 
+  inline uint32_t write(const char* data, size_t size) {
+    size_t write_count = fwrite(data, 1, size, _file.get());
+    if (write_count != size) {
+      return -1;
+    }
+    return 0;
+  }
+
  private:
   uint32_t _buffer_size;
   FsChannelConfig _config;
diff --git a/paddle/fluid/distributed/ps/service/ps_client.h b/paddle/fluid/distributed/ps/service/ps_client.h
index 5654669d76fdb..74f946b2253aa 100644
--- a/paddle/fluid/distributed/ps/service/ps_client.h
+++ b/paddle/fluid/distributed/ps/service/ps_client.h
@@ -148,10 +148,12 @@ class PSClient {
     return fut;
   }
 
-  virtual ::std::future<int32_t> PullSparsePtr(char **select_values,
+  virtual ::std::future<int32_t> PullSparsePtr(int shard_id,
+                                               char **select_values,
                                                size_t table_id,
                                                const uint64_t *keys,
-                                               size_t num) {
+                                               size_t num,
+                                               uint16_t pass_id) {
     VLOG(0) << "Did not implement";
     std::promise<int32_t> promise;
     std::future<int> fut = promise.get_future();
@@ -160,6 +162,15 @@ class PSClient {
   }
 
   virtual std::future<int32_t> PrintTableStat(uint32_t table_id) = 0;
+  virtual std::future<int32_t> SaveCacheTable(uint32_t table_id,
+                                              uint16_t pass_id,
+                                              size_t threshold) {
+    VLOG(0) << "Did not implement";
+    std::promise<int32_t> promise;
+    std::future<int> fut = promise.get_future();
+    promise.set_value(-1);
+    return fut;
+  }
 
   // 确保所有积攒中的请求都发起发送
   virtual std::future<int32_t> Flush() = 0;
diff --git a/paddle/fluid/distributed/ps/service/ps_local_client.cc b/paddle/fluid/distributed/ps/service/ps_local_client.cc
index e8bf426710bc3..8d7071ad9ea69 100644
--- a/paddle/fluid/distributed/ps/service/ps_local_client.cc
+++ b/paddle/fluid/distributed/ps/service/ps_local_client.cc
@@ -260,10 +260,12 @@ ::std::future<int32_t> PsLocalClient::PushDense(const Region* regions,
 //  return done();
 //}
 
-::std::future<int32_t> PsLocalClient::PullSparsePtr(char** select_values,
+::std::future<int32_t> PsLocalClient::PullSparsePtr(int shard_id,
+                                                    char** select_values,
                                                     size_t table_id,
                                                     const uint64_t* keys,
-                                                    size_t num) {
+                                                    size_t num,
+                                                    uint16_t pass_id) {
   // FIXME
   // auto timer =
   // std::make_shared<CostTimer>("pslib_downpour_client_pull_sparse");
@@ -278,6 +280,8 @@ ::std::future<int32_t> PsLocalClient::PullSparsePtr(char** select_values,
   table_context.pull_context.ptr_values = select_values;
   table_context.use_ptr = true;
   table_context.num = num;
+  table_context.shard_id = shard_id;
+  table_context.pass_id = pass_id;
 
   //  table_ptr->PullSparsePtr(select_values, keys, num);
   table_ptr->Pull(table_context);
@@ -285,6 +289,28 @@ ::std::future<int32_t> PsLocalClient::PullSparsePtr(char** select_values,
   return done();
 }
 
+::std::future<int32_t> PsLocalClient::PrintTableStat(uint32_t table_id) {
+  auto* table_ptr = GetTable(table_id);
+  std::pair<int64_t, int64_t> ret = table_ptr->PrintTableStat();
+  VLOG(0) << "table id: " << table_id << ", feasign size: " << ret.first
+          << ", mf size: " << ret.second;
+  return done();
+}
+
+::std::future<int32_t> PsLocalClient::SaveCacheTable(uint32_t table_id,
+                                                     uint16_t pass_id,
+                                                     size_t threshold) {
+  auto* table_ptr = GetTable(table_id);
+  std::pair<int64_t, int64_t> ret = table_ptr->PrintTableStat();
+  VLOG(0) << "table id: " << table_id << ", feasign size: " << ret.first
+          << ", mf size: " << ret.second;
+  if (ret.first > (int64_t)threshold) {
+    VLOG(0) << "run cache table";
+    table_ptr->CacheTable(pass_id);
+  }
+  return done();
+}
+
 ::std::future<int32_t> PsLocalClient::PushSparseRawGradient(
     size_t table_id,
     const uint64_t* keys,
diff --git a/paddle/fluid/distributed/ps/service/ps_local_client.h b/paddle/fluid/distributed/ps/service/ps_local_client.h
index 593805547af84..583ea8052eb01 100644
--- a/paddle/fluid/distributed/ps/service/ps_local_client.h
+++ b/paddle/fluid/distributed/ps/service/ps_local_client.h
@@ -76,18 +76,19 @@ class PsLocalClient : public PSClient {
     return fut;
   }
 
-  virtual ::std::future<int32_t> PullSparsePtr(char** select_values,
+  virtual ::std::future<int32_t> PullSparsePtr(int shard_id,
+                                               char** select_values,
                                                size_t table_id,
                                                const uint64_t* keys,
-                                               size_t num);
+                                               size_t num,
+                                               uint16_t pass_id);
 
-  virtual ::std::future<int32_t> PrintTableStat(uint32_t table_id) {
-    std::promise<int32_t> prom;
-    std::future<int32_t> fut = prom.get_future();
-    prom.set_value(0);
+  virtual ::std::future<int32_t> PrintTableStat(uint32_t table_id);
+
+  virtual ::std::future<int32_t> SaveCacheTable(uint32_t table_id,
+                                                uint16_t pass_id,
+                                                size_t threshold);
 
-    return fut;
-  }
   virtual ::std::future<int32_t> PushSparse(size_t table_id,
                                             const uint64_t* keys,
                                             const float** update_values,
diff --git a/paddle/fluid/distributed/ps/table/CMakeLists.txt b/paddle/fluid/distributed/ps/table/CMakeLists.txt
index 4b233a8b27f28..2a5c4ad25d16b 100644
--- a/paddle/fluid/distributed/ps/table/CMakeLists.txt
+++ b/paddle/fluid/distributed/ps/table/CMakeLists.txt
@@ -53,16 +53,6 @@ cc_library(
 set_source_files_properties(
   tensor_accessor.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS})
 
-cc_library(
-  tensor_table
-  SRCS
-  DEPS eigen3
-       ps_framework_proto
-       executor
-       scope
-       device_context
-       tensor
-       ${TABLE_DEPS})
 set_source_files_properties(table.cc PROPERTIES COMPILE_FLAGS
                                                 ${DISTRIBUTE_COMPILE_FLAGS})
 
@@ -98,7 +88,6 @@ cc_library(
        table.cc
   DEPS ${TABLE_DEPS}
        common_table
-       tensor_table
        ps_framework_proto
        string_helper
        device_context
diff --git a/paddle/fluid/distributed/ps/table/accessor.h b/paddle/fluid/distributed/ps/table/accessor.h
index b55c77bf52d84..5ac0de018eeaf 100644
--- a/paddle/fluid/distributed/ps/table/accessor.h
+++ b/paddle/fluid/distributed/ps/table/accessor.h
@@ -121,6 +121,9 @@ class ValueAccessor {
   virtual void UpdateStatAfterSave(float* value, int param) {}
   // 判断该value是否保存到ssd
   virtual bool SaveSSD(float* value) = 0;
+  // 判断热启时是否过滤slot对应的feasign
+  virtual bool FilterSlot(float* value) { return false; }
+
   //
   virtual bool SaveCache(float* value,
                          int param,
@@ -162,9 +165,18 @@ class ValueAccessor {
     return 0;
   }
 
+  virtual bool SaveMemCache(float* value,
+                            int param,
+                            double global_cache_threshold,
+                            uint16_t pass_id) {
+    return true;
+  }
+
+  virtual void UpdatePassId(float* value, uint16_t pass_id) {}
+
   virtual float GetField(float* value, const std::string& name) { return 0.0; }
 #define DEFINE_GET_INDEX(class, field) \
-  virtual int get_##field##_index() override { return class ::field##_index(); }
+  virtual int get_##field##_index() { return class ::field##_index(); }
 
  protected:
   size_t _value_size;
diff --git a/paddle/fluid/distributed/ps/table/common_graph_table.cc b/paddle/fluid/distributed/ps/table/common_graph_table.cc
index 5687d94309ea9..2b545047c3bfe 100644
--- a/paddle/fluid/distributed/ps/table/common_graph_table.cc
+++ b/paddle/fluid/distributed/ps/table/common_graph_table.cc
@@ -24,6 +24,8 @@
 #include "gflags/gflags.h"
 #include "paddle/fluid/distributed/common/utils.h"
 #include "paddle/fluid/distributed/ps/table/graph/graph_node.h"
+#include "paddle/fluid/framework/fleet/fleet_wrapper.h"
+#include "paddle/fluid/framework/fleet/heter_ps/graph_gpu_wrapper.h"
 #include "paddle/fluid/framework/generator.h"
 #include "paddle/fluid/framework/io/fs.h"
 #include "paddle/fluid/platform/timer.h"
@@ -31,6 +33,9 @@
 #include "paddle/fluid/string/string_helper.h"
 
 DECLARE_bool(graph_load_in_parallel);
+DECLARE_bool(graph_get_neighbor_id);
+DECLARE_int32(gpugraph_storage_mode);
+DECLARE_uint64(gpugraph_slot_feasign_max_num);
 
 namespace paddle {
 namespace distributed {
@@ -53,27 +58,38 @@ int32_t GraphTable::Load_to_ssd(const std::string &path,
 }
 
 paddle::framework::GpuPsCommGraphFea GraphTable::make_gpu_ps_graph_fea(
-    std::vector<uint64_t> &node_ids, int slot_num) {
-  std::vector<std::vector<uint64_t>> bags(task_pool_size_);
-  for (int i = 0; i < task_pool_size_; i++) {
-    auto predsize = node_ids.size() / task_pool_size_;
+    int gpu_id, std::vector<uint64_t> &node_ids, int slot_num) {
+  size_t shard_num = 64;
+  std::vector<std::vector<uint64_t>> bags(shard_num);
+  std::vector<uint64_t> feature_array[shard_num];
+  std::vector<uint8_t> slot_id_array[shard_num];
+  std::vector<uint64_t> node_id_array[shard_num];
+  std::vector<paddle::framework::GpuPsFeaInfo> node_fea_info_array[shard_num];
+  for (size_t i = 0; i < shard_num; i++) {
+    auto predsize = node_ids.size() / shard_num;
     bags[i].reserve(predsize * 1.2);
+    feature_array[i].reserve(predsize * 1.2 * slot_num);
+    slot_id_array[i].reserve(predsize * 1.2 * slot_num);
+    node_id_array[i].reserve(predsize * 1.2);
+    node_fea_info_array[i].reserve(predsize * 1.2);
   }
 
   for (auto x : node_ids) {
-    int location = x % shard_num % task_pool_size_;
+    int location = x % shard_num;
     bags[location].push_back(x);
   }
 
   std::vector<std::future<int>> tasks;
-  std::vector<uint64_t> feature_array[task_pool_size_];
-  std::vector<uint8_t> slot_id_array[task_pool_size_];
-  std::vector<uint64_t> node_id_array[task_pool_size_];
-  std::vector<paddle::framework::GpuPsFeaInfo>
-      node_fea_info_array[task_pool_size_];
+  if (slot_feature_num_map_.size() == 0) {
+    slot_feature_num_map_.resize(slot_num);
+    for (int k = 0; k < slot_num; ++k) {
+      slot_feature_num_map_[k] = 0;
+    }
+  }
+
   for (size_t i = 0; i < bags.size(); i++) {
     if (bags[i].size() > 0) {
-      tasks.push_back(_shards_task_pool[i]->enqueue([&, i, this]() -> int {
+      tasks.push_back(_cpu_worker_pool[gpu_id]->enqueue([&, i, this]() -> int {
         uint64_t node_id;
         paddle::framework::GpuPsFeaInfo x;
         std::vector<uint64_t> feature_ids;
@@ -90,15 +106,12 @@ paddle::framework::GpuPsCommGraphFea GraphTable::make_gpu_ps_graph_fea(
             x.feature_offset = feature_array[i].size();
             int total_feature_size = 0;
             for (int k = 0; k < slot_num; ++k) {
-              v->get_feature_ids(k, &feature_ids);
-              total_feature_size += feature_ids.size();
-              if (!feature_ids.empty()) {
-                feature_array[i].insert(feature_array[i].end(),
-                                        feature_ids.begin(),
-                                        feature_ids.end());
-                slot_id_array[i].insert(
-                    slot_id_array[i].end(), feature_ids.size(), k);
+              auto feature_ids_size =
+                  v->get_feature_ids(k, feature_array[i], slot_id_array[i]);
+              if (slot_feature_num_map_[k] < feature_ids_size) {
+                slot_feature_num_map_[k] = feature_ids_size;
               }
+              total_feature_size += feature_ids_size;
             }
             x.feature_size = total_feature_size;
             node_fea_info_array[i].push_back(x);
@@ -110,32 +123,48 @@ paddle::framework::GpuPsCommGraphFea GraphTable::make_gpu_ps_graph_fea(
     }
   }
   for (size_t i = 0; i < tasks.size(); i++) tasks[i].get();
+
+  std::stringstream ss;
+  for (int k = 0; k < slot_num; ++k) {
+    ss << slot_feature_num_map_[k] << " ";
+  }
+  VLOG(0) << "slot_feature_num_map: " << ss.str();
+
+  tasks.clear();
+
   paddle::framework::GpuPsCommGraphFea res;
   uint64_t tot_len = 0;
-  for (int i = 0; i < task_pool_size_; i++) {
+  for (size_t i = 0; i < shard_num; i++) {
     tot_len += feature_array[i].size();
   }
   VLOG(0) << "Loaded feature table on cpu, feature_list_size[" << tot_len
           << "] node_ids_size[" << node_ids.size() << "]";
   res.init_on_cpu(tot_len, (unsigned int)node_ids.size(), slot_num);
   unsigned int offset = 0, ind = 0;
-  for (int i = 0; i < task_pool_size_; i++) {
-    for (size_t j = 0; j < node_id_array[i].size(); j++) {
-      res.node_list[ind] = node_id_array[i][j];
-      res.fea_info_list[ind] = node_fea_info_array[i][j];
-      res.fea_info_list[ind++].feature_offset += offset;
-    }
-    for (size_t j = 0; j < feature_array[i].size(); j++) {
-      res.feature_list[offset + j] = feature_array[i][j];
-      res.slot_id_list[offset + j] = slot_id_array[i][j];
-    }
+  for (size_t i = 0; i < shard_num; i++) {
+    tasks.push_back(
+        _cpu_worker_pool[gpu_id]->enqueue([&, i, ind, offset, this]() -> int {
+          auto start = ind;
+          for (size_t j = 0; j < node_id_array[i].size(); j++) {
+            res.node_list[start] = node_id_array[i][j];
+            res.fea_info_list[start] = node_fea_info_array[i][j];
+            res.fea_info_list[start++].feature_offset += offset;
+          }
+          for (size_t j = 0; j < feature_array[i].size(); j++) {
+            res.feature_list[offset + j] = feature_array[i][j];
+            res.slot_id_list[offset + j] = slot_id_array[i][j];
+          }
+          return 0;
+        }));
     offset += feature_array[i].size();
+    ind += node_id_array[i].size();
   }
+  for (size_t i = 0; i < tasks.size(); i++) tasks[i].get();
   return res;
 }
 
 paddle::framework::GpuPsCommGraph GraphTable::make_gpu_ps_graph(
-    int idx, std::vector<uint64_t> ids) {
+    int idx, const std::vector<uint64_t> &ids) {
   std::vector<std::vector<uint64_t>> bags(task_pool_size_);
   for (int i = 0; i < task_pool_size_; i++) {
     auto predsize = ids.size() / task_pool_size_;
@@ -213,7 +242,8 @@ int32_t GraphTable::add_node_to_ssd(
                  ch,
                  sizeof(int) * 2 + sizeof(uint64_t),
                  str) == 0) {
-      uint64_t *stored_data = ((uint64_t *)str.c_str());  // NOLINT
+      const uint64_t *stored_data =
+          reinterpret_cast<const uint64_t *>(str.c_str());  // NOLINT
       int n = str.size() / sizeof(uint64_t);
       char *new_data = new char[n * sizeof(uint64_t) + len];
       memcpy(new_data, stored_data, n * sizeof(uint64_t));
@@ -221,14 +251,14 @@ int32_t GraphTable::add_node_to_ssd(
       _db->put(src_id % shard_num % task_pool_size_,
                ch,
                sizeof(int) * 2 + sizeof(uint64_t),
-               (char *)new_data,  // NOLINT
+               reinterpret_cast<char *>(new_data),
                n * sizeof(uint64_t) + len);
       delete[] new_data;
     } else {
       _db->put(src_id % shard_num % task_pool_size_,
                ch,
                sizeof(int) * 2 + sizeof(uint64_t),
-               (char *)data,  // NOLINT
+               reinterpret_cast<char *>(data),
                len);
     }
   }
@@ -254,7 +284,7 @@ char *GraphTable::random_sample_neighbor_from_ssd(
                ch,
                sizeof(int) * 2 + sizeof(uint64_t),
                str) == 0) {
-    uint64_t *data = ((uint64_t *)str.c_str());  // NOLINT
+    const uint64_t *data = reinterpret_cast<const uint64_t *>(str.c_str());
     int n = str.size() / sizeof(uint64_t);
     std::unordered_map<int, int> m;
     // std::vector<uint64_t> res;
@@ -281,7 +311,8 @@ char *GraphTable::random_sample_neighbor_from_ssd(
       // res.push_back(data[pos]);
     }
     for (int i = 0; i < actual_size; i += 8) {
-      VLOG(2) << "sampled an neighbor " << *(uint64_t *)&buff[i];  // NOLINT
+      VLOG(2) << "sampled an neighbor "
+              << *reinterpret_cast<uint64_t *>(&buff[i]);
     }
     return buff;
   }
@@ -311,7 +342,8 @@ int64_t GraphTable::load_graph_to_memory_from_ssd(int idx,
           if (_db->get(i, ch, sizeof(int) * 2 + sizeof(uint64_t), str) == 0) {
             count[i] += (int64_t)str.size();
             for (size_t j = 0; j < str.size(); j += sizeof(uint64_t)) {
-              uint64_t id = *(uint64_t *)(str.c_str() + j);  // NOLINT
+              uint64_t id =
+                  *reinterpret_cast<const uint64_t *>(str.c_str() + j);
               add_comm_edge(idx, v, id);
             }
           }
@@ -364,8 +396,8 @@ void GraphTable::make_partitions(int idx, int64_t byte_size, int device_len) {
       continue;
     }
     std::string key = iters[next]->key().ToString();
-    int type_idx = *(int *)key.c_str();                  // NOLINT
-    int temp_idx = *(int *)(key.c_str() + sizeof(int));  // NOLINT
+    int type_idx = *(reinterpret_cast<const int *>(key.c_str()));
+    int temp_idx = *(reinterpret_cast<const int *>(key.c_str() + sizeof(int)));
     if (type_idx != 0 || temp_idx != idx) {
       iters[next]->Next();
       next++;
@@ -373,7 +405,7 @@ void GraphTable::make_partitions(int idx, int64_t byte_size, int device_len) {
     }
     std::string value = iters[next]->value().ToString();
     std::uint64_t i_key =
-        *(uint64_t *)(key.c_str() + sizeof(int) * 2);  // NOLINT
+        *reinterpret_cast<const uint64_t *>(key.c_str() + sizeof(int) * 2);
     for (int i = 0; i < part_len; i++) {
       if (memory_remaining[i] < (int64_t)value.size()) {
         score[i] = -100000.0;
@@ -382,7 +414,7 @@ void GraphTable::make_partitions(int idx, int64_t byte_size, int device_len) {
       }
     }
     for (size_t j = 0; j < value.size(); j += sizeof(uint64_t)) {
-      uint64_t v = *((uint64_t *)(value.c_str() + j));  // NOLINT
+      uint64_t v = *(reinterpret_cast<const uint64_t *>(value.c_str() + j));
       int index = -1;
       if (id_map.find(v) != id_map.end()) {
         index = id_map[v];
@@ -464,6 +496,7 @@ void GraphTable::export_partition_files(int idx, std::string file_path) {
 }
 void GraphTable::clear_graph(int idx) {
   for (auto p : edge_shards[idx]) {
+    p->clear();
     delete p;
   }
 
@@ -472,6 +505,127 @@ void GraphTable::clear_graph(int idx) {
     edge_shards[idx].push_back(new GraphShard());
   }
 }
+
+void GraphTable::release_graph() {
+  // Before releasing graph, prepare for sampling ids and embedding keys.
+  build_graph_type_keys();
+
+  if (FLAGS_gpugraph_storage_mode ==
+      paddle::framework::GpuGraphStorageMode::WHOLE_HBM) {
+    build_graph_total_keys();
+  }
+  // clear graph
+  if (FLAGS_gpugraph_storage_mode == paddle::framework::GpuGraphStorageMode::
+                                         MEM_EMB_FEATURE_AND_GPU_GRAPH ||
+      FLAGS_gpugraph_storage_mode == paddle::framework::GpuGraphStorageMode::
+                                         SSD_EMB_AND_MEM_FEATURE_GPU_GRAPH) {
+    clear_edge_shard();
+  } else {
+    clear_graph();
+  }
+}
+
+void GraphTable::release_graph_edge() {
+  if (FLAGS_gpugraph_storage_mode ==
+      paddle::framework::GpuGraphStorageMode::WHOLE_HBM) {
+    build_graph_total_keys();
+  }
+  clear_edge_shard();
+}
+
+void GraphTable::release_graph_node() {
+  build_graph_type_keys();
+  if (FLAGS_gpugraph_storage_mode != paddle::framework::GpuGraphStorageMode::
+                                         MEM_EMB_FEATURE_AND_GPU_GRAPH &&
+      FLAGS_gpugraph_storage_mode != paddle::framework::GpuGraphStorageMode::
+                                         SSD_EMB_AND_MEM_FEATURE_GPU_GRAPH) {
+    clear_feature_shard();
+  } else {
+    merge_feature_shard();
+    feature_shrink_to_fit();
+  }
+}
+
+void GraphTable::clear_edge_shard() {
+  VLOG(0) << "begin clear edge shard";
+  std::vector<std::future<int>> tasks;
+  for (auto &type_shards : edge_shards) {
+    for (auto &shard : type_shards) {
+      tasks.push_back(
+          load_node_edge_task_pool->enqueue([&shard, this]() -> int {
+            delete shard;
+            return 0;
+          }));
+    }
+  }
+  for (size_t i = 0; i < tasks.size(); i++) tasks[i].get();
+  for (auto &shards : edge_shards) {
+    shards.clear();
+    for (size_t i = 0; i < shard_num_per_server; i++) {
+      shards.push_back(new GraphShard());
+    }
+  }
+  VLOG(0) << "finish clear edge shard";
+}
+
+void GraphTable::clear_feature_shard() {
+  VLOG(0) << "begin clear feature shard";
+  std::vector<std::future<int>> tasks;
+  for (auto &type_shards : feature_shards) {
+    for (auto &shard : type_shards) {
+      tasks.push_back(
+          load_node_edge_task_pool->enqueue([&shard, this]() -> int {
+            delete shard;
+            return 0;
+          }));
+    }
+  }
+  for (size_t i = 0; i < tasks.size(); i++) tasks[i].get();
+  for (auto &shards : feature_shards) {
+    shards.clear();
+    for (size_t i = 0; i < shard_num_per_server; i++) {
+      shards.push_back(new GraphShard());
+    }
+  }
+  VLOG(0) << "finish clear feature shard";
+}
+
+void GraphTable::feature_shrink_to_fit() {
+  std::vector<std::future<int>> tasks;
+  for (auto &type_shards : feature_shards) {
+    for (auto &shard : type_shards) {
+      tasks.push_back(
+          load_node_edge_task_pool->enqueue([&shard, this]() -> int {
+            shard->shrink_to_fit();
+            return 0;
+          }));
+    }
+  }
+  for (size_t i = 0; i < tasks.size(); i++) tasks[i].get();
+}
+
+void GraphTable::merge_feature_shard() {
+  VLOG(0) << "begin merge_feature_shard";
+  std::vector<std::future<int>> tasks;
+  for (size_t i = 0; i < feature_shards[0].size(); i++) {
+    tasks.push_back(load_node_edge_task_pool->enqueue([i, this]() -> int {
+      for (size_t j = 1; j < feature_shards.size(); j++) {
+        feature_shards[0][i]->merge_shard(feature_shards[j][i]);
+      }
+      return 0;
+    }));
+  }
+  for (size_t i = 0; i < tasks.size(); i++) tasks[i].get();
+  feature_shards.resize(1);
+}
+
+void GraphTable::clear_graph() {
+  VLOG(0) << "begin clear_graph";
+  clear_edge_shard();
+  clear_feature_shard();
+  VLOG(0) << "finish clear_graph";
+}
+
 int32_t GraphTable::load_next_partition(int idx) {
   if (next_partition >= static_cast<int>(partitions[idx].size())) {
     VLOG(0) << "partition iteration is done";
@@ -519,7 +673,7 @@ int32_t GraphTable::load_edges_to_ssd(const std::string &path,
       add_node_to_ssd(0,
                       idx,
                       src_id,
-                      (char *)dist_data.data(),  // NOLINT
+                      reinterpret_cast<char *>(dist_data.data()),
                       static_cast<int>(dist_data.size() * sizeof(uint64_t)));
     }
   }
@@ -545,7 +699,7 @@ int32_t GraphTable::dump_edges_to_ssd(int idx) {
             add_node_to_ssd(0,
                             idx,
                             v[j]->get_id(),
-                            (char *)s.data(),  // NOLINT
+                            (char *)(s.data()),  // NOLINT
                             s.size() * sizeof(uint64_t));
           }
           return cost;
@@ -557,7 +711,6 @@ int32_t GraphTable::dump_edges_to_ssd(int idx) {
 int32_t GraphTable::make_complementary_graph(int idx, int64_t byte_size) {
   VLOG(0) << "make_complementary_graph";
   const size_t fixed_size = byte_size / 8;
-  // std::vector<int64_t> edge_array[task_pool_size_];
   std::vector<std::unordered_map<uint64_t, int>> count(task_pool_size_);
   std::vector<std::future<int>> tasks;
   auto &shards = edge_shards[idx];
@@ -654,7 +807,7 @@ int CompleteGraphSampler::run_graph_sampling() {
                 node.node_id = v[j]->get_id();
                 node.neighbor_size = v[j]->get_neighbor_size();
                 node.neighbor_offset =
-                    (int)sample_neighbors_ex[ind][location].size();
+                    static_cast<int>sample_neighbors_ex[ind][location].size();
                 sample_nodes_ex[ind][location].emplace_back(node);
                 for (int k = 0; k < node.neighbor_size; k++)
                   sample_neighbors_ex[ind][location].push_back(
@@ -770,7 +923,7 @@ int BasicBfsGraphSampler::run_graph_sampling() {
     for (size_t i = 0; i < graph_table->shards.size(); ++i) {
       std::vector<Node *> &v = graph_table->shards[i]->get_bucket();
       if (v.size() > 0) {
-        int search_size = std::min(init_search_size, (int)v.size());
+        int search_size = std::min(init_search_size, static_cast<int>v.size());
         for (int k = 0; k < search_size; k++) {
           init_size++;
           __sync_add_and_fetch(&task_size, 1);
@@ -818,7 +971,7 @@ int BasicBfsGraphSampler::run_graph_sampling() {
               node.node_id = iter->first;
               node.neighbor_size = iter->second.size();
               node.neighbor_offset =
-                  (int)sample_neighbors_ex[ind][location].size();
+                  static_cast<int>sample_neighbors_ex[ind][location].size();
               sample_nodes_ex[ind][location].emplace_back(node);
               for (auto k : iter->second)
                 sample_neighbors_ex[ind][location].push_back(k);
@@ -902,7 +1055,7 @@ void BasicBfsGraphSampler::init(size_t gpu_num, GraphTable *graph_table,
 std::vector<Node *> GraphShard::get_batch(int start, int end, int step) {
   if (start < 0) start = 0;
   std::vector<Node *> res;
-  for (int pos = start; pos < std::min(end, (int)bucket.size());  // NOLINT
+  for (int pos = start; pos < std::min(end, static_cast<int>(bucket.size()));
        pos += step) {
     res.push_back(bucket[pos]);
   }
@@ -1004,7 +1157,7 @@ GraphNode *GraphShard::add_graph_node(uint64_t id) {
     node_location[id] = bucket.size();
     bucket.push_back(new GraphNode(id));
   }
-  return (GraphNode *)bucket[node_location[id]];  // NOLINT
+  return reinterpret_cast<GraphNode *>(bucket[node_location[id]]);
 }
 
 GraphNode *GraphShard::add_graph_node(Node *node) {
@@ -1013,17 +1166,17 @@ GraphNode *GraphShard::add_graph_node(Node *node) {
     node_location[id] = bucket.size();
     bucket.push_back(node);
   }
-  return (GraphNode *)bucket[node_location[id]];  // NOLINT
+  return reinterpret_cast<GraphNode *>(bucket[node_location[id]]);
 }
 
 FeatureNode *GraphShard::add_feature_node(uint64_t id, bool is_overlap) {
   if (node_location.find(id) == node_location.end()) {
     node_location[id] = bucket.size();
     bucket.push_back(new FeatureNode(id));
-    return (FeatureNode *)bucket[node_location[id]];  // NOLINT
+    return reinterpret_cast<FeatureNode *>(bucket[node_location[id]]);
   }
   if (is_overlap) {
-    return (FeatureNode *)bucket[node_location[id]];  // NOLINT
+    return reinterpret_cast<FeatureNode *>(bucket[node_location[id]]);
   }
 
   return NULL;
@@ -1039,19 +1192,9 @@ Node *GraphShard::find_node(uint64_t id) {
 }
 
 GraphTable::~GraphTable() {
-  for (size_t i = 0; i < edge_shards.size(); i++) {
-    for (auto p : edge_shards[i]) {
-      delete p;
-    }
-    edge_shards[i].clear();
-  }
-
-  for (size_t i = 0; i < feature_shards.size(); i++) {
-    for (auto p : feature_shards[i]) {
-      delete p;
-    }
-    feature_shards[i].clear();
-  }
+#ifdef PADDLE_WITH_GPU_GRAPH
+  clear_graph();
+#endif
 }
 
 int32_t GraphTable::Load(const std::string &path, const std::string &param) {
@@ -1080,16 +1223,136 @@ std::string GraphTable::get_inverse_etype(std::string &etype) {
   return res;
 }
 
-int32_t GraphTable::load_node_and_edge_file(std::string etype,
-                                            std::string ntype,
-                                            std::string epath,
-                                            std::string npath,
+int32_t GraphTable::parse_type_to_typepath(
+    std::string &type2files,
+    std::string graph_data_local_path,
+    std::vector<std::string> &res_type,
+    std::unordered_map<std::string, std::string> &res_type2path) {
+  auto type2files_split =
+      paddle::string::split_string<std::string>(type2files, ",");
+  if (type2files_split.size() == 0) {
+    return -1;
+  }
+  for (auto one_type2file : type2files_split) {
+    auto one_type2file_split =
+        paddle::string::split_string<std::string>(one_type2file, ":");
+    auto type = one_type2file_split[0];
+    auto type_dir = one_type2file_split[1];
+    res_type.push_back(type);
+    res_type2path[type] = graph_data_local_path + "/" + type_dir;
+  }
+  return 0;
+}
+
+int32_t GraphTable::parse_edge_and_load(std::string etype2files,
+                                        std::string graph_data_local_path,
+                                        int part_num,
+                                        bool reverse) {
+  std::vector<std::string> etypes;
+  std::unordered_map<std::string, std::string> edge_to_edgedir;
+  int res = parse_type_to_typepath(
+      etype2files, graph_data_local_path, etypes, edge_to_edgedir);
+  if (res != 0) {
+    VLOG(0) << "parse edge type and edgedir failed!";
+    return -1;
+  }
+  VLOG(0) << "etypes size: " << etypes.size();
+  VLOG(0) << "whether reverse: " << reverse;
+  is_load_reverse_edge = reverse;
+  std::string delim = ";";
+  size_t total_len = etypes.size();
+
+  std::vector<std::future<int>> tasks;
+  for (size_t i = 0; i < total_len; i++) {
+    tasks.push_back(
+        _shards_task_pool[i % task_pool_size_]->enqueue([&, i, this]() -> int {
+          std::string etype_path = edge_to_edgedir[etypes[i]];
+          auto etype_path_list = paddle::framework::localfs_list(etype_path);
+          std::string etype_path_str;
+          if (part_num > 0 &&
+              part_num < static_cast<int>(etype_path_list.size())) {
+            std::vector<std::string> sub_etype_path_list(
+                etype_path_list.begin(), etype_path_list.begin() + part_num);
+            etype_path_str =
+                paddle::string::join_strings(sub_etype_path_list, delim);
+          } else {
+            etype_path_str =
+                paddle::string::join_strings(etype_path_list, delim);
+          }
+          this->load_edges(etype_path_str, false, etypes[i]);
+          if (reverse) {
+            std::string r_etype = get_inverse_etype(etypes[i]);
+            this->load_edges(etype_path_str, true, r_etype);
+          }
+          return 0;
+        }));
+  }
+  for (size_t i = 0; i < tasks.size(); i++) tasks[i].get();
+  return 0;
+}
+
+int32_t GraphTable::parse_node_and_load(std::string ntype2files,
+                                        std::string graph_data_local_path,
+                                        int part_num) {
+  std::vector<std::string> ntypes;
+  std::unordered_map<std::string, std::string> node_to_nodedir;
+  int res = parse_type_to_typepath(
+      ntype2files, graph_data_local_path, ntypes, node_to_nodedir);
+  if (res != 0) {
+    VLOG(0) << "parse node type and nodedir failed!";
+    return -1;
+  }
+  std::string delim = ";";
+  std::string npath = node_to_nodedir[ntypes[0]];
+  auto npath_list = paddle::framework::localfs_list(npath);
+  std::string npath_str;
+  if (part_num > 0 && part_num < static_cast<int>(npath_list.size())) {
+    std::vector<std::string> sub_npath_list(npath_list.begin(),
+                                            npath_list.begin() + part_num);
+    npath_str = paddle::string::join_strings(sub_npath_list, delim);
+  } else {
+    npath_str = paddle::string::join_strings(npath_list, delim);
+  }
+
+  if (ntypes.size() == 0) {
+    VLOG(0) << "node_type not specified, nothing will be loaded ";
+    return 0;
+  }
+  if (FLAGS_graph_load_in_parallel) {
+    this->load_nodes(npath_str, "");
+  } else {
+    for (size_t j = 0; j < ntypes.size(); j++) {
+      this->load_nodes(npath_str, ntypes[j]);
+    }
+  }
+  return 0;
+}
+
+int32_t GraphTable::load_node_and_edge_file(std::string etype2files,
+                                            std::string ntype2files,
+                                            std::string graph_data_local_path,
                                             int part_num,
                                             bool reverse) {
-  auto etypes = paddle::string::split_string<std::string>(etype, ",");
-  auto ntypes = paddle::string::split_string<std::string>(ntype, ",");
+  std::vector<std::string> etypes;
+  std::unordered_map<std::string, std::string> edge_to_edgedir;
+  int res = parse_type_to_typepath(
+      etype2files, graph_data_local_path, etypes, edge_to_edgedir);
+  if (res != 0) {
+    VLOG(0) << "parse edge type and edgedir failed!";
+    return -1;
+  }
+  std::vector<std::string> ntypes;
+  std::unordered_map<std::string, std::string> node_to_nodedir;
+  res = parse_type_to_typepath(
+      ntype2files, graph_data_local_path, ntypes, node_to_nodedir);
+  if (res != 0) {
+    VLOG(0) << "parse node type and nodedir failed!";
+    return -1;
+  }
+
   VLOG(0) << "etypes size: " << etypes.size();
   VLOG(0) << "whether reverse: " << reverse;
+  is_load_reverse_edge = reverse;
   std::string delim = ";";
   size_t total_len = etypes.size() + 1;  // 1 is for node
 
@@ -1098,11 +1361,11 @@ int32_t GraphTable::load_node_and_edge_file(std::string etype,
     tasks.push_back(
         _shards_task_pool[i % task_pool_size_]->enqueue([&, i, this]() -> int {
           if (i < etypes.size()) {
-            std::string etype_path = epath + "/" + etypes[i];
+            std::string etype_path = edge_to_edgedir[etypes[i]];
             auto etype_path_list = paddle::framework::localfs_list(etype_path);
             std::string etype_path_str;
             if (part_num > 0 &&
-                part_num < (int)etype_path_list.size()) {  // NOLINT
+                part_num < static_cast<int>(etype_path_list.size())) {
               std::vector<std::string> sub_etype_path_list(
                   etype_path_list.begin(), etype_path_list.begin() + part_num);
               etype_path_str =
@@ -1117,9 +1380,11 @@ int32_t GraphTable::load_node_and_edge_file(std::string etype,
               this->load_edges(etype_path_str, true, r_etype);
             }
           } else {
+            std::string npath = node_to_nodedir[ntypes[0]];
             auto npath_list = paddle::framework::localfs_list(npath);
             std::string npath_str;
-            if (part_num > 0 && part_num < (int)npath_list.size()) {  // NOLINT
+            if (part_num > 0 &&
+                part_num < static_cast<int>(npath_list.size())) {
               std::vector<std::string> sub_npath_list(
                   npath_list.begin(), npath_list.begin() + part_num);
               npath_str = paddle::string::join_strings(sub_npath_list, delim);
@@ -1131,7 +1396,6 @@ int32_t GraphTable::load_node_and_edge_file(std::string etype,
               VLOG(0) << "node_type not specified, nothing will be loaded ";
               return 0;
             }
-
             if (FLAGS_graph_load_in_parallel) {
               this->load_nodes(npath_str, "");
             } else {
@@ -1157,7 +1421,8 @@ int32_t GraphTable::get_nodes_ids_by_ranges(
   res.clear();
   auto &shards = type_id == 0 ? edge_shards[idx] : feature_shards[idx];
   std::vector<std::future<size_t>> tasks;
-  for (size_t i = 0; i < shards.size() && index < (int)ranges.size();  // NOLINT
+  for (size_t i = 0;
+       i < shards.size() && index < static_cast<int>(ranges.size());
        i++) {
     end = total_size + shards[i]->get_size();
     start = total_size;
@@ -1183,7 +1448,7 @@ int32_t GraphTable::get_nodes_ids_by_ranges(
               for (auto &id : keys) {
                 res.push_back(id);
                 std::swap(res[rand() % res.size()],
-                          res[(int)res.size() - 1]);  // NOLINT
+                          res[static_cast<int>(res.size()) - 1]);
               }
               mutex.unlock();
 
@@ -1458,6 +1723,8 @@ int32_t GraphTable::load_edges(const std::string &path,
   }
   VLOG(0) << valid_count << "/" << count << " edge_type[" << edge_type
           << "] edges are loaded successfully";
+  std::string edge_size = edge_type + ":" + std::to_string(valid_count);
+  edge_type_size.push_back(edge_size);
 
 #ifdef PADDLE_WITH_HETERPS
   if (search_level == 2) {
@@ -1558,10 +1825,8 @@ int32_t GraphTable::random_sample_nodes(int type_id,
   int remain = sample_size, last_pos = -1, num;
   std::set<int> separator_set;
   for (int i = 0; i < range_num - 1; i++) {
-    unsigned int seed = time(0);
-    while (separator_set.find(num = rand_r(&seed) % (sample_size - 1)) !=
+    while (separator_set.find(num = rand() % (sample_size - 1)) !=  // NOLINT
            separator_set.end()) {
-      continue;
     }
     separator_set.insert(num);
   }
@@ -1573,10 +1838,8 @@ int32_t GraphTable::random_sample_nodes(int type_id,
   remain = total_size - sample_size + range_num;
   separator_set.clear();
   for (int i = 0; i < range_num; i++) {
-    unsigned int seed = time(0);
-    while (separator_set.find(num = rand_r(&seed) % remain) !=
+    while (separator_set.find(num = rand() % remain) !=  // NOLINT
            separator_set.end()) {
-      continue;
     }
     separator_set.insert(num);
   }
@@ -1589,13 +1852,12 @@ int32_t GraphTable::random_sample_nodes(int type_id,
     used += ranges_len[index++];
   }
   std::vector<std::pair<int, int>> first_half, second_half;
-  unsigned int seed = time(0);
-  int start_index = rand_r(&seed) % total_size;
+  int start_index = rand() % total_size;  // NOLINT
   for (size_t i = 0; i < ranges_len.size() && i < ranges_pos.size(); i++) {
     if (ranges_pos[i] + ranges_len[i] - 1 + start_index < total_size) {
       first_half.push_back({ranges_pos[i] + start_index,
                             ranges_pos[i] + ranges_len[i] + start_index});
-    } else if (ranges_pos[i] + start_index >= total_size) {
+    } else if ((ranges_pos[i] + start_index) >= total_size) {
       second_half.push_back(
           {ranges_pos[i] + start_index - total_size,
            ranges_pos[i] + ranges_len[i] + start_index - total_size});
@@ -1821,9 +2083,17 @@ int GraphTable::parse_feature(int idx,
   // "")
   thread_local std::vector<paddle::string::str_ptr> fields;
   fields.clear();
-  const char c = feature_separator_.at(0);
+  char c = slot_feature_separator_.at(0);
   paddle::string::split_string_ptr(feat_str, len, c, &fields);
 
+  thread_local std::vector<paddle::string::str_ptr> fea_fields;
+  fea_fields.clear();
+  c = feature_separator_.at(0);
+  paddle::string::split_string_ptr(fields[1].ptr,
+                                   fields[1].len,
+                                   c,
+                                   &fea_fields,
+                                   FLAGS_gpugraph_slot_feasign_max_num);
   std::string name = fields[0].to_string();
   auto it = feat_id_map[idx].find(name);
   if (it != feat_id_map[idx].end()) {
@@ -1834,26 +2104,27 @@ int GraphTable::parse_feature(int idx,
       //      string_vector_2_string(fields.begin() + 1, fields.end(), ' ',
       //      fea_ptr);
       FeatureNode::parse_value_to_bytes<uint64_t>(
-          fields.begin() + 1, fields.end(), fea_ptr);
+          fea_fields.begin(), fea_fields.end(), fea_ptr);
       return 0;
     } else if (dtype == "string") {
-      string_vector_2_string(fields.begin() + 1, fields.end(), ' ', fea_ptr);
+      string_vector_2_string(
+          fea_fields.begin(), fea_fields.end(), ' ', fea_ptr);
       return 0;
     } else if (dtype == "float32") {
       FeatureNode::parse_value_to_bytes<float>(
-          fields.begin() + 1, fields.end(), fea_ptr);
+          fea_fields.begin(), fea_fields.end(), fea_ptr);
       return 0;
     } else if (dtype == "float64") {
       FeatureNode::parse_value_to_bytes<double>(
-          fields.begin() + 1, fields.end(), fea_ptr);
+          fea_fields.begin(), fea_fields.end(), fea_ptr);
       return 0;
     } else if (dtype == "int32") {
       FeatureNode::parse_value_to_bytes<int32_t>(
-          fields.begin() + 1, fields.end(), fea_ptr);
+          fea_fields.begin(), fea_fields.end(), fea_ptr);
       return 0;
     } else if (dtype == "int64") {
       FeatureNode::parse_value_to_bytes<uint64_t>(
-          fields.begin() + 1, fields.end(), fea_ptr);
+          fea_fields.begin(), fea_fields.end(), fea_ptr);
       return 0;
     }
   } else {
@@ -2023,6 +2294,16 @@ int GraphTable::get_all_feature_ids(
   return 0;
 }
 
+int GraphTable::get_node_embedding_ids(
+    int slice_num, std::vector<std::vector<uint64_t>> *output) {
+  if (is_load_reverse_edge && !FLAGS_graph_get_neighbor_id) {
+    return get_all_id(0, slice_num, output);
+  } else {
+    get_all_id(0, slice_num, output);
+    return get_all_neighbor_id(0, slice_num, output);
+  }
+}
+
 int32_t GraphTable::pull_graph_list(int type_id,
                                     int idx,
                                     int start,
@@ -2080,6 +2361,10 @@ void GraphTable::set_feature_separator(const std::string &ch) {
   feature_separator_ = ch;
 }
 
+void GraphTable::set_slot_feature_separator(const std::string &ch) {
+  slot_feature_separator_ = ch;
+}
+
 int32_t GraphTable::get_server_index_by_id(uint64_t id) {
   return id % shard_num / shard_num_per_server;
 }
@@ -2149,8 +2434,7 @@ int32_t GraphTable::Initialize(const GraphParameter &graph) {
   if (use_cache) {
     cache_size_limit = graph.cache_size_limit();
     cache_ttl = graph.cache_ttl();
-    make_neighbor_sample_cache((size_t)cache_size_limit,  // NOLINT
-                               (size_t)cache_ttl);        // NOLINT
+    make_neighbor_sample_cache(cache_size_limit, cache_ttl);
   }
   _shards_task_pool.resize(task_pool_size_);
   for (size_t i = 0; i < _shards_task_pool.size(); ++i) {
@@ -2172,7 +2456,7 @@ int32_t GraphTable::Initialize(const GraphParameter &graph) {
   feat_name.resize(node_types.size());
   feat_shape.resize(node_types.size());
   feat_dtype.resize(node_types.size());
-  VLOG(0) << "got " << node_types.size() << "node types in total";
+  VLOG(0) << "got " << node_types.size() << " node types in total";
   for (int k = 0; k < node_types.size(); k++) {
     feature_to_id[node_types[k]] = k;
     auto node_type = node_types[k];
@@ -2232,5 +2516,50 @@ int32_t GraphTable::Initialize(const GraphParameter &graph) {
   return 0;
 }
 
+void GraphTable::init_worker_poll(int gpu_num) {
+  _cpu_worker_pool.resize(gpu_num);
+  for (int i = 0; i < gpu_num; i++) {
+    _cpu_worker_pool[i].reset(new ::ThreadPool(16));
+  }
+}
+
+void GraphTable::build_graph_total_keys() {
+  VLOG(0) << "begin insert edge to graph_total_keys";
+  // build node embedding id
+  std::vector<std::vector<uint64_t>> keys;
+  this->get_node_embedding_ids(1, &keys);
+  graph_total_keys_.insert(
+      graph_total_keys_.end(), keys[0].begin(), keys[0].end());
+
+  VLOG(0) << "finish insert edge to graph_total_keys";
+}
+
+void GraphTable::build_graph_type_keys() {
+  VLOG(0) << "begin build_graph_type_keys";
+  graph_type_keys_.clear();
+  graph_type_keys_.resize(this->feature_to_id.size());
+
+  int cnt = 0;
+  for (auto &it : this->feature_to_id) {
+    auto node_idx = it.second;
+    std::vector<std::vector<uint64_t>> keys;
+    this->get_all_id(1, node_idx, 1, &keys);
+    type_to_index_[node_idx] = cnt;
+    graph_type_keys_[cnt++] = std::move(keys[0]);
+  }
+  VLOG(0) << "finish build_graph_type_keys";
+
+  VLOG(0) << "begin insert feature into graph_total_keys";
+  // build feature embedding id
+  for (auto &it : this->feature_to_id) {
+    auto node_idx = it.second;
+    std::vector<std::vector<uint64_t>> keys;
+    this->get_all_feature_ids(1, node_idx, 1, &keys);
+    graph_total_keys_.insert(
+        graph_total_keys_.end(), keys[0].begin(), keys[0].end());
+  }
+  VLOG(0) << "finish insert feature into graph_total_keys";
+}
+
 }  // namespace distributed
 };  // namespace paddle
diff --git a/paddle/fluid/distributed/ps/table/common_graph_table.h b/paddle/fluid/distributed/ps/table/common_graph_table.h
index f78011cedfa7d..b988ffa5fc3b5 100644
--- a/paddle/fluid/distributed/ps/table/common_graph_table.h
+++ b/paddle/fluid/distributed/ps/table/common_graph_table.h
@@ -60,7 +60,7 @@ class GraphShard {
   std::vector<Node *> get_batch(int start, int end, int step);
   void get_ids_by_range(int start, int end, std::vector<uint64_t> *res) {
     res->reserve(res->size() + end - start);
-    for (int i = start; i < end && i < (int)bucket.size(); i++) {
+    for (int i = start; i < end && i < static_cast<int>(bucket.size()); i++) {
       res->emplace_back(bucket[i]->get_id());
     }
   }
@@ -93,7 +93,7 @@ class GraphShard {
   size_t get_all_feature_ids(std::vector<std::vector<uint64_t>> *total_res,
                              int slice_num) {
     std::vector<uint64_t> keys;
-    for (int i = 0; i < (int)bucket.size(); i++) {
+    for (size_t i = 0; i < bucket.size(); i++) {
       bucket[i]->get_feature_ids(&keys);
     }
     return dedup2shard_keys(&keys, total_res, slice_num);
@@ -130,7 +130,29 @@ class GraphShard {
     return node_location;
   }
 
- private:
+  void shrink_to_fit() {
+    bucket.shrink_to_fit();
+    for (size_t i = 0; i < bucket.size(); i++) {
+      bucket[i]->shrink_to_fit();
+    }
+  }
+
+  void merge_shard(GraphShard *&shard) {  // NOLINT
+    bucket.reserve(bucket.size() + shard->bucket.size());
+    for (size_t i = 0; i < shard->bucket.size(); i++) {
+      auto node_id = shard->bucket[i]->get_id();
+      if (node_location.find(node_id) == node_location.end()) {
+        node_location[node_id] = bucket.size();
+        bucket.push_back(shard->bucket[i]);
+      }
+    }
+    shard->node_location.clear();
+    shard->bucket.clear();
+    delete shard;
+    shard = NULL;
+  }
+
+ public:
   std::unordered_map<uint64_t, int> node_location;
   std::vector<Node *> bucket;
 };
@@ -161,7 +183,7 @@ class SampleResult {
  public:
   size_t actual_size;
   std::shared_ptr<char> buffer;
-  SampleResult(size_t _actual_size, std::shared_ptr<char> &_buffer)
+  SampleResult(size_t _actual_size, std::shared_ptr<char> &_buffer)  // NOLINT
       : actual_size(_actual_size), buffer(_buffer) {}
   SampleResult(size_t _actual_size, char *_buffer)
       : actual_size(_actual_size),
@@ -187,7 +209,7 @@ class ScaledLRU;
 template <typename K, typename V>
 class RandomSampleLRU {
  public:
-  RandomSampleLRU(ScaledLRU<K, V> *_father) {
+  explicit RandomSampleLRU(ScaledLRU<K, V> *_father) {
     father = _father;
     remove_count = 0;
     node_size = 0;
@@ -204,7 +226,9 @@ class RandomSampleLRU {
       node_head = p;
     }
   }
-  LRUResponse query(K *keys, size_t length, std::vector<std::pair<K, V>> &res) {
+  LRUResponse query(K *keys,
+                    size_t length,
+                    std::vector<std::pair<K, V>> &res) {  // NOLINT
     if (pthread_rwlock_tryrdlock(&father->rwlock) != 0)
       return LRUResponse::blocked;
     // pthread_rwlock_rdlock(&father->rwlock);
@@ -271,7 +295,6 @@ class RandomSampleLRU {
       remove(node_head);
       remove_count--;
     }
-    // std::cerr<<"after remove_count = "<<remove_count<<std::endl;
   }
 
   void move_to_tail(LRUNode<K, V> *node) {
@@ -356,25 +379,25 @@ class ScaledLRU {
   LRUResponse query(size_t index,
                     K *keys,
                     size_t length,
-                    std::vector<std::pair<K, V>> &res) {
+                    std::vector<std::pair<K, V>> &res) {  // NOLINT
     return lru_pool[index].query(keys, length, res);
   }
   LRUResponse insert(size_t index, K *keys, V *data, size_t length) {
     return lru_pool[index].insert(keys, data, length);
   }
   int Shrink() {
-    int node_size = 0;
+    size_t node_size = 0;
     for (size_t i = 0; i < lru_pool.size(); i++) {
       node_size += lru_pool[i].node_size - lru_pool[i].remove_count;
     }
 
-    if ((size_t)node_size <= size_t(1.1 * size_limit) + 1) return 0;
+    if (node_size <= static_cast<size_t>(1.1 * size_limit) + 1) return 0;
     if (pthread_rwlock_wrlock(&rwlock) == 0) {
       global_count = 0;
       for (size_t i = 0; i < lru_pool.size(); i++) {
         global_count += lru_pool[i].node_size - lru_pool[i].remove_count;
       }
-      if ((size_t)global_count > size_limit) {
+      if (static_cast<size_t>(global_count) > size_limit) {
         size_t remove = global_count - size_limit;
         for (size_t i = 0; i < lru_pool.size(); i++) {
           lru_pool[i].total_diff = 0;
@@ -392,7 +415,7 @@ class ScaledLRU {
   void handle_size_diff(int diff) {
     if (diff != 0) {
       __sync_fetch_and_add(&global_count, diff);
-      if (global_count > int(1.25 * size_limit)) {
+      if (global_count > static_cast<int>(1.25 * size_limit)) {
         thread_pool->enqueue([this]() -> int { return Shrink(); });
       }
     }
@@ -507,8 +530,8 @@ class GraphTable : public Table {
                                   int idx,
                                   int start,
                                   int size,
-                                  std::unique_ptr<char[]> &buffer,
-                                  int &actual_size,
+                                  std::unique_ptr<char[]> &buffer,  // NOLINT
+                                  int &actual_size,                 // NOLINT
                                   bool need_feature,
                                   int step);
 
@@ -516,40 +539,48 @@ class GraphTable : public Table {
       int idx,
       uint64_t *node_ids,
       int sample_size,
-      std::vector<std::shared_ptr<char>> &buffers,
-      std::vector<int> &actual_sizes,
+      std::vector<std::shared_ptr<char>> &buffers,  // NOLINT
+      std::vector<int> &actual_sizes,               // NOLINT
       bool need_weight);
 
   int32_t random_sample_nodes(int type_id,
                               int idx,
                               int sample_size,
-                              std::unique_ptr<char[]> &buffers,
-                              int &actual_sizes);
+                              std::unique_ptr<char[]> &buffers,  // NOLINT
+                              int &actual_sizes);                // NOLINT
 
   virtual int32_t get_nodes_ids_by_ranges(
       int type_id,
       int idx,
       std::vector<std::pair<int, int>> ranges,
-      std::vector<uint64_t> &res);
+      std::vector<uint64_t> &res);  // NOLINT
   virtual int32_t Initialize() { return 0; }
   virtual int32_t Initialize(const TableParameter &config,
                              const FsClientParameter &fs_config);
   virtual int32_t Initialize(const GraphParameter &config);
+  void init_worker_poll(int gpu_num);
   int32_t Load(const std::string &path, const std::string &param);
-
-  int32_t load_node_and_edge_file(std::string etype,
-                                  std::string ntype,
-                                  std::string epath,
-                                  std::string npath,
+  int32_t load_node_and_edge_file(std::string etype2files,
+                                  std::string ntype2files,
+                                  std::string graph_data_local_path,
                                   int part_num,
                                   bool reverse);
-
-  std::string get_inverse_etype(std::string &etype);
-
+  int32_t parse_edge_and_load(std::string etype2files,
+                              std::string graph_data_local_path,
+                              int part_num,
+                              bool reverse);
+  int32_t parse_node_and_load(std::string ntype2files,
+                              std::string graph_data_local_path,
+                              int part_num);
+  std::string get_inverse_etype(std::string &etype);  // NOLINT
+  int32_t parse_type_to_typepath(
+      std::string &type2files,  // NOLINT
+      std::string graph_data_local_path,
+      std::vector<std::string> &res_type,                            // NOLINT
+      std::unordered_map<std::string, std::string> &res_type2path);  // NOLINT
   int32_t load_edges(const std::string &path,
                      bool reverse,
                      const std::string &edge_type);
-
   int get_all_id(int type,
                  int slice_num,
                  std::vector<std::vector<uint64_t>> *output);
@@ -568,6 +599,8 @@ class GraphTable : public Table {
                           int idx,
                           int slice_num,
                           std::vector<std::vector<uint64_t>> *output);
+  int get_node_embedding_ids(int slice_num,
+                             std::vector<std::vector<uint64_t>> *output);
   int32_t load_nodes(const std::string &path,
                      std::string node_type = std::string());
   std::pair<uint64_t, uint64_t> parse_edge_file(const std::string &path,
@@ -578,23 +611,23 @@ class GraphTable : public Table {
                                                 int idx);
   std::pair<uint64_t, uint64_t> parse_node_file(const std::string &path);
   int32_t add_graph_node(int idx,
-                         std::vector<uint64_t> &id_list,
-                         std::vector<bool> &is_weight_list);
+                         std::vector<uint64_t> &id_list,      // NOLINT
+                         std::vector<bool> &is_weight_list);  // NOLINT
 
-  int32_t remove_graph_node(int idx, std::vector<uint64_t> &id_list);
+  int32_t remove_graph_node(int idx, std::vector<uint64_t> &id_list);  // NOLINT
 
   int32_t get_server_index_by_id(uint64_t id);
   Node *find_node(int type_id, int idx, uint64_t id);
   Node *find_node(int type_id, uint64_t id);
 
-  virtual int32_t Pull(TableContext &context) { return 0; }
-  virtual int32_t Push(TableContext &context) { return 0; }
+  virtual int32_t Pull(TableContext &context) { return 0; }  // NOLINT
+  virtual int32_t Push(TableContext &context) { return 0; }  // NOLINT
 
   virtual int32_t clear_nodes(int type, int idx);
   virtual void Clear() {}
   virtual int32_t Flush() { return 0; }
   virtual int32_t Shrink(const std::string &param) { return 0; }
-  //指定保存路径
+  // 指定保存路径
   virtual int32_t Save(const std::string &path, const std::string &converter) {
     return 0;
   }
@@ -617,19 +650,28 @@ class GraphTable : public Table {
                             size_t len,
                             FeatureNode *node);
 
-  virtual int32_t get_node_feat(int idx,
-                                const std::vector<uint64_t> &node_ids,
-                                const std::vector<std::string> &feature_names,
-                                std::vector<std::vector<std::string>> &res);
-
-  virtual int32_t set_node_feat(
+  virtual int32_t get_node_feat(
       int idx,
       const std::vector<uint64_t> &node_ids,
       const std::vector<std::string> &feature_names,
-      const std::vector<std::vector<std::string>> &res);
+      std::vector<std::vector<std::string>> &res);  // NOLINT
+
+  virtual int32_t set_node_feat(
+      int idx,
+      const std::vector<uint64_t> &node_ids,              // NOLINT
+      const std::vector<std::string> &feature_names,      // NOLINT
+      const std::vector<std::vector<std::string>> &res);  // NOLINT
 
   size_t get_server_num() { return server_num; }
+  void clear_graph();
   void clear_graph(int idx);
+  void clear_edge_shard();
+  void clear_feature_shard();
+  void feature_shrink_to_fit();
+  void merge_feature_shard();
+  void release_graph();
+  void release_graph_edge();
+  void release_graph_node();
   virtual int32_t make_neighbor_sample_cache(size_t size_limit, size_t ttl) {
     {
       std::unique_lock<std::mutex> lock(mutex_);
@@ -662,20 +704,24 @@ class GraphTable : public Table {
       uint64_t id,
       int sample_size,
       const std::shared_ptr<std::mt19937_64> rng,
-      int &actual_size);
+      int &actual_size);  // NOLINT
   virtual int32_t add_node_to_ssd(
       int type_id, int idx, uint64_t src_id, char *data, int len);
   virtual paddle::framework::GpuPsCommGraph make_gpu_ps_graph(
-      int idx, std::vector<uint64_t> ids);
+      int idx, const std::vector<uint64_t> &ids);
   virtual paddle::framework::GpuPsCommGraphFea make_gpu_ps_graph_fea(
-      std::vector<uint64_t> &node_ids, int slot_num);
+      int gpu_id, std::vector<uint64_t> &node_ids, int slot_num);  // NOLINT
   int32_t Load_to_ssd(const std::string &path, const std::string &param);
-  int64_t load_graph_to_memory_from_ssd(int idx, std::vector<uint64_t> &ids);
+  int64_t load_graph_to_memory_from_ssd(int idx,
+                                        std::vector<uint64_t> &ids);  // NOLINT
   int32_t make_complementary_graph(int idx, int64_t byte_size);
   int32_t dump_edges_to_ssd(int idx);
   int32_t get_partition_num(int idx) { return partitions[idx].size(); }
-  std::vector<uint64_t> get_partition(int idx, int index) {
-    if (idx >= (int)partitions.size() || index >= (int)partitions[idx].size())
+  std::vector<int> slot_feature_num_map() const {
+    return slot_feature_num_map_;
+  }
+  std::vector<uint64_t> get_partition(size_t idx, size_t index) {
+    if (idx >= partitions.size() || index >= partitions[idx].size())
       return std::vector<uint64_t>();
     return partitions[idx][index];
   }
@@ -691,10 +737,19 @@ class GraphTable : public Table {
 #endif
   virtual int32_t add_comm_edge(int idx, uint64_t src_id, uint64_t dst_id);
   virtual int32_t build_sampler(int idx, std::string sample_type = "random");
+  void set_slot_feature_separator(const std::string &ch);
   void set_feature_separator(const std::string &ch);
+
+  void build_graph_total_keys();
+  void build_graph_type_keys();
+
+  std::vector<uint64_t> graph_total_keys_;
+  std::vector<std::vector<uint64_t>> graph_type_keys_;
+  std::unordered_map<int, int> type_to_index_;
+
   std::vector<std::vector<GraphShard *>> edge_shards, feature_shards;
   size_t shard_start, shard_end, server_num, shard_num_per_server, shard_num;
-  int task_pool_size_ = 24;
+  int task_pool_size_ = 64;
   int load_thread_num = 160;
 
   const int random_sample_nodes_ranges = 3;
@@ -708,8 +763,10 @@ class GraphTable : public Table {
   std::vector<std::string> id_to_feature, id_to_edge;
   std::string table_name;
   std::string table_type;
+  std::vector<std::string> edge_type_size;
 
   std::vector<std::shared_ptr<::ThreadPool>> _shards_task_pool;
+  std::vector<std::shared_ptr<::ThreadPool>> _cpu_worker_pool;
   std::vector<std::shared_ptr<std::mt19937_64>> _shards_task_rng_pool;
   std::shared_ptr<::ThreadPool> load_node_edge_task_pool;
   std::shared_ptr<ScaledLRU<SampleKey, SampleResult>> scaled_lru;
@@ -720,6 +777,7 @@ class GraphTable : public Table {
   int cache_ttl;
   mutable std::mutex mutex_;
   bool build_sampler_on_cpu;
+  bool is_load_reverse_edge = false;
   std::shared_ptr<pthread_rwlock_t> rw_lock;
 #ifdef PADDLE_WITH_HETERPS
   // paddle::framework::GpuPsGraphTable gpu_graph_table;
@@ -728,7 +786,9 @@ class GraphTable : public Table {
   // std::shared_ptr<GraphSampler> graph_sampler;
   // REGISTER_GRAPH_FRIEND_CLASS(2, CompleteGraphSampler, BasicBfsGraphSampler)
 #endif
+  std::string slot_feature_separator_ = std::string(" ");
   std::string feature_separator_ = std::string(" ");
+  std::vector<int> slot_feature_num_map_;
 };
 
 /*
diff --git a/paddle/fluid/distributed/ps/table/ctr_dymf_accessor.cc b/paddle/fluid/distributed/ps/table/ctr_dymf_accessor.cc
index 4feee70fed751..4824ab8946b9d 100644
--- a/paddle/fluid/distributed/ps/table/ctr_dymf_accessor.cc
+++ b/paddle/fluid/distributed/ps/table/ctr_dymf_accessor.cc
@@ -43,6 +43,12 @@ int CtrDymfAccessor::Initialize() {
   if (_config.ctr_accessor_param().show_scale()) {
     _show_scale = true;
   }
+  for (int i = 0; i < _config.ctr_accessor_param().load_filter_slots_size();
+       i++) {
+    _filtered_slots.insert(_config.ctr_accessor_param().load_filter_slots(i));
+    VLOG(0) << "CtrDymfAccessor::Initialize() load filter slot:"
+            << _config.ctr_accessor_param().load_filter_slots(i);
+  }
   VLOG(0) << " INTO CtrDymfAccessor::Initialize(); embed_sgd_dim:"
           << common_feature_value.embed_sgd_dim
           << " embedx_dim:" << common_feature_value.embedx_dim
@@ -104,6 +110,15 @@ bool CtrDymfAccessor::SaveSSD(float* value) {
   return false;
 }
 
+bool CtrDymfAccessor::FilterSlot(float* value) {
+  // 热启时过滤掉_filtered_slots中的feasign
+  if (_filtered_slots.find(common_feature_value.Slot(value)) !=
+      _filtered_slots.end()) {
+    return true;
+  }
+  return false;
+}
+
 bool CtrDymfAccessor::Save(float* value, int param) {
   auto base_threshold = _config.ctr_accessor_param().base_threshold();
   auto delta_threshold = _config.ctr_accessor_param().delta_threshold();
@@ -177,7 +192,8 @@ void CtrDymfAccessor::UpdateStatAfterSave(float* value, int param) {
 int32_t CtrDymfAccessor::Create(float** values, size_t num) {
   for (size_t value_item = 0; value_item < num; ++value_item) {
     float* value = values[value_item];
-    value[common_feature_value.UnseenDaysIndex()] = 0;
+    common_feature_value.UnseenDays(value) = 0;
+    common_feature_value.PassId(value) = 0;
     value[common_feature_value.DeltaScoreIndex()] = 0;
     value[common_feature_value.ShowIndex()] = 0;
     value[common_feature_value.ClickIndex()] = 0;
@@ -292,7 +308,8 @@ std::string CtrDymfAccessor::ParseToString(const float* v, int param) {
   thread_local std::ostringstream os;
   os.clear();
   os.str("");
-  os << v[0] << " " << v[1] << " " << v[2] << " " << v[3] << " " << v[4];
+  os << common_feature_value.UnseenDays(const_cast<float*>(v)) << " " << v[1]
+     << " " << v[2] << " " << v[3] << " " << v[4];
   //    << v[5] << " " << v[6];
   for (int i = common_feature_value.EmbedG2SumIndex();
        i < common_feature_value.EmbedxG2SumIndex();
@@ -302,7 +319,8 @@ std::string CtrDymfAccessor::ParseToString(const float* v, int param) {
   auto show = common_feature_value.Show(const_cast<float*>(v));
   auto click = common_feature_value.Click(const_cast<float*>(v));
   auto score = ShowClickScore(show, click);
-  auto mf_dim = int(common_feature_value.MfDim(const_cast<float*>(v)));
+  auto mf_dim =
+      static_cast<int>(common_feature_value.MfDim(const_cast<float*>(v)));
   if (score >= _config.embedx_threshold() &&
       param > common_feature_value.EmbedxG2SumIndex()) {
     for (auto i = common_feature_value.EmbedxG2SumIndex();
@@ -316,9 +334,24 @@ std::string CtrDymfAccessor::ParseToString(const float* v, int param) {
 
 int CtrDymfAccessor::ParseFromString(const std::string& str, float* value) {
   auto ret = paddle::string::str_to_float(str.data(), value);
+  float unseen_day = value[common_feature_value.UnseenDaysIndex()];
+  common_feature_value.UnseenDays(value) = (uint16_t)(unseen_day);
+  common_feature_value.PassId(value) = 0;
   CHECK(ret >= 7) << "expect more than 7 real:" << ret;
   return ret;
 }
 
+bool CtrDymfAccessor::SaveMemCache(float* value,
+                                   int param,
+                                   double global_cache_threshold,
+                                   uint16_t pass_id) {
+  return common_feature_value.Show(value) > global_cache_threshold ||
+         common_feature_value.PassId(value) >= pass_id;
+}
+
+void CtrDymfAccessor::UpdatePassId(float* value, uint16_t pass_id) {
+  common_feature_value.PassId(value) = pass_id;
+}
+
 }  // namespace distributed
 }  // namespace paddle
diff --git a/paddle/fluid/distributed/ps/table/ctr_dymf_accessor.h b/paddle/fluid/distributed/ps/table/ctr_dymf_accessor.h
index b820d617d06ae..be813cdf9624f 100644
--- a/paddle/fluid/distributed/ps/table/ctr_dymf_accessor.h
+++ b/paddle/fluid/distributed/ps/table/ctr_dymf_accessor.h
@@ -21,6 +21,7 @@
 #include "paddle/fluid/distributed/common/registerer.h"
 #include "paddle/fluid/distributed/ps/table/accessor.h"
 #include "paddle/fluid/distributed/ps/table/sparse_sgd_rule.h"
+#include "paddle/fluid/distributed/ps/thirdparty/round_robin.h"
 #include "paddle/fluid/distributed/the_one_ps.pb.h"
 
 namespace paddle {
@@ -30,7 +31,7 @@ namespace distributed {
 class CtrDymfAccessor : public ValueAccessor {
  public:
   struct CtrDymfFeatureValue {
-    /*
+    /*v1: old version
       float unseen_days;
       float delta_score;
       float show;
@@ -44,6 +45,20 @@ class CtrDymfAccessor : public ValueAccessor {
       // float embedx_g2sum;
       std::vector<float> embedx_w;
        */
+    /* V2: support pass_id
+      uint16_t pass_id;
+      uint16_t unseen_days;
+      float show;
+      float click;
+      float embed_w;
+      // float embed_g2sum;
+      std::vector<float> embed_g2sum;
+      float slot;
+      float mf_dim
+      std::<vector>float embedx_g2sum;
+      // float embedx_g2sum;
+      std::vector<float> embedx_w;
+    */
 
     int Dim() { return 7 + embed_sgd_dim + embedx_sgd_dim + embedx_dim; }
     int DimSize(size_t dim, int embedx_dim) { return sizeof(float); }
@@ -60,7 +75,7 @@ class CtrDymfAccessor : public ValueAccessor {
     int EmbedxWIndex() { return EmbedxG2SumIndex() + embedx_sgd_dim; }
 
     // 根据mf_dim计算的总长度
-    int Dim(int& mf_dim) {
+    int Dim(int mf_dim) {
       int tmp_embedx_sgd_dim = 1;
       if (optimizer_name == "SparseAdamSGDRule") {  // adam
         tmp_embedx_sgd_dim = mf_dim * 2 + 2;
@@ -71,9 +86,19 @@ class CtrDymfAccessor : public ValueAccessor {
     }
 
     // 根据mf_dim计算的总byte数
-    int Size(int& mf_dim) { return (Dim(mf_dim)) * sizeof(float); }
+    int Size(int mf_dim) { return (Dim(mf_dim)) * sizeof(float); }
 
-    float& UnseenDays(float* val) { return val[UnseenDaysIndex()]; }
+    uint16_t& PassId(float* val) {
+      uint16_t* int16_val =
+          reinterpret_cast<uint16_t*>(val + UnseenDaysIndex());
+      return int16_val[0];
+    }
+
+    uint16_t& UnseenDays(float* val) {
+      uint16_t* int16_val =
+          reinterpret_cast<uint16_t*>(val + UnseenDaysIndex());
+      return int16_val[1];
+    }
     float& DeltaScore(float* val) { return val[DeltaScoreIndex()]; }
     float& Show(float* val) { return val[ShowIndex()]; }
     float& Click(float* val) { return val[ClickIndex()]; }
@@ -184,6 +209,7 @@ class CtrDymfAccessor : public ValueAccessor {
                  int param,
                  double global_cache_threshold) override;
   bool SaveSSD(float* value) override;
+  bool FilterSlot(float* value);
   // update delta_score and unseen_days after save
   void UpdateStatAfterSave(float* value, int param) override;
   // keys不存在时，为values生成随机值
@@ -217,6 +243,14 @@ class CtrDymfAccessor : public ValueAccessor {
     return 0.0;
   }
 
+  // 根据pass_id和show_threashold阈值来判断cache到ssd
+  bool SaveMemCache(float* value,
+                    int param,
+                    double global_cache_threshold,
+                    uint16_t pass_id);
+  // 更新pass_id
+  void UpdatePassId(float* value, uint16_t pass_id);
+
  private:
   // float ShowClickScore(float show, float click);
 
@@ -233,6 +267,7 @@ class CtrDymfAccessor : public ValueAccessor {
   float ShowClickScore(float show, float click);
   SparseValueSGDRule* _embed_sgd_rule;
   SparseValueSGDRule* _embedx_sgd_rule;
+  robin_hood::unordered_set<float> _filtered_slots;
 };
 }  // namespace distributed
 }  // namespace paddle
diff --git a/paddle/fluid/distributed/ps/table/depends/rocksdb_warpper.h b/paddle/fluid/distributed/ps/table/depends/rocksdb_warpper.h
index eb3ff2e254f56..873644f8ca416 100644
--- a/paddle/fluid/distributed/ps/table/depends/rocksdb_warpper.h
+++ b/paddle/fluid/distributed/ps/table/depends/rocksdb_warpper.h
@@ -16,8 +16,12 @@
 #include <glog/logging.h>
 #include <rocksdb/db.h>
 #include <rocksdb/filter_policy.h>
+#include <rocksdb/iostats_context.h>
 #include <rocksdb/options.h>
+#include <rocksdb/perf_context.h>
+#include <rocksdb/perf_level.h>
 #include <rocksdb/slice.h>
+#include <rocksdb/slice_transform.h>
 #include <rocksdb/table.h>
 #include <rocksdb/write_batch.h>
 
@@ -27,6 +31,55 @@
 namespace paddle {
 namespace distributed {
 
+class Uint64Comparator : public rocksdb::Comparator {
+  int Compare(const rocksdb::Slice& a, const rocksdb::Slice& b) const {
+    uint64_t A = *(reinterpret_cast<const uint64_t*>(a.data()));
+    uint64_t B = *(reinterpret_cast<const uint64_t*>(b.data()));
+    if (A < B) {
+      return -1;
+    }
+    if (A > B) {
+      return 1;
+    }
+    return 0;
+  }
+  const char* Name() const { return "Uint64Comparator"; }
+  void FindShortestSeparator(std::string*, const rocksdb::Slice&) const {}
+  void FindShortSuccessor(std::string*) const {}
+};
+
+class RocksDBItem {
+ public:
+  RocksDBItem() {}
+  ~RocksDBItem() {}
+  void reset() {
+    batch_keys.clear();
+    batch_index.clear();
+    batch_values.clear();
+    status.clear();
+  }
+  std::vector<rocksdb::Slice> batch_keys;
+  std::vector<int> batch_index;
+  std::vector<rocksdb::PinnableSlice> batch_values;
+  std::vector<rocksdb::Status> status;
+};
+
+class RocksDBCtx {
+ public:
+  RocksDBCtx() {
+    items[0].reset();
+    items[1].reset();
+    cur_index = 0;
+  }
+  ~RocksDBCtx() {}
+  RocksDBItem* switch_item() {
+    cur_index = (cur_index + 1) % 2;
+    return &items[cur_index];
+  }
+  RocksDBItem items[2];
+  int cur_index;
+};
+
 class RocksDBHandler {
  public:
   RocksDBHandler() {}
@@ -38,55 +91,69 @@ class RocksDBHandler {
   }
 
   int initialize(const std::string& db_path, const int colnum) {
-    VLOG(3) << "db path: " << db_path << " colnum: " << colnum;
-    rocksdb::Options options;
-    rocksdb::BlockBasedTableOptions bbto;
-    bbto.block_size = 4 * 1024;
-    bbto.block_cache = rocksdb::NewLRUCache(64 * 1024 * 1024);
-    bbto.block_cache_compressed = rocksdb::NewLRUCache(64 * 1024 * 1024);
-    bbto.cache_index_and_filter_blocks = false;
-    bbto.filter_policy.reset(rocksdb::NewBloomFilterPolicy(20, false));
-    bbto.whole_key_filtering = true;
-    options.table_factory.reset(rocksdb::NewBlockBasedTableFactory(bbto));
-
-    options.keep_log_file_num = 100;
-    options.max_log_file_size = 50 * 1024 * 1024;  // 50MB
-    options.create_if_missing = true;
-    options.use_direct_reads = true;
-    options.max_background_flushes = 5;
-    options.max_background_compactions = 5;
-    options.base_background_compactions = 10;
-    options.write_buffer_size = 256 * 1024 * 1024;  // 256MB
-    options.max_write_buffer_number = 8;
-    options.max_bytes_for_level_base =
-        options.max_write_buffer_number * options.write_buffer_size;
-    options.min_write_buffer_number_to_merge = 1;
-    options.target_file_size_base = 1024 * 1024 * 1024;  // 1024MB
-    options.memtable_prefix_bloom_size_ratio = 0.02;
-    options.num_levels = 4;
-    options.max_open_files = -1;
-
-    options.compression = rocksdb::kNoCompression;
-    options.level0_file_num_compaction_trigger = 8;
-    options.level0_slowdown_writes_trigger =
-        1.8 * options.level0_file_num_compaction_trigger;
-    options.level0_stop_writes_trigger =
-        3.6 * options.level0_file_num_compaction_trigger;
-
-    if (!db_path.empty()) {
-      std::string rm_cmd = "rm -rf " + db_path;
-      system(rm_cmd.c_str());
-    }
-
-    rocksdb::Status s = rocksdb::DB::Open(options, db_path, &_db);
-    assert(s.ok());
-    _handles.resize(colnum);
+    VLOG(0) << "db path: " << db_path << " colnum: " << colnum;
+    _dbs.resize(colnum);
     for (int i = 0; i < colnum; i++) {
-      s = _db->CreateColumnFamily(
-          options, "shard_" + std::to_string(i), &_handles[i]);
+      rocksdb::Options options;
+      options.comparator = &_comparator;
+      rocksdb::BlockBasedTableOptions bbto;
+      // options.memtable_factory.reset(rocksdb::NewHashSkipListRepFactory(65536));
+      // options.prefix_extractor.reset(rocksdb::NewFixedPrefixTransform(2));
+      bbto.format_version = 5;
+      bbto.use_delta_encoding = false;
+      bbto.block_size = 4 * 1024;
+      bbto.block_restart_interval = 6;
+      bbto.block_cache = rocksdb::NewLRUCache(64 * 1024 * 1024);
+      // bbto.block_cache_compressed = rocksdb::NewLRUCache(64 * 1024 * 1024);
+      bbto.cache_index_and_filter_blocks = false;
+      bbto.filter_policy.reset(rocksdb::NewBloomFilterPolicy(15, false));
+      bbto.whole_key_filtering = true;
+      options.statistics = rocksdb::CreateDBStatistics();
+      options.table_factory.reset(rocksdb::NewBlockBasedTableFactory(bbto));
+
+      // options.IncreaseParallelism();
+      options.OptimizeLevelStyleCompaction();
+      options.keep_log_file_num = 100;
+      // options.db_log_dir = "./log/rocksdb";
+      options.max_log_file_size = 50 * 1024 * 1024;  // 50MB
+      // options.threads = 8;
+      options.create_if_missing = true;
+      options.use_direct_reads = true;
+      options.max_background_flushes = 37;
+      options.max_background_compactions = 64;
+      options.base_background_compactions = 10;
+      options.write_buffer_size = 256 * 1024 * 1024;  // 256MB
+      options.max_write_buffer_number = 8;
+      options.max_bytes_for_level_base =
+          options.max_write_buffer_number * options.write_buffer_size;
+      options.min_write_buffer_number_to_merge = 1;
+      options.target_file_size_base = 1024 * 1024 * 1024;  // 1024MB
+      // options.verify_checksums_in_compaction = false;
+      // options.disable_auto_compactions = true;
+      options.memtable_prefix_bloom_size_ratio = 0.02;
+      options.num_levels = 4;
+      options.max_open_files = -1;
+
+      options.compression = rocksdb::kNoCompression;
+      // options.compaction_options_fifo = rocksdb::CompactionOptionsFIFO();
+      // options.compaction_style =
+      // rocksdb::CompactionStyle::kCompactionStyleFIFO;
+      options.level0_file_num_compaction_trigger = 5;
+      options.level0_slowdown_writes_trigger =
+          1.8 * options.level0_file_num_compaction_trigger;
+      options.level0_stop_writes_trigger =
+          3.6 * options.level0_file_num_compaction_trigger;
+
+      std::string shard_path = db_path + "_" + std::to_string(i);
+      if (!shard_path.empty()) {
+        std::string rm_cmd = "rm -rf " + shard_path;
+        system(rm_cmd.c_str());
+      }
+
+      rocksdb::Status s = rocksdb::DB::Open(options, shard_path, &_dbs[i]);
       assert(s.ok());
     }
-    LOG(INFO) << "DB initialize success, colnum:" << colnum;
+    VLOG(0) << "DB initialize success, colnum:" << colnum;
     return 0;
   }
 
@@ -94,36 +161,32 @@ class RocksDBHandler {
       int id, const char* key, int key_len, const char* value, int value_len) {
     rocksdb::WriteOptions options;
     options.disableWAL = true;
-    rocksdb::Status s = _db->Put(options,
-                                 _handles[id],
-                                 rocksdb::Slice(key, key_len),
-                                 rocksdb::Slice(value, value_len));
+    rocksdb::Status s = _dbs[id]->Put(options,
+                                      rocksdb::Slice(key, key_len),
+                                      rocksdb::Slice(value, value_len));
     assert(s.ok());
     return 0;
   }
 
   int put_batch(int id,
-                std::vector<std::pair<char*, int>>& ssd_keys,
-                std::vector<std::pair<char*, int>>& ssd_values,
+                std::vector<std::pair<char*, int>>& ssd_keys,    // NOLINT
+                std::vector<std::pair<char*, int>>& ssd_values,  // NOLINT
                 int n) {
     rocksdb::WriteOptions options;
     options.disableWAL = true;
     rocksdb::WriteBatch batch(n * 128);
     for (int i = 0; i < n; i++) {
-      batch.Put(_handles[id],
-                rocksdb::Slice(ssd_keys[i].first, ssd_keys[i].second),
+      batch.Put(rocksdb::Slice(ssd_keys[i].first, ssd_keys[i].second),
                 rocksdb::Slice(ssd_values[i].first, ssd_values[i].second));
     }
-    rocksdb::Status s = _db->Write(options, &batch);
+    rocksdb::Status s = _dbs[id]->Write(options, &batch);
     assert(s.ok());
     return 0;
   }
 
-  int get(int id, const char* key, int key_len, std::string& value) {
-    rocksdb::Status s = _db->Get(rocksdb::ReadOptions(),
-                                 _handles[id],
-                                 rocksdb::Slice(key, key_len),
-                                 &value);
+  int get(int id, const char* key, int key_len, std::string& value) {  // NOLINT
+    rocksdb::Status s = _dbs[id]->Get(
+        rocksdb::ReadOptions(), rocksdb::Slice(key, key_len), &value);
     if (s.IsNotFound()) {
       return 1;
     }
@@ -131,33 +194,63 @@ class RocksDBHandler {
     return 0;
   }
 
+  void multi_get(int id,
+                 const size_t num_keys,
+                 const rocksdb::Slice* keys,
+                 rocksdb::PinnableSlice* values,
+                 rocksdb::Status* status,
+                 const bool sorted_input = true) {
+    rocksdb::ColumnFamilyHandle* handle = _dbs[id]->DefaultColumnFamily();
+    auto read_opt = rocksdb::ReadOptions();
+    read_opt.fill_cache = false;
+    _dbs[id]->MultiGet(
+        read_opt, handle, num_keys, keys, values, status, sorted_input);
+  }
+
   int del_data(int id, const char* key, int key_len) {
     rocksdb::WriteOptions options;
     options.disableWAL = true;
-    rocksdb::Status s =
-        _db->Delete(options, _handles[id], rocksdb::Slice(key, key_len));
+    rocksdb::Status s = _dbs[id]->Delete(options, rocksdb::Slice(key, key_len));
     assert(s.ok());
     return 0;
   }
 
   int flush(int id) {
-    rocksdb::Status s = _db->Flush(rocksdb::FlushOptions(), _handles[id]);
+    rocksdb::Status s = _dbs[id]->Flush(rocksdb::FlushOptions());
     assert(s.ok());
     return 0;
   }
 
   rocksdb::Iterator* get_iterator(int id) {
-    return _db->NewIterator(rocksdb::ReadOptions(), _handles[id]);
+    return _dbs[id]->NewIterator(rocksdb::ReadOptions());
   }
 
-  int get_estimate_key_num(uint64_t& num_keys) {
-    _db->GetAggregatedIntProperty("rocksdb.estimate-num-keys", &num_keys);
+  int get_estimate_key_num(uint64_t& num_keys) {  // NOLINT
+    num_keys = 0;
+    for (size_t i = 0; i < _dbs.size(); i++) {
+      uint64_t cur_keys = 0;
+      _dbs[i]->GetAggregatedIntProperty("rocksdb.estimate-num-keys", &cur_keys);
+      num_keys += cur_keys;
+    }
+    return 0;
+  }
+
+  Uint64Comparator* get_comparator() { return &_comparator; }
+
+  int ingest_externel_file(int id,
+                           const std::vector<std::string>& sst_filelist) {
+    rocksdb::IngestExternalFileOptions ifo;
+    ifo.move_files = true;
+    rocksdb::Status s = _dbs[id]->IngestExternalFile(sst_filelist, ifo);
+    assert(s.ok());
     return 0;
   }
 
  private:
   std::vector<rocksdb::ColumnFamilyHandle*> _handles;
-  rocksdb::DB* _db;
+  // rocksdb::DB* _db;
+  std::vector<rocksdb::DB*> _dbs;
+  Uint64Comparator _comparator;
 };
 }  // namespace distributed
 }  // namespace paddle
diff --git a/paddle/fluid/distributed/ps/table/graph/graph_node.h b/paddle/fluid/distributed/ps/table/graph/graph_node.h
index 9c384a9744b8a..fc26b20da9390 100644
--- a/paddle/fluid/distributed/ps/table/graph/graph_node.h
+++ b/paddle/fluid/distributed/ps/table/graph/graph_node.h
@@ -31,7 +31,7 @@ namespace distributed {
 class Node {
  public:
   Node() {}
-  Node(uint64_t id) : id(id) {}
+  explicit Node(uint64_t id) : id(id) {}
   virtual ~Node() {}
   static int id_size, int_size, weight_size;
   uint64_t get_id() { return id; }
@@ -56,8 +56,14 @@ class Node {
   virtual int get_feature_ids(int slot_idx, std::vector<uint64_t> *res) const {
     return 0;
   }
+  virtual int get_feature_ids(int slot_idx,
+                              std::vector<uint64_t> &feature_id,      // NOLINT
+                              std::vector<uint8_t> &slot_id) const {  // NOLINT
+    return 0;
+  }
   virtual void set_feature(int idx, const std::string &str) {}
   virtual void set_feature_size(int size) {}
+  virtual void shrink_to_fit() {}
   virtual int get_feature_size() { return 0; }
   virtual size_t get_neighbor_size() { return 0; }
 
@@ -69,7 +75,8 @@ class Node {
 class GraphNode : public Node {
  public:
   GraphNode() : Node(), sampler(nullptr), edges(nullptr) {}
-  GraphNode(uint64_t id) : Node(id), sampler(nullptr), edges(nullptr) {}
+  explicit GraphNode(uint64_t id)
+      : Node(id), sampler(nullptr), edges(nullptr) {}
   virtual ~GraphNode();
   virtual void build_edges(bool is_weighted);
   virtual void build_sampler(std::string sample_type);
@@ -92,13 +99,13 @@ class GraphNode : public Node {
 class FeatureNode : public Node {
  public:
   FeatureNode() : Node() {}
-  FeatureNode(uint64_t id) : Node(id) {}
+  explicit FeatureNode(uint64_t id) : Node(id) {}
   virtual ~FeatureNode() {}
   virtual int get_size(bool need_feature);
   virtual void to_buffer(char *buffer, bool need_feature);
   virtual void recover_from_buffer(char *buffer);
   virtual std::string get_feature(int idx) {
-    if (idx < (int)this->feature.size()) {
+    if (idx < static_cast<int>(this->feature.size())) {
       return this->feature[idx];
     } else {
       return std::string("");
@@ -135,7 +142,7 @@ class FeatureNode : public Node {
                                 "get_feature_ids res should not be null"));
     res->clear();
     errno = 0;
-    if (slot_idx < (int)this->feature.size()) {
+    if (slot_idx < static_cast<int>(this->feature.size())) {
       const std::string &s = this->feature[slot_idx];
       const uint64_t *feas = (const uint64_t *)(s.c_str());
 
@@ -155,21 +162,51 @@ class FeatureNode : public Node {
     return 0;
   }
 
+  virtual int get_feature_ids(int slot_idx,
+                              std::vector<uint64_t> &feature_id,      // NOLINT
+                              std::vector<uint8_t> &slot_id) const {  // NOLINT
+    errno = 0;
+    size_t num = 0;
+    if (slot_idx < static_cast<int>(this->feature.size())) {
+      const std::string &s = this->feature[slot_idx];
+      const uint64_t *feas = (const uint64_t *)(s.c_str());
+      num = s.length() / sizeof(uint64_t);
+      CHECK((s.length() % sizeof(uint64_t)) == 0)
+          << "bad feature_item: [" << s << "]";
+      for (size_t i = 0; i < num; ++i) {
+        feature_id.push_back(feas[i]);
+        slot_id.push_back(slot_idx);
+      }
+    }
+    PADDLE_ENFORCE_EQ(
+        errno,
+        0,
+        paddle::platform::errors::InvalidArgument(
+            "get_feature_ids get errno should be 0, but got %d.", errno));
+    return num;
+  }
+
   virtual std::string *mutable_feature(int idx) {
-    if (idx >= (int)this->feature.size()) {
+    if (idx >= static_cast<int>(this->feature.size())) {
       this->feature.resize(idx + 1);
     }
     return &(this->feature[idx]);
   }
 
   virtual void set_feature(int idx, const std::string &str) {
-    if (idx >= (int)this->feature.size()) {
+    if (idx >= static_cast<int>(this->feature.size())) {
       this->feature.resize(idx + 1);
     }
     this->feature[idx] = str;
   }
   virtual void set_feature_size(int size) { this->feature.resize(size); }
   virtual int get_feature_size() { return this->feature.size(); }
+  virtual void shrink_to_fit() {
+    feature.shrink_to_fit();
+    for (auto &slot : feature) {
+      slot.shrink_to_fit();
+    }
+  }
 
   template <typename T>
   static std::string parse_value_to_bytes(std::vector<std::string> feat_str) {
@@ -179,7 +216,8 @@ class FeatureNode : public Node {
     for (size_t i = 0; i < feat_str.size(); i++) {
       std::stringstream ss(feat_str[i]);
       ss >> v;
-      std::memcpy(buffer + sizeof(T) * i, (char *)&v, sizeof(T));
+      std::memcpy(
+          buffer + sizeof(T) * i, reinterpret_cast<char *>(&v), sizeof(T));
     }
     return std::string(buffer, Tsize);
   }
@@ -196,7 +234,8 @@ class FeatureNode : public Node {
     for (size_t i = 0; i < feat_str_size; i++) {
       std::stringstream ss(*(feat_str_begin + i));
       ss >> v;
-      std::memcpy(buffer + sizeof(T) * i, (char *)&v, sizeof(T));
+      std::memcpy(
+          buffer + sizeof(T) * i, reinterpret_cast<char *>(&v), sizeof(T));
     }
     output->assign(buffer);
   }
@@ -208,7 +247,7 @@ class FeatureNode : public Node {
     size_t start = 0;
     const char *buffer = feat_str.data();
     while (start < feat_str.size()) {
-      std::memcpy((char *)&v, buffer + start, sizeof(T));
+      std::memcpy(reinterpret_cast<char *>(&v), buffer + start, sizeof(T));
       start += sizeof(T);
       out.push_back(v);
     }
@@ -225,7 +264,7 @@ class FeatureNode : public Node {
     size_t num = output->length();
     output->resize(num + Tsize);
 
-    T *fea_ptrs = (T *)(&(*output)[num]);
+    T *fea_ptrs = reinterpret_cast<T *>(&(*output)[num]);
 
     thread_local paddle::string::str_ptr_stream ss;
     for (size_t i = 0; i < feat_str_size; i++) {
diff --git a/paddle/fluid/distributed/ps/table/memory_sparse_table.cc b/paddle/fluid/distributed/ps/table/memory_sparse_table.cc
index f668ec951589b..975067d300ecb 100644
--- a/paddle/fluid/distributed/ps/table/memory_sparse_table.cc
+++ b/paddle/fluid/distributed/ps/table/memory_sparse_table.cc
@@ -137,7 +137,12 @@ int32_t MemorySparseTable::Load(const std::string &path,
   size_t feature_value_size =
       _value_accesor->GetAccessorInfo().size / sizeof(float);
 
+#ifdef PADDLE_WITH_HETERPS
+  int thread_num = _real_local_shard_num;
+#else
   int thread_num = _real_local_shard_num < 15 ? _real_local_shard_num : 15;
+#endif
+
   omp_set_num_threads(thread_num);
 #pragma omp parallel for schedule(dynamic)
   for (int i = 0; i < _real_local_shard_num; ++i) {
@@ -167,12 +172,6 @@ int32_t MemorySparseTable::Load(const std::string &path,
           value.resize(feature_value_size);
           int parse_size = _value_accesor->ParseFromString(++end, value.data());
           value.resize(parse_size);
-
-          // for debug
-          for (int ii = 0; ii < parse_size; ++ii) {
-            VLOG(2) << "MemorySparseTable::load key: " << key << " value " << ii
-                    << ": " << value.data()[ii] << " local_shard: " << i;
-          }
         }
         read_channel->close();
         if (err_no == -1) {
@@ -340,7 +339,7 @@ int32_t MemorySparseTable::Save(const std::string &dirname,
 
   size_t file_start_idx = _avg_local_shard_num * _shard_idx;
 
-#ifdef PADDLE_WITH_GPU_GRAPH
+#ifdef PADDLE_WITH_HETERPS
   int thread_num = _real_local_shard_num;
 #else
   int thread_num = _real_local_shard_num < 20 ? _real_local_shard_num : 20;
@@ -723,7 +722,8 @@ int32_t MemorySparseTable::Pull(TableContext &context) {
   if (context.use_ptr) {
     char **pull_values = context.pull_context.ptr_values;
     const uint64_t *keys = context.pull_context.keys;
-    return PullSparsePtr(pull_values, keys, context.num);
+    return PullSparsePtr(
+        context.shard_id, pull_values, keys, context.num, context.pass_id);
   } else {
     float *pull_values = context.pull_context.values;
     const PullSparseValue &pull_value = context.pull_context.pull_value;
@@ -820,9 +820,11 @@ int32_t MemorySparseTable::PullSparse(float *pull_values,
   return 0;
 }
 
-int32_t MemorySparseTable::PullSparsePtr(char **pull_values,
+int32_t MemorySparseTable::PullSparsePtr(int shard_id,  // fake num
+                                         char **pull_values,
                                          const uint64_t *keys,
-                                         size_t num) {
+                                         size_t num,
+                                         uint16_t pass_id) {
   CostTimer timer("pscore_sparse_select_all");
   size_t value_size = _value_accesor->GetAccessorInfo().size / sizeof(float);
   size_t mf_value_size =
diff --git a/paddle/fluid/distributed/ps/table/memory_sparse_table.h b/paddle/fluid/distributed/ps/table/memory_sparse_table.h
index 17018d5e5dfc3..658446d770c71 100644
--- a/paddle/fluid/distributed/ps/table/memory_sparse_table.h
+++ b/paddle/fluid/distributed/ps/table/memory_sparse_table.h
@@ -90,7 +90,11 @@ class MemorySparseTable : public Table {
   std::pair<int64_t, int64_t> PrintTableStat() override;
   int32_t PullSparse(float* values, const PullSparseValue& pull_value);
 
-  int32_t PullSparsePtr(char** pull_values, const uint64_t* keys, size_t num);
+  int32_t PullSparsePtr(int shard_id,
+                        char** pull_values,
+                        const uint64_t* keys,
+                        size_t num,
+                        uint16_t pass_id);
 
   int32_t PushSparse(const uint64_t* keys, const float* values, size_t num);
 
diff --git a/paddle/fluid/distributed/ps/table/ssd_sparse_table.cc b/paddle/fluid/distributed/ps/table/ssd_sparse_table.cc
index 3e0f631ed41bc..aa626a5e49fb5 100644
--- a/paddle/fluid/distributed/ps/table/ssd_sparse_table.cc
+++ b/paddle/fluid/distributed/ps/table/ssd_sparse_table.cc
@@ -24,8 +24,10 @@ DECLARE_bool(pserver_print_missed_key_num_every_push);
 DECLARE_bool(pserver_create_value_when_push);
 DECLARE_bool(pserver_enable_create_feasign_randomly);
 DEFINE_bool(pserver_open_strict_check, false, "pserver_open_strict_check");
-DEFINE_string(rocksdb_path, "database", "path of sparse table rocksdb file");
 DEFINE_int32(pserver_load_batch_size, 5000, "load batch size for ssd");
+PADDLE_DEFINE_EXPORTED_string(rocksdb_path,
+                              "database",
+                              "path of sparse table rocksdb file");
 
 namespace paddle {
 namespace distributed {
@@ -34,6 +36,9 @@ int32_t SSDSparseTable::Initialize() {
   MemorySparseTable::Initialize();
   _db = paddle::distributed::RocksDBHandler::GetInstance();
   _db->initialize(FLAGS_rocksdb_path, _real_local_shard_num);
+  VLOG(0) << "initalize SSDSparseTable succ";
+  VLOG(0) << "SSD FLAGS_pserver_print_missed_key_num_every_push:"
+          << FLAGS_pserver_print_missed_key_num_every_push;
   return 0;
 }
 
@@ -44,7 +49,8 @@ int32_t SSDSparseTable::Pull(TableContext& context) {
   if (context.use_ptr) {
     char** pull_values = context.pull_context.ptr_values;
     const uint64_t* keys = context.pull_context.keys;
-    return PullSparsePtr(pull_values, keys, context.num);
+    return PullSparsePtr(
+        context.shard_id, pull_values, keys, context.num, context.pass_id);
   } else {
     float* pull_values = context.pull_context.values;
     const PullSparseValue& pull_value = context.pull_context.pull_value;
@@ -171,90 +177,142 @@ int32_t SSDSparseTable::PullSparse(float* pull_values,
   return 0;
 }
 
-int32_t SSDSparseTable::PullSparsePtr(char** pull_values,
-                                      const uint64_t* keys,
-                                      size_t num) {
+int32_t SSDSparseTable::PullSparsePtr(int shard_id,
+                                      char** pull_values,
+                                      const uint64_t* pull_keys,
+                                      size_t num,
+                                      uint16_t pass_id) {
   CostTimer timer("pserver_ssd_sparse_select_all");
   size_t value_size = _value_accesor->GetAccessorInfo().size / sizeof(float);
   size_t mf_value_size =
       _value_accesor->GetAccessorInfo().mf_size / sizeof(float);
 
   {  // 从table取值 or create
-    std::vector<std::future<int>> tasks(_real_local_shard_num);
-    std::vector<std::vector<std::pair<uint64_t, int>>> task_keys(
-        _real_local_shard_num);
+    RocksDBCtx context;
+    std::vector<std::future<int>> tasks;
+    RocksDBItem* cur_ctx = context.switch_item();
+    cur_ctx->reset();
+    FixedFeatureValue* ret = NULL;
+    auto& local_shard = _local_shards[shard_id];
+    float data_buffer[value_size];  // NOLINT
+    float* data_buffer_ptr = data_buffer;
+
     for (size_t i = 0; i < num; ++i) {
-      int shard_id = (keys[i] % _sparse_table_shard_num) % _avg_local_shard_num;
-      task_keys[shard_id].push_back({keys[i], i});
+      uint64_t key = pull_keys[i];
+      auto itr = local_shard.find(key);
+      if (itr == local_shard.end()) {
+        cur_ctx->batch_index.push_back(i);
+        cur_ctx->batch_keys.push_back(rocksdb::Slice(
+            (char*)&(pull_keys[i]), sizeof(uint64_t)));  // NOLINT
+        if (cur_ctx->batch_keys.size() == 1024) {
+          cur_ctx->batch_values.resize(cur_ctx->batch_keys.size());
+          cur_ctx->status.resize(cur_ctx->batch_keys.size());
+          auto fut =
+              _shards_task_pool[shard_id % _shards_task_pool.size()]->enqueue(
+                  [this, shard_id, cur_ctx]() -> int {
+                    _db->multi_get(shard_id,
+                                   cur_ctx->batch_keys.size(),
+                                   cur_ctx->batch_keys.data(),
+                                   cur_ctx->batch_values.data(),
+                                   cur_ctx->status.data());
+                    return 0;
+                  });
+          cur_ctx = context.switch_item();
+          for (size_t x = 0; x < tasks.size(); ++x) {
+            tasks[x].wait();
+            for (size_t idx = 0; idx < cur_ctx->status.size(); idx++) {
+              uint64_t cur_key = *(reinterpret_cast<uint64_t*>(
+                  const_cast<char*>(cur_ctx->batch_keys[idx].data())));
+              if (cur_ctx->status[idx].IsNotFound()) {
+                auto& feature_value = local_shard[cur_key];
+                int init_size = value_size - mf_value_size;
+                feature_value.resize(init_size);
+                _value_accesor->Create(&data_buffer_ptr, 1);
+                memcpy(const_cast<float*>(feature_value.data()),
+                       data_buffer_ptr,
+                       init_size * sizeof(float));
+                ret = &feature_value;
+              } else {
+                int data_size =
+                    cur_ctx->batch_values[idx].size() / sizeof(float);
+                // from rocksdb to mem
+                auto& feature_value = local_shard[cur_key];
+                feature_value.resize(data_size);
+                memcpy(const_cast<float*>(feature_value.data()),
+                       paddle::string::str_to_float(
+                           cur_ctx->batch_values[idx].data()),
+                       data_size * sizeof(float));
+                _db->del_data(shard_id,
+                              reinterpret_cast<char*>(&cur_key),
+                              sizeof(uint64_t));
+                ret = &feature_value;
+              }
+              _value_accesor->UpdatePassId(ret->data(), pass_id);
+              int pull_data_idx = cur_ctx->batch_index[idx];
+              pull_values[pull_data_idx] = reinterpret_cast<char*>(ret);
+            }
+          }
+          cur_ctx->reset();
+          tasks.clear();
+          tasks.push_back(std::move(fut));
+        }
+      } else {
+        ret = itr.value_ptr();
+        // int pull_data_idx = keys[i].second;
+        _value_accesor->UpdatePassId(ret->data(), pass_id);
+        pull_values[i] = reinterpret_cast<char*>(ret);
+      }
     }
-
-    std::atomic<uint32_t> missed_keys{0};
-    for (int shard_id = 0; shard_id < _real_local_shard_num; ++shard_id) {
-      tasks[shard_id] =
+    if (cur_ctx->batch_keys.size() != 0) {
+      cur_ctx->batch_values.resize(cur_ctx->batch_keys.size());
+      cur_ctx->status.resize(cur_ctx->batch_keys.size());
+      auto fut =
           _shards_task_pool[shard_id % _shards_task_pool.size()]->enqueue(
-              [this,
-               shard_id,
-               &task_keys,
-               value_size,
-               mf_value_size,
-               pull_values,
-               &missed_keys]() -> int {
-                auto& keys = task_keys[shard_id];
-                auto& local_shard = _local_shards[shard_id];
-                float data_buffer[value_size];  // NOLINT
-                float* data_buffer_ptr = data_buffer;
-                for (size_t i = 0; i < keys.size(); ++i) {
-                  uint64_t key = keys[i].first;
-                  auto itr = local_shard.find(key);
-                  size_t data_size = value_size - mf_value_size;
-                  FixedFeatureValue* ret = NULL;
-                  if (itr == local_shard.end()) {
-                    // pull rocksdb
-                    std::string tmp_string("");
-                    if (_db->get(shard_id,
-                                 reinterpret_cast<char*>(&key),
-                                 sizeof(uint64_t),
-                                 tmp_string) > 0) {
-                      ++missed_keys;
-                      auto& feature_value = local_shard[key];
-                      feature_value.resize(data_size);
-                      float* data_ptr =
-                          const_cast<float*>(feature_value.data());
-                      _value_accesor->Create(&data_buffer_ptr, 1);
-                      memcpy(
-                          data_ptr, data_buffer_ptr, data_size * sizeof(float));
-                      ret = &feature_value;
-                    } else {
-                      data_size = tmp_string.size() / sizeof(float);
-                      memcpy(data_buffer_ptr,
-                             paddle::string::str_to_float(tmp_string),
-                             data_size * sizeof(float));
-                      // from rocksdb to mem
-                      auto& feature_value = local_shard[key];
-                      feature_value.resize(data_size);
-                      memcpy(const_cast<float*>(feature_value.data()),
-                             data_buffer_ptr,
-                             data_size * sizeof(float));
-                      _db->del_data(shard_id,
-                                    reinterpret_cast<char*>(&key),
-                                    sizeof(uint64_t));
-                      ret = &feature_value;
-                    }
-                  } else {
-                    ret = itr.value_ptr();
-                  }
-                  int pull_data_idx = keys[i].second;
-                  pull_values[pull_data_idx] = reinterpret_cast<char*>(ret);
-                }
+              [this, shard_id, cur_ctx]() -> int {
+                _db->multi_get(shard_id,
+                               cur_ctx->batch_keys.size(),
+                               cur_ctx->batch_keys.data(),
+                               cur_ctx->batch_values.data(),
+                               cur_ctx->status.data());
                 return 0;
               });
+      tasks.push_back(std::move(fut));
     }
-    for (int i = 0; i < _real_local_shard_num; ++i) {
-      tasks[i].wait();
+    for (size_t x = 0; x < tasks.size(); ++x) {
+      tasks[x].wait();
     }
-    if (FLAGS_pserver_print_missed_key_num_every_push) {
-      LOG(WARNING) << "total pull keys:" << num
-                   << " missed_keys:" << missed_keys.load();
+    for (size_t x = 0; x < 2; x++) {
+      cur_ctx = context.switch_item();
+      for (size_t idx = 0; idx < cur_ctx->status.size(); idx++) {
+        uint64_t cur_key = *(reinterpret_cast<uint64_t*>(
+            const_cast<char*>(cur_ctx->batch_keys[idx].data())));
+        if (cur_ctx->status[idx].IsNotFound()) {
+          auto& feature_value = local_shard[cur_key];
+          int init_size = value_size - mf_value_size;
+          feature_value.resize(init_size);
+          _value_accesor->Create(&data_buffer_ptr, 1);
+          memcpy(const_cast<float*>(feature_value.data()),
+                 data_buffer_ptr,
+                 init_size * sizeof(float));
+          ret = &feature_value;
+        } else {
+          int data_size = cur_ctx->batch_values[idx].size() / sizeof(float);
+          // from rocksdb to mem
+          auto& feature_value = local_shard[cur_key];
+          feature_value.resize(data_size);
+          memcpy(
+              const_cast<float*>(feature_value.data()),
+              paddle::string::str_to_float(cur_ctx->batch_values[idx].data()),
+              data_size * sizeof(float));
+          _db->del_data(
+              shard_id, reinterpret_cast<char*>(&cur_key), sizeof(uint64_t));
+          ret = &feature_value;
+        }
+        _value_accesor->UpdatePassId(ret->data(), pass_id);
+        int pull_data_idx = cur_ctx->batch_index[idx];
+        pull_values[pull_data_idx] = reinterpret_cast<char*>(ret);
+      }
+      cur_ctx->reset();
     }
   }
   return 0;
@@ -499,9 +557,9 @@ int32_t SSDSparseTable::UpdateTable() {
     for (auto it = shard.begin(); it != shard.end();) {
       if (_value_accesor->SaveSSD(it.value().data())) {
         _db->put(i,
-                 (char*)&it.key(),
+                 reinterpret_cast<const char*>(&it.key()),
                  sizeof(uint64_t),
-                 (char*)it.value().data(),
+                 reinterpret_cast<const char*>(it.value().data()),
                  it.value().size() * sizeof(float));
         count++;
         it = shard.erase(it);
@@ -526,166 +584,778 @@ int64_t SSDSparseTable::LocalSize() {
 
 int32_t SSDSparseTable::Save(const std::string& path,
                              const std::string& param) {
+  std::lock_guard<std::mutex> guard(_table_mutex);
+#ifdef PADDLE_WITH_HETERPS
+  int save_param = atoi(param.c_str());
+  int32_t ret = 0;
+  if (save_param > 3) {
+    ret = SaveWithStringMultiOutput(path, param);  // batch_model:4  xbox:5
+  } else {
+    ret = SaveWithBinary(path, param);  // batch_model:0  xbox:1
+  }
+  return ret;
+#else
+  // CPUPS PSCORE
+  return SaveWithString(path, param);  // batch_model:0  xbox:1
+#endif
+}
+
+// save shard_num 个文件
+int32_t SSDSparseTable::SaveWithString(const std::string& path,
+                                       const std::string& param) {
+  std::lock_guard<std::mutex> guard(_table_mutex);
   if (_real_local_shard_num == 0) {
     _local_show_threshold = -1;
     return 0;
   }
   int save_param = atoi(param.c_str());  // batch_model:0  xbox:1
+#ifdef PADDLE_WITH_HETERPS
+  save_param -= 4;
+#endif
   //    if (save_param == 5) {
   //        return save_patch(path, save_param);
   //    }
 
   // LOG(INFO) << "table cache rate is: " << _config.sparse_table_cache_rate();
-  LOG(INFO) << "table cache rate is: " << _config.sparse_table_cache_rate();
-  LOG(INFO) << "enable_sparse_table_cache: "
-            << _config.enable_sparse_table_cache();
-  LOG(INFO) << "LocalSize: " << LocalSize();
+  VLOG(0) << "table cache rate is: " << _config.sparse_table_cache_rate();
+  VLOG(0) << "enable_sparse_table_cache: "
+          << _config.enable_sparse_table_cache();
+  VLOG(0) << "LocalSize: " << LocalSize();
   if (_config.enable_sparse_table_cache()) {
-    LOG(INFO) << "Enable sparse table cache, top n:" << _cache_tk_size;
+    VLOG(0) << "Enable sparse table cache, top n:" << _cache_tk_size;
   }
   _cache_tk_size = LocalSize() * _config.sparse_table_cache_rate();
   TopkCalculator tk(_real_local_shard_num, _cache_tk_size);
+  VLOG(0) << "TopkCalculator top n:" << _cache_tk_size;
   size_t file_start_idx = _avg_local_shard_num * _shard_idx;
   std::string table_path = TableDir(path);
   _afs_client.remove(paddle::string::format_string(
       "%s/part-%03d-*", table_path.c_str(), _shard_idx));
+#ifdef PADDLE_WITH_GPU_GRAPH
+  int thread_num = _real_local_shard_num;
+#else
   int thread_num = _real_local_shard_num < 20 ? _real_local_shard_num : 20;
+#endif
 
   // std::atomic<uint32_t> feasign_size;
   std::atomic<uint32_t> feasign_size_all{0};
   // feasign_size = 0;
 
-  omp_set_num_threads(thread_num);
-#pragma omp parallel for schedule(dynamic)
-  for (int i = 0; i < _real_local_shard_num; ++i) {
+  std::vector<
+      paddle::framework::Channel<std::pair<uint64_t, std::vector<float>>>>
+      fs_channel;
+  for (int i = 0; i < _real_local_shard_num; i++) {
+    fs_channel.push_back(
+        paddle::framework::MakeChannel<std::pair<uint64_t, std::vector<float>>>(
+            10240));
+  }
+  std::vector<std::thread> threads;
+  threads.resize(_real_local_shard_num);
+
+  auto save_func = [this,
+                    &save_param,
+                    &table_path,
+                    &file_start_idx,
+                    &fs_channel](int file_num) {
+    int err_no = 0;
     FsChannelConfig channel_config;
     if (_config.compress_in_save() && (save_param == 0 || save_param == 3)) {
       channel_config.path =
           paddle::string::format_string("%s/part-%03d-%05d.gz",
                                         table_path.c_str(),
                                         _shard_idx,
-                                        file_start_idx + i);
+                                        file_start_idx + file_num);
     } else {
-      channel_config.path = paddle::string::format_string("%s/part-%03d-%05d",
-                                                          table_path.c_str(),
-                                                          _shard_idx,
-                                                          file_start_idx + i);
+      channel_config.path =
+          paddle::string::format_string("%s/part-%03d-%05d",
+                                        table_path.c_str(),
+                                        _shard_idx,
+                                        file_start_idx + file_num);
     }
     channel_config.converter = _value_accesor->Converter(save_param).converter;
     channel_config.deconverter =
         _value_accesor->Converter(save_param).deconverter;
-    int err_no = 0;
-    int retry_num = 0;
-    bool is_write_failed = false;
+    auto write_channel =
+        _afs_client.open_w(channel_config, 1024 * 1024 * 40, &err_no);
+    paddle::framework::ChannelReader<std::pair<uint64_t, std::vector<float>>>
+        reader(fs_channel[file_num].get());
+    std::pair<uint64_t, std::vector<float>> out_str;
+    while (reader >> out_str) {
+      std::string format_value = _value_accesor->ParseToString(
+          out_str.second.data(), out_str.second.size());
+      if (0 != write_channel->write_line(paddle::string::format_string(
+                   "%lu %s", out_str.first, format_value.c_str()))) {
+        LOG(FATAL) << "SSDSparseTable save failed, retry it! path:"
+                   << channel_config.path;
+      }
+    }
+    write_channel->close();
+  };
+  for (size_t i = 0; i < threads.size(); i++) {
+    threads[i] = std::thread(save_func, i);
+  }
+
+  std::vector<
+      paddle::framework::ChannelWriter<std::pair<uint64_t, std::vector<float>>>>
+      writers(_real_local_shard_num);
+  omp_set_num_threads(thread_num);
+#pragma omp parallel for schedule(dynamic)
+  for (int i = 0; i < _real_local_shard_num; ++i) {
     int feasign_size = 0;
     auto& shard = _local_shards[i];
-    do {
-      err_no = 0;
-      feasign_size = 0;
-      is_write_failed = false;
-      auto write_channel =
-          _afs_client.open_w(channel_config, 1024 * 1024 * 40, &err_no);
+    auto& writer = writers[i];
+    writer.Reset(fs_channel[i].get());
+    {
       for (auto it = shard.begin(); it != shard.end(); ++it) {
         if (_config.enable_sparse_table_cache() &&
-            (save_param == 1 || save_param == 2) &&
-            _value_accesor->Save(it.value().data(), 4)) {
-          // tk.push(i, it.value().data()[2]);
+            (save_param == 1 || save_param == 2)) {
+          // get_field get right decayed show
           tk.push(i, _value_accesor->GetField(it.value().data(), "show"));
         }
         if (_value_accesor->Save(it.value().data(), save_param)) {
-          std::string format_value = _value_accesor->ParseToString(
-              it.value().data(), it.value().size());
-          if (0 != write_channel->write_line(paddle::string::format_string(
-                       "%lu %s", it.key(), format_value.c_str()))) {
-            ++retry_num;
-            is_write_failed = true;
-            LOG(ERROR) << "SSDSparseTable save failed, retry it! path:"
-                       << channel_config.path << ", retry_num=" << retry_num;
-            break;
-          }
+          std::vector<float> feature_value;
+          feature_value.resize(it.value().size());
+          memcpy(const_cast<float*>(feature_value.data()),
+                 it.value().data(),
+                 it.value().size() * sizeof(float));
+          writer << std::make_pair(it.key(), std::move(feature_value));
           ++feasign_size;
         }
       }
+    }
 
-      if (err_no == -1 && !is_write_failed) {
-        ++retry_num;
-        is_write_failed = true;
-        LOG(ERROR) << "SSDSparseTable save failed after write, retry it! "
-                   << "path:" << channel_config.path
-                   << " , retry_num=" << retry_num;
+    if (save_param != 1) {
+      auto* it = _db->get_iterator(i);
+      for (it->SeekToFirst(); it->Valid(); it->Next()) {
+        bool need_save = _value_accesor->Save(
+            paddle::string::str_to_float(it->value().data()), save_param);
+        _value_accesor->UpdateStatAfterSave(
+            paddle::string::str_to_float(it->value().data()), save_param);
+        if (need_save) {
+          std::vector<float> feature_value;
+          feature_value.resize(it->value().size() / sizeof(float));
+          memcpy(const_cast<float*>(feature_value.data()),
+                 paddle::string::str_to_float(it->value().data()),
+                 it->value().size());
+          writer << std::make_pair(*(reinterpret_cast<uint64_t*>(
+                                       const_cast<char*>(it->key().data()))),
+                                   std::move(feature_value));
+          ++feasign_size;
+        }
       }
-      if (is_write_failed) {
-        _afs_client.remove(channel_config.path);
-        continue;
+      delete it;
+    }
+
+    writer.Flush();
+    fs_channel[i]->Close();
+    feasign_size_all += feasign_size;
+    for (auto it = shard.begin(); it != shard.end(); ++it) {
+      _value_accesor->UpdateStatAfterSave(it.value().data(), save_param);
+    }
+  }
+  for (size_t i = 0; i < threads.size(); i++) {
+    threads[i].join();
+  }
+  for (size_t i = 0; i < fs_channel.size(); i++) {
+    fs_channel[i].reset();
+  }
+  fs_channel.clear();
+
+  if (save_param == 3) {
+    // UpdateTable();
+    _cache_tk_size = LocalSize() * _config.sparse_table_cache_rate();
+    VLOG(0) << "SSDSparseTable update success.";
+  }
+  VLOG(0) << "SSDSparseTable save success, feasign size:" << feasign_size_all
+          << ", path:"
+          << paddle::string::format_string("%s/%03d/part-%03d-",
+                                           path.c_str(),
+                                           _config.table_id(),
+                                           _shard_idx)
+          << " from " << file_start_idx << " to "
+          << file_start_idx + _real_local_shard_num - 1;
+  _local_show_threshold = tk.top();
+  VLOG(0) << "local cache threshold: " << _local_show_threshold;
+  return 0;
+}
+
+// save shard_num * n 个文件, n由模型大小决定
+int32_t SSDSparseTable::SaveWithStringMultiOutput(const std::string& path,
+                                                  const std::string& param) {
+  if (_real_local_shard_num == 0) {
+    _local_show_threshold = -1;
+    return 0;
+  }
+  int save_param = atoi(param.c_str());
+#ifdef PADDLE_WITH_HETERPS
+  save_param -= 4;
+#endif
+  VLOG(0) << "table cache rate is: " << _config.sparse_table_cache_rate();
+  VLOG(0) << "enable_sparse_table_cache: "
+          << _config.enable_sparse_table_cache();
+  VLOG(0) << "LocalSize: " << LocalSize();
+  if (_config.enable_sparse_table_cache()) {
+    VLOG(0) << "Enable sparse table cache, top n:" << _cache_tk_size;
+  }
+  _cache_tk_size = LocalSize() * _config.sparse_table_cache_rate();
+  TopkCalculator tk(_real_local_shard_num, _cache_tk_size);
+  VLOG(0) << "TopkCalculator top n:" << _cache_tk_size;
+  size_t file_start_idx = _avg_local_shard_num * _shard_idx;
+  std::string table_path = TableDir(path);
+  _afs_client.remove(paddle::string::format_string(
+      "%s/part-%03d-*", table_path.c_str(), _shard_idx));
+#ifdef PADDLE_WITH_GPU_GRAPH
+  int thread_num = _real_local_shard_num;
+#else
+  int thread_num = _real_local_shard_num < 20 ? _real_local_shard_num : 20;
+#endif
+
+  std::atomic<uint32_t> feasign_size_all{0};
+  std::vector<paddle::framework::Channel<std::shared_ptr<MemRegion>>>
+      busy_channel;
+  std::vector<paddle::framework::Channel<std::shared_ptr<MemRegion>>>
+      free_channel;
+  std::vector<std::thread> threads;
+
+  for (int i = 0; i < _real_local_shard_num; i++) {
+    busy_channel.push_back(
+        paddle::framework::MakeChannel<std::shared_ptr<MemRegion>>());
+    free_channel.push_back(
+        paddle::framework::MakeChannel<std::shared_ptr<MemRegion>>());
+  }
+  threads.resize(_real_local_shard_num);
+
+  auto save_func = [this,
+                    &save_param,
+                    &table_path,
+                    &file_start_idx,
+                    &free_channel,
+                    &busy_channel](int file_num) {
+    int err_no = 0;
+    int shard_num = file_num;
+    int part_num = 0;
+    shard_num = file_num;
+    part_num = 0;
+    FsChannelConfig channel_config;
+    channel_config.converter = _value_accesor->Converter(save_param).converter;
+    channel_config.deconverter =
+        _value_accesor->Converter(save_param).deconverter;
+
+    auto get_filename = [](int compress,
+                           int save_param,
+                           const char* table_path,
+                           int node_num,
+                           int shard_num,
+                           int part_num,
+                           int split_num) {
+      if (compress && (save_param == 0 || save_param == 3)) {
+        // return
+        // paddle::string::format_string("%s/part-%03d-%05d-%03d-%03d.gz",
+        //     table_path, node_num, shard_num, part_num, split_num);
+        return paddle::string::format_string(
+            "%s/part-%05d-%03d.gz", table_path, shard_num, split_num);
+      } else {
+        // return paddle::string::format_string("%s/part-%03d-%05d-%03d-%03d",
+        //     table_path, node_num,  shard_num, part_num, split_num);
+        return paddle::string::format_string(
+            "%s/part-%05d-%03d", table_path, shard_num, split_num);
       }
+    };
+    std::shared_ptr<MemRegion> region = nullptr;
+    // std::shared_ptr<AfsWriter> afs_writer = nullptr;
+    // std::shared_ptr<XboxConverter> xbox_converter = nullptr;
+    std::string filename;
+    int last_file_idx = -1;
+    std::shared_ptr<FsWriteChannel> write_channel = nullptr;
 
-      // delta and cache and revert is all in mem, base in rocksdb
-      if (save_param != 1) {
-        auto* it = _db->get_iterator(i);
-        for (it->SeekToFirst(); it->Valid(); it->Next()) {
-          bool need_save = _value_accesor->Save(
-              paddle::string::str_to_float(it->value().data()), save_param);
-          _value_accesor->UpdateStatAfterSave(
-              paddle::string::str_to_float(it->value().data()), save_param);
-          if (need_save) {
-            std::string format_value = _value_accesor->ParseToString(
-                paddle::string::str_to_float(it->value().data()),
-                it->value().size() / sizeof(float));
-            if (0 != write_channel->write_line(paddle::string::format_string(
-                         "%lu %s",
-                         *((uint64_t*)const_cast<char*>(it->key().data())),
-                         format_value.c_str()))) {
-              ++retry_num;
-              is_write_failed = true;
-              LOG(ERROR) << "SSDSparseTable save failed, retry it! path:"
-                         << channel_config.path << ", retry_num=" << retry_num;
-              break;
+    while (busy_channel[shard_num]->Get(region)) {
+      if (region->_file_idx != last_file_idx) {
+        filename = get_filename(_config.compress_in_save(),
+                                save_param,
+                                table_path.c_str(),
+                                _shard_idx,
+                                file_start_idx + shard_num,
+                                part_num,
+                                region->_file_idx);
+        channel_config.path = filename;
+        write_channel =
+            _afs_client.open_w(channel_config, 1024 * 1024 * 40, &err_no);
+        // afs_writer = _api_wrapper.open_writer(filename);
+        last_file_idx = region->_file_idx;
+        // xbox_converter = std::make_shared<XboxConverter>(afs_writer);
+      }
+      char* cursor = region->_buf;
+      int remain = region->_cur;
+      while (remain) {
+        uint32_t len = *reinterpret_cast<uint32_t*>(cursor);
+        len -= sizeof(uint32_t);
+        remain -= sizeof(uint32_t);
+        cursor += sizeof(uint32_t);
+
+        uint64_t k = *reinterpret_cast<uint64_t*>(cursor);
+        cursor += sizeof(uint64_t);
+        len -= sizeof(uint64_t);
+        remain -= sizeof(uint64_t);
+
+        float* value = reinterpret_cast<float*>(cursor);
+        int dim = len / sizeof(float);
+
+        std::string format_value = _value_accesor->ParseToString(value, dim);
+        if (0 != write_channel->write_line(paddle::string::format_string(
+                     "%lu %s", k, format_value.c_str()))) {
+          VLOG(0) << "SSDSparseTable save failed, retry it! path:"
+                  << channel_config.path;
+        }
+        remain -= len;
+        cursor += len;
+      }
+      region->reset();
+      free_channel[shard_num]->Put(region);
+    }
+  };
+  for (size_t i = 0; i < threads.size(); i++) {
+    threads[i] = std::thread(save_func, i);
+  }
+
+  omp_set_num_threads(thread_num);
+#pragma omp parallel for schedule(dynamic)
+  for (size_t i = 0; i < static_cast<size_t>(_real_local_shard_num); ++i) {
+    std::shared_ptr<MemRegion> region = nullptr;
+    std::vector<std::shared_ptr<MemRegion>> regions;
+    free_channel[i]->Put(std::make_shared<MemRegion>());
+    free_channel[i]->Put(std::make_shared<MemRegion>());
+    free_channel[i]->Get(region);
+    int feasign_size = 0;
+    auto& shard = _local_shards[i];
+    int file_idx = 0;
+    int switch_cnt = 0;
+    region->_file_idx = 0;
+    {
+      // auto ssd_timer =
+      // std::make_shared<CostTimer>("pslib_downpour_memtable_iterator_v2");
+      for (auto it = shard.begin(); it != shard.end(); ++it) {
+        if (_config.enable_sparse_table_cache() &&
+            (save_param == 1 || save_param == 2)) {
+          // get_field get right decayed show
+          tk.push(i, _value_accesor->GetField(it.value().data(), "show"));
+        }
+        if (_value_accesor->Save(it.value().data(), save_param)) {
+          uint32_t len = sizeof(uint64_t) + it.value().size() * sizeof(float) +
+                         sizeof(uint32_t);
+          int region_idx = i;
+          if (!region->buff_remain(len)) {
+            busy_channel[region_idx]->Put(region);
+            free_channel[region_idx]->Get(region);
+            // region->_file_idx = 0;
+            switch_cnt += 1;
+            if (switch_cnt % 1024 == 0) {
+              file_idx += 1;
             }
-            if (save_param == 3) {
-              _db->put(i,
-                       it->key().data(),
-                       it->key().size(),
-                       it->value().data(),
-                       it->value().size());
+            region->_file_idx = file_idx;
+          }
+          int read_count = 0;
+          char* buf = region->acquire(len);
+          // CHECK(buf);
+          *reinterpret_cast<uint32_t*>(buf + read_count) = len;
+          read_count += sizeof(uint32_t);
+
+          *reinterpret_cast<uint64_t*>(buf + read_count) = it.key();
+          read_count += sizeof(uint64_t);
+
+          memcpy(buf + read_count,
+                 it.value().data(),
+                 sizeof(float) * it.value().size());
+          // if (save_param == 1 || save_param == 2) {
+          //     _value_accesor->update_time_decay((float*)(buf + read_count),
+          //     false);
+          // }
+          ++feasign_size;
+        }
+      }
+    }
+    // delta and cache is all in mem, base in rocksdb
+    if (save_param != 1) {
+      // int file_idx = 1;
+      // int switch_cnt = 0;
+      file_idx++;
+      switch_cnt = 0;
+      // ssd里的参数必须按key值升序, 而内存里的参数是乱序的,
+      // 这里必须重新申请region
+      busy_channel[i]->Put(region);
+      free_channel[i]->Get(region);
+      region->_file_idx = file_idx;
+      auto* it = _db->get_iterator(i);
+      for (it->SeekToFirst(); it->Valid(); it->Next()) {
+        bool need_save = _value_accesor->Save(
+            paddle::string::str_to_float(it->value().data()), save_param);
+        _value_accesor->UpdateStatAfterSave(
+            paddle::string::str_to_float(it->value().data()), save_param);
+        if (need_save) {
+          uint32_t len =
+              sizeof(uint64_t) + it->value().size() + sizeof(uint32_t);
+          int region_idx = i;
+          uint64_t key = *(
+              reinterpret_cast<uint64_t*>(const_cast<char*>(it->key().data())));
+          if (!region->buff_remain(len)) {
+            busy_channel[region_idx]->Put(region);
+            free_channel[region_idx]->Get(region);
+            switch_cnt += 1;
+            if (switch_cnt % 1024 == 0) {
+              // if (switch_cnt % 1 == 0) {
+              file_idx += 1;
             }
-            ++feasign_size;
+            region->_file_idx = file_idx;
+          }
+          int read_count = 0;
+          char* buf = region->acquire(len);
+          *reinterpret_cast<uint32_t*>(buf + read_count) = len;
+          read_count += sizeof(uint32_t);
+
+          *reinterpret_cast<uint64_t*>(buf + read_count) = key;
+          read_count += sizeof(uint64_t);
+
+          memcpy(buf + read_count, it->value().data(), it->value().size());
+          // if (save_param == 2) {
+          //     _value_accesor->update_time_decay((float*)(buf + read_count),
+          //     false);
+          // }
+          ++feasign_size;
+        }
+      }
+      delete it;
+    }
+    if (region->_cur) {
+      busy_channel[i]->Put(region);
+    }
+    feasign_size_all += feasign_size;
+    for (auto it = shard.begin(); it != shard.end(); ++it) {
+      _value_accesor->UpdateStatAfterSave(it.value().data(), save_param);
+    }
+  }
+  for (auto& channel : busy_channel) {
+    channel->Close();
+  }
+  for (size_t i = 0; i < threads.size(); i++) {
+    threads[i].join();
+  }
+  for (size_t i = 0; i < busy_channel.size(); i++) {
+    busy_channel[i].reset();
+    free_channel[i].reset();
+  }
+  busy_channel.clear();
+  free_channel.clear();
+  if (save_param == 3) {
+    //        update_table();
+    uint64_t ssd_key_num = 0;
+    _db->get_estimate_key_num(ssd_key_num);
+    _cache_tk_size =
+        (LocalSize() + ssd_key_num) * _config.sparse_table_cache_rate();
+    VLOG(0) << "DownpourSparseSSDTable update success.";
+  }
+  VLOG(0) << "DownpourSparseSSDTable save success, feasign size:"
+          << feasign_size_all << " ,path:"
+          << paddle::string::format_string("%s/%03d/part-%03d-",
+                                           path.c_str(),
+                                           _config.table_id(),
+                                           _shard_idx)
+          << " from " << file_start_idx << " to "
+          << file_start_idx + _real_local_shard_num - 1;
+  if (_config.enable_sparse_table_cache()) {
+    _local_show_threshold = tk.top();
+    VLOG(0) << "local cache threshold: " << _local_show_threshold;
+  }
+  // int32 may overflow need to change return value
+  return 0;
+}
+
+int32_t SSDSparseTable::SaveWithBinary(const std::string& path,
+                                       const std::string& param) {
+  if (_real_local_shard_num == 0) {
+    _local_show_threshold = -1;
+    return 0;
+  }
+  int save_param = atoi(param.c_str());
+  VLOG(0) << "table cache rate is: " << _config.sparse_table_cache_rate();
+  VLOG(0) << "enable_sparse_table_cache: "
+          << _config.enable_sparse_table_cache();
+  VLOG(0) << "LocalSize: " << LocalSize();
+  if (_config.enable_sparse_table_cache()) {
+    VLOG(0) << "Enable sparse table cache, top n:" << _cache_tk_size;
+  }
+  _cache_tk_size = LocalSize() * _config.sparse_table_cache_rate();
+  TopkCalculator tk(_real_local_shard_num, _cache_tk_size);
+  VLOG(0) << "TopkCalculator top n:" << _cache_tk_size;
+  size_t file_start_idx = _avg_local_shard_num * _shard_idx;
+  std::string table_path = TableDir(path);
+  _afs_client.remove(paddle::string::format_string(
+      "%s/part-%03d-*", table_path.c_str(), _shard_idx));
+#ifdef PADDLE_WITH_GPU_GRAPH
+  int thread_num = _real_local_shard_num;
+#else
+  int thread_num = _real_local_shard_num < 20 ? _real_local_shard_num : 20;
+#endif
+
+  std::atomic<uint32_t> feasign_size_all{0};
+  std::vector<paddle::framework::Channel<std::shared_ptr<MemRegion>>>
+      busy_channel;
+  std::vector<paddle::framework::Channel<std::shared_ptr<MemRegion>>>
+      free_channel;
+  std::vector<std::thread> threads;
+
+  for (int i = 0; i < _real_local_shard_num; i++) {
+    busy_channel.push_back(
+        paddle::framework::MakeChannel<std::shared_ptr<MemRegion>>());
+    free_channel.push_back(
+        paddle::framework::MakeChannel<std::shared_ptr<MemRegion>>());
+  }
+  threads.resize(_real_local_shard_num);
+
+  auto save_func = [this,
+                    &save_param,
+                    &table_path,
+                    &file_start_idx,
+                    &free_channel,
+                    &busy_channel](int file_num) {
+    int err_no = 0;
+    int shard_num = file_num;
+    int part_num = 0;
+    shard_num = file_num;
+    part_num = 0;
+    FsChannelConfig channel_config;
+    channel_config.converter = _value_accesor->Converter(save_param).converter;
+    channel_config.deconverter =
+        _value_accesor->Converter(save_param).deconverter;
+
+    auto get_filename = [](int compress,
+                           int save_param,
+                           const char* table_path,
+                           int node_num,
+                           int shard_num,
+                           int part_num,
+                           int split_num) {
+      if (compress && (save_param == 0 || save_param == 3)) {
+        return paddle::string::format_string("%s/part-%03d-%05d-%03d-%03d.gz",
+                                             table_path,
+                                             node_num,
+                                             shard_num,
+                                             part_num,
+                                             split_num);
+      } else {
+        return paddle::string::format_string("%s/part-%03d-%05d-%03d-%03d",
+                                             table_path,
+                                             node_num,
+                                             shard_num,
+                                             part_num,
+                                             split_num);
+      }
+    };
+    std::shared_ptr<MemRegion> region = nullptr;
+    std::string filename;
+    int last_file_idx = -1;
+    std::shared_ptr<FsWriteChannel> write_channel = nullptr;
+    if (save_param != 1 && save_param != 2) {
+      while (busy_channel[shard_num]->Get(region)) {
+        if (region->_file_idx != last_file_idx) {
+          filename = get_filename(_config.compress_in_save(),
+                                  save_param,
+                                  table_path.c_str(),
+                                  _shard_idx,
+                                  file_start_idx + shard_num,
+                                  part_num,
+                                  region->_file_idx);
+          channel_config.path = filename;
+          write_channel =
+              _afs_client.open_w(channel_config, 1024 * 1024 * 40, &err_no);
+          last_file_idx = region->_file_idx;
+        }
+        if (0 != write_channel->write(region->_buf, region->_cur)) {
+          LOG(FATAL) << "DownpourSparseSSDTable save failed, retry it! path:"
+                     << channel_config.path;
+          CHECK(false);
+        }
+        region->reset();
+        free_channel[shard_num]->Put(region);
+      }
+    } else {
+      while (busy_channel[shard_num]->Get(region)) {
+        if (region->_file_idx != last_file_idx) {
+          filename = get_filename(_config.compress_in_save(),
+                                  save_param,
+                                  table_path.c_str(),
+                                  _shard_idx,
+                                  file_start_idx + shard_num,
+                                  part_num,
+                                  region->_file_idx);
+          channel_config.path = filename;
+          write_channel =
+              _afs_client.open_w(channel_config, 1024 * 1024 * 40, &err_no);
+          last_file_idx = region->_file_idx;
+        }
+        char* cursor = region->_buf;
+        int remain = region->_cur;
+        while (remain) {
+          uint32_t len = *reinterpret_cast<uint32_t*>(cursor);
+          len -= sizeof(uint32_t);
+          remain -= sizeof(uint32_t);
+          cursor += sizeof(uint32_t);
+
+          uint64_t k = *reinterpret_cast<uint64_t*>(cursor);
+          cursor += sizeof(uint64_t);
+          len -= sizeof(uint64_t);
+          remain -= sizeof(uint64_t);
+
+          float* value = reinterpret_cast<float*>(cursor);
+          int dim = len / sizeof(float);
+
+          std::string format_value = _value_accesor->ParseToString(value, dim);
+          if (0 != write_channel->write_line(paddle::string::format_string(
+                       "%lu %s", k, format_value.c_str()))) {
+            LOG(FATAL) << "SSDSparseTable save failed, retry it! path:"
+                       << channel_config.path;
           }
+          remain -= len;
+          cursor += len;
         }
-        delete it;
+        region->reset();
+        free_channel[shard_num]->Put(region);
       }
+    }
+    // write_channel->close();
+  };
+  for (size_t i = 0; i < threads.size(); i++) {
+    threads[i] = std::thread(save_func, i);
+  }
 
-      write_channel->close();
-      if (err_no == -1) {
-        ++retry_num;
-        is_write_failed = true;
-        LOG(ERROR) << "SSDSparseTable save failed after write, retry it! "
-                   << "path:" << channel_config.path
-                   << " , retry_num=" << retry_num;
+  omp_set_num_threads(thread_num);
+#pragma omp parallel for schedule(dynamic)
+  for (size_t i = 0; i < static_cast<size_t>(_real_local_shard_num); ++i) {
+    std::shared_ptr<MemRegion> region = nullptr;
+    std::vector<std::shared_ptr<MemRegion>> regions;
+    free_channel[i]->Put(std::make_shared<MemRegion>());
+    free_channel[i]->Put(std::make_shared<MemRegion>());
+    free_channel[i]->Get(region);
+    int feasign_size = 0;
+    auto& shard = _local_shards[i];
+    region->_file_idx = 0;
+    {
+      for (auto it = shard.begin(); it != shard.end(); ++it) {
+        if (_config.enable_sparse_table_cache() &&
+            (save_param == 1 || save_param == 2)) {
+          // get_field get right decayed show
+          tk.push(i, _value_accesor->GetField(it.value().data(), "show"));
+        }
+        if (_value_accesor->Save(it.value().data(), save_param)) {
+          uint32_t len = sizeof(uint64_t) + it.value().size() * sizeof(float) +
+                         sizeof(uint32_t);
+          int region_idx = i;
+          if (!region->buff_remain(len)) {
+            busy_channel[region_idx]->Put(region);
+            free_channel[region_idx]->Get(region);
+            region->_file_idx = 0;
+          }
+          int read_count = 0;
+          char* buf = region->acquire(len);
+          // CHECK(buf);
+          *reinterpret_cast<uint32_t*>(buf + read_count) = len;
+          read_count += sizeof(uint32_t);
+
+          *reinterpret_cast<uint64_t*>(buf + read_count) = it.key();
+          read_count += sizeof(uint64_t);
+
+          memcpy(buf + read_count,
+                 it.value().data(),
+                 sizeof(float) * it.value().size());
+          ++feasign_size;
+        }
       }
-      if (is_write_failed) {
-        _afs_client.remove(channel_config.path);
+    }
+    // delta and cache is all in mem, base in rocksdb
+    if (save_param != 1) {
+      int file_idx = 1;
+      int switch_cnt = 0;
+      busy_channel[i]->Put(region);
+      free_channel[i]->Get(region);
+      region->_file_idx = file_idx;
+      auto* it = _db->get_iterator(i);
+      for (it->SeekToFirst(); it->Valid(); it->Next()) {
+        bool need_save = _value_accesor->Save(
+            paddle::string::str_to_float(it->value().data()), save_param);
+        _value_accesor->UpdateStatAfterSave(
+            paddle::string::str_to_float(it->value().data()), save_param);
+        if (need_save) {
+          uint32_t len =
+              sizeof(uint64_t) + it->value().size() + sizeof(uint32_t);
+          int region_idx = i;
+          uint64_t key = *(
+              reinterpret_cast<uint64_t*>(const_cast<char*>(it->key().data())));
+          if (!region->buff_remain(len)) {
+            busy_channel[region_idx]->Put(region);
+            free_channel[region_idx]->Get(region);
+            switch_cnt += 1;
+            if (switch_cnt % 1024 == 0) {
+              file_idx += 1;
+            }
+            region->_file_idx = file_idx;
+          }
+          int read_count = 0;
+          char* buf = region->acquire(len);
+          *reinterpret_cast<uint32_t*>(buf + read_count) = len;
+          read_count += sizeof(uint32_t);
+
+          *reinterpret_cast<uint64_t*>(buf + read_count) = key;
+          read_count += sizeof(uint64_t);
+
+          memcpy(buf + read_count, it->value().data(), it->value().size());
+          // if (save_param == 2) {
+          //     _value_accesor->update_time_decay((float*)(buf + read_count),
+          //     false);
+          // }
+          ++feasign_size;
+        }
       }
-    } while (is_write_failed);
+      delete it;
+    }
+    if (region->_cur) {
+      busy_channel[i]->Put(region);
+    }
     feasign_size_all += feasign_size;
     for (auto it = shard.begin(); it != shard.end(); ++it) {
       _value_accesor->UpdateStatAfterSave(it.value().data(), save_param);
     }
   }
+  for (auto& channel : busy_channel) {
+    channel->Close();
+  }
+  for (size_t i = 0; i < threads.size(); i++) {
+    threads[i].join();
+  }
+  for (size_t i = 0; i < busy_channel.size(); i++) {
+    busy_channel[i].reset();
+    free_channel[i].reset();
+  }
+
+  busy_channel.clear();
+  free_channel.clear();
   if (save_param == 3) {
-    UpdateTable();
-    _cache_tk_size = LocalSize() * _config.sparse_table_cache_rate();
-    LOG(INFO) << "SSDSparseTable update success.";
-  }
-  LOG(INFO) << "SSDSparseTable save success, path:"
-            << paddle::string::format_string("%s/%03d/part-%03d-",
-                                             path.c_str(),
-                                             _config.table_id(),
-                                             _shard_idx)
-            << " from " << file_start_idx << " to "
-            << file_start_idx + _real_local_shard_num - 1;
-  // return feasign_size_all;
-  _local_show_threshold = tk.top();
-  LOG(INFO) << "local cache threshold: " << _local_show_threshold;
+    //        update_table();
+    uint64_t ssd_key_num = 0;
+    _db->get_estimate_key_num(ssd_key_num);
+    _cache_tk_size =
+        (LocalSize() + ssd_key_num) * _config.sparse_table_cache_rate();
+    VLOG(0) << "DownpourSparseSSDTable update success.";
+  }
+  VLOG(0) << "DownpourSparseSSDTable save success, feasign size:"
+          << feasign_size_all << " ,path:"
+          << paddle::string::format_string("%s/%03d/part-%03d-",
+                                           path.c_str(),
+                                           _config.table_id(),
+                                           _shard_idx)
+          << " from " << file_start_idx << " to "
+          << file_start_idx + _real_local_shard_num - 1;
+  if (_config.enable_sparse_table_cache()) {
+    _local_show_threshold = tk.top();
+    VLOG(0) << "local cache threshold: " << _local_show_threshold;
+  }
   // int32 may overflow need to change return value
   return 0;
 }
@@ -857,142 +1527,582 @@ int32_t SSDSparseTable::SaveCache(
 
 int32_t SSDSparseTable::Load(const std::string& path,
                              const std::string& param) {
-  return MemorySparseTable::Load(path, param);
+  VLOG(0) << "LOAD FLAGS_rocksdb_path:" << FLAGS_rocksdb_path;
+  std::string table_path = TableDir(path);
+  auto file_list = _afs_client.list(table_path);
+
+  // std::sort(file_list.begin(), file_list.end());
+  for (auto file : file_list) {
+    VLOG(1) << "SSDSparseTable::Load() file list: " << file;
+  }
+
+  int load_param = atoi(param.c_str());
+  size_t expect_shard_num = _sparse_table_shard_num;
+  if (file_list.size() != expect_shard_num) {
+    LOG(WARNING) << "SSDSparseTable file_size:" << file_list.size()
+                 << " not equal to expect_shard_num:" << expect_shard_num;
+    return -1;
+  }
+  if (file_list.size() == 0) {
+    LOG(WARNING) << "SSDSparseTable load file is empty, path:" << path;
+    return -1;
+  }
+  if (load_param > 3) {
+    size_t file_start_idx = _shard_idx * _avg_local_shard_num;
+    return LoadWithString(file_start_idx,
+                          file_start_idx + _real_local_shard_num,
+                          file_list,
+                          param);
+  } else {
+    return LoadWithBinary(table_path, load_param);
+  }
 }
 
-//加载path目录下数据[start_idx, end_idx)
-int32_t SSDSparseTable::Load(size_t start_idx,
-                             size_t end_idx,
-                             const std::vector<std::string>& file_list,
-                             const std::string& param) {
-  if (start_idx >= file_list.size()) {
+int32_t SSDSparseTable::LoadWithString(
+    size_t file_start_idx,
+    size_t end_idx,
+    const std::vector<std::string>& file_list,
+    const std::string& param) {
+  if (file_start_idx >= file_list.size()) {
     return 0;
   }
   int load_param = atoi(param.c_str());
+#ifdef PADDLE_WITH_HETERPS
+  load_param -= 4;
+#endif
   size_t feature_value_size =
       _value_accesor->GetAccessorInfo().size / sizeof(float);
   size_t mf_value_size =
       _value_accesor->GetAccessorInfo().mf_size / sizeof(float);
 
-  end_idx = static_cast<int>(end_idx) < _sparse_table_shard_num
-                ? end_idx
-                : _sparse_table_shard_num;
-  int thread_num = (end_idx - start_idx) < 20 ? (end_idx - start_idx) : 20;
-  omp_set_num_threads(thread_num);
-#pragma omp parallel for schedule(dynamic)
-  for (size_t i = start_idx; i < end_idx; ++i) {
+#ifdef PADDLE_WITH_HETERPS
+  int thread_num = _real_local_shard_num;
+#else
+  int thread_num = _real_local_shard_num < 15 ? _real_local_shard_num : 15;
+#endif
+
+  for (int i = 0; i < _real_local_shard_num; i++) {
+    _fs_channel.push_back(paddle::framework::MakeChannel<std::string>(30000));
+  }
+
+  std::vector<std::thread> threads;
+  threads.resize(thread_num);
+  auto load_func = [this, &file_start_idx, &file_list, &load_param](
+                       int file_num) {
+    int err_no = 0;
     FsChannelConfig channel_config;
-    channel_config.path = file_list[i];
+    channel_config.path = file_list[file_num + file_start_idx];
+    VLOG(1) << "SSDSparseTable::load begin load " << channel_config.path
+            << " into local shard " << file_num;
     channel_config.converter = _value_accesor->Converter(load_param).converter;
     channel_config.deconverter =
         _value_accesor->Converter(load_param).deconverter;
 
-    int retry_num = 0;
-    int err_no = 0;
-    bool is_read_failed = false;
+    std::string line_data;
+    auto read_channel = _afs_client.open_r(channel_config, 0, &err_no);
+    paddle::framework::ChannelWriter<std::string> writer(
+        _fs_channel[file_num].get());
+    while (read_channel->read_line(line_data) == 0 && line_data.size() > 1) {
+      writer << line_data;
+    }
+    writer.Flush();
+    read_channel->close();
+    _fs_channel[file_num]->Close();
+  };
+  for (size_t i = 0; i < threads.size(); i++) {
+    threads[i] = std::thread(load_func, i);
+  }
+
+  omp_set_num_threads(thread_num);
+#pragma omp parallel for schedule(dynamic)
+  for (int i = 0; i < _real_local_shard_num; ++i) {
     std::vector<std::pair<char*, int>> ssd_keys;
     std::vector<std::pair<char*, int>> ssd_values;
     std::vector<uint64_t> tmp_key;
     ssd_keys.reserve(FLAGS_pserver_load_batch_size);
     ssd_values.reserve(FLAGS_pserver_load_batch_size);
     tmp_key.reserve(FLAGS_pserver_load_batch_size);
-    do {
-      ssd_keys.clear();
-      ssd_values.clear();
-      tmp_key.clear();
-      err_no = 0;
-      is_read_failed = false;
-      std::string line_data;
-      auto read_channel = _afs_client.open_r(channel_config, 0, &err_no);
-      char* end = NULL;
-      int local_shard_id = i % _avg_local_shard_num;
-      auto& shard = _local_shards[local_shard_id];
-      float data_buffer[FLAGS_pserver_load_batch_size * feature_value_size];
-      float* data_buffer_ptr = data_buffer;
-      uint64_t mem_count = 0;
-      uint64_t ssd_count = 0;
-      uint64_t mem_mf_count = 0;
-      uint64_t ssd_mf_count = 0;
-      try {
-        while (read_channel->read_line(line_data) == 0 &&
-               line_data.size() > 1) {
-          uint64_t key = std::strtoul(line_data.data(), &end, 10);
-          if (FLAGS_pserver_open_strict_check) {
-            if (key % _sparse_table_shard_num != i) {
-              LOG(WARNING) << "SSDSparseTable key:" << key
-                           << " not match shard,"
-                           << " file_idx:" << i
-                           << " shard num:" << _sparse_table_shard_num
-                           << " file:" << channel_config.path;
-              continue;
-            }
+    ssd_keys.clear();
+    ssd_values.clear();
+    tmp_key.clear();
+    std::string line_data;
+    char* end = NULL;
+    int local_shard_id = i % _avg_local_shard_num;
+    auto& shard = _local_shards[local_shard_id];
+    float data_buffer[FLAGS_pserver_load_batch_size *
+                      feature_value_size];  // NOLINT
+    float* data_buffer_ptr = data_buffer;
+    uint64_t mem_count = 0;
+    uint64_t ssd_count = 0;
+    uint64_t mem_mf_count = 0;
+    uint64_t ssd_mf_count = 0;
+    uint64_t filtered_count = 0;
+    uint64_t filter_time = 0;
+    uint64_t filter_begin = 0;
+
+    paddle::framework::ChannelReader<std::string> reader(_fs_channel[i].get());
+
+    while (reader >> line_data) {
+      uint64_t key = std::strtoul(line_data.data(), &end, 10);
+      if (FLAGS_pserver_open_strict_check) {
+        if (key % _sparse_table_shard_num != (i + file_start_idx)) {
+          LOG(WARNING) << "SSDSparseTable key:" << key << " not match shard,"
+                       << " file_idx:" << i
+                       << " shard num:" << _sparse_table_shard_num;
+          continue;
+        }
+      }
+      size_t value_size =
+          _value_accesor->ParseFromString(++end, data_buffer_ptr);
+      filter_begin = butil::gettimeofday_ms();
+      if (!_value_accesor->FilterSlot(data_buffer_ptr)) {
+        filter_time += butil::gettimeofday_ms() - filter_begin;
+        // ssd or mem
+        if (_value_accesor->SaveSSD(data_buffer_ptr)) {
+          tmp_key.emplace_back(key);
+          ssd_keys.emplace_back(std::make_pair(
+              reinterpret_cast<char*>(&tmp_key.back()), sizeof(uint64_t)));
+          ssd_values.emplace_back(
+              std::make_pair(reinterpret_cast<char*>(data_buffer_ptr),
+                             value_size * sizeof(float)));
+          data_buffer_ptr += feature_value_size;
+          if (static_cast<int>(ssd_keys.size()) ==
+              FLAGS_pserver_load_batch_size) {
+            _db->put_batch(
+                local_shard_id, ssd_keys, ssd_values, ssd_keys.size());
+            ssd_keys.clear();
+            ssd_values.clear();
+            tmp_key.clear();
+            data_buffer_ptr = data_buffer;
           }
-          size_t value_size =
-              _value_accesor->ParseFromString(++end, data_buffer_ptr);
-          // ssd or mem
-          if (_value_accesor->SaveSSD(data_buffer_ptr)) {
-            tmp_key.emplace_back(key);
-            ssd_keys.emplace_back(
-                std::make_pair((char*)&tmp_key.back(), sizeof(uint64_t)));
-            ssd_values.emplace_back(std::make_pair((char*)data_buffer_ptr,
-                                                   value_size * sizeof(float)));
-            data_buffer_ptr += feature_value_size;
-            if (static_cast<int>(ssd_keys.size()) ==
-                FLAGS_pserver_load_batch_size) {
-              _db->put_batch(
-                  local_shard_id, ssd_keys, ssd_values, ssd_keys.size());
-              ssd_keys.clear();
-              ssd_values.clear();
-              tmp_key.clear();
-              data_buffer_ptr = data_buffer;
-            }
-            ssd_count++;
-            if (value_size > feature_value_size - mf_value_size) {
-              ssd_mf_count++;
-            }
-          } else {
-            auto& value = shard[key];
-            value.resize(value_size);
-            _value_accesor->ParseFromString(end, value.data());
-            mem_count++;
-            if (value_size > feature_value_size - mf_value_size) {
-              mem_mf_count++;
-            }
+          ssd_count++;
+          if (value_size > feature_value_size - mf_value_size) {
+            ssd_mf_count++;
+          }
+        } else {
+          auto& value = shard[key];
+          value.resize(value_size);
+          _value_accesor->ParseFromString(end, value.data());
+          mem_count++;
+          if (value_size > feature_value_size - mf_value_size) {
+            mem_mf_count++;
           }
         }
-        // last batch
-        if (ssd_keys.size() > 0) {
-          _db->put_batch(local_shard_id, ssd_keys, ssd_values, ssd_keys.size());
-        }
-        read_channel->close();
-        if (err_no == -1) {
-          ++retry_num;
-          is_read_failed = true;
-          LOG(ERROR) << "SSDSparseTable load failed after read, retry it! path:"
-                     << channel_config.path << " , retry_num=" << retry_num;
-          continue;
-        }
-
-        _db->flush(local_shard_id);
-        LOG(INFO) << "Table>> load done. ALL[" << mem_count + ssd_count
-                  << "] MEM[" << mem_count << "] MEM_MF[" << mem_mf_count
-                  << "] SSD[" << ssd_count << "] SSD_MF[" << ssd_mf_count
-                  << "].";
-      } catch (...) {
-        ++retry_num;
-        is_read_failed = true;
-        LOG(ERROR) << "SSDSparseTable load failed after read, retry it! path:"
-                   << channel_config.path << " , retry_num=" << retry_num;
+      } else {
+        filter_time += butil::gettimeofday_ms() - filter_begin;
+        filtered_count++;
       }
-    } while (is_read_failed);
+    }
+    // last batch
+    if (ssd_keys.size() > 0) {
+      _db->put_batch(local_shard_id, ssd_keys, ssd_values, ssd_keys.size());
+    }
+
+    _db->flush(local_shard_id);
+    VLOG(0) << "Table>> load done. ALL[" << mem_count + ssd_count << "] MEM["
+            << mem_count << "] MEM_MF[" << mem_mf_count << "] SSD[" << ssd_count
+            << "] SSD_MF[" << ssd_mf_count << "] FILTERED[" << filtered_count
+            << "] filter_time cost:" << filter_time / 1000 << " s";
+  }
+  for (size_t i = 0; i < threads.size(); i++) {
+    threads[i].join();
+  }
+  for (size_t i = 0; i < _fs_channel.size(); i++) {
+    _fs_channel[i].reset();
   }
+  _fs_channel.clear();
   LOG(INFO) << "load num:" << LocalSize();
-  LOG(INFO) << "SSDSparseTable load success, path from " << file_list[start_idx]
-            << " to " << file_list[end_idx - 1];
+  LOG(INFO) << "SSDSparseTable load success, path from "
+            << file_list[file_start_idx] << " to "
+            << file_list[file_start_idx + _real_local_shard_num - 1];
 
   _cache_tk_size = LocalSize() * _config.sparse_table_cache_rate();
   return 0;
 }
 
+int32_t SSDSparseTable::LoadWithBinary(const std::string& path, int param) {
+  size_t feature_value_size =
+      _value_accesor->GetAccessorInfo().size / sizeof(float);
+  size_t mf_value_size =
+      _value_accesor->GetAccessorInfo().mf_size / sizeof(float);
+  // task pool _file_num_one_shard default 7
+  auto task_pool = std::make_shared<::ThreadPool>(_real_local_shard_num * 7);
+  auto filelists = _afs_client.list(
+      paddle::string::format_string("%s/part-%03d*", path.c_str(), _shard_idx));
+  // #pragma omp parallel for schedule(dynamic)
+  std::vector<std::future<int>> tasks;
+
+  for (int shard_idx = 0; shard_idx < _real_local_shard_num; shard_idx++) {
+    // FsChannelConfig channel_config;
+    // channel_config.converter = _value_accesor->Converter(param).converter;
+    // channel_config.deconverter =
+    // _value_accesor->Converter(param).deconverter;
+    for (auto& filename : filelists) {
+      std::vector<std::string> split_filename_string =
+          paddle::string::split_string<std::string>(filename, "-");
+      int file_split_idx =
+          atoi(split_filename_string[split_filename_string.size() - 1].c_str());
+      int file_shard_idx =
+          atoi(split_filename_string[split_filename_string.size() - 3].c_str());
+      if (file_shard_idx != shard_idx) {
+        continue;
+      }
+      auto future = task_pool->enqueue([this,
+                                        feature_value_size,
+                                        mf_value_size,
+                                        shard_idx,
+                                        filename,
+                                        file_split_idx,
+                                        param]() -> int {
+        // &channel_config]() -> int {
+        FsChannelConfig channel_config;
+        channel_config.converter = _value_accesor->Converter(param).converter;
+        channel_config.deconverter =
+            _value_accesor->Converter(param).deconverter;
+        int err_no = 0;
+        uint64_t mem_count = 0;
+        uint64_t mem_mf_count = 0;
+        uint64_t ssd_count = 0;
+        uint64_t ssd_mf_count = 0;
+
+        channel_config.path = filename;
+        auto read_channel = _afs_client.open_r(channel_config, 0, &err_no);
+        // auto reader = _api_wrapper.open_reader(filename);
+        auto& shard = _local_shards[shard_idx];
+        rocksdb::Options options;
+        options.comparator = _db->get_comparator();
+        rocksdb::BlockBasedTableOptions bbto;
+        bbto.format_version = 5;
+        bbto.use_delta_encoding = false;
+        bbto.block_size = 4 * 1024;
+        bbto.block_restart_interval = 6;
+        bbto.cache_index_and_filter_blocks = false;
+        bbto.filter_policy.reset(rocksdb::NewBloomFilterPolicy(15, false));
+        bbto.whole_key_filtering = true;
+        options.table_factory.reset(rocksdb::NewBlockBasedTableFactory(bbto));
+        options.OptimizeLevelStyleCompaction();
+        options.keep_log_file_num = 100;
+        options.max_log_file_size = 50 * 1024 * 1024;  // 50MB
+        options.create_if_missing = true;
+        options.use_direct_reads = true;
+        options.write_buffer_size = 256 * 1024 * 1024;  // 256MB
+        options.max_write_buffer_number = 8;
+        options.max_bytes_for_level_base =
+            options.max_write_buffer_number * options.write_buffer_size;
+        options.min_write_buffer_number_to_merge = 1;
+        options.target_file_size_base = 1024 * 1024 * 1024;  // 1024MB
+        options.memtable_prefix_bloom_size_ratio = 0.02;
+        options.num_levels = 4;
+        options.max_open_files = -1;
+
+        options.compression = rocksdb::kNoCompression;
+
+        rocksdb::SstFileWriter sst_writer(rocksdb::EnvOptions(), options);
+        int use_sst = 0;
+        if (file_split_idx != 0) {
+          std::string path =
+              paddle::string::format_string("%s_%d/part-%03d.sst",
+                                            FLAGS_rocksdb_path.c_str(),
+                                            shard_idx,
+                                            file_split_idx);
+          rocksdb::Status status = sst_writer.Open(path);
+          if (!status.ok()) {
+            VLOG(0) << "sst writer open " << path << "failed";
+            abort();
+          }
+          use_sst = 1;
+        }
+        uint64_t last_k = 0;
+        int buf_len = 1024 * 1024 * 10;
+        char* buf = reinterpret_cast<char*>(malloc(buf_len + 10));
+        // used for cache converted line
+        char* convert_buf = reinterpret_cast<char*>(malloc(buf_len + 10));
+        int ret = 0;
+        char* cursor = buf;
+        char* convert_cursor = convert_buf;
+        int remain = 0;
+        while (1) {
+          remain = ret;
+          cursor = buf + remain;
+          ret = read_channel->read(cursor, buf_len - remain);
+          // ret = reader->read(cursor, buf_len - remain);
+          if (ret <= 0) {
+            break;
+          }
+          cursor = buf;
+          convert_cursor = convert_buf;
+          ret += remain;
+          do {
+            if (ret >= static_cast<int>(sizeof(uint32_t))) {
+              uint32_t len = *reinterpret_cast<uint32_t*>(cursor);
+              if (ret >= static_cast<int>(len)) {
+                ret -= sizeof(uint32_t);
+                len -= sizeof(uint32_t);
+                cursor += sizeof(uint32_t);
+
+                uint64_t k = *reinterpret_cast<uint64_t*>(cursor);
+                cursor += sizeof(uint64_t);
+                ret -= sizeof(uint64_t);
+                len -= sizeof(uint64_t);
+
+                float* value = reinterpret_cast<float*>(cursor);
+                size_t dim = len / sizeof(float);
+
+                // copy value to convert_buf
+                memcpy(convert_cursor, cursor, len);
+                float* convert_value = reinterpret_cast<float*>(convert_cursor);
+
+                if (use_sst) {
+                  if (last_k >= k) {
+                    VLOG(0) << "[last_k: " << last_k << "][k: " << k
+                            << "][shard_idx: " << shard_idx
+                            << "][file_split_idx: " << file_split_idx << "]"
+                            << value[0];
+                    abort();
+                  }
+                  last_k = k;
+                  _value_accesor->UpdatePassId(convert_value, 0);
+                  rocksdb::Status status = sst_writer.Put(
+                      rocksdb::Slice(reinterpret_cast<char*>(&k),
+                                     sizeof(uint64_t)),
+                      rocksdb::Slice(reinterpret_cast<char*>(convert_value),
+                                     dim * sizeof(float)));
+                  if (!status.ok()) {
+                    VLOG(0) << "fatal in Put file: " << filename;
+                    abort();
+                  }
+                  ssd_count += 1;
+                  if (dim > feature_value_size - mf_value_size) {
+                    ssd_mf_count++;
+                  }
+                } else {
+                  auto& feature_value = shard[k];
+                  _value_accesor->UpdatePassId(convert_value, 0);
+                  feature_value.resize(dim);
+                  memcpy(const_cast<float*>(feature_value.data()),
+                         convert_value,
+                         dim * sizeof(float));
+                  mem_count += 1;
+                  if (dim > feature_value_size - mf_value_size) {
+                    mem_mf_count++;
+                  }
+                }
+                cursor += len;
+                convert_cursor += dim * sizeof(float);
+                ret -= len;
+              } else {
+                memcpy(buf, cursor, ret);
+                break;
+              }
+            } else {
+              memcpy(buf, cursor, ret);
+              break;
+            }
+          } while (ret);
+        }
+        if (use_sst) {
+          rocksdb::Status status = sst_writer.Finish();
+          if (!status.ok()) {
+            VLOG(0) << "fatal in finish file: " << filename << ", "
+                    << status.getState();
+            abort();
+          }
+        }
+        free(buf);
+        free(convert_buf);
+        // read_channel->close();
+        // VLOG(0) << "[last_k: " << last_k << "][remain: " << remain
+        //         << "][shard_idx: " << shard_idx
+        //         << "][file_split_idx: " << file_split_idx << "]";
+        VLOG(0) << "Table " << filename << " load done. ALL["
+                << mem_count + ssd_count << "] MEM[" << mem_count << "] MEM_MF["
+                << mem_mf_count << "] SSD[" << ssd_count << "] SSD_MF["
+                << ssd_mf_count << "].";
+        return 0;
+      });
+      tasks.push_back(std::move(future));
+    }
+  }
+  for (auto& fut : tasks) {
+    fut.wait();
+  }
+  tasks.clear();
+  for (int shard_idx = 0; shard_idx < _real_local_shard_num; shard_idx++) {
+    auto sst_filelist = _afs_client.list(paddle::string::format_string(
+        "%s_%d/part-*", FLAGS_rocksdb_path.c_str(), shard_idx));
+    if (sst_filelist.size() > 0) {
+      int ret = _db->ingest_externel_file(shard_idx, sst_filelist);
+      if (ret) {
+        VLOG(0) << "ingest file failed";
+        abort();
+      }
+    }
+  }
+  uint64_t ssd_key_num = 0;
+  _db->get_estimate_key_num(ssd_key_num);
+  _cache_tk_size =
+      (LocalSize() + ssd_key_num) * _config.sparse_table_cache_rate();
+  return 0;
+}
+
+std::pair<int64_t, int64_t> SSDSparseTable::PrintTableStat() {
+  int64_t feasign_size = LocalSize();
+  return {feasign_size, -1};
+}
+
+int32_t SSDSparseTable::CacheTable(uint16_t pass_id) {
+  std::lock_guard<std::mutex> guard(_table_mutex);
+  VLOG(0) << "cache_table";
+  std::atomic<uint32_t> count{0};
+  std::vector<std::future<int>> tasks;
+
+  double show_threshold = 10000000;
+
+  // 保证cache数据不被淘汰掉
+  if (_config.enable_sparse_table_cache()) {
+    if (_local_show_threshold < show_threshold) {
+      show_threshold = _local_show_threshold;
+    }
+  }
+
+  if (show_threshold < 500) {
+    show_threshold = 500;
+  }
+  VLOG(0) << " show_threshold:" << show_threshold
+          << " ; local_show_threshold:" << _local_show_threshold;
+  VLOG(0) << "Table>> origin mem feasign size:" << LocalSize();
+  static int cache_table_count = 0;
+  ++cache_table_count;
+  for (size_t shard_id = 0;
+       shard_id < static_cast<size_t>(_real_local_shard_num);
+       ++shard_id) {
+    // from mem to ssd
+    auto fut = _shards_task_pool[shard_id % _shards_task_pool.size()]->enqueue(
+        [shard_id, this, &count, show_threshold, pass_id]() -> int {
+          rocksdb::Options options;
+          options.comparator = _db->get_comparator();
+          rocksdb::BlockBasedTableOptions bbto;
+          bbto.format_version = 5;
+          bbto.use_delta_encoding = false;
+          bbto.block_size = 4 * 1024;
+          bbto.block_restart_interval = 6;
+          bbto.cache_index_and_filter_blocks = false;
+          bbto.filter_policy.reset(rocksdb::NewBloomFilterPolicy(15, false));
+          bbto.whole_key_filtering = true;
+          options.table_factory.reset(rocksdb::NewBlockBasedTableFactory(bbto));
+          options.OptimizeLevelStyleCompaction();
+          options.keep_log_file_num = 100;
+          options.max_log_file_size = 50 * 1024 * 1024;  // 50MB
+          options.create_if_missing = true;
+          options.use_direct_reads = true;
+          options.write_buffer_size = 64 * 1024 * 1024;  // 256MB
+          options.max_write_buffer_number = 4;
+          options.max_bytes_for_level_base =
+              options.max_write_buffer_number * options.write_buffer_size;
+          options.min_write_buffer_number_to_merge = 1;
+          options.target_file_size_base = 1024 * 1024 * 1024;  // 1024MB
+          options.memtable_prefix_bloom_size_ratio = 0.02;
+          options.num_levels = 4;
+          options.max_open_files = -1;
+
+          options.compression = rocksdb::kNoCompression;
+
+          auto& shard = _local_shards[shard_id];
+          if (1) {
+            using DataType = shard_type::map_type::iterator;
+            std::vector<DataType> datas;
+            datas.reserve(shard.size() * 0.8);
+            for (auto it = shard.begin(); it != shard.end(); ++it) {
+              if (!_value_accesor->SaveMemCache(
+                      it.value().data(), 0, show_threshold, pass_id)) {
+                datas.emplace_back(it.it);
+              }
+            }
+            count.fetch_add(datas.size(), std::memory_order_relaxed);
+            VLOG(0) << "datas size:  " << datas.size();
+            {
+              // sst文件写入必须有序
+              uint64_t show_begin = butil::gettimeofday_ms();
+              std::sort(datas.begin(),
+                        datas.end(),
+                        [](const DataType& a, const DataType& b) {
+                          return a->first < b->first;
+                        });
+              VLOG(0) << "sort shard " << shard_id << ": "
+                      << butil::gettimeofday_ms() - show_begin
+                      << " ms, num: " << datas.size();
+            }
+
+            // 必须做空判断，否则sst_writer.Finish会core掉
+            if (datas.size() != 0) {
+              rocksdb::SstFileWriter sst_writer(rocksdb::EnvOptions(), options);
+              std::string filename =
+                  paddle::string::format_string("%s_%d/cache-%05d.sst",
+                                                FLAGS_rocksdb_path.c_str(),
+                                                shard_id,
+                                                cache_table_count);
+              rocksdb::Status status = sst_writer.Open(filename);
+              if (!status.ok()) {
+                VLOG(0) << "sst writer open " << filename << "failed"
+                        << ", " << status.getState();
+                abort();
+              }
+              VLOG(0) << "sst writer open " << filename;
+
+              uint64_t show_begin = butil::gettimeofday_ms();
+              for (auto& data : datas) {
+                uint64_t tmp_key = data->first;
+                FixedFeatureValue& tmp_value =
+                    *((FixedFeatureValue*)(void*)(data->second));  // NOLINT
+                status = sst_writer.Put(
+                    rocksdb::Slice(reinterpret_cast<char*>(&(tmp_key)),
+                                   sizeof(uint64_t)),
+                    rocksdb::Slice(reinterpret_cast<char*>(tmp_value.data()),
+                                   tmp_value.size() * sizeof(float)));
+                if (!status.ok()) {
+                  VLOG(0) << "fatal in Put file: " << filename << ", "
+                          << status.getState();
+                  abort();
+                }
+              }
+              status = sst_writer.Finish();
+              if (!status.ok()) {
+                VLOG(0) << "fatal in finish file: " << filename << ", "
+                        << status.getState();
+                abort();
+              }
+              VLOG(0) << "write sst_file shard " << shard_id << ": "
+                      << butil::gettimeofday_ms() - show_begin << " ms";
+              int ret = _db->ingest_externel_file(shard_id, {filename});
+              if (ret) {
+                VLOG(0) << "ingest file failed"
+                        << ", " << status.getState();
+                abort();
+              }
+            }
+
+            for (auto it = shard.begin(); it != shard.end();) {
+              if (!_value_accesor->SaveMemCache(
+                      it.value().data(), 0, show_threshold, pass_id)) {
+                it = shard.erase(it);
+              } else {
+                ++it;
+              }
+            }
+          }
+          return 0;
+        });
+    tasks.push_back(std::move(fut));
+  }
+  for (size_t i = 0; i < tasks.size(); ++i) {
+    tasks[i].wait();
+  }
+  tasks.clear();
+
+  VLOG(0) << "Table>> cache ssd count: " << count.load();
+  VLOG(0) << "Table>> after update, mem feasign size:" << LocalSize();
+  return 0;
+}
+
 }  // namespace distributed
 }  // namespace paddle
diff --git a/paddle/fluid/distributed/ps/table/ssd_sparse_table.h b/paddle/fluid/distributed/ps/table/ssd_sparse_table.h
index 55a05bbab5ec2..e5561c5e42b99 100644
--- a/paddle/fluid/distributed/ps/table/ssd_sparse_table.h
+++ b/paddle/fluid/distributed/ps/table/ssd_sparse_table.h
@@ -21,6 +21,41 @@
 namespace paddle {
 namespace distributed {
 
+class MemRegion {
+ public:
+  MemRegion() {
+    _cap = 2 * 1024 * 1024;
+    _buf = reinterpret_cast<char*>(malloc(_cap));
+    _cur = 0;
+    _file_idx = -1;
+  }
+  virtual ~MemRegion() { free(_buf); }
+  bool buff_remain(int len) {
+    if (_cap - _cur < len) {
+      return false;
+    } else {
+      return true;
+    }
+  }
+  char* acquire(int len) {
+    if (_cap - _cur < len) {
+      return nullptr;
+    } else {
+      char* ret = _buf + _cur;
+      _cur += len;
+      return ret;
+    }
+  }
+  void reset() {
+    _cur = 0;
+    _file_idx = -1;
+  }
+  int _cap;
+  int _cur;
+  int _file_idx;
+  char* _buf;
+};
+
 class SSDSparseTable : public MemorySparseTable {
  public:
   typedef SparseTableShard<uint64_t, FixedFeatureValue> shard_type;
@@ -38,27 +73,34 @@ class SSDSparseTable : public MemorySparseTable {
   int32_t Push(TableContext& context) override;
 
   int32_t PullSparse(float* pull_values, const uint64_t* keys, size_t num);
-  int32_t PullSparsePtr(char** pull_values, const uint64_t* keys, size_t num);
+  int32_t PullSparsePtr(int shard_id,
+                        char** pull_values,
+                        const uint64_t* keys,
+                        size_t num,
+                        uint16_t pass_id);
   int32_t PushSparse(const uint64_t* keys, const float* values, size_t num);
   int32_t PushSparse(const uint64_t* keys, const float** values, size_t num);
 
   int32_t Flush() override { return 0; }
-  virtual int32_t Shrink(const std::string& param) override;
-  virtual void Clear() override {
+  int32_t Shrink(const std::string& param) override;
+  void Clear() override {
     for (int i = 0; i < _real_local_shard_num; ++i) {
       _local_shards[i].clear();
     }
   }
 
-  virtual int32_t Save(const std::string& path,
-                       const std::string& param) override;
-  virtual int32_t SaveCache(
+  int32_t Save(const std::string& path, const std::string& param) override;
+  int32_t SaveWithString(const std::string& path, const std::string& param);
+  int32_t SaveWithStringMultiOutput(const std::string& path,
+                                    const std::string& param);
+  int32_t SaveWithBinary(const std::string& path, const std::string& param);
+  int32_t SaveCache(
       const std::string& path,
       const std::string& param,
       paddle::framework::Channel<std::pair<uint64_t, std::string>>&
           shuffled_channel) override;
-  virtual double GetCacheThreshold() override { return _local_show_threshold; }
-  virtual int64_t CacheShuffle(
+  double GetCacheThreshold() override { return _local_show_threshold; }
+  int64_t CacheShuffle(
       const std::string& path,
       const std::string& param,
       double cache_threshold,
@@ -67,20 +109,25 @@ class SSDSparseTable : public MemorySparseTable {
       paddle::framework::Channel<std::pair<uint64_t, std::string>>&
           shuffled_channel,
       const std::vector<Table*>& table_ptrs) override;
-  //加载path目录下数据
-  virtual int32_t Load(const std::string& path,
-                       const std::string& param) override;
-  //加载path目录下数据[start_idx, end_idx)
-  virtual int32_t Load(size_t start_idx,
-                       size_t end_idx,
-                       const std::vector<std::string>& file_list,
-                       const std::string& param);
+  // 加载path目录下数据
+  int32_t Load(const std::string& path, const std::string& param) override;
+  int32_t LoadWithString(size_t file_start_idx,
+                         size_t end_idx,
+                         const std::vector<std::string>& file_list,
+                         const std::string& param);
+  int32_t LoadWithBinary(const std::string& path, int param);
   int64_t LocalSize();
 
+  std::pair<int64_t, int64_t> PrintTableStat() override;
+
+  int32_t CacheTable(uint16_t pass_id) override;
+
  private:
   RocksDBHandler* _db;
   int64_t _cache_tk_size;
   double _local_show_threshold{0.0};
+  std::vector<paddle::framework::Channel<std::string>> _fs_channel;
+  std::mutex _table_mutex;
 };
 
 }  // namespace distributed
diff --git a/paddle/fluid/distributed/ps/table/table.cc b/paddle/fluid/distributed/ps/table/table.cc
index 3e6d5a9941206..95655c1360e22 100644
--- a/paddle/fluid/distributed/ps/table/table.cc
+++ b/paddle/fluid/distributed/ps/table/table.cc
@@ -26,16 +26,15 @@
 #include "paddle/fluid/distributed/ps/table/sparse_accessor.h"
 #include "paddle/fluid/distributed/ps/table/ssd_sparse_table.h"
 #include "paddle/fluid/distributed/ps/table/tensor_accessor.h"
-#include "paddle/fluid/distributed/ps/table/tensor_table.h"
 
 namespace paddle {
 namespace distributed {
 REGISTER_PSCORE_CLASS(Table, GraphTable);
 REGISTER_PSCORE_CLASS(Table, MemoryDenseTable);
 REGISTER_PSCORE_CLASS(Table, BarrierTable);
-REGISTER_PSCORE_CLASS(Table, TensorTable);
-REGISTER_PSCORE_CLASS(Table, DenseTensorTable);
-REGISTER_PSCORE_CLASS(Table, GlobalStepTable);
+// REGISTER_PSCORE_CLASS(Table, TensorTable);
+// REGISTER_PSCORE_CLASS(Table, DenseTensorTable);
+// REGISTER_PSCORE_CLASS(Table, GlobalStepTable);
 REGISTER_PSCORE_CLASS(Table, MemorySparseTable);
 REGISTER_PSCORE_CLASS(Table, SSDSparseTable);
 REGISTER_PSCORE_CLASS(Table, MemorySparseGeoTable);
diff --git a/paddle/fluid/distributed/ps/table/table.h b/paddle/fluid/distributed/ps/table/table.h
index aee707712f662..f07a3f2132217 100644
--- a/paddle/fluid/distributed/ps/table/table.h
+++ b/paddle/fluid/distributed/ps/table/table.h
@@ -62,6 +62,8 @@ struct TableContext {
   size_t num;
   bool use_ptr = false;
   uint32_t trainer_id;  // for GEO and global step
+  int shard_id;         // for gpups
+  uint16_t pass_id;     // for gpups ssd
 };
 
 class Table {
@@ -147,6 +149,7 @@ class Table {
 
   virtual void *GetShard(size_t shard_idx) = 0;
   virtual std::pair<int64_t, int64_t> PrintTableStat() { return {0, 0}; }
+  virtual int32_t CacheTable(uint16_t pass_id) { return 0; }
 
   // for patch model
   virtual void Revert() {}
diff --git a/paddle/fluid/distributed/ps/wrapper/fleet.cc b/paddle/fluid/distributed/ps/wrapper/fleet.cc
index a6d233ac6dc4e..077c21e263386 100644
--- a/paddle/fluid/distributed/ps/wrapper/fleet.cc
+++ b/paddle/fluid/distributed/ps/wrapper/fleet.cc
@@ -747,6 +747,17 @@ void FleetWrapper::PrintTableStat(const uint64_t table_id) {
   }
 }
 
+void FleetWrapper::SaveCacheTable(const uint64_t table_id,
+                                  uint16_t pass_id,
+                                  size_t threshold) {
+  auto ret = worker_ptr_->SaveCacheTable(table_id, pass_id, threshold);
+  ret.wait();
+  int32_t err_code = ret.get();
+  if (err_code == -1) {
+    LOG(ERROR) << "save cache table stat failed";
+  }
+}
+
 void FleetWrapper::ShrinkSparseTable(int table_id, int threshold) {
   auto ret = worker_ptr_->Shrink(table_id, std::to_string(threshold));
   ret.wait();
diff --git a/paddle/fluid/distributed/ps/wrapper/fleet.h b/paddle/fluid/distributed/ps/wrapper/fleet.h
old mode 100755
new mode 100644
index c7aaededa20ba..acf2c3f72256c
--- a/paddle/fluid/distributed/ps/wrapper/fleet.h
+++ b/paddle/fluid/distributed/ps/wrapper/fleet.h
@@ -242,6 +242,9 @@ class FleetWrapper {
   void BarrierWithTable(uint32_t barrier_type);
 
   void PrintTableStat(const uint64_t table_id);
+  void SaveCacheTable(const uint64_t table_id,
+                      uint16_t pass_id,
+                      size_t threshold);
   // mode = 0, load all feature
   // mode = 1, load delta feature, which means load diff
   void LoadModel(const std::string& path, const int mode);
diff --git a/paddle/fluid/distributed/the_one_ps.proto b/paddle/fluid/distributed/the_one_ps.proto
index 5eeba70336007..a3f0707efe72f 100755
--- a/paddle/fluid/distributed/the_one_ps.proto
+++ b/paddle/fluid/distributed/the_one_ps.proto
@@ -169,6 +169,7 @@ message CtrAccessorParameter {
       [ default = 1 ]; // threshold to save ssd
   optional bool show_scale = 10 [ default = true ];
   optional bool zero_init = 11 [ default = true ];
+  repeated float load_filter_slots = 12;
 }
 
 message TensorAccessorParameter {
diff --git a/paddle/fluid/framework/barrier.h b/paddle/fluid/framework/barrier.h
new file mode 100644
index 0000000000000..e7aa976cc9eda
--- /dev/null
+++ b/paddle/fluid/framework/barrier.h
@@ -0,0 +1,113 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+
+#ifdef __LINUX__
+#include <pthread.h>
+#include <semaphore.h>
+#endif
+#include "paddle/fluid/platform/enforce.h"
+
+namespace paddle {
+namespace framework {
+class Barrier {
+ public:
+  explicit Barrier(int count = 1) {
+#ifdef __LINUX__
+    CHECK_GE(count, 1);
+    CHECK_EQ(pthread_barrier_init(&_barrier, NULL, count), 0);
+#endif
+  }
+
+  ~Barrier() {
+#ifdef __LINUX__
+    CHECK_EQ(pthread_barrier_destroy(&_barrier), 0);
+#endif
+  }
+
+  void reset(int count) {
+#ifdef __LINUX__
+    CHECK_GE(count, 1);
+    CHECK_EQ(pthread_barrier_destroy(&_barrier), 0);
+    CHECK_EQ(pthread_barrier_init(&_barrier, NULL, count), 0);
+#endif
+  }
+
+  void wait() {
+#ifdef __LINUX__
+    int err = pthread_barrier_wait(&_barrier);
+    if (err != 0 && err != PTHREAD_BARRIER_SERIAL_THREAD)) {
+      CHECK_EQ(1, 0);
+    }
+#endif
+  }
+
+ private:
+#ifdef __LINUX__
+  pthread_barrier_t _barrier;
+#endif
+};
+// Call func(args...). If interrupted by signal, recall the function.
+template <class FUNC, class... ARGS>
+auto ignore_signal_call(FUNC &&func, ARGS &&...args) ->
+    typename std::result_of<FUNC(ARGS...)>::type {
+  for (;;) {
+    auto err = func(args...);
+
+    if (err < 0 && errno == EINTR) {
+      LOG(INFO) << "Signal is caught. Ignored.";
+      continue;
+    }
+    return err;
+  }
+}
+class Semaphore {
+ public:
+  Semaphore() {
+#ifdef __LINUX__
+    CHECK_EQ(sem_init(&_sem, 0, 0), 0);
+#endif
+  }
+  ~Semaphore() {
+#ifdef __LINUX__
+    CHECK_EQ(sem_destroy(&_sem), 0);
+#endif
+  }
+  void post() {
+#ifdef __LINUX__
+    CHECK_EQ(sem_post(&_sem), 0);
+#endif
+  }
+  void wait() {
+#ifdef __LINUX__
+    CHECK_EQ(ignore_signal_call(sem_wait, &_sem), 0);
+#endif
+  }
+  bool try_wait() {
+    int err = 0;
+#ifdef __LINUX__
+    CHECK((err = ignore_signal_call(sem_trywait, &_sem),
+           err == 0 || errno == EAGAIN));
+#endif
+    return err == 0;
+  }
+
+ private:
+#ifdef __LINUX__
+  sem_t _sem;
+#endif
+};
+}  // namespace framework
+}  // namespace paddle
diff --git a/paddle/fluid/framework/data_feed.cc b/paddle/fluid/framework/data_feed.cc
index 660f1838dcf77..f1b7d696a4ec0 100644
--- a/paddle/fluid/framework/data_feed.cc
+++ b/paddle/fluid/framework/data_feed.cc
@@ -2114,6 +2114,16 @@ void SlotRecordInMemoryDataFeed::Init(const DataFeedDesc& data_feed_desc) {
 #endif
 }
 
+#if defined(PADDLE_WITH_GPU_GRAPH) && defined(PADDLE_WITH_HETERPS)
+void SlotRecordInMemoryDataFeed::InitGraphResource() {
+  gpu_graph_data_generator_.AllocResource(thread_id_, feed_vec_);
+}
+
+void SlotRecordInMemoryDataFeed::InitGraphTrainResource() {
+  gpu_graph_data_generator_.AllocTrainResource(thread_id_);
+}
+#endif
+
 void SlotRecordInMemoryDataFeed::LoadIntoMemory() {
   VLOG(3) << "SlotRecord LoadIntoMemory() begin, thread_id=" << thread_id_;
   if (!so_parser_name_.empty()) {
@@ -2650,7 +2660,7 @@ bool SlotRecordInMemoryDataFeed::Start() {
   pack_ = BatchGpuPackMgr().get(this->GetPlace(), used_slots_info_);
 #endif
 #if defined(PADDLE_WITH_GPU_GRAPH) && defined(PADDLE_WITH_HETERPS)
-  gpu_graph_data_generator_.AllocResource(this->place_, feed_vec_);
+  gpu_graph_data_generator_.SetFeedVec(feed_vec_);
 #endif
   return true;
 }
@@ -2692,6 +2702,12 @@ int SlotRecordInMemoryDataFeed::Next() {
 #endif
 }
 
+#if defined(PADDLE_WITH_GPU_GRAPH) && defined(PADDLE_WITH_HETERPS)
+void SlotRecordInMemoryDataFeed::DoWalkandSage() {
+  gpu_graph_data_generator_.DoWalkandSage();
+}
+#endif
+
 #if defined(PADDLE_WITH_CUDA) && defined(PADDLE_WITH_HETERPS)
 void SlotRecordInMemoryDataFeed::BuildSlotBatchGPU(const int ins_num) {
   int offset_cols_size = (ins_num + 1);
diff --git a/paddle/fluid/framework/data_feed.cu b/paddle/fluid/framework/data_feed.cu
index c761e03b84ff1..aec9fee25573a 100644
--- a/paddle/fluid/framework/data_feed.cu
+++ b/paddle/fluid/framework/data_feed.cu
@@ -24,9 +24,17 @@ limitations under the License. */
 #include <sstream>
 #include "cub/cub.cuh"
 #include "paddle/fluid/framework/fleet/heter_ps/gpu_graph_node.h"
+#include "paddle/fluid/framework/fleet/heter_ps/gpu_graph_utils.h"
 #include "paddle/fluid/framework/fleet/heter_ps/graph_gpu_wrapper.h"
+#include "paddle/fluid/framework/fleet/heter_ps/hashtable.h"
+#include "paddle/fluid/framework/fleet/ps_gpu_wrapper.h"
+#include "paddle/phi/kernels/gpu/graph_reindex_funcs.h"
+#include "paddle/phi/kernels/graph_reindex_kernel.h"
 
 DECLARE_bool(enable_opt_get_features);
+DECLARE_bool(graph_metapath_split_opt);
+DECLARE_int32(gpugraph_storage_mode);
+DECLARE_double(gpugraph_hbm_table_load_factor);
 
 namespace paddle {
 namespace framework {
@@ -35,12 +43,234 @@ namespace framework {
   for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < (n); \
        i += blockDim.x * gridDim.x)
 
+#define DEBUG_STATE(state)                                             \
+  VLOG(2) << "left: " << state->left << " right: " << state->right     \
+          << " central_word: " << state->central_word                  \
+          << " step: " << state->step << " cursor: " << state->cursor  \
+          << " len: " << state->len << " row_num: " << state->row_num; \
 // CUDA: use 512 threads per block
 const int CUDA_NUM_THREADS = 512;
 // CUDA: number of blocks for threads.
 inline int GET_BLOCKS(const int N) {
   return (N + CUDA_NUM_THREADS - 1) / CUDA_NUM_THREADS;
 }
+
+template <typename T>
+__global__ void fill_idx(T *idx, size_t len) {
+  const size_t i = blockIdx.x * blockDim.x + threadIdx.x;
+  if (i < len) {
+    idx[i] = i;
+  }
+}
+
+/**
+ * @brief sort cub
+ */
+template <typename K, typename V>
+void cub_sort_pairs(int len,
+                    const K *in_keys,
+                    K *out_keys,
+                    const V *in_vals,
+                    V *out_vals,
+                    cudaStream_t stream,
+                    std::shared_ptr<phi::Allocation> &d_buf_,  // NOLINT
+                    const paddle::platform::Place &place_) {
+  size_t temp_storage_bytes = 0;
+  CUDA_CHECK(cub::DeviceRadixSort::SortPairs(NULL,
+                                             temp_storage_bytes,
+                                             in_keys,
+                                             out_keys,
+                                             in_vals,
+                                             out_vals,
+                                             len,
+                                             0,
+                                             8 * sizeof(K),
+                                             stream,
+                                             false));
+  if (d_buf_ == NULL || d_buf_->size() < temp_storage_bytes) {
+    d_buf_ = memory::AllocShared(
+        place_,
+        temp_storage_bytes,
+        phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+  }
+  CUDA_CHECK(cub::DeviceRadixSort::SortPairs(d_buf_->ptr(),
+                                             temp_storage_bytes,
+                                             in_keys,
+                                             out_keys,
+                                             in_vals,
+                                             out_vals,
+                                             len,
+                                             0,
+                                             8 * sizeof(K),
+                                             stream,
+                                             false));
+}
+
+/**
+ * @Brief cub run length encode
+ */
+template <typename K, typename V, typename TNum>
+void cub_runlength_encode(int N,
+                          const K *in_keys,
+                          K *out_keys,
+                          V *out_sizes,
+                          TNum *d_out_len,
+                          cudaStream_t stream,
+                          std::shared_ptr<phi::Allocation> &d_buf_,  // NOLINT
+                          const paddle::platform::Place &place_) {
+  size_t temp_storage_bytes = 0;
+  CUDA_CHECK(cub::DeviceRunLengthEncode::Encode(NULL,
+                                                temp_storage_bytes,
+                                                in_keys,
+                                                out_keys,
+                                                out_sizes,
+                                                d_out_len,
+                                                N,
+                                                stream));
+  if (d_buf_ == NULL || d_buf_->size() < temp_storage_bytes) {
+    d_buf_ = memory::AllocShared(
+        place_,
+        temp_storage_bytes,
+        phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+  }
+  CUDA_CHECK(cub::DeviceRunLengthEncode::Encode(d_buf_->ptr(),
+                                                temp_storage_bytes,
+                                                in_keys,
+                                                out_keys,
+                                                out_sizes,
+                                                d_out_len,
+                                                N,
+                                                stream));
+}
+
+/**
+ * @brief exclusive sum
+ */
+template <typename K>
+void cub_exclusivesum(int N,
+                      const K *in,
+                      K *out,
+                      cudaStream_t stream,
+                      std::shared_ptr<phi::Allocation> &d_buf_,  // NOLINT
+                      const paddle::platform::Place &place_) {
+  size_t temp_storage_bytes = 0;
+  CUDA_CHECK(cub::DeviceScan::ExclusiveSum(
+      NULL, temp_storage_bytes, in, out, N, stream));
+  if (d_buf_ == NULL || d_buf_->size() < temp_storage_bytes) {
+    d_buf_ = memory::AllocShared(
+        place_,
+        temp_storage_bytes,
+        phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+  }
+  CUDA_CHECK(cub::DeviceScan::ExclusiveSum(
+      d_buf_->ptr(), temp_storage_bytes, in, out, N, stream));
+}
+
+template <typename T>
+__global__ void kernel_fill_restore_idx(size_t N,
+                                        const T *d_sorted_idx,
+                                        const T *d_offset,
+                                        const T *d_merged_cnts,
+                                        T *d_restore_idx) {
+  CUDA_KERNEL_LOOP(i, N) {
+    const T &off = d_offset[i];
+    const T &num = d_merged_cnts[i];
+    for (size_t k = 0; k < num; k++) {
+      d_restore_idx[d_sorted_idx[off + k]] = i;
+    }
+  }
+}
+
+template <typename T>
+__global__ void kernel_fill_restore_idx_by_search(size_t N,
+                                                  const T *d_sorted_idx,
+                                                  size_t merge_num,
+                                                  const T *d_offset,
+                                                  T *d_restore_idx) {
+  CUDA_KERNEL_LOOP(i, N) {
+    if (i < d_offset[1]) {
+      d_restore_idx[d_sorted_idx[i]] = 0;
+      continue;
+    }
+    int high = merge_num - 1;
+    int low = 1;
+    while (low < high) {
+      int mid = (low + high) / 2;
+      if (i < d_offset[mid + 1]) {
+        high = mid;
+      } else {
+        low = mid + 1;
+      }
+    }
+    d_restore_idx[d_sorted_idx[i]] = low;
+  }
+}
+
+// For unique node and inverse id.
+int dedup_keys_and_fillidx(int total_nodes_num,
+                           const uint64_t *d_keys,
+                           uint64_t *d_merged_keys,  // input
+                           uint64_t *d_sorted_keys,  // output
+                           uint32_t *d_restore_idx,  // inverse
+                           uint32_t *d_sorted_idx,
+                           uint32_t *d_offset,
+                           uint32_t *d_merged_cnts,
+                           cudaStream_t stream,
+                           std::shared_ptr<phi::Allocation> &d_buf_,  // NOLINT
+                           const paddle::platform::Place &place_) {
+  int merged_size = 0;  // Final num
+  auto d_index_in =
+      memory::Alloc(place_,
+                    sizeof(uint32_t) * (total_nodes_num + 1),
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+  uint32_t *d_index_in_ptr = reinterpret_cast<uint32_t *>(d_index_in->ptr());
+  int *d_merged_size =
+      reinterpret_cast<int *>(&d_index_in_ptr[total_nodes_num]);
+  fill_idx<<<GET_BLOCKS(total_nodes_num), CUDA_NUM_THREADS, 0, stream>>>(
+      d_index_in_ptr, total_nodes_num);
+  cub_sort_pairs(total_nodes_num,
+                 d_keys,
+                 d_sorted_keys,
+                 d_index_in_ptr,
+                 d_sorted_idx,
+                 stream,
+                 d_buf_,
+                 place_);
+  cub_runlength_encode(total_nodes_num,
+                       d_sorted_keys,
+                       d_merged_keys,
+                       d_merged_cnts,
+                       d_merged_size,
+                       stream,
+                       d_buf_,
+                       place_);
+  CUDA_CHECK(cudaMemcpyAsync(&merged_size,
+                             d_merged_size,
+                             sizeof(int),
+                             cudaMemcpyDeviceToHost,
+                             stream));
+  CUDA_CHECK(cudaStreamSynchronize(stream));
+  cub_exclusivesum(
+      merged_size, d_merged_cnts, d_offset, stream, d_buf_, place_);
+
+  if (total_nodes_num < merged_size * 2) {
+    kernel_fill_restore_idx<<<GET_BLOCKS(merged_size),
+                              CUDA_NUM_THREADS,
+                              0,
+                              stream>>>(
+        merged_size, d_sorted_idx, d_offset, d_merged_cnts, d_restore_idx);
+  } else {
+    // used mid search fill idx when high dedup rate
+    kernel_fill_restore_idx_by_search<<<GET_BLOCKS(total_nodes_num),
+                                        CUDA_NUM_THREADS,
+                                        0,
+                                        stream>>>(
+        total_nodes_num, d_sorted_idx, merged_size, d_offset, d_restore_idx);
+  }
+  CUDA_CHECK(cudaStreamSynchronize(stream));
+  return merged_size;
+}
+
 // fill slot values
 __global__ void FillSlotValueOffsetKernel(const int ins_num,
                                           const int used_slot_num,
@@ -207,13 +437,13 @@ __global__ void CopyDuplicateKeys(int64_t *dist_tensor,
 int GraphDataGenerator::AcquireInstance(BufState *state) {
   //
   if (state->GetNextStep()) {
-    state->Debug();
+    DEBUG_STATE(state);
     return state->len;
   } else if (state->GetNextCentrolWord()) {
-    state->Debug();
+    DEBUG_STATE(state);
     return state->len;
   } else if (state->GetNextBatch()) {
-    state->Debug();
+    DEBUG_STATE(state);
     return state->len;
   }
   return 0;
@@ -315,22 +545,31 @@ __global__ void GraphFillSlotKernel(uint64_t *id_tensor,
                                     uint64_t *feature_buf,
                                     int len,
                                     int total_ins,
-                                    int slot_num) {
+                                    int slot_num,
+                                    int *slot_feature_num_map,
+                                    int fea_num_per_node,
+                                    int *actual_slot_id_map,
+                                    int *fea_offset_map) {
   CUDA_KERNEL_LOOP(idx, len) {
-    int slot_idx = idx / total_ins;
+    int fea_idx = idx / total_ins;
     int ins_idx = idx % total_ins;
-    ((uint64_t *)(id_tensor[slot_idx]))[ins_idx] =  // NOLINT
-        feature_buf[ins_idx * slot_num + slot_idx];
+    int actual_slot_id = actual_slot_id_map[fea_idx];
+    int fea_offset = fea_offset_map[fea_idx];
+    reinterpret_cast<uint64_t *>(id_tensor[actual_slot_id])
+        [ins_idx * slot_feature_num_map[actual_slot_id] + fea_offset] =
+            feature_buf[ins_idx * fea_num_per_node + fea_idx];
   }
 }
 
 __global__ void GraphFillSlotLodKernelOpt(uint64_t *id_tensor,
                                           int len,
-                                          int total_ins) {
+                                          int total_ins,
+                                          int *slot_feature_num_map) {
   CUDA_KERNEL_LOOP(idx, len) {
     int slot_idx = idx / total_ins;
     int ins_idx = idx % total_ins;
-    ((uint64_t *)(id_tensor[slot_idx]))[ins_idx] = ins_idx;  // NOLINT
+    (reinterpret_cast<uint64_t *>(id_tensor[slot_idx]))[ins_idx] =
+        ins_idx * slot_feature_num_map[slot_idx];
   }
 }
 
@@ -338,64 +577,189 @@ __global__ void GraphFillSlotLodKernel(int64_t *id_tensor, int len) {
   CUDA_KERNEL_LOOP(idx, len) { id_tensor[idx] = idx; }
 }
 
-int GraphDataGenerator::FillInsBuf() {
-  if (ins_buf_pair_len_ >= batch_size_) {
-    return batch_size_;
+// fill sage neighbor results
+__global__ void FillActualNeighbors(int64_t *vals,
+                                    int64_t *actual_vals,
+                                    int64_t *actual_vals_dst,
+                                    int *actual_sample_size,
+                                    int *cumsum_actual_sample_size,
+                                    int sample_size,
+                                    int len,
+                                    int mod) {
+  const size_t i = blockIdx.x * blockDim.x + threadIdx.x;
+  if (i < len) {
+    int offset1 = cumsum_actual_sample_size[i];
+    int offset2 = sample_size * i;
+    int dst_id = i % mod;
+    for (int j = 0; j < actual_sample_size[i]; j++) {
+      actual_vals[offset1 + j] = vals[offset2 + j];
+      actual_vals_dst[offset1 + j] = dst_id;
+    }
   }
-  int total_instance = AcquireInstance(&buf_state_);
+}
 
-  VLOG(2) << "total_ins: " << total_instance;
-  buf_state_.Debug();
+int GraphDataGenerator::FillIdShowClkTensor(int total_instance,
+                                            bool gpu_graph_training,
+                                            size_t cursor) {
+  id_tensor_ptr_ =
+      feed_vec_[0]->mutable_data<int64_t>({total_instance, 1}, this->place_);
+  show_tensor_ptr_ =
+      feed_vec_[1]->mutable_data<int64_t>({total_instance}, this->place_);
+  clk_tensor_ptr_ =
+      feed_vec_[2]->mutable_data<int64_t>({total_instance}, this->place_);
+  if (gpu_graph_training) {
+    uint64_t *ins_cursor, *ins_buf;
+    ins_buf = reinterpret_cast<uint64_t *>(d_ins_buf_->ptr());
+    ins_cursor = ins_buf + ins_buf_pair_len_ * 2 - total_instance;
+    cudaMemcpyAsync(id_tensor_ptr_,
+                    ins_cursor,
+                    sizeof(uint64_t) * total_instance,
+                    cudaMemcpyDeviceToDevice,
+                    train_stream_);
+  } else {
+    uint64_t *d_type_keys =
+        reinterpret_cast<uint64_t *>(d_device_keys_[cursor]->ptr());
+    d_type_keys += infer_node_start_;
+    infer_node_start_ += total_instance / 2;
+    CopyDuplicateKeys<<<GET_BLOCKS(total_instance / 2),
+                        CUDA_NUM_THREADS,
+                        0,
+                        train_stream_>>>(
+        id_tensor_ptr_, d_type_keys, total_instance / 2);
+  }
 
-  if (total_instance == 0) {
-    int res = FillWalkBuf(d_walk_);
-    if (!res) {
-      // graph iterate complete
-      return -1;
-    } else {
-      total_instance = buf_state_.len;
-      VLOG(2) << "total_ins: " << total_instance;
-      buf_state_.Debug();
-      // if (total_instance == 0) {
-      //  return -1;
-      //}
-    }
+  GraphFillCVMKernel<<<GET_BLOCKS(total_instance),
+                       CUDA_NUM_THREADS,
+                       0,
+                       train_stream_>>>(show_tensor_ptr_, total_instance);
+  GraphFillCVMKernel<<<GET_BLOCKS(total_instance),
+                       CUDA_NUM_THREADS,
+                       0,
+                       train_stream_>>>(clk_tensor_ptr_, total_instance);
+  return 0;
+}
 
-    if (!FLAGS_enable_opt_get_features && slot_num_ > 0) {
-      FillFeatureBuf(d_walk_, d_feature_);
-      if (debug_mode_) {
-        int len = buf_size_ > 5000 ? 5000 : buf_size_;
-        uint64_t h_walk[len];  // NOLINT
-        cudaMemcpy(h_walk,
-                   d_walk_->ptr(),
-                   len * sizeof(uint64_t),
-                   cudaMemcpyDeviceToHost);
-        uint64_t h_feature[len * slot_num_];  // NOLINT
-        cudaMemcpy(h_feature,
-                   d_feature_->ptr(),
-                   len * slot_num_ * sizeof(uint64_t),
-                   cudaMemcpyDeviceToHost);
-        for (int i = 0; i < len; ++i) {
-          std::stringstream ss;
-          for (int j = 0; j < slot_num_; ++j) {
-            ss << h_feature[i * slot_num_ + j] << " ";
-          }
-          VLOG(2) << "aft FillFeatureBuf, gpu[" << gpuid_ << "] walk[" << i
-                  << "] = " << (uint64_t)h_walk[i] << " feature["
-                  << i * slot_num_ << ".." << (i + 1) * slot_num_
-                  << "] = " << ss.str();
-        }
-      }
-    }
+int GraphDataGenerator::FillGraphIdShowClkTensor(int uniq_instance,
+                                                 int total_instance,
+                                                 int index) {
+  id_tensor_ptr_ =
+      feed_vec_[0]->mutable_data<int64_t>({uniq_instance, 1}, this->place_);
+  show_tensor_ptr_ =
+      feed_vec_[1]->mutable_data<int64_t>({uniq_instance}, this->place_);
+  clk_tensor_ptr_ =
+      feed_vec_[2]->mutable_data<int64_t>({uniq_instance}, this->place_);
+  int index_offset = 3 + slot_num_ * 2 + 5 * samples_.size();
+  index_tensor_ptr_ = feed_vec_[index_offset]->mutable_data<int>(
+      {total_instance}, this->place_);
+
+  int len_samples = samples_.size();
+  int *num_nodes_tensor_ptr_[len_samples];
+  int *next_num_nodes_tensor_ptr_[len_samples];
+  int64_t *edges_src_tensor_ptr_[len_samples];
+  int64_t *edges_dst_tensor_ptr_[len_samples];
+  int *edges_split_tensor_ptr_[len_samples];
+
+  std::vector<std::vector<int>> edges_split_num_for_graph =
+      edges_split_num_vec_[index];
+  std::vector<std::shared_ptr<phi::Allocation>> graph_edges =
+      graph_edges_vec_[index];
+  for (int i = 0; i < len_samples; i++) {
+    int offset = 3 + 2 * slot_num_ + 5 * i;
+    std::vector<int> edges_split_num = edges_split_num_for_graph[i];
+
+    int neighbor_len = edges_split_num[edge_to_id_len_ + 2];
+    num_nodes_tensor_ptr_[i] =
+        feed_vec_[offset]->mutable_data<int>({1}, this->place_);
+    next_num_nodes_tensor_ptr_[i] =
+        feed_vec_[offset + 1]->mutable_data<int>({1}, this->place_);
+    edges_src_tensor_ptr_[i] = feed_vec_[offset + 2]->mutable_data<int64_t>(
+        {neighbor_len, 1}, this->place_);
+    edges_dst_tensor_ptr_[i] = feed_vec_[offset + 3]->mutable_data<int64_t>(
+        {neighbor_len, 1}, this->place_);
+    edges_split_tensor_ptr_[i] = feed_vec_[offset + 4]->mutable_data<int>(
+        {edge_to_id_len_}, this->place_);
+
+    // [edges_split_num, next_num_nodes, num_nodes, neighbor_len]
+    cudaMemcpyAsync(next_num_nodes_tensor_ptr_[i],
+                    edges_split_num.data() + edge_to_id_len_,
+                    sizeof(int),
+                    cudaMemcpyHostToDevice,
+                    train_stream_);
+    cudaMemcpyAsync(num_nodes_tensor_ptr_[i],
+                    edges_split_num.data() + edge_to_id_len_ + 1,
+                    sizeof(int),
+                    cudaMemcpyHostToDevice,
+                    train_stream_);
+    cudaMemcpyAsync(edges_split_tensor_ptr_[i],
+                    edges_split_num.data(),
+                    sizeof(int) * edge_to_id_len_,
+                    cudaMemcpyHostToDevice,
+                    train_stream_);
+    cudaMemcpyAsync(edges_src_tensor_ptr_[i],
+                    graph_edges[i * 2]->ptr(),
+                    sizeof(int64_t) * neighbor_len,
+                    cudaMemcpyDeviceToDevice,
+                    train_stream_);
+    cudaMemcpyAsync(edges_dst_tensor_ptr_[i],
+                    graph_edges[i * 2 + 1]->ptr(),
+                    sizeof(int64_t) * neighbor_len,
+                    cudaMemcpyDeviceToDevice,
+                    train_stream_);
+  }
+
+  cudaMemcpyAsync(id_tensor_ptr_,
+                  final_sage_nodes_vec_[index]->ptr(),
+                  sizeof(int64_t) * uniq_instance,
+                  cudaMemcpyDeviceToDevice,
+                  train_stream_);
+  cudaMemcpyAsync(index_tensor_ptr_,
+                  inverse_vec_[index]->ptr(),
+                  sizeof(int) * total_instance,
+                  cudaMemcpyDeviceToDevice,
+                  train_stream_);
+  GraphFillCVMKernel<<<GET_BLOCKS(uniq_instance),
+                       CUDA_NUM_THREADS,
+                       0,
+                       train_stream_>>>(show_tensor_ptr_, uniq_instance);
+  GraphFillCVMKernel<<<GET_BLOCKS(uniq_instance),
+                       CUDA_NUM_THREADS,
+                       0,
+                       train_stream_>>>(clk_tensor_ptr_, uniq_instance);
+  return 0;
+}
+
+int GraphDataGenerator::FillGraphSlotFeature(
+    int total_instance,
+    bool gpu_graph_training,
+    std::shared_ptr<phi::Allocation> final_sage_nodes) {
+  uint64_t *ins_cursor, *ins_buf;
+  if (gpu_graph_training) {
+    ins_buf = reinterpret_cast<uint64_t *>(d_ins_buf_->ptr());
+    ins_cursor = ins_buf + ins_buf_pair_len_ * 2 - total_instance;
+  } else {
+    id_tensor_ptr_ =
+        feed_vec_[0]->mutable_data<int64_t>({total_instance, 1}, this->place_);
+    ins_cursor = reinterpret_cast<uint64_t *>(id_tensor_ptr_);
   }
 
+  if (!sage_mode_) {
+    return FillSlotFeature(ins_cursor, total_instance);
+  } else {
+    uint64_t *sage_nodes_ptr =
+        reinterpret_cast<uint64_t *>(final_sage_nodes->ptr());
+    return FillSlotFeature(sage_nodes_ptr, total_instance);
+  }
+}
+
+int GraphDataGenerator::MakeInsPair(cudaStream_t stream) {
   uint64_t *walk = reinterpret_cast<uint64_t *>(d_walk_->ptr());
   uint64_t *ins_buf = reinterpret_cast<uint64_t *>(d_ins_buf_->ptr());
   int *random_row = reinterpret_cast<int *>(d_random_row_->ptr());
   int *d_pair_num = reinterpret_cast<int *>(d_pair_num_->ptr());
-  cudaMemsetAsync(d_pair_num, 0, sizeof(int), stream_);
+  cudaMemsetAsync(d_pair_num, 0, sizeof(int), stream);
   int len = buf_state_.len;
-  GraphFillIdKernel<<<GET_BLOCKS(len), CUDA_NUM_THREADS, 0, stream_>>>(
+  // make pair
+  GraphFillIdKernel<<<GET_BLOCKS(len), CUDA_NUM_THREADS, 0, stream>>>(
       ins_buf + ins_buf_pair_len_ * 2,
       d_pair_num,
       walk,
@@ -406,28 +770,8 @@ int GraphDataGenerator::FillInsBuf() {
       walk_len_);
   int h_pair_num;
   cudaMemcpyAsync(
-      &h_pair_num, d_pair_num, sizeof(int), cudaMemcpyDeviceToHost, stream_);
-  if (!FLAGS_enable_opt_get_features && slot_num_ > 0) {
-    uint64_t *feature_buf = reinterpret_cast<uint64_t *>(d_feature_buf_->ptr());
-    uint64_t *feature = reinterpret_cast<uint64_t *>(d_feature_->ptr());
-    cudaMemsetAsync(d_pair_num, 0, sizeof(int), stream_);
-    int len = buf_state_.len;
-    VLOG(2) << "feature_buf start[" << ins_buf_pair_len_ * 2 * slot_num_
-            << "] len[" << len << "]";
-    GraphFillFeatureKernel<<<GET_BLOCKS(len), CUDA_NUM_THREADS, 0, stream_>>>(
-        feature_buf + ins_buf_pair_len_ * 2 * slot_num_,
-        d_pair_num,
-        walk,
-        feature,
-        random_row + buf_state_.cursor,
-        buf_state_.central_word,
-        window_step_[buf_state_.step],
-        len,
-        walk_len_,
-        slot_num_);
-  }
-
-  cudaStreamSynchronize(stream_);
+      &h_pair_num, d_pair_num, sizeof(int), cudaMemcpyDeviceToHost, stream);
+  cudaStreamSynchronize(stream);
   ins_buf_pair_len_ += h_pair_num;
 
   if (debug_mode_) {
@@ -441,213 +785,94 @@ int GraphDataGenerator::FillInsBuf() {
     for (int xx = 0; xx < 2 * ins_buf_pair_len_; xx++) {
       VLOG(2) << "h_ins_buf[" << xx << "]: " << h_ins_buf[xx];
     }
-    delete[] h_ins_buf;
-
-    if (!FLAGS_enable_opt_get_features && slot_num_ > 0) {
-      uint64_t *feature_buf =
-          reinterpret_cast<uint64_t *>(d_feature_buf_->ptr());
-      uint64_t h_feature_buf[(batch_size_ * 2 * 2) * slot_num_];  // NOLINT
-      cudaMemcpy(h_feature_buf,
-                 feature_buf,
-                 (batch_size_ * 2 * 2) * slot_num_ * sizeof(uint64_t),
-                 cudaMemcpyDeviceToHost);
-      for (int xx = 0; xx < (batch_size_ * 2 * 2) * slot_num_; xx++) {
-        VLOG(2) << "h_feature_buf[" << xx << "]: " << h_feature_buf[xx];
-      }
-    }
   }
   return ins_buf_pair_len_;
 }
 
+int GraphDataGenerator::FillInsBuf(cudaStream_t stream) {
+  if (ins_buf_pair_len_ >= batch_size_) {
+    return batch_size_;
+  }
+  int total_instance = AcquireInstance(&buf_state_);
+
+  VLOG(2) << "total_ins: " << total_instance;
+  buf_state_.Debug();
+
+  if (total_instance == 0) {
+    return -1;
+  }
+  return MakeInsPair(stream);
+}
+
 int GraphDataGenerator::GenerateBatch() {
   int total_instance = 0;
   platform::CUDADeviceGuard guard(gpuid_);
   int res = 0;
   if (!gpu_graph_training_) {
-    while (cursor_ < h_device_keys_.size()) {
-      size_t device_key_size = h_device_keys_[cursor_]->size();
-      if (infer_node_type_start_[cursor_] >= device_key_size) {
-        cursor_++;
-        continue;
-      }
-      total_instance =
-          (infer_node_type_start_[cursor_] + batch_size_ <= device_key_size)
-              ? batch_size_
-              : device_key_size - infer_node_type_start_[cursor_];
-      uint64_t *d_type_keys =
-          reinterpret_cast<uint64_t *>(d_device_keys_[cursor_]->ptr());
-      d_type_keys += infer_node_type_start_[cursor_];
-      infer_node_type_start_[cursor_] += total_instance;
+    if (!sage_mode_) {
+      total_instance = (infer_node_start_ + batch_size_ <= infer_node_end_)
+                           ? batch_size_
+                           : infer_node_end_ - infer_node_start_;
       VLOG(1) << "in graph_data generator:batch_size = " << batch_size_
               << " instance = " << total_instance;
       total_instance *= 2;
-      id_tensor_ptr_ = feed_vec_[0]->mutable_data<int64_t>({total_instance, 1},
-                                                           this->place_);
-      show_tensor_ptr_ =
-          feed_vec_[1]->mutable_data<int64_t>({total_instance}, this->place_);
-      clk_tensor_ptr_ =
-          feed_vec_[2]->mutable_data<int64_t>({total_instance}, this->place_);
-      CopyDuplicateKeys<<<GET_BLOCKS(total_instance / 2),
-                          CUDA_NUM_THREADS,
-                          0,
-                          stream_>>>(
-          id_tensor_ptr_, d_type_keys, total_instance / 2);
-      GraphFillCVMKernel<<<GET_BLOCKS(total_instance),
-                           CUDA_NUM_THREADS,
-                           0,
-                           stream_>>>(show_tensor_ptr_, total_instance);
-      GraphFillCVMKernel<<<GET_BLOCKS(total_instance),
-                           CUDA_NUM_THREADS,
-                           0,
-                           stream_>>>(clk_tensor_ptr_, total_instance);
-      break;
-    }
-    if (total_instance == 0) {
-      return 0;
-    }
-  } else {
-    while (ins_buf_pair_len_ < batch_size_) {
-      res = FillInsBuf();
-      if (res == -1) {
-        if (ins_buf_pair_len_ == 0) {
-          return 0;
-        } else {
-          break;
-        }
+      if (total_instance == 0) {
+        return 0;
       }
+      FillIdShowClkTensor(total_instance, gpu_graph_training_, cursor_);
+    } else {
+      if (sage_batch_count_ == sage_batch_num_) {
+        return 0;
+      }
+      FillGraphIdShowClkTensor(uniq_instance_vec_[sage_batch_count_],
+                               total_instance_vec_[sage_batch_count_],
+                               sage_batch_count_);
     }
-    total_instance =
-        ins_buf_pair_len_ < batch_size_ ? ins_buf_pair_len_ : batch_size_;
-
-    total_instance *= 2;
-    id_tensor_ptr_ =
-        feed_vec_[0]->mutable_data<int64_t>({total_instance, 1}, this->place_);
-    show_tensor_ptr_ =
-        feed_vec_[1]->mutable_data<int64_t>({total_instance}, this->place_);
-    clk_tensor_ptr_ =
-        feed_vec_[2]->mutable_data<int64_t>({total_instance}, this->place_);
-  }
-
-  int64_t *slot_tensor_ptr_[slot_num_];
-  int64_t *slot_lod_tensor_ptr_[slot_num_];
-  if (slot_num_ > 0) {
-    for (int i = 0; i < slot_num_; ++i) {
-      slot_tensor_ptr_[i] = feed_vec_[3 + 2 * i]->mutable_data<int64_t>(
-          {total_instance, 1}, this->place_);
-      slot_lod_tensor_ptr_[i] = feed_vec_[3 + 2 * i + 1]->mutable_data<int64_t>(
-          {total_instance + 1}, this->place_);
-    }
-    if (FLAGS_enable_opt_get_features || !gpu_graph_training_) {
-      cudaMemcpyAsync(d_slot_tensor_ptr_->ptr(),
-                      slot_tensor_ptr_,
-                      sizeof(uint64_t *) * slot_num_,
-                      cudaMemcpyHostToDevice,
-                      stream_);
-      cudaMemcpyAsync(d_slot_lod_tensor_ptr_->ptr(),
-                      slot_lod_tensor_ptr_,
-                      sizeof(uint64_t *) * slot_num_,
-                      cudaMemcpyHostToDevice,
-                      stream_);
-    }
-  }
-
-  uint64_t *ins_cursor, *ins_buf;
-  if (gpu_graph_training_) {
-    VLOG(2) << "total_instance: " << total_instance
-            << ", ins_buf_pair_len = " << ins_buf_pair_len_;
-    // uint64_t *ins_buf = reinterpret_cast<uint64_t *>(d_ins_buf_->ptr());
-    // uint64_t *ins_cursor = ins_buf + ins_buf_pair_len_ * 2 - total_instance;
-    ins_buf = reinterpret_cast<uint64_t *>(d_ins_buf_->ptr());
-    ins_cursor = ins_buf + ins_buf_pair_len_ * 2 - total_instance;
-    cudaMemcpyAsync(id_tensor_ptr_,
-                    ins_cursor,
-                    sizeof(uint64_t) * total_instance,
-                    cudaMemcpyDeviceToDevice,
-                    stream_);
-
-    GraphFillCVMKernel<<<GET_BLOCKS(total_instance),
-                         CUDA_NUM_THREADS,
-                         0,
-                         stream_>>>(show_tensor_ptr_, total_instance);
-    GraphFillCVMKernel<<<GET_BLOCKS(total_instance),
-                         CUDA_NUM_THREADS,
-                         0,
-                         stream_>>>(clk_tensor_ptr_, total_instance);
   } else {
-    ins_cursor = (uint64_t *)id_tensor_ptr_;  // NOLINT
-  }
-
-  if (slot_num_ > 0) {
-    uint64_t *feature_buf = reinterpret_cast<uint64_t *>(d_feature_buf_->ptr());
-    if (FLAGS_enable_opt_get_features || !gpu_graph_training_) {
-      FillFeatureBuf(ins_cursor, feature_buf, total_instance);
-      // FillFeatureBuf(id_tensor_ptr_, feature_buf, total_instance);
-      if (debug_mode_) {
-        uint64_t h_walk[total_instance];  // NOLINT
-        cudaMemcpy(h_walk,
-                   ins_cursor,
-                   total_instance * sizeof(uint64_t),
-                   cudaMemcpyDeviceToHost);
-        uint64_t h_feature[total_instance * slot_num_];  // NOLINT
-        cudaMemcpy(h_feature,
-                   feature_buf,
-                   total_instance * slot_num_ * sizeof(uint64_t),
-                   cudaMemcpyDeviceToHost);
-        for (int i = 0; i < total_instance; ++i) {
-          std::stringstream ss;
-          for (int j = 0; j < slot_num_; ++j) {
-            ss << h_feature[i * slot_num_ + j] << " ";
+    if (!sage_mode_) {
+      while (ins_buf_pair_len_ < batch_size_) {
+        res = FillInsBuf(train_stream_);
+        if (res == -1) {
+          if (ins_buf_pair_len_ == 0) {
+            return 0;
+          } else {
+            break;
           }
-          VLOG(2) << "aft FillFeatureBuf, gpu[" << gpuid_ << "] walk[" << i
-                  << "] = " << (uint64_t)h_walk[i] << " feature["
-                  << i * slot_num_ << ".." << (i + 1) * slot_num_
-                  << "] = " << ss.str();
         }
       }
-
-      GraphFillSlotKernel<<<GET_BLOCKS(total_instance * slot_num_),
-                            CUDA_NUM_THREADS,
-                            0,
-                            stream_>>>(
-          (uint64_t *)d_slot_tensor_ptr_->ptr(),  // NOLINT
-          feature_buf,
-          total_instance * slot_num_,
-          total_instance,
-          slot_num_);
-      GraphFillSlotLodKernelOpt<<<GET_BLOCKS((total_instance + 1) * slot_num_),
-                                  CUDA_NUM_THREADS,
-                                  0,
-                                  stream_>>>(
-          (uint64_t *)d_slot_lod_tensor_ptr_->ptr(),  // NOLINT
-          (total_instance + 1) * slot_num_,
-          total_instance + 1);
+      total_instance =
+          ins_buf_pair_len_ < batch_size_ ? ins_buf_pair_len_ : batch_size_;
+      total_instance *= 2;
+      VLOG(2) << "total_instance: " << total_instance
+              << ", ins_buf_pair_len = " << ins_buf_pair_len_;
+      FillIdShowClkTensor(total_instance, gpu_graph_training_);
     } else {
-      for (int i = 0; i < slot_num_; ++i) {
-        int feature_buf_offset =
-            (ins_buf_pair_len_ * 2 - total_instance) * slot_num_ + i * 2;
-        for (int j = 0; j < total_instance; j += 2) {
-          VLOG(2) << "slot_tensor[" << i << "][" << j << "] <- feature_buf["
-                  << feature_buf_offset + j * slot_num_ << "]";
-          VLOG(2) << "slot_tensor[" << i << "][" << j + 1 << "] <- feature_buf["
-                  << feature_buf_offset + j * slot_num_ + 1 << "]";
-          cudaMemcpyAsync(slot_tensor_ptr_[i] + j,
-                          &feature_buf[feature_buf_offset + j * slot_num_],
-                          sizeof(uint64_t) * 2,
-                          cudaMemcpyDeviceToDevice,
-                          stream_);
-        }
-        GraphFillSlotLodKernel<<<GET_BLOCKS(total_instance),
-                                 CUDA_NUM_THREADS,
-                                 0,
-                                 stream_>>>(slot_lod_tensor_ptr_[i],
-                                            total_instance + 1);
+      if (sage_batch_count_ == sage_batch_num_) {
+        return 0;
       }
+      FillGraphIdShowClkTensor(uniq_instance_vec_[sage_batch_count_],
+                               total_instance_vec_[sage_batch_count_],
+                               sage_batch_count_);
     }
   }
 
+  if (slot_num_ > 0) {
+    if (!sage_mode_) {
+      FillGraphSlotFeature(total_instance, gpu_graph_training_);
+    } else {
+      FillGraphSlotFeature(uniq_instance_vec_[sage_batch_count_],
+                           gpu_graph_training_,
+                           final_sage_nodes_vec_[sage_batch_count_]);
+    }
+  }
   offset_.clear();
   offset_.push_back(0);
-  offset_.push_back(total_instance);
+  if (!sage_mode_) {
+    offset_.push_back(total_instance);
+  } else {
+    offset_.push_back(uniq_instance_vec_[sage_batch_count_]);
+    sage_batch_count_ += 1;
+  }
   LoD lod{offset_};
   feed_vec_[0]->set_lod(lod);
   if (slot_num_ > 0) {
@@ -656,35 +881,11 @@ int GraphDataGenerator::GenerateBatch() {
     }
   }
 
-  cudaStreamSynchronize(stream_);
+  cudaStreamSynchronize(train_stream_);
   if (!gpu_graph_training_) return 1;
-  ins_buf_pair_len_ -= total_instance / 2;
-  if (debug_mode_) {
-    uint64_t h_slot_tensor[slot_num_][total_instance];
-    uint64_t h_slot_lod_tensor[slot_num_][total_instance + 1];
-    for (int i = 0; i < slot_num_; ++i) {
-      cudaMemcpy(h_slot_tensor[i],
-                 slot_tensor_ptr_[i],
-                 total_instance * sizeof(uint64_t),
-                 cudaMemcpyDeviceToHost);
-      int len = total_instance > 5000 ? 5000 : total_instance;
-      for (int j = 0; j < len; ++j) {
-        VLOG(2) << "gpu[" << gpuid_ << "] slot_tensor[" << i << "][" << j
-                << "] = " << h_slot_tensor[i][j];
-      }
-
-      cudaMemcpy(h_slot_lod_tensor[i],
-                 slot_lod_tensor_ptr_[i],
-                 (total_instance + 1) * sizeof(uint64_t),
-                 cudaMemcpyDeviceToHost);
-      len = total_instance + 1 > 5000 ? 5000 : total_instance + 1;
-      for (int j = 0; j < len; ++j) {
-        VLOG(2) << "gpu[" << gpuid_ << "] slot_lod_tensor[" << i << "][" << j
-                << "] = " << h_slot_lod_tensor[i][j];
-      }
-    }
+  if (!sage_mode_) {
+    ins_buf_pair_len_ -= total_instance / 2;
   }
-
   return 1;
 }
 
@@ -751,6 +952,123 @@ __global__ void GraphFillFirstStepKernel(int *prefix_sum,
   }
 }
 
+__global__ void get_each_ins_info(uint8_t *slot_list,
+                                  uint32_t *slot_size_list,
+                                  uint32_t *slot_size_prefix,
+                                  uint32_t *each_ins_slot_num,
+                                  uint32_t *each_ins_slot_num_inner_prefix,
+                                  size_t key_num,
+                                  int slot_num) {
+  const size_t i = blockIdx.x * blockDim.y + threadIdx.y;
+  if (i < key_num) {
+    uint32_t slot_index = slot_size_prefix[i];
+    size_t each_ins_slot_index = i * slot_num;
+    for (int j = 0; j < slot_size_list[i]; j++) {
+      each_ins_slot_num[each_ins_slot_index + slot_list[slot_index + j]] += 1;
+    }
+    each_ins_slot_num_inner_prefix[each_ins_slot_index] = 0;
+    for (int j = 1; j < slot_num; j++) {
+      each_ins_slot_num_inner_prefix[each_ins_slot_index + j] =
+          each_ins_slot_num[each_ins_slot_index + j - 1] +
+          each_ins_slot_num_inner_prefix[each_ins_slot_index + j - 1];
+    }
+  }
+}
+
+__global__ void fill_slot_num(uint32_t *d_each_ins_slot_num_ptr,
+                              uint64_t **d_ins_slot_num_vector_ptr,
+                              size_t key_num,
+                              int slot_num) {
+  const size_t i = blockIdx.x * blockDim.y + threadIdx.y;
+  if (i < key_num) {
+    size_t d_each_index = i * slot_num;
+    for (int j = 0; j < slot_num; j++) {
+      d_ins_slot_num_vector_ptr[j][i] =
+          d_each_ins_slot_num_ptr[d_each_index + j];
+    }
+  }
+}
+
+__global__ void fill_slot_tensor(uint64_t *feature_list,
+                                 uint32_t *feature_size_prefixsum,
+                                 uint32_t *each_ins_slot_num_inner_prefix,
+                                 uint64_t *ins_slot_num,
+                                 int64_t *slot_lod_tensor,
+                                 int64_t *slot_tensor,
+                                 int slot,
+                                 int slot_num,
+                                 size_t node_num) {
+  const size_t i = blockIdx.x * blockDim.y + threadIdx.y;
+  if (i < node_num) {
+    size_t dst_index = slot_lod_tensor[i];
+    size_t src_index = feature_size_prefixsum[i] +
+                       each_ins_slot_num_inner_prefix[slot_num * i + slot];
+    for (uint64_t j = 0; j < ins_slot_num[i]; j++) {
+      slot_tensor[dst_index + j] = feature_list[src_index + j];
+    }
+  }
+}
+
+__global__ void GetUniqueFeaNum(uint64_t *d_in,
+                                uint64_t *unique_num,
+                                size_t len) {
+  const size_t i = blockIdx.x * blockDim.x + threadIdx.x;
+  __shared__ uint64_t local_num;
+  if (threadIdx.x == 0) {
+    local_num = 0;
+  }
+  __syncthreads();
+
+  if (i < len - 1) {
+    if (d_in[i] != d_in[i + 1]) {
+      atomicAdd(&local_num, 1);
+    }
+  }
+  if (i == len - 1) {
+    atomicAdd(&local_num, 1);
+  }
+
+  __syncthreads();
+  if (threadIdx.x == 0) {
+    atomicAdd(unique_num, local_num);
+  }
+}
+
+__global__ void UniqueFeature(uint64_t *d_in,
+                              uint64_t *d_out,
+                              uint64_t *unique_num,
+                              size_t len) {
+  const size_t i = blockIdx.x * blockDim.x + threadIdx.x;
+  __shared__ uint64_t local_key[CUDA_NUM_THREADS];
+  __shared__ uint64_t local_num;
+  __shared__ uint64_t global_num;
+  if (threadIdx.x == 0) {
+    local_num = 0;
+  }
+  __syncthreads();
+
+  if (i < len - 1) {
+    if (d_in[i] != d_in[i + 1]) {
+      size_t dst = atomicAdd(&local_num, 1);
+      local_key[dst] = d_in[i];
+    }
+  }
+  if (i == len - 1) {
+    size_t dst = atomicAdd(&local_num, 1);
+    local_key[dst] = d_in[i];
+  }
+
+  __syncthreads();
+
+  if (threadIdx.x == 0) {
+    global_num = atomicAdd(unique_num, local_num);
+  }
+  __syncthreads();
+
+  if (threadIdx.x < local_num) {
+    d_out[global_num + threadIdx.x] = local_key[threadIdx.x];
+  }
+}
 // Fill sample_res to the stepth column of walk
 void GraphDataGenerator::FillOneStep(uint64_t *d_start_ids,
                                      uint64_t *walk,
@@ -774,45 +1092,50 @@ void GraphDataGenerator::FillOneStep(uint64_t *d_start_ids,
                                            d_actual_sample_size,
                                            d_prefix_sum + 1,
                                            len,
-                                           stream_));
-  auto d_temp_storage = memory::Alloc(place_, temp_storage_bytes);
+                                           sample_stream_));
+  auto d_temp_storage = memory::Alloc(
+      place_,
+      temp_storage_bytes,
+      phi::Stream(reinterpret_cast<phi::StreamId>(sample_stream_)));
 
   CUDA_CHECK(cub::DeviceScan::InclusiveSum(d_temp_storage->ptr(),
                                            temp_storage_bytes,
                                            d_actual_sample_size,
                                            d_prefix_sum + 1,
                                            len,
-                                           stream_));
+                                           sample_stream_));
 
-  cudaStreamSynchronize(stream_);
+  cudaStreamSynchronize(sample_stream_);
 
   if (step == 1) {
-    GraphFillFirstStepKernel<<<GET_BLOCKS(len), CUDA_NUM_THREADS, 0, stream_>>>(
-        d_prefix_sum,
-        d_tmp_sampleidx2row,
-        walk,
-        d_start_ids,
-        len,
-        walk_degree_,
-        walk_len_,
-        d_actual_sample_size,
-        d_neighbors,
-        d_sample_keys);
+    GraphFillFirstStepKernel<<<GET_BLOCKS(len),
+                               CUDA_NUM_THREADS,
+                               0,
+                               sample_stream_>>>(d_prefix_sum,
+                                                 d_tmp_sampleidx2row,
+                                                 walk,
+                                                 d_start_ids,
+                                                 len,
+                                                 walk_degree_,
+                                                 walk_len_,
+                                                 d_actual_sample_size,
+                                                 d_neighbors,
+                                                 d_sample_keys);
 
   } else {
     GraphFillSampleKeysKernel<<<GET_BLOCKS(len),
                                 CUDA_NUM_THREADS,
                                 0,
-                                stream_>>>(d_neighbors,
-                                           d_sample_keys,
-                                           d_prefix_sum,
-                                           d_sampleidx2row,
-                                           d_tmp_sampleidx2row,
-                                           d_actual_sample_size,
-                                           cur_degree,
-                                           len);
-
-    GraphDoWalkKernel<<<GET_BLOCKS(len), CUDA_NUM_THREADS, 0, stream_>>>(
+                                sample_stream_>>>(d_neighbors,
+                                                  d_sample_keys,
+                                                  d_prefix_sum,
+                                                  d_sampleidx2row,
+                                                  d_tmp_sampleidx2row,
+                                                  d_actual_sample_size,
+                                                  cur_degree,
+                                                  len);
+
+    GraphDoWalkKernel<<<GET_BLOCKS(len), CUDA_NUM_THREADS, 0, sample_stream_>>>(
         d_neighbors,
         walk,
         d_prefix_sum,
@@ -829,7 +1152,6 @@ void GraphDataGenerator::FillOneStep(uint64_t *d_start_ids,
     int *h_prefix_sum = new int[len + 1];
     int *h_actual_size = new int[len];
     int *h_offset2idx = new int[once_max_sample_keynum];
-    uint64_t h_sample_keys[once_max_sample_keynum];  // NOLINT
     cudaMemcpy(h_offset2idx,
                d_tmp_sampleidx2row,
                once_max_sample_keynum * sizeof(int),
@@ -848,12 +1170,272 @@ void GraphDataGenerator::FillOneStep(uint64_t *d_start_ids,
     delete[] h_prefix_sum;
     delete[] h_actual_size;
     delete[] h_offset2idx;
-    delete[] h_sample_keys;
   }
-  cudaStreamSynchronize(stream_);
+  cudaStreamSynchronize(sample_stream_);
   cur_sampleidx2row_ = 1 - cur_sampleidx2row_;
 }
 
+int GraphDataGenerator::FillSlotFeature(uint64_t *d_walk, size_t key_num) {
+  platform::CUDADeviceGuard guard(gpuid_);
+  auto gpu_graph_ptr = GraphGpuWrapper::GetInstance();
+  std::shared_ptr<phi::Allocation> d_feature_list;
+  std::shared_ptr<phi::Allocation> d_slot_list;
+
+  if (sage_mode_) {
+    size_t temp_storage_bytes = (key_num + 1) * sizeof(uint32_t);
+    if (d_feature_size_list_buf_ == NULL ||
+        d_feature_size_list_buf_->size() < temp_storage_bytes) {
+      d_feature_size_list_buf_ =
+          memory::AllocShared(this->place_, temp_storage_bytes);
+    }
+    if (d_feature_size_prefixsum_buf_ == NULL ||
+        d_feature_size_prefixsum_buf_->size() < temp_storage_bytes) {
+      d_feature_size_prefixsum_buf_ =
+          memory::AllocShared(this->place_, temp_storage_bytes);
+    }
+  }
+
+  uint32_t *d_feature_size_list_ptr =
+      reinterpret_cast<uint32_t *>(d_feature_size_list_buf_->ptr());
+  uint32_t *d_feature_size_prefixsum_ptr =
+      reinterpret_cast<uint32_t *>(d_feature_size_prefixsum_buf_->ptr());
+  int fea_num =
+      gpu_graph_ptr->get_feature_info_of_nodes(gpuid_,
+                                               d_walk,
+                                               key_num,
+                                               d_feature_size_list_ptr,
+                                               d_feature_size_prefixsum_ptr,
+                                               d_feature_list,
+                                               d_slot_list);
+  int64_t *slot_tensor_ptr_[slot_num_];
+  int64_t *slot_lod_tensor_ptr_[slot_num_];
+  if (fea_num == 0) {
+    int64_t default_lod = 1;
+    for (int i = 0; i < slot_num_; ++i) {
+      slot_lod_tensor_ptr_[i] = feed_vec_[3 + 2 * i + 1]->mutable_data<int64_t>(
+          {(long)key_num + 1}, this->place_);  // NOLINT
+      slot_tensor_ptr_[i] =
+          feed_vec_[3 + 2 * i]->mutable_data<int64_t>({1, 1}, this->place_);
+      CUDA_CHECK(cudaMemsetAsync(
+          slot_tensor_ptr_[i], 0, sizeof(int64_t), train_stream_));
+      CUDA_CHECK(cudaMemsetAsync(slot_lod_tensor_ptr_[i],
+                                 0,
+                                 sizeof(int64_t) * key_num,
+                                 train_stream_));
+      CUDA_CHECK(cudaMemcpyAsync(
+          reinterpret_cast<char *>(slot_lod_tensor_ptr_[i] + key_num),
+          &default_lod,
+          sizeof(int64_t),
+          cudaMemcpyHostToDevice,
+          train_stream_));
+    }
+    CUDA_CHECK(cudaStreamSynchronize(train_stream_));
+    return 0;
+  }
+
+  uint64_t *d_feature_list_ptr =
+      reinterpret_cast<uint64_t *>(d_feature_list->ptr());
+  uint8_t *d_slot_list_ptr = reinterpret_cast<uint8_t *>(d_slot_list->ptr());
+
+  std::shared_ptr<phi::Allocation> d_each_ins_slot_num_inner_prefix =
+      memory::AllocShared(place_, (slot_num_ * key_num) * sizeof(uint32_t));
+  std::shared_ptr<phi::Allocation> d_each_ins_slot_num =
+      memory::AllocShared(place_, (slot_num_ * key_num) * sizeof(uint32_t));
+  uint32_t *d_each_ins_slot_num_ptr =
+      reinterpret_cast<uint32_t *>(d_each_ins_slot_num->ptr());
+  uint32_t *d_each_ins_slot_num_inner_prefix_ptr =
+      reinterpret_cast<uint32_t *>(d_each_ins_slot_num_inner_prefix->ptr());
+  CUDA_CHECK(cudaMemsetAsync(d_each_ins_slot_num_ptr,
+                             0,
+                             slot_num_ * key_num * sizeof(uint32_t),
+                             train_stream_));
+
+  dim3 grid((key_num - 1) / 256 + 1);
+  dim3 block(1, 256);
+
+  get_each_ins_info<<<grid, block, 0, train_stream_>>>(
+      d_slot_list_ptr,
+      d_feature_size_list_ptr,
+      d_feature_size_prefixsum_ptr,
+      d_each_ins_slot_num_ptr,
+      d_each_ins_slot_num_inner_prefix_ptr,
+      key_num,
+      slot_num_);
+
+  std::vector<std::shared_ptr<phi::Allocation>> ins_slot_num(slot_num_,
+                                                             nullptr);
+  std::vector<uint64_t *> ins_slot_num_vecotr(slot_num_, NULL);
+  std::shared_ptr<phi::Allocation> d_ins_slot_num_vector =
+      memory::AllocShared(place_, (slot_num_) * sizeof(uint64_t *));
+  uint64_t **d_ins_slot_num_vector_ptr =
+      reinterpret_cast<uint64_t **>(d_ins_slot_num_vector->ptr());
+  for (int i = 0; i < slot_num_; i++) {
+    ins_slot_num[i] = memory::AllocShared(place_, key_num * sizeof(uint64_t));
+    ins_slot_num_vecotr[i] =
+        reinterpret_cast<uint64_t *>(ins_slot_num[i]->ptr());
+  }
+  CUDA_CHECK(
+      cudaMemcpyAsync(reinterpret_cast<char *>(d_ins_slot_num_vector_ptr),
+                      ins_slot_num_vecotr.data(),
+                      sizeof(uint64_t *) * slot_num_,
+                      cudaMemcpyHostToDevice,
+                      train_stream_));
+  fill_slot_num<<<grid, block, 0, train_stream_>>>(
+      d_each_ins_slot_num_ptr, d_ins_slot_num_vector_ptr, key_num, slot_num_);
+  CUDA_CHECK(cudaStreamSynchronize(train_stream_));
+
+  for (int i = 0; i < slot_num_; ++i) {
+    slot_lod_tensor_ptr_[i] = feed_vec_[3 + 2 * i + 1]->mutable_data<int64_t>(
+        {(long)key_num + 1}, this->place_);  // NOLINT
+  }
+  size_t temp_storage_bytes = 0;
+  CUDA_CHECK(cub::DeviceScan::InclusiveSum(NULL,
+                                           temp_storage_bytes,
+                                           ins_slot_num_vecotr[0],
+                                           slot_lod_tensor_ptr_[0] + 1,
+                                           key_num,
+                                           train_stream_));
+  CUDA_CHECK(cudaStreamSynchronize(train_stream_));
+  auto d_temp_storage = memory::Alloc(
+      this->place_,
+      temp_storage_bytes,
+      phi::Stream(reinterpret_cast<phi::StreamId>(train_stream_)));
+  std::vector<int64_t> each_slot_fea_num(slot_num_, 0);
+  for (int i = 0; i < slot_num_; ++i) {
+    CUDA_CHECK(cudaMemsetAsync(
+        slot_lod_tensor_ptr_[i], 0, sizeof(uint64_t), train_stream_));
+    CUDA_CHECK(cub::DeviceScan::InclusiveSum(d_temp_storage->ptr(),
+                                             temp_storage_bytes,
+                                             ins_slot_num_vecotr[i],
+                                             slot_lod_tensor_ptr_[i] + 1,
+                                             key_num,
+                                             train_stream_));
+    CUDA_CHECK(cudaMemcpyAsync(&each_slot_fea_num[i],
+                               slot_lod_tensor_ptr_[i] + key_num,
+                               sizeof(uint64_t),
+                               cudaMemcpyDeviceToHost,
+                               train_stream_));
+  }
+  CUDA_CHECK(cudaStreamSynchronize(train_stream_));
+  for (int i = 0; i < slot_num_; ++i) {
+    slot_tensor_ptr_[i] = feed_vec_[3 + 2 * i]->mutable_data<int64_t>(
+        {each_slot_fea_num[i], 1}, this->place_);
+  }
+  int64_t default_lod = 1;
+  for (int i = 0; i < slot_num_; ++i) {
+    fill_slot_tensor<<<grid, block, 0, train_stream_>>>(
+        d_feature_list_ptr,
+        d_feature_size_prefixsum_ptr,
+        d_each_ins_slot_num_inner_prefix_ptr,
+        ins_slot_num_vecotr[i],
+        slot_lod_tensor_ptr_[i],
+        slot_tensor_ptr_[i],
+        i,
+        slot_num_,
+        key_num);
+    // trick for empty tensor
+    if (each_slot_fea_num[i] == 0) {
+      slot_tensor_ptr_[i] =
+          feed_vec_[3 + 2 * i]->mutable_data<int64_t>({1, 1}, this->place_);
+      CUDA_CHECK(cudaMemsetAsync(
+          slot_tensor_ptr_[i], 0, sizeof(uint64_t), train_stream_));
+      CUDA_CHECK(cudaMemcpyAsync(
+          reinterpret_cast<char *>(slot_lod_tensor_ptr_[i] + key_num),
+          &default_lod,
+          sizeof(int64_t),
+          cudaMemcpyHostToDevice,
+          train_stream_));
+    }
+  }
+  CUDA_CHECK(cudaStreamSynchronize(train_stream_));
+
+  if (debug_mode_) {
+    std::vector<uint32_t> h_feature_size_list(key_num, 0);
+    std::vector<uint32_t> h_feature_size_list_prefixsum(key_num, 0);
+    std::vector<uint64_t> node_list(key_num, 0);
+    std::vector<uint64_t> h_feature_list(fea_num, 0);
+    std::vector<uint8_t> h_slot_list(fea_num, 0);
+
+    CUDA_CHECK(
+        cudaMemcpyAsync(reinterpret_cast<char *>(h_feature_size_list.data()),
+                        d_feature_size_list_ptr,
+                        sizeof(uint32_t) * key_num,
+                        cudaMemcpyDeviceToHost,
+                        train_stream_));
+    CUDA_CHECK(cudaMemcpyAsync(
+        reinterpret_cast<char *>(h_feature_size_list_prefixsum.data()),
+        d_feature_size_prefixsum_ptr,
+        sizeof(uint32_t) * key_num,
+        cudaMemcpyDeviceToHost,
+        train_stream_));
+    CUDA_CHECK(cudaMemcpyAsync(reinterpret_cast<char *>(node_list.data()),
+                               d_walk,
+                               sizeof(uint64_t) * key_num,
+                               cudaMemcpyDeviceToHost,
+                               train_stream_));
+
+    CUDA_CHECK(cudaMemcpyAsync(reinterpret_cast<char *>(h_feature_list.data()),
+                               d_feature_list_ptr,
+                               sizeof(uint64_t) * fea_num,
+                               cudaMemcpyDeviceToHost,
+                               train_stream_));
+    CUDA_CHECK(cudaMemcpyAsync(reinterpret_cast<char *>(h_slot_list.data()),
+                               d_slot_list_ptr,
+                               sizeof(uint8_t) * fea_num,
+                               cudaMemcpyDeviceToHost,
+                               train_stream_));
+
+    CUDA_CHECK(cudaStreamSynchronize(train_stream_));
+    for (size_t i = 0; i < key_num; i++) {
+      std::stringstream ss;
+      ss << "node_id: " << node_list[i]
+         << " fea_num: " << h_feature_size_list[i] << " offset "
+         << h_feature_size_list_prefixsum[i] << " slot: ";
+      for (uint32_t j = 0; j < h_feature_size_list[i]; j++) {
+        ss << int(h_slot_list[h_feature_size_list_prefixsum[i] + j]) << " : "
+           << h_feature_list[h_feature_size_list_prefixsum[i] + j] << "  ";
+      }
+      VLOG(0) << ss.str();
+    }
+    VLOG(0) << "all fea_num is " << fea_num << " calc fea_num is "
+            << h_feature_size_list[key_num - 1] +
+                   h_feature_size_list_prefixsum[key_num - 1];
+    for (int i = 0; i < slot_num_; ++i) {
+      std::vector<int64_t> h_slot_lod_tensor(key_num + 1, 0);
+      CUDA_CHECK(
+          cudaMemcpyAsync(reinterpret_cast<char *>(h_slot_lod_tensor.data()),
+                          slot_lod_tensor_ptr_[i],
+                          sizeof(int64_t) * (key_num + 1),
+                          cudaMemcpyDeviceToHost,
+                          train_stream_));
+      CUDA_CHECK(cudaStreamSynchronize(train_stream_));
+      std::stringstream ss_lod;
+      std::stringstream ss_tensor;
+      ss_lod << " slot " << i << " lod is [";
+      for (size_t j = 0; j < key_num + 1; j++) {
+        ss_lod << h_slot_lod_tensor[j] << ",";
+      }
+      ss_lod << "]";
+      std::vector<int64_t> h_slot_tensor(h_slot_lod_tensor[key_num], 0);
+      CUDA_CHECK(cudaMemcpyAsync(reinterpret_cast<char *>(h_slot_tensor.data()),
+                                 slot_tensor_ptr_[i],
+                                 sizeof(int64_t) * h_slot_lod_tensor[key_num],
+                                 cudaMemcpyDeviceToHost,
+                                 train_stream_));
+      CUDA_CHECK(cudaStreamSynchronize(train_stream_));
+
+      ss_tensor << " tensor is [ ";
+      for (size_t j = 0; j < h_slot_lod_tensor[key_num]; j++) {
+        ss_tensor << h_slot_tensor[j] << ",";
+      }
+      ss_tensor << "]";
+      VLOG(0) << ss_lod.str() << "  " << ss_tensor.str();
+    }
+  }
+
+  return 0;
+}
+
 int GraphDataGenerator::FillFeatureBuf(uint64_t *d_walk,
                                        uint64_t *d_feature,
                                        size_t key_num) {
@@ -861,7 +1443,13 @@ int GraphDataGenerator::FillFeatureBuf(uint64_t *d_walk,
 
   auto gpu_graph_ptr = GraphGpuWrapper::GetInstance();
   int ret = gpu_graph_ptr->get_feature_of_nodes(
-      gpuid_, d_walk, d_feature, key_num, slot_num_);
+      gpuid_,
+      d_walk,
+      d_feature,
+      key_num,
+      slot_num_,
+      reinterpret_cast<int *>(d_slot_feature_num_map_->ptr()),
+      fea_num_per_node_);
   return ret;
 }
 
@@ -873,14 +1461,628 @@ int GraphDataGenerator::FillFeatureBuf(
   auto gpu_graph_ptr = GraphGpuWrapper::GetInstance();
   int ret = gpu_graph_ptr->get_feature_of_nodes(
       gpuid_,
-      (uint64_t *)d_walk->ptr(),     // NOLINT
-      (uint64_t *)d_feature->ptr(),  // NOLINT
+      reinterpret_cast<uint64_t *>(d_walk->ptr()),
+      reinterpret_cast<uint64_t *>(d_feature->ptr()),
       buf_size_,
-      slot_num_);
+      slot_num_,
+      reinterpret_cast<int *>(d_slot_feature_num_map_->ptr()),
+      fea_num_per_node_);
   return ret;
 }
 
-int GraphDataGenerator::FillWalkBuf(std::shared_ptr<phi::Allocation> d_walk) {
+// 对于deepwalk模式，尝试插入table，0表示插入成功，1表示插入失败；
+// 对于sage模式，尝试插入table，table数量不够则清空table重新插入，返回值无影响。
+int GraphDataGenerator::InsertTable(
+    const uint64_t *d_keys,
+    uint64_t len,
+    std::shared_ptr<phi::Allocation> d_uniq_node_num) {
+  // Used under NOT WHOLE_HBM.
+  uint64_t h_uniq_node_num = 0;
+  uint64_t *d_uniq_node_num_ptr =
+      reinterpret_cast<uint64_t *>(d_uniq_node_num->ptr());
+  cudaMemcpyAsync(&h_uniq_node_num,
+                  d_uniq_node_num_ptr,
+                  sizeof(uint64_t),
+                  cudaMemcpyDeviceToHost,
+                  sample_stream_);
+  cudaStreamSynchronize(sample_stream_);
+
+  if (gpu_graph_training_) {
+    VLOG(2) << "table capacity: " << train_table_cap_ << ", " << h_uniq_node_num
+            << " used";
+    if (h_uniq_node_num + len >= train_table_cap_) {
+      if (!sage_mode_) {
+        return 1;
+      } else {
+        // Copy unique nodes first.
+        uint64_t copy_len = CopyUniqueNodes();
+        copy_unique_len_ += copy_len;
+        table_->clear(sample_stream_);
+        cudaMemsetAsync(
+            d_uniq_node_num_ptr, 0, sizeof(uint64_t), sample_stream_);
+      }
+    }
+  } else {
+    // used only for sage_mode.
+    if (h_uniq_node_num + len >= infer_table_cap_) {
+      uint64_t copy_len = CopyUniqueNodes();
+      copy_unique_len_ += copy_len;
+      table_->clear(sample_stream_);
+      cudaMemsetAsync(d_uniq_node_num_ptr, 0, sizeof(uint64_t), sample_stream_);
+    }
+  }
+
+  table_->insert(d_keys, len, d_uniq_node_num_ptr, sample_stream_);
+  CUDA_CHECK(cudaStreamSynchronize(sample_stream_));
+  return 0;
+}
+
+std::vector<std::shared_ptr<phi::Allocation>>
+GraphDataGenerator::SampleNeighbors(int64_t *uniq_nodes,
+                                    int len,
+                                    int sample_size,
+                                    std::vector<int> &edges_split_num,
+                                    int64_t *neighbor_len) {
+  auto gpu_graph_ptr = GraphGpuWrapper::GetInstance();
+  auto sample_res = gpu_graph_ptr->graph_neighbor_sample_all_edge_type(
+      gpuid_,
+      edge_to_id_len_,
+      reinterpret_cast<uint64_t *>(uniq_nodes),
+      sample_size,
+      len,
+      edge_type_graph_);
+
+  int *all_sample_count_ptr =
+      reinterpret_cast<int *>(sample_res.actual_sample_size_mem->ptr());
+
+  auto cumsum_actual_sample_size = memory::Alloc(
+      place_,
+      (len * edge_to_id_len_ + 1) * sizeof(int),
+      phi::Stream(reinterpret_cast<phi::StreamId>(sample_stream_)));
+  int *cumsum_actual_sample_size_ptr =
+      reinterpret_cast<int *>(cumsum_actual_sample_size->ptr());
+  cudaMemsetAsync(cumsum_actual_sample_size_ptr,
+                  0,
+                  (len * edge_to_id_len_ + 1) * sizeof(int),
+                  sample_stream_);
+
+  size_t temp_storage_bytes = 0;
+  CUDA_CHECK(cub::DeviceScan::InclusiveSum(NULL,
+                                           temp_storage_bytes,
+                                           all_sample_count_ptr,
+                                           cumsum_actual_sample_size_ptr + 1,
+                                           len * edge_to_id_len_,
+                                           sample_stream_));
+  auto d_temp_storage = memory::Alloc(
+      place_,
+      temp_storage_bytes,
+      phi::Stream(reinterpret_cast<phi::StreamId>(sample_stream_)));
+  CUDA_CHECK(cub::DeviceScan::InclusiveSum(d_temp_storage->ptr(),
+                                           temp_storage_bytes,
+                                           all_sample_count_ptr,
+                                           cumsum_actual_sample_size_ptr + 1,
+                                           len * edge_to_id_len_,
+                                           sample_stream_));
+  cudaStreamSynchronize(sample_stream_);
+
+  edges_split_num.resize(edge_to_id_len_);
+  for (int i = 0; i < edge_to_id_len_; i++) {
+    cudaMemcpyAsync(edges_split_num.data() + i,
+                    cumsum_actual_sample_size_ptr + (i + 1) * len,
+                    sizeof(int),
+                    cudaMemcpyDeviceToHost,
+                    sample_stream_);
+  }
+
+  CUDA_CHECK(cudaStreamSynchronize(sample_stream_));
+
+  int all_sample_size = edges_split_num[edge_to_id_len_ - 1];
+  auto final_sample_val = memory::AllocShared(
+      place_,
+      all_sample_size * sizeof(int64_t),
+      phi::Stream(reinterpret_cast<phi::StreamId>(sample_stream_)));
+  auto final_sample_val_dst = memory::AllocShared(
+      place_,
+      all_sample_size * sizeof(int64_t),
+      phi::Stream(reinterpret_cast<phi::StreamId>(sample_stream_)));
+  int64_t *final_sample_val_ptr =
+      reinterpret_cast<int64_t *>(final_sample_val->ptr());
+  int64_t *final_sample_val_dst_ptr =
+      reinterpret_cast<int64_t *>(final_sample_val_dst->ptr());
+  int64_t *all_sample_val_ptr =
+      reinterpret_cast<int64_t *>(sample_res.val_mem->ptr());
+  FillActualNeighbors<<<GET_BLOCKS(len * edge_to_id_len_),
+                        CUDA_NUM_THREADS,
+                        0,
+                        sample_stream_>>>(all_sample_val_ptr,
+                                          final_sample_val_ptr,
+                                          final_sample_val_dst_ptr,
+                                          all_sample_count_ptr,
+                                          cumsum_actual_sample_size_ptr,
+                                          sample_size,
+                                          len * edge_to_id_len_,
+                                          len);
+  *neighbor_len = all_sample_size;
+  cudaStreamSynchronize(sample_stream_);
+
+  std::vector<std::shared_ptr<phi::Allocation>> sample_results;
+  sample_results.emplace_back(final_sample_val);
+  sample_results.emplace_back(final_sample_val_dst);
+  return sample_results;
+}
+
+std::shared_ptr<phi::Allocation> GraphDataGenerator::FillReindexHashTable(
+    int64_t *input,
+    int num_input,
+    int64_t len_hashtable,
+    int64_t *keys,
+    int *values,
+    int *key_index,
+    int *final_nodes_len) {
+  phi::BuildHashTable<int64_t>
+      <<<GET_BLOCKS(num_input), CUDA_NUM_THREADS, 0, sample_stream_>>>(
+          input, num_input, len_hashtable, keys, key_index);
+
+  // Get item index count.
+  auto item_count = memory::Alloc(
+      place_,
+      (num_input + 1) * sizeof(int),
+      phi::Stream(reinterpret_cast<phi::StreamId>(sample_stream_)));
+  int *item_count_ptr = reinterpret_cast<int *>(item_count->ptr());
+  cudaMemsetAsync(
+      item_count_ptr, 0, sizeof(int) * (num_input + 1), sample_stream_);
+  phi::GetItemIndexCount<int64_t>
+      <<<GET_BLOCKS(num_input), CUDA_NUM_THREADS, 0, sample_stream_>>>(
+          input, item_count_ptr, num_input, len_hashtable, keys, key_index);
+
+  size_t temp_storage_bytes = 0;
+  cub::DeviceScan::ExclusiveSum(NULL,
+                                temp_storage_bytes,
+                                item_count_ptr,
+                                item_count_ptr,
+                                num_input + 1,
+                                sample_stream_);
+  auto d_temp_storage = memory::Alloc(
+      place_,
+      temp_storage_bytes,
+      phi::Stream(reinterpret_cast<phi::StreamId>(sample_stream_)));
+  cub::DeviceScan::ExclusiveSum(d_temp_storage->ptr(),
+                                temp_storage_bytes,
+                                item_count_ptr,
+                                item_count_ptr,
+                                num_input + 1,
+                                sample_stream_);
+
+  int total_unique_items = 0;
+  cudaMemcpyAsync(&total_unique_items,
+                  item_count_ptr + num_input,
+                  sizeof(int),
+                  cudaMemcpyDeviceToHost,
+                  sample_stream_);
+  cudaStreamSynchronize(sample_stream_);
+
+  auto unique_items = memory::AllocShared(
+      place_,
+      total_unique_items * sizeof(int64_t),
+      phi::Stream(reinterpret_cast<phi::StreamId>(sample_stream_)));
+  int64_t *unique_items_ptr = reinterpret_cast<int64_t *>(unique_items->ptr());
+  *final_nodes_len = total_unique_items;
+
+  // Get unique items
+  phi::FillUniqueItems<int64_t>
+      <<<GET_BLOCKS(num_input), CUDA_NUM_THREADS, 0, sample_stream_>>>(
+          input,
+          num_input,
+          len_hashtable,
+          unique_items_ptr,
+          item_count_ptr,
+          keys,
+          values,
+          key_index);
+  cudaStreamSynchronize(sample_stream_);
+  return unique_items;
+}
+
+std::shared_ptr<phi::Allocation> GraphDataGenerator::GetReindexResult(
+    int64_t *reindex_src_data,
+    int64_t *center_nodes,
+    int *final_nodes_len,
+    int node_len,
+    int64_t neighbor_len) {
+  // Reset reindex table
+  int64_t *d_reindex_table_key_ptr =
+      reinterpret_cast<int64_t *>(d_reindex_table_key_->ptr());
+  int *d_reindex_table_value_ptr =
+      reinterpret_cast<int *>(d_reindex_table_value_->ptr());
+  int *d_reindex_table_index_ptr =
+      reinterpret_cast<int *>(d_reindex_table_index_->ptr());
+
+  // Fill table with -1.
+  cudaMemsetAsync(d_reindex_table_key_ptr,
+                  -1,
+                  reindex_table_size_ * sizeof(int64_t),
+                  sample_stream_);
+  cudaMemsetAsync(d_reindex_table_value_ptr,
+                  -1,
+                  reindex_table_size_ * sizeof(int),
+                  sample_stream_);
+  cudaMemsetAsync(d_reindex_table_index_ptr,
+                  -1,
+                  reindex_table_size_ * sizeof(int),
+                  sample_stream_);
+
+  auto all_nodes = memory::AllocShared(
+      place_,
+      (node_len + neighbor_len) * sizeof(int64_t),
+      phi::Stream(reinterpret_cast<phi::StreamId>(sample_stream_)));
+  int64_t *all_nodes_data = reinterpret_cast<int64_t *>(all_nodes->ptr());
+
+  cudaMemcpyAsync(all_nodes_data,
+                  center_nodes,
+                  sizeof(int64_t) * node_len,
+                  cudaMemcpyDeviceToDevice,
+                  sample_stream_);
+  cudaMemcpyAsync(all_nodes_data + node_len,
+                  reindex_src_data,
+                  sizeof(int64_t) * neighbor_len,
+                  cudaMemcpyDeviceToDevice,
+                  sample_stream_);
+
+  cudaStreamSynchronize(sample_stream_);
+
+  auto final_nodes = FillReindexHashTable(all_nodes_data,
+                                          node_len + neighbor_len,
+                                          reindex_table_size_,
+                                          d_reindex_table_key_ptr,
+                                          d_reindex_table_value_ptr,
+                                          d_reindex_table_index_ptr,
+                                          final_nodes_len);
+
+  phi::ReindexSrcOutput<int64_t>
+      <<<GET_BLOCKS(neighbor_len), CUDA_NUM_THREADS, 0, sample_stream_>>>(
+          reindex_src_data,
+          neighbor_len,
+          reindex_table_size_,
+          d_reindex_table_key_ptr,
+          d_reindex_table_value_ptr);
+  return final_nodes;
+}
+
+std::shared_ptr<phi::Allocation> GraphDataGenerator::GenerateSampleGraph(
+    uint64_t *node_ids,
+    int len,
+    int *final_len,
+    std::shared_ptr<phi::Allocation> &inverse) {
+  VLOG(2) << "Get Unique Nodes";
+
+  auto uniq_nodes = memory::Alloc(
+      place_,
+      len * sizeof(uint64_t),
+      phi::Stream(reinterpret_cast<phi::StreamId>(sample_stream_)));
+  int *inverse_ptr = reinterpret_cast<int *>(inverse->ptr());
+  int64_t *uniq_nodes_data = reinterpret_cast<int64_t *>(uniq_nodes->ptr());
+  int uniq_len = dedup_keys_and_fillidx(
+      len,
+      node_ids,
+      reinterpret_cast<uint64_t *>(uniq_nodes_data),
+      reinterpret_cast<uint64_t *>(d_sorted_keys_->ptr()),
+      reinterpret_cast<uint32_t *>(inverse_ptr),
+      reinterpret_cast<uint32_t *>(d_sorted_idx_->ptr()),
+      reinterpret_cast<uint32_t *>(d_offset_->ptr()),
+      reinterpret_cast<uint32_t *>(d_merged_cnts_->ptr()),
+      sample_stream_,
+      d_buf_,
+      place_);
+  int len_samples = samples_.size();
+
+  VLOG(2) << "Sample Neighbors and Reindex";
+  std::vector<int> edges_split_num;
+  std::vector<std::shared_ptr<phi::Allocation>> final_nodes_vec;
+  std::vector<std::shared_ptr<phi::Allocation>> graph_edges;
+  std::vector<std::vector<int>> edges_split_num_for_graph;
+  std::vector<int> final_nodes_len_vec;
+
+  for (int i = 0; i < len_samples; i++) {
+    edges_split_num.clear();
+    std::shared_ptr<phi::Allocation> neighbors, reindex_dst;
+    int64_t neighbors_len = 0;
+    if (i == 0) {
+      auto sample_results = SampleNeighbors(uniq_nodes_data,
+                                            uniq_len,
+                                            samples_[i],
+                                            edges_split_num,
+                                            &neighbors_len);
+      neighbors = sample_results[0];
+      reindex_dst = sample_results[1];
+      edges_split_num.push_back(uniq_len);
+    } else {
+      int64_t *final_nodes_data =
+          reinterpret_cast<int64_t *>(final_nodes_vec[i - 1]->ptr());
+      auto sample_results = SampleNeighbors(final_nodes_data,
+                                            final_nodes_len_vec[i - 1],
+                                            samples_[i],
+                                            edges_split_num,
+                                            &neighbors_len);
+      neighbors = sample_results[0];
+      reindex_dst = sample_results[1];
+      edges_split_num.push_back(final_nodes_len_vec[i - 1]);
+    }
+
+    int64_t *reindex_src_data = reinterpret_cast<int64_t *>(neighbors->ptr());
+    int final_nodes_len = 0;
+    if (i == 0) {
+      auto tmp_final_nodes = GetReindexResult(reindex_src_data,
+                                              uniq_nodes_data,
+                                              &final_nodes_len,
+                                              uniq_len,
+                                              neighbors_len);
+      final_nodes_vec.emplace_back(tmp_final_nodes);
+      final_nodes_len_vec.emplace_back(final_nodes_len);
+    } else {
+      int64_t *final_nodes_data =
+          reinterpret_cast<int64_t *>(final_nodes_vec[i - 1]->ptr());
+      auto tmp_final_nodes = GetReindexResult(reindex_src_data,
+                                              final_nodes_data,
+                                              &final_nodes_len,
+                                              final_nodes_len_vec[i - 1],
+                                              neighbors_len);
+      final_nodes_vec.emplace_back(tmp_final_nodes);
+      final_nodes_len_vec.emplace_back(final_nodes_len);
+    }
+    edges_split_num.emplace_back(
+        final_nodes_len_vec[i]);  // [edges_split_num, next_num_nodes,
+                                  // num_nodes]
+    edges_split_num.emplace_back(neighbors_len);
+    graph_edges.emplace_back(neighbors);
+    graph_edges.emplace_back(reindex_dst);
+    edges_split_num_for_graph.emplace_back(edges_split_num);
+  }
+  graph_edges_vec_.emplace_back(graph_edges);
+  edges_split_num_vec_.emplace_back(edges_split_num_for_graph);
+
+  *final_len = final_nodes_len_vec[len_samples - 1];
+  return final_nodes_vec[len_samples - 1];
+}
+
+uint64_t GraphDataGenerator::CopyUniqueNodes() {
+  if (FLAGS_gpugraph_storage_mode != GpuGraphStorageMode::WHOLE_HBM) {
+    uint64_t h_uniq_node_num = 0;
+    uint64_t *d_uniq_node_num =
+        reinterpret_cast<uint64_t *>(d_uniq_node_num_->ptr());
+    cudaMemcpyAsync(&h_uniq_node_num,
+                    d_uniq_node_num,
+                    sizeof(uint64_t),
+                    cudaMemcpyDeviceToHost,
+                    sample_stream_);
+    cudaStreamSynchronize(sample_stream_);
+    auto d_uniq_node = memory::AllocShared(
+        place_,
+        h_uniq_node_num * sizeof(uint64_t),
+        phi::Stream(reinterpret_cast<phi::StreamId>(sample_stream_)));
+    uint64_t *d_uniq_node_ptr =
+        reinterpret_cast<uint64_t *>(d_uniq_node->ptr());
+
+    auto d_node_cursor = memory::AllocShared(
+        place_,
+        sizeof(uint64_t),
+        phi::Stream(reinterpret_cast<phi::StreamId>(sample_stream_)));
+
+    uint64_t *d_node_cursor_ptr =
+        reinterpret_cast<uint64_t *>(d_node_cursor->ptr());
+    cudaMemsetAsync(d_node_cursor_ptr, 0, sizeof(uint64_t), sample_stream_);
+    // uint64_t unused_key = std::numeric_limits<uint64_t>::max();
+    table_->get_keys(d_uniq_node_ptr, d_node_cursor_ptr, sample_stream_);
+
+    cudaStreamSynchronize(sample_stream_);
+
+    host_vec_.resize(h_uniq_node_num + copy_unique_len_);
+    cudaMemcpyAsync(host_vec_.data() + copy_unique_len_,
+                    d_uniq_node_ptr,
+                    sizeof(uint64_t) * h_uniq_node_num,
+                    cudaMemcpyDeviceToHost,
+                    sample_stream_);
+    cudaStreamSynchronize(sample_stream_);
+    return h_uniq_node_num;
+  }
+  return 0;
+}
+
+void GraphDataGenerator::DoWalkandSage() {
+  int device_id = place_.GetDeviceId();
+  debug_gpu_memory_info(device_id, "DoWalkandSage start");
+  platform::CUDADeviceGuard guard(gpuid_);
+  if (gpu_graph_training_) {
+    bool train_flag;
+    if (FLAGS_graph_metapath_split_opt) {
+      train_flag = FillWalkBufMultiPath();
+    } else {
+      train_flag = FillWalkBuf();
+    }
+
+    if (sage_mode_) {
+      sage_batch_num_ = 0;
+      if (train_flag) {
+        int total_instance = 0, uniq_instance = 0;
+        bool ins_pair_flag = true;
+        uint64_t *ins_buf, *ins_cursor;
+        while (ins_pair_flag) {
+          int res = 0;
+          while (ins_buf_pair_len_ < batch_size_) {
+            res = FillInsBuf(sample_stream_);
+            if (res == -1) {
+              if (ins_buf_pair_len_ == 0) {
+                ins_pair_flag = false;
+              }
+              break;
+            }
+          }
+
+          if (!ins_pair_flag) {
+            break;
+          }
+
+          total_instance =
+              ins_buf_pair_len_ < batch_size_ ? ins_buf_pair_len_ : batch_size_;
+          total_instance *= 2;
+
+          ins_buf = reinterpret_cast<uint64_t *>(d_ins_buf_->ptr());
+          ins_cursor = ins_buf + ins_buf_pair_len_ * 2 - total_instance;
+          auto inverse = memory::AllocShared(
+              place_,
+              total_instance * sizeof(int),
+              phi::Stream(reinterpret_cast<phi::StreamId>(sample_stream_)));
+          auto final_sage_nodes = GenerateSampleGraph(
+              ins_cursor, total_instance, &uniq_instance, inverse);
+          if (FLAGS_gpugraph_storage_mode != GpuGraphStorageMode::WHOLE_HBM) {
+            uint64_t *final_sage_nodes_ptr =
+                reinterpret_cast<uint64_t *>(final_sage_nodes->ptr());
+            InsertTable(final_sage_nodes_ptr, uniq_instance, d_uniq_node_num_);
+          }
+          final_sage_nodes_vec_.emplace_back(final_sage_nodes);
+          inverse_vec_.emplace_back(inverse);
+          uniq_instance_vec_.emplace_back(uniq_instance);
+          total_instance_vec_.emplace_back(total_instance);
+          ins_buf_pair_len_ -= total_instance / 2;
+          sage_batch_num_ += 1;
+        }
+        uint64_t h_uniq_node_num = CopyUniqueNodes();
+        VLOG(0) << "train sage_batch_num: " << sage_batch_num_;
+      }
+    }
+  } else {
+    bool infer_flag = FillInferBuf();
+    if (sage_mode_) {
+      sage_batch_num_ = 0;
+      if (infer_flag) {
+        int total_instance = 0, uniq_instance = 0;
+        total_instance = (infer_node_start_ + batch_size_ <= infer_node_end_)
+                             ? batch_size_
+                             : infer_node_end_ - infer_node_start_;
+        total_instance *= 2;
+        while (total_instance != 0) {
+          uint64_t *d_type_keys =
+              reinterpret_cast<uint64_t *>(d_device_keys_[cursor_]->ptr());
+          d_type_keys += infer_node_start_;
+          infer_node_start_ += total_instance / 2;
+          auto node_buf = memory::AllocShared(
+              place_,
+              total_instance * sizeof(uint64_t),
+              phi::Stream(reinterpret_cast<phi::StreamId>(sample_stream_)));
+          int64_t *node_buf_ptr = reinterpret_cast<int64_t *>(node_buf->ptr());
+          CopyDuplicateKeys<<<GET_BLOCKS(total_instance / 2),
+                              CUDA_NUM_THREADS,
+                              0,
+                              sample_stream_>>>(
+              node_buf_ptr, d_type_keys, total_instance / 2);
+          uint64_t *node_buf_ptr_ =
+              reinterpret_cast<uint64_t *>(node_buf->ptr());
+          auto inverse = memory::AllocShared(
+              place_,
+              total_instance * sizeof(int),
+              phi::Stream(reinterpret_cast<phi::StreamId>(sample_stream_)));
+          auto final_sage_nodes = GenerateSampleGraph(
+              node_buf_ptr_, total_instance, &uniq_instance, inverse);
+          cudaStreamSynchronize(sample_stream_);
+          if (FLAGS_gpugraph_storage_mode != GpuGraphStorageMode::WHOLE_HBM) {
+            uint64_t *final_sage_nodes_ptr =
+                reinterpret_cast<uint64_t *>(final_sage_nodes->ptr());
+            InsertTable(final_sage_nodes_ptr, uniq_instance, d_uniq_node_num_);
+          }
+          final_sage_nodes_vec_.emplace_back(final_sage_nodes);
+          inverse_vec_.emplace_back(inverse);
+          uniq_instance_vec_.emplace_back(uniq_instance);
+          total_instance_vec_.emplace_back(total_instance);
+          sage_batch_num_ += 1;
+
+          total_instance = (infer_node_start_ + batch_size_ <= infer_node_end_)
+                               ? batch_size_
+                               : infer_node_end_ - infer_node_start_;
+          total_instance *= 2;
+        }
+
+        uint64_t h_uniq_node_num = CopyUniqueNodes();
+        VLOG(0) << "infer sage_batch_num: " << sage_batch_num_;
+      }
+    }
+  }
+  debug_gpu_memory_info(device_id, "DoWalkandSage end");
+}
+
+void GraphDataGenerator::clear_gpu_mem() {
+  d_len_per_row_.reset();
+  d_sample_keys_.reset();
+  d_prefix_sum_.reset();
+  for (size_t i = 0; i < d_sampleidx2rows_.size(); i++) {
+    d_sampleidx2rows_[i].reset();
+  }
+  delete table_;
+  if (sage_mode_) {
+    d_reindex_table_key_.reset();
+    d_reindex_table_value_.reset();
+    d_reindex_table_index_.reset();
+    d_sorted_keys_.reset();
+    d_sorted_idx_.reset();
+    d_offset_.reset();
+    d_merged_cnts_.reset();
+  }
+}
+
+int GraphDataGenerator::FillInferBuf() {
+  platform::CUDADeviceGuard guard(gpuid_);
+  auto gpu_graph_ptr = GraphGpuWrapper::GetInstance();
+  auto &global_infer_node_type_start =
+      gpu_graph_ptr->global_infer_node_type_start_[gpuid_];
+  auto &infer_cursor = gpu_graph_ptr->infer_cursor_[thread_id_];
+  total_row_ = 0;
+  if (infer_cursor < h_device_keys_len_.size()) {
+    if (global_infer_node_type_start[infer_cursor] >=
+        h_device_keys_len_[infer_cursor]) {
+      infer_cursor++;
+      if (infer_cursor >= h_device_keys_len_.size()) {
+        return 0;
+      }
+    }
+    size_t device_key_size = h_device_keys_len_[infer_cursor];
+    total_row_ =
+        (global_infer_node_type_start[infer_cursor] + infer_table_cap_ <=
+         device_key_size)
+            ? infer_table_cap_
+            : device_key_size - global_infer_node_type_start[infer_cursor];
+
+    uint64_t *d_type_keys =
+        reinterpret_cast<uint64_t *>(d_device_keys_[infer_cursor]->ptr());
+    if (!sage_mode_) {
+      host_vec_.resize(total_row_);
+      cudaMemcpyAsync(host_vec_.data(),
+                      d_type_keys + global_infer_node_type_start[infer_cursor],
+                      sizeof(uint64_t) * total_row_,
+                      cudaMemcpyDeviceToHost,
+                      sample_stream_);
+      cudaStreamSynchronize(sample_stream_);
+    }
+    VLOG(1) << "cursor: " << infer_cursor
+            << " start: " << global_infer_node_type_start[infer_cursor]
+            << " num: " << total_row_;
+    infer_node_start_ = global_infer_node_type_start[infer_cursor];
+    global_infer_node_type_start[infer_cursor] += total_row_;
+    infer_node_end_ = global_infer_node_type_start[infer_cursor];
+    cursor_ = infer_cursor;
+  }
+  return 1;
+}
+
+void GraphDataGenerator::ClearSampleState() {
+  auto gpu_graph_ptr = GraphGpuWrapper::GetInstance();
+  auto &finish_node_type = gpu_graph_ptr->finish_node_type_[gpuid_];
+  auto &node_type_start = gpu_graph_ptr->node_type_start_[gpuid_];
+  finish_node_type.clear();
+  for (auto iter = node_type_start.begin(); iter != node_type_start.end();
+       iter++) {
+    iter->second = 0;
+  }
+}
+
+int GraphDataGenerator::FillWalkBuf() {
   platform::CUDADeviceGuard guard(gpuid_);
   size_t once_max_sample_keynum = walk_degree_ * once_sample_startid_len_;
   ////////
@@ -898,30 +2100,42 @@ int GraphDataGenerator::FillWalkBuf(std::shared_ptr<phi::Allocation> d_walk) {
   }
   ///////
   auto gpu_graph_ptr = GraphGpuWrapper::GetInstance();
-  uint64_t *walk = reinterpret_cast<uint64_t *>(d_walk->ptr());
+  uint64_t *walk = reinterpret_cast<uint64_t *>(d_walk_->ptr());
   int *len_per_row = reinterpret_cast<int *>(d_len_per_row_->ptr());
   uint64_t *d_sample_keys = reinterpret_cast<uint64_t *>(d_sample_keys_->ptr());
-  cudaMemsetAsync(walk, 0, buf_size_ * sizeof(uint64_t), stream_);
-  cudaMemsetAsync(
-      len_per_row, 0, once_max_sample_keynum * sizeof(int), stream_);
+  cudaMemsetAsync(walk, 0, buf_size_ * sizeof(uint64_t), sample_stream_);
+  // cudaMemsetAsync(
+  //     len_per_row, 0, once_max_sample_keynum * sizeof(int), sample_stream_);
+  int sample_times = 0;
   int i = 0;
-  int total_row = 0;
-  size_t node_type_len = first_node_type_.size();
+  total_row_ = 0;
+
+  // 获取全局采样状态
+  auto &first_node_type = gpu_graph_ptr->first_node_type_;
+  auto &meta_path = gpu_graph_ptr->meta_path_;
+  auto &node_type_start = gpu_graph_ptr->node_type_start_[gpuid_];
+  auto &finish_node_type = gpu_graph_ptr->finish_node_type_[gpuid_];
+  auto &type_to_index = gpu_graph_ptr->get_graph_type_to_index();
+  auto &cursor = gpu_graph_ptr->cursor_[thread_id_];
+  size_t node_type_len = first_node_type.size();
   int remain_size =
       buf_size_ - walk_degree_ * once_sample_startid_len_ * walk_len_;
+  int total_samples = 0;
 
   while (i <= remain_size) {
-    int cur_node_idx = cursor_ % node_type_len;
-    int node_type = first_node_type_[cur_node_idx];
-    auto &path = meta_path_[cur_node_idx];
-    size_t start = node_type_start_[node_type];
+    int cur_node_idx = cursor % node_type_len;
+    int node_type = first_node_type[cur_node_idx];
+    auto &path = meta_path[cur_node_idx];
+    size_t start = node_type_start[node_type];
+    VLOG(2) << "cur_node_idx = " << cur_node_idx
+            << " meta_path.size = " << meta_path.size();
     // auto node_query_result = gpu_graph_ptr->query_node_list(
-    //    gpuid_, node_type, start, once_sample_startid_len_);
+    //     gpuid_, node_type, start, once_sample_startid_len_);
 
     // int tmp_len = node_query_result.actual_sample_size;
     VLOG(2) << "choose start type: " << node_type;
-    int type_index = type_to_index_[node_type];
-    size_t device_key_size = h_device_keys_[type_index]->size();
+    int type_index = type_to_index[node_type];
+    size_t device_key_size = h_device_keys_len_[type_index];
     VLOG(2) << "type: " << node_type << " size: " << device_key_size
             << " start: " << start;
     uint64_t *d_type_keys =
@@ -929,21 +2143,19 @@ int GraphDataGenerator::FillWalkBuf(std::shared_ptr<phi::Allocation> d_walk) {
     int tmp_len = start + once_sample_startid_len_ > device_key_size
                       ? device_key_size - start
                       : once_sample_startid_len_;
-    node_type_start_[node_type] = tmp_len + start;
+    bool update = true;
     if (tmp_len == 0) {
-      finish_node_type_.insert(node_type);
-      if (finish_node_type_.size() == node_type_start_.size()) {
+      finish_node_type.insert(node_type);
+      if (finish_node_type.size() == node_type_start.size()) {
+        cursor = 0;
+        epoch_finish_ = true;
         break;
       }
-      cursor_ += 1;
+      cursor += 1;
       continue;
     }
-    // if (tmp_len == 0) {
-    //  break;
-    //}
-    VLOG(2) << "i = " << i << " buf_size_ = " << buf_size_
-            << " tmp_len = " << tmp_len << " cursor = " << cursor_
-            << " once_max_sample_keynum = " << once_max_sample_keynum;
+
+    VLOG(2) << "gpuid = " << gpuid_ << " path[0] = " << path[0];
     uint64_t *cur_walk = walk + i;
 
     NeighborSampleQuery q;
@@ -952,11 +2164,38 @@ int GraphDataGenerator::FillWalkBuf(std::shared_ptr<phi::Allocation> d_walk) {
                  (uint64_t)(d_type_keys + start),
                  walk_degree_,
                  tmp_len);
-    auto sample_res = gpu_graph_ptr->graph_neighbor_sample_v3(q, false);
+    auto sample_res = gpu_graph_ptr->graph_neighbor_sample_v3(q, false, true);
 
     int step = 1;
     VLOG(2) << "sample edge type: " << path[0] << " step: " << 1;
     jump_rows_ = sample_res.total_sample_size;
+    total_samples += sample_res.total_sample_size;
+    VLOG(2) << "i = " << i << " start = " << start << " tmp_len = " << tmp_len
+            << " cursor = " << node_type << " cur_node_idx = " << cur_node_idx
+            << " jump row: " << jump_rows_;
+    VLOG(2) << "jump_row: " << jump_rows_;
+    if (jump_rows_ == 0) {
+      node_type_start[node_type] = tmp_len + start;
+      cursor += 1;
+      continue;
+    }
+
+    if (!sage_mode_) {
+      if (FLAGS_gpugraph_storage_mode != GpuGraphStorageMode::WHOLE_HBM) {
+        if (InsertTable(d_type_keys + start, tmp_len, d_uniq_node_num_) != 0) {
+          VLOG(2) << "in step 0, insert key stage, table is full";
+          update = false;
+          break;
+        }
+        if (InsertTable(sample_res.actual_val,
+                        sample_res.total_sample_size,
+                        d_uniq_node_num_) != 0) {
+          VLOG(2) << "in step 0, insert sample res stage, table is full";
+          update = false;
+          break;
+        }
+      }
+    }
     FillOneStep(d_type_keys + start,
                 cur_walk,
                 tmp_len,
@@ -964,7 +2203,6 @@ int GraphDataGenerator::FillWalkBuf(std::shared_ptr<phi::Allocation> d_walk) {
                 walk_degree_,
                 step,
                 len_per_row);
-    VLOG(2) << "jump_row: " << jump_rows_;
     /////////
     if (debug_mode_) {
       cudaMemcpy(
@@ -973,11 +2211,16 @@ int GraphDataGenerator::FillWalkBuf(std::shared_ptr<phi::Allocation> d_walk) {
         VLOG(2) << "h_walk[" << xx << "]: " << h_walk[xx];
       }
     }
+
+    VLOG(2) << "sample, step=" << step << " sample_keys=" << tmp_len
+            << " sample_res_len=" << sample_res.total_sample_size;
+
     /////////
     step++;
     size_t path_len = path.size();
     for (; step < walk_len_; step++) {
       if (sample_res.total_sample_size == 0) {
+        VLOG(2) << "sample finish, step=" << step;
         break;
       }
       auto sample_key_mem = sample_res.actual_val_mem;
@@ -990,11 +2233,247 @@ int GraphDataGenerator::FillWalkBuf(std::shared_ptr<phi::Allocation> d_walk) {
                    (uint64_t)sample_keys_ptr,
                    1,
                    sample_res.total_sample_size);
-      sample_res = gpu_graph_ptr->graph_neighbor_sample_v3(q, false);
+      int sample_key_len = sample_res.total_sample_size;
+      sample_res = gpu_graph_ptr->graph_neighbor_sample_v3(q, false, true);
+      total_samples += sample_res.total_sample_size;
+      if (!sage_mode_) {
+        if (FLAGS_gpugraph_storage_mode != GpuGraphStorageMode::WHOLE_HBM) {
+          if (InsertTable(sample_res.actual_val,
+                          sample_res.total_sample_size,
+                          d_uniq_node_num_) != 0) {
+            VLOG(2) << "in step: " << step << ", table is full";
+            update = false;
+            break;
+          }
+        }
+      }
+      FillOneStep(d_type_keys + start,
+                  cur_walk,
+                  sample_key_len,
+                  sample_res,
+                  1,
+                  step,
+                  len_per_row);
+      if (debug_mode_) {
+        cudaMemcpy(
+            h_walk, walk, buf_size_ * sizeof(uint64_t), cudaMemcpyDeviceToHost);
+        for (int xx = 0; xx < buf_size_; xx++) {
+          VLOG(2) << "h_walk[" << xx << "]: " << h_walk[xx];
+        }
+      }
 
+      VLOG(2) << "sample, step=" << step << " sample_keys=" << sample_key_len
+              << " sample_res_len=" << sample_res.total_sample_size;
+    }
+    // 此时更新全局采样状态
+    if (update == true) {
+      node_type_start[node_type] = tmp_len + start;
+      i += jump_rows_ * walk_len_;
+      total_row_ += jump_rows_;
+      cursor += 1;
+      sample_times++;
+    } else {
+      VLOG(2) << "table is full, not update stat!";
+      break;
+    }
+  }
+  buf_state_.Reset(total_row_);
+  int *d_random_row = reinterpret_cast<int *>(d_random_row_->ptr());
+
+  thrust::random::default_random_engine engine(shuffle_seed_);
+  const auto &exec_policy = thrust::cuda::par.on(sample_stream_);
+  thrust::counting_iterator<int> cnt_iter(0);
+  thrust::shuffle_copy(exec_policy,
+                       cnt_iter,
+                       cnt_iter + total_row_,
+                       thrust::device_pointer_cast(d_random_row),
+                       engine);
+
+  cudaStreamSynchronize(sample_stream_);
+  shuffle_seed_ = engine();
+
+  if (debug_mode_) {
+    int *h_random_row = new int[total_row_ + 10];
+    cudaMemcpy(h_random_row,
+               d_random_row,
+               total_row_ * sizeof(int),
+               cudaMemcpyDeviceToHost);
+    for (int xx = 0; xx < total_row_; xx++) {
+      VLOG(2) << "h_random_row[" << xx << "]: " << h_random_row[xx];
+    }
+    delete[] h_random_row;
+    delete[] h_walk;
+    delete[] h_sample_keys;
+    delete[] h_offset2idx;
+    delete[] h_len_per_row;
+    delete[] h_prefix_sum;
+  }
+
+  if (!sage_mode_) {
+    uint64_t h_uniq_node_num = CopyUniqueNodes();
+    VLOG(0) << "sample_times:" << sample_times << ", d_walk_size:" << buf_size_
+            << ", d_walk_offset:" << i << ", total_rows:" << total_row_
+            << ", total_samples:" << total_samples;
+  } else {
+    VLOG(0) << "sample_times:" << sample_times << ", d_walk_size:" << buf_size_
+            << ", d_walk_offset:" << i << ", total_rows:" << total_row_
+            << ", total_samples:" << total_samples;
+  }
+  return total_row_ != 0;
+}
+
+int GraphDataGenerator::FillWalkBufMultiPath() {
+  platform::CUDADeviceGuard guard(gpuid_);
+  size_t once_max_sample_keynum = walk_degree_ * once_sample_startid_len_;
+  ////////
+  uint64_t *h_walk;
+  uint64_t *h_sample_keys;
+  int *h_offset2idx;
+  int *h_len_per_row;
+  uint64_t *h_prefix_sum;
+  if (debug_mode_) {
+    h_walk = new uint64_t[buf_size_];
+    h_sample_keys = new uint64_t[once_max_sample_keynum];
+    h_offset2idx = new int[once_max_sample_keynum];
+    h_len_per_row = new int[once_max_sample_keynum];
+    h_prefix_sum = new uint64_t[once_max_sample_keynum + 1];
+  }
+  ///////
+  auto gpu_graph_ptr = GraphGpuWrapper::GetInstance();
+  uint64_t *walk = reinterpret_cast<uint64_t *>(d_walk_->ptr());
+  int *len_per_row = reinterpret_cast<int *>(d_len_per_row_->ptr());
+  uint64_t *d_sample_keys = reinterpret_cast<uint64_t *>(d_sample_keys_->ptr());
+  cudaMemsetAsync(walk, 0, buf_size_ * sizeof(uint64_t), sample_stream_);
+  int sample_times = 0;
+  int i = 0;
+  total_row_ = 0;
+
+  // 获取全局采样状态
+  auto &first_node_type = gpu_graph_ptr->first_node_type_;
+  auto &cur_metapath = gpu_graph_ptr->cur_metapath_;
+  auto &meta_path = gpu_graph_ptr->meta_path_;
+  auto &path = gpu_graph_ptr->cur_parse_metapath_;
+  auto &cur_metapath_start = gpu_graph_ptr->cur_metapath_start_[gpuid_];
+  auto &finish_node_type = gpu_graph_ptr->finish_node_type_[gpuid_];
+  auto &type_to_index = gpu_graph_ptr->get_graph_type_to_index();
+  size_t node_type_len = first_node_type.size();
+  std::string first_node =
+      paddle::string::split_string<std::string>(cur_metapath, "2")[0];
+  auto it = gpu_graph_ptr->feature_to_id.find(first_node);
+  auto node_type = it->second;
+
+  int remain_size =
+      buf_size_ - walk_degree_ * once_sample_startid_len_ * walk_len_;
+  int total_samples = 0;
+
+  while (i <= remain_size) {
+    size_t start = cur_metapath_start;
+    size_t device_key_size = h_train_metapath_keys_len_;
+    VLOG(2) << "type: " << node_type << " size: " << device_key_size
+            << " start: " << start;
+    uint64_t *d_type_keys =
+        reinterpret_cast<uint64_t *>(d_train_metapath_keys_->ptr());
+    int tmp_len = start + once_sample_startid_len_ > device_key_size
+                      ? device_key_size - start
+                      : once_sample_startid_len_;
+    bool update = true;
+    if (tmp_len == 0) {
+      break;
+    }
+
+    VLOG(2) << "gpuid = " << gpuid_ << " path[0] = " << path[0];
+    uint64_t *cur_walk = walk + i;
+
+    NeighborSampleQuery q;
+    q.initialize(gpuid_,
+                 path[0],
+                 (uint64_t)(d_type_keys + start),
+                 walk_degree_,
+                 tmp_len);
+    auto sample_res = gpu_graph_ptr->graph_neighbor_sample_v3(q, false, true);
+
+    int step = 1;
+    VLOG(2) << "sample edge type: " << path[0] << " step: " << 1;
+    jump_rows_ = sample_res.total_sample_size;
+    total_samples += sample_res.total_sample_size;
+    VLOG(2) << "i = " << i << " start = " << start << " tmp_len = " << tmp_len
+            << "jump row: " << jump_rows_;
+    if (jump_rows_ == 0) {
+      cur_metapath_start = tmp_len + start;
+      continue;
+    }
+
+    if (!sage_mode_) {
+      if (FLAGS_gpugraph_storage_mode != GpuGraphStorageMode::WHOLE_HBM) {
+        if (InsertTable(d_type_keys + start, tmp_len, d_uniq_node_num_) != 0) {
+          VLOG(2) << "in step 0, insert key stage, table is full";
+          update = false;
+          break;
+        }
+        if (InsertTable(sample_res.actual_val,
+                        sample_res.total_sample_size,
+                        d_uniq_node_num_) != 0) {
+          VLOG(2) << "in step 0, insert sample res stage, table is full";
+          update = false;
+          break;
+        }
+      }
+    }
+
+    FillOneStep(d_type_keys + start,
+                cur_walk,
+                tmp_len,
+                sample_res,
+                walk_degree_,
+                step,
+                len_per_row);
+    /////////
+    if (debug_mode_) {
+      cudaMemcpy(
+          h_walk, walk, buf_size_ * sizeof(uint64_t), cudaMemcpyDeviceToHost);
+      for (int xx = 0; xx < buf_size_; xx++) {
+        VLOG(2) << "h_walk[" << xx << "]: " << h_walk[xx];
+      }
+    }
+
+    VLOG(2) << "sample, step=" << step << " sample_keys=" << tmp_len
+            << " sample_res_len=" << sample_res.total_sample_size;
+
+    /////////
+    step++;
+    size_t path_len = path.size();
+    for (; step < walk_len_; step++) {
+      if (sample_res.total_sample_size == 0) {
+        VLOG(2) << "sample finish, step=" << step;
+        break;
+      }
+      auto sample_key_mem = sample_res.actual_val_mem;
+      uint64_t *sample_keys_ptr =
+          reinterpret_cast<uint64_t *>(sample_key_mem->ptr());
+      int edge_type_id = path[(step - 1) % path_len];
+      VLOG(2) << "sample edge type: " << edge_type_id << " step: " << step;
+      q.initialize(gpuid_,
+                   edge_type_id,
+                   (uint64_t)sample_keys_ptr,
+                   1,
+                   sample_res.total_sample_size);
+      int sample_key_len = sample_res.total_sample_size;
+      sample_res = gpu_graph_ptr->graph_neighbor_sample_v3(q, false, true);
+      total_samples += sample_res.total_sample_size;
+      if (!sage_mode_) {
+        if (FLAGS_gpugraph_storage_mode != GpuGraphStorageMode::WHOLE_HBM) {
+          if (InsertTable(sample_res.actual_val,
+                          sample_res.total_sample_size,
+                          d_uniq_node_num_) != 0) {
+            VLOG(2) << "in step: " << step << ", table is full";
+            update = false;
+            break;
+          }
+        }
+      }
       FillOneStep(d_type_keys + start,
                   cur_walk,
-                  sample_res.total_sample_size,
+                  sample_key_len,
                   sample_res,
                   1,
                   step,
@@ -1006,34 +2485,43 @@ int GraphDataGenerator::FillWalkBuf(std::shared_ptr<phi::Allocation> d_walk) {
           VLOG(2) << "h_walk[" << xx << "]: " << h_walk[xx];
         }
       }
+
+      VLOG(2) << "sample, step=" << step << " sample_keys=" << sample_key_len
+              << " sample_res_len=" << sample_res.total_sample_size;
+    }
+    // 此时更新全局采样状态
+    if (update == true) {
+      cur_metapath_start = tmp_len + start;
+      i += jump_rows_ * walk_len_;
+      total_row_ += jump_rows_;
+      sample_times++;
+    } else {
+      VLOG(2) << "table is full, not update stat!";
+      break;
     }
-    // cursor_ += tmp_len;
-    i += jump_rows_ * walk_len_;
-    total_row += jump_rows_;
-    cursor_ += 1;
   }
-  buf_state_.Reset(total_row);
+  buf_state_.Reset(total_row_);
   int *d_random_row = reinterpret_cast<int *>(d_random_row_->ptr());
 
   thrust::random::default_random_engine engine(shuffle_seed_);
-  const auto &exec_policy = thrust::cuda::par.on(stream_);
+  const auto &exec_policy = thrust::cuda::par.on(sample_stream_);
   thrust::counting_iterator<int> cnt_iter(0);
   thrust::shuffle_copy(exec_policy,
                        cnt_iter,
-                       cnt_iter + total_row,
+                       cnt_iter + total_row_,
                        thrust::device_pointer_cast(d_random_row),
                        engine);
 
-  cudaStreamSynchronize(stream_);
+  cudaStreamSynchronize(sample_stream_);
   shuffle_seed_ = engine();
 
   if (debug_mode_) {
-    int *h_random_row = new int[total_row + 10];
+    int *h_random_row = new int[total_row_ + 10];
     cudaMemcpy(h_random_row,
                d_random_row,
-               total_row * sizeof(int),
+               total_row_ * sizeof(int),
                cudaMemcpyDeviceToHost);
-    for (int xx = 0; xx < total_row; xx++) {
+    for (int xx = 0; xx < total_row_; xx++) {
       VLOG(2) << "h_random_row[" << xx << "]: " << h_random_row[xx];
     }
     delete[] h_random_row;
@@ -1043,72 +2531,133 @@ int GraphDataGenerator::FillWalkBuf(std::shared_ptr<phi::Allocation> d_walk) {
     delete[] h_len_per_row;
     delete[] h_prefix_sum;
   }
-  return total_row != 0;
+
+  if (!sage_mode_) {
+    uint64_t h_uniq_node_num = CopyUniqueNodes();
+    VLOG(0) << "sample_times:" << sample_times << ", d_walk_size:" << buf_size_
+            << ", d_walk_offset:" << i << ", total_rows:" << total_row_
+            << ", h_uniq_node_num:" << h_uniq_node_num
+            << ", total_samples:" << total_samples;
+  } else {
+    VLOG(0) << "sample_times:" << sample_times << ", d_walk_size:" << buf_size_
+            << ", d_walk_offset:" << i << ", total_rows:" << total_row_
+            << ", total_samples:" << total_samples;
+  }
+
+  return total_row_ != 0;
 }
 
-void GraphDataGenerator::AllocResource(
-    const paddle::platform::Place &place,
-    std::vector<phi::DenseTensor *> feed_vec) {
-  place_ = place;
-  gpuid_ = place_.GetDeviceId();
-  VLOG(3) << "gpuid " << gpuid_;
-  stream_ = dynamic_cast<phi::GPUContext *>(
-                platform::DeviceContextPool::Instance().Get(place))
-                ->stream();
+void GraphDataGenerator::SetFeedVec(std::vector<phi::DenseTensor *> feed_vec) {
   feed_vec_ = feed_vec;
-  slot_num_ = (feed_vec_.size() - 3) / 2;
-
-  // d_device_keys_.resize(h_device_keys_.size());
-  VLOG(2) << "h_device_keys size: " << h_device_keys_.size();
-  infer_node_type_start_ = std::vector<int>(h_device_keys_.size(), 0);
-  for (size_t i = 0; i < h_device_keys_.size(); i++) {
-    for (size_t j = 0; j < h_device_keys_[i]->size(); j++) {
-      VLOG(3) << "h_device_keys_[" << i << "][" << j
-              << "] = " << (*(h_device_keys_[i]))[j];
+}
+
+void GraphDataGenerator::AllocResource(
+    int thread_id, std::vector<phi::DenseTensor *> feed_vec) {
+  auto gpu_graph_ptr = GraphGpuWrapper::GetInstance();
+  gpuid_ = gpu_graph_ptr->device_id_mapping[thread_id];
+  thread_id_ = thread_id;
+  place_ = platform::CUDAPlace(gpuid_);
+  debug_gpu_memory_info(gpuid_, "AllocResource start");
+
+  platform::CUDADeviceGuard guard(gpuid_);
+  if (FLAGS_gpugraph_storage_mode != GpuGraphStorageMode::WHOLE_HBM) {
+    if (gpu_graph_training_) {
+      table_ = new HashTable<uint64_t, uint64_t>(
+          train_table_cap_ / FLAGS_gpugraph_hbm_table_load_factor);
+    } else {
+      table_ = new HashTable<uint64_t, uint64_t>(
+          infer_table_cap_ / FLAGS_gpugraph_hbm_table_load_factor);
+    }
+  }
+  VLOG(1) << "AllocResource gpuid " << gpuid_
+          << " feed_vec.size: " << feed_vec.size()
+          << " table cap: " << train_table_cap_;
+  sample_stream_ = gpu_graph_ptr->get_local_stream(gpuid_);
+  train_stream_ = dynamic_cast<phi::GPUContext *>(
+                      platform::DeviceContextPool::Instance().Get(place_))
+                      ->stream();
+  // feed_vec_ = feed_vec;
+  if (!sage_mode_) {
+    slot_num_ = (feed_vec.size() - 3) / 2;
+  } else {
+    slot_num_ = (feed_vec.size() - 4 - samples_.size() * 5) / 2;
+  }
+
+  // infer_node_type_start_ = std::vector<int>(h_device_keys_.size(), 0);
+  // for (size_t i = 0; i < h_device_keys_.size(); i++) {
+  //   for (size_t j = 0; j < h_device_keys_[i]->size(); j++) {
+  //     VLOG(3) << "h_device_keys_[" << i << "][" << j
+  //             << "] = " << (*(h_device_keys_[i]))[j];
+  //   }
+  //   auto buf = memory::AllocShared(
+  //       place_, h_device_keys_[i]->size() * sizeof(uint64_t));
+  //   d_device_keys_.push_back(buf);
+  //   CUDA_CHECK(cudaMemcpyAsync(buf->ptr(),
+  //                              h_device_keys_[i]->data(),
+  //                              h_device_keys_[i]->size() * sizeof(uint64_t),
+  //                              cudaMemcpyHostToDevice,
+  //                              stream_));
+  // }
+  if (gpu_graph_training_ && FLAGS_graph_metapath_split_opt) {
+    d_train_metapath_keys_ =
+        gpu_graph_ptr->d_graph_train_total_keys_[thread_id];
+    h_train_metapath_keys_len_ =
+        gpu_graph_ptr->h_graph_train_keys_len_[thread_id];
+    VLOG(2) << "h train metapaths key len: " << h_train_metapath_keys_len_;
+  } else {
+    auto &d_graph_all_type_keys = gpu_graph_ptr->d_graph_all_type_total_keys_;
+    auto &h_graph_all_type_keys_len = gpu_graph_ptr->h_graph_all_type_keys_len_;
+
+    for (size_t i = 0; i < d_graph_all_type_keys.size(); i++) {
+      d_device_keys_.push_back(d_graph_all_type_keys[i][thread_id]);
+      h_device_keys_len_.push_back(h_graph_all_type_keys_len[i][thread_id]);
     }
-    auto buf = memory::AllocShared(
-        place_, h_device_keys_[i]->size() * sizeof(uint64_t));
-    d_device_keys_.push_back(buf);
-    CUDA_CHECK(cudaMemcpyAsync(buf->ptr(),
-                               h_device_keys_[i]->data(),
-                               h_device_keys_[i]->size() * sizeof(uint64_t),
-                               cudaMemcpyHostToDevice,
-                               stream_));
-  }
-  // h_device_keys_ = h_device_keys;
-  // device_key_size_ = h_device_keys_->size();
-  // d_device_keys_ =
-  //    memory::AllocShared(place_, device_key_size_ * sizeof(int64_t));
-  // CUDA_CHECK(cudaMemcpyAsync(d_device_keys_->ptr(), h_device_keys_->data(),
-  //                           device_key_size_ * sizeof(int64_t),
-  //                           cudaMemcpyHostToDevice, stream_));
+    VLOG(2) << "h_device_keys size: " << h_device_keys_len_.size();
+  }
+
   size_t once_max_sample_keynum = walk_degree_ * once_sample_startid_len_;
-  d_prefix_sum_ =
-      memory::AllocShared(place_, (once_max_sample_keynum + 1) * sizeof(int));
+  d_prefix_sum_ = memory::AllocShared(
+      place_,
+      (once_max_sample_keynum + 1) * sizeof(int),
+      phi::Stream(reinterpret_cast<phi::StreamId>(sample_stream_)));
   int *d_prefix_sum_ptr = reinterpret_cast<int *>(d_prefix_sum_->ptr());
-  cudaMemsetAsync(
-      d_prefix_sum_ptr, 0, (once_max_sample_keynum + 1) * sizeof(int), stream_);
+  cudaMemsetAsync(d_prefix_sum_ptr,
+                  0,
+                  (once_max_sample_keynum + 1) * sizeof(int),
+                  sample_stream_);
   cursor_ = 0;
   jump_rows_ = 0;
-  d_walk_ = memory::AllocShared(place_, buf_size_ * sizeof(uint64_t));
-  cudaMemsetAsync(d_walk_->ptr(), 0, buf_size_ * sizeof(uint64_t), stream_);
-  if (!FLAGS_enable_opt_get_features && slot_num_ > 0) {
-    d_feature_ =
-        memory::AllocShared(place_, buf_size_ * slot_num_ * sizeof(uint64_t));
-    cudaMemsetAsync(
-        d_feature_->ptr(), 0, buf_size_ * sizeof(uint64_t), stream_);
-  }
-  d_sample_keys_ =
-      memory::AllocShared(place_, once_max_sample_keynum * sizeof(uint64_t));
-
-  d_sampleidx2rows_.push_back(
-      memory::AllocShared(place_, once_max_sample_keynum * sizeof(int)));
-  d_sampleidx2rows_.push_back(
-      memory::AllocShared(place_, once_max_sample_keynum * sizeof(int)));
+  d_uniq_node_num_ = memory::AllocShared(
+      place_,
+      sizeof(uint64_t),
+      phi::Stream(reinterpret_cast<phi::StreamId>(sample_stream_)));
+  cudaMemsetAsync(d_uniq_node_num_->ptr(), 0, sizeof(uint64_t), sample_stream_);
+
+  d_walk_ = memory::AllocShared(
+      place_,
+      buf_size_ * sizeof(uint64_t),
+      phi::Stream(reinterpret_cast<phi::StreamId>(sample_stream_)));
+  cudaMemsetAsync(
+      d_walk_->ptr(), 0, buf_size_ * sizeof(uint64_t), sample_stream_);
+  d_sample_keys_ = memory::AllocShared(
+      place_,
+      once_max_sample_keynum * sizeof(uint64_t),
+      phi::Stream(reinterpret_cast<phi::StreamId>(sample_stream_)));
+
+  d_sampleidx2rows_.push_back(memory::AllocShared(
+      place_,
+      once_max_sample_keynum * sizeof(int),
+      phi::Stream(reinterpret_cast<phi::StreamId>(sample_stream_))));
+  d_sampleidx2rows_.push_back(memory::AllocShared(
+      place_,
+      once_max_sample_keynum * sizeof(int),
+      phi::Stream(reinterpret_cast<phi::StreamId>(sample_stream_))));
   cur_sampleidx2row_ = 0;
 
-  d_len_per_row_ =
-      memory::AllocShared(place_, once_max_sample_keynum * sizeof(int));
+  d_len_per_row_ = memory::AllocShared(
+      place_,
+      once_max_sample_keynum * sizeof(int),
+      phi::Stream(reinterpret_cast<phi::StreamId>(sample_stream_)));
   for (int i = -window_; i < 0; i++) {
     window_step_.push_back(i);
   }
@@ -1118,25 +2667,92 @@ void GraphDataGenerator::AllocResource(
   buf_state_.Init(batch_size_, walk_len_, &window_step_);
   d_random_row_ = memory::AllocShared(
       place_,
-      (once_sample_startid_len_ * walk_degree_ * repeat_time_) * sizeof(int));
+      (once_sample_startid_len_ * walk_degree_ * repeat_time_) * sizeof(int),
+      phi::Stream(reinterpret_cast<phi::StreamId>(sample_stream_)));
   shuffle_seed_ = 0;
 
   ins_buf_pair_len_ = 0;
-  d_ins_buf_ =
-      memory::AllocShared(place_, (batch_size_ * 2 * 2) * sizeof(uint64_t));
-  if (slot_num_ > 0) {
-    d_feature_buf_ = memory::AllocShared(
-        place_, (batch_size_ * 2 * 2) * slot_num_ * sizeof(uint64_t));
+  if (!sage_mode_) {
+    d_ins_buf_ =
+        memory::AllocShared(place_, (batch_size_ * 2 * 2) * sizeof(uint64_t));
+    d_pair_num_ = memory::AllocShared(place_, sizeof(int));
+  } else {
+    d_ins_buf_ = memory::AllocShared(
+        place_,
+        (batch_size_ * 2 * 2) * sizeof(uint64_t),
+        phi::Stream(reinterpret_cast<phi::StreamId>(sample_stream_)));
+    d_pair_num_ = memory::AllocShared(
+        place_,
+        sizeof(int),
+        phi::Stream(reinterpret_cast<phi::StreamId>(sample_stream_)));
   }
-  d_pair_num_ = memory::AllocShared(place_, sizeof(int));
-  if (FLAGS_enable_opt_get_features && slot_num_ > 0) {
-    d_slot_tensor_ptr_ =
-        memory::AllocShared(place_, slot_num_ * sizeof(uint64_t *));
-    d_slot_lod_tensor_ptr_ =
-        memory::AllocShared(place_, slot_num_ * sizeof(uint64_t *));
+
+  d_slot_tensor_ptr_ =
+      memory::AllocShared(place_, slot_num_ * sizeof(uint64_t *));
+  d_slot_lod_tensor_ptr_ =
+      memory::AllocShared(place_, slot_num_ * sizeof(uint64_t *));
+
+  if (sage_mode_) {
+    reindex_table_size_ = batch_size_ * 2;
+    // get hashtable size
+    for (int i = 0; i < samples_.size(); i++) {
+      reindex_table_size_ *= (samples_[i] * edge_to_id_len_ + 1);
+    }
+    int64_t next_pow2 =
+        1 << static_cast<size_t>(1 + std::log2(reindex_table_size_ >> 1));
+    reindex_table_size_ = next_pow2 << 1;
+
+    d_reindex_table_key_ = memory::AllocShared(
+        place_,
+        reindex_table_size_ * sizeof(int64_t),
+        phi::Stream(reinterpret_cast<phi::StreamId>(sample_stream_)));
+    d_reindex_table_value_ = memory::AllocShared(
+        place_,
+        reindex_table_size_ * sizeof(int),
+        phi::Stream(reinterpret_cast<phi::StreamId>(sample_stream_)));
+    d_reindex_table_index_ = memory::AllocShared(
+        place_,
+        reindex_table_size_ * sizeof(int),
+        phi::Stream(reinterpret_cast<phi::StreamId>(sample_stream_)));
+    edge_type_graph_ =
+        gpu_graph_ptr->get_edge_type_graph(gpuid_, edge_to_id_len_);
+
+    d_sorted_keys_ = memory::AllocShared(
+        place_,
+        (batch_size_ * 2 * 2) * sizeof(uint64_t),
+        phi::Stream(reinterpret_cast<phi::StreamId>(sample_stream_)));
+    d_sorted_idx_ = memory::AllocShared(
+        place_,
+        (batch_size_ * 2 * 2) * sizeof(uint32_t),
+        phi::Stream(reinterpret_cast<phi::StreamId>(sample_stream_)));
+    d_offset_ = memory::AllocShared(
+        place_,
+        (batch_size_ * 2 * 2) * sizeof(uint32_t),
+        phi::Stream(reinterpret_cast<phi::StreamId>(sample_stream_)));
+    d_merged_cnts_ = memory::AllocShared(
+        place_,
+        (batch_size_ * 2 * 2) * sizeof(uint32_t),
+        phi::Stream(reinterpret_cast<phi::StreamId>(sample_stream_)));
   }
 
-  cudaStreamSynchronize(stream_);
+  cudaStreamSynchronize(sample_stream_);
+
+  debug_gpu_memory_info(gpuid_, "AllocResource end");
+}
+
+void GraphDataGenerator::AllocTrainResource(int thread_id) {
+  if (slot_num_ > 0) {
+    platform::CUDADeviceGuard guard(gpuid_);
+    if (!sage_mode_) {
+      d_feature_size_list_buf_ =
+          memory::AllocShared(place_, (batch_size_ * 2) * sizeof(uint32_t));
+      d_feature_size_prefixsum_buf_ =
+          memory::AllocShared(place_, (batch_size_ * 2 + 1) * sizeof(uint32_t));
+    } else {
+      d_feature_size_list_buf_ = NULL;
+      d_feature_size_prefixsum_buf_ = NULL;
+    }
+  }
 }
 
 void GraphDataGenerator::SetConfig(
@@ -1156,48 +2772,34 @@ void GraphDataGenerator::SetConfig(
   repeat_time_ = graph_config.sample_times_one_chunk();
   buf_size_ =
       once_sample_startid_len_ * walk_len_ * walk_degree_ * repeat_time_;
-  VLOG(2) << "Confirm GraphConfig, walk_degree : " << walk_degree_
+  train_table_cap_ = graph_config.train_table_cap();
+  infer_table_cap_ = graph_config.infer_table_cap();
+  epoch_finish_ = false;
+  VLOG(1) << "Confirm GraphConfig, walk_degree : " << walk_degree_
           << ", walk_len : " << walk_len_ << ", window : " << window_
           << ", once_sample_startid_len : " << once_sample_startid_len_
           << ", sample_times_one_chunk : " << repeat_time_
-          << ", batch_size: " << batch_size_;
+          << ", batch_size: " << batch_size_
+          << ", train_table_cap: " << train_table_cap_
+          << ", infer_table_cap: " << infer_table_cap_;
   std::string first_node_type = graph_config.first_node_type();
   std::string meta_path = graph_config.meta_path();
+  sage_mode_ = graph_config.sage_mode();
+  std::string str_samples = graph_config.samples();
   auto gpu_graph_ptr = GraphGpuWrapper::GetInstance();
+  debug_gpu_memory_info("init_conf start");
+  gpu_graph_ptr->init_conf(first_node_type, meta_path);
+  debug_gpu_memory_info("init_conf end");
+
   auto edge_to_id = gpu_graph_ptr->edge_to_id;
-  auto node_to_id = gpu_graph_ptr->feature_to_id;
-  // parse first_node_type
-  auto node_types =
-      paddle::string::split_string<std::string>(first_node_type, ";");
-  VLOG(2) << "node_types: " << first_node_type;
-  finish_node_type_.clear();
-  node_type_start_.clear();
-  for (auto &type : node_types) {
-    auto iter = node_to_id.find(type);
-    PADDLE_ENFORCE_NE(
-        iter,
-        node_to_id.end(),
-        platform::errors::NotFound("(%s) is not found in node_to_id.", type));
-    VLOG(2) << "node_to_id[" << type << "] = " << iter->second;
-    first_node_type_.push_back(iter->second);
-    node_type_start_[iter->second] = 0;
-  }
-  meta_path_.resize(first_node_type_.size());
-  auto meta_paths = paddle::string::split_string<std::string>(meta_path, ";");
-
-  for (size_t i = 0; i < meta_paths.size(); i++) {
-    auto path = meta_paths[i];
-    auto nodes = paddle::string::split_string<std::string>(path, "-");
-    for (auto &node : nodes) {
-      auto iter = edge_to_id.find(node);
-      PADDLE_ENFORCE_NE(
-          iter,
-          edge_to_id.end(),
-          platform::errors::NotFound("(%s) is not found in edge_to_id.", node));
-      VLOG(2) << "edge_to_id[" << node << "] = " << iter->second;
-      meta_path_[i].push_back(iter->second);
-    }
+  edge_to_id_len_ = edge_to_id.size();
+  sage_batch_count_ = 0;
+  auto samples = paddle::string::split_string<std::string>(str_samples, ";");
+  for (size_t i = 0; i < samples.size(); i++) {
+    int sample_size = std::stoi(samples[i]);
+    samples_.emplace_back(sample_size);
   }
+  copy_unique_len_ = 0;
 }
 
 }  // namespace framework
diff --git a/paddle/fluid/framework/data_feed.h b/paddle/fluid/framework/data_feed.h
index c8142deb99094..77bce79338161 100644
--- a/paddle/fluid/framework/data_feed.h
+++ b/paddle/fluid/framework/data_feed.h
@@ -4,7 +4,7 @@ Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
 
-    http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0
 
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
@@ -60,6 +60,8 @@ class Scope;
 class Variable;
 class NeighborSampleResult;
 class NodeQueryResult;
+template <typename KeyType, typename ValType>
+class HashTable;
 }  // namespace framework
 }  // namespace paddle
 
@@ -878,6 +880,9 @@ struct BufState {
 
   int GetNextBatch() {
     cursor += len;
+    if (row_num - cursor < 0) {
+      return 0;
+    }
     int tmp_len = cursor + batch_size > row_num ? row_num - cursor : batch_size;
     if (tmp_len == 0) {
       return 0;
@@ -895,11 +900,16 @@ class GraphDataGenerator {
   GraphDataGenerator() {}
   virtual ~GraphDataGenerator() {}
   void SetConfig(const paddle::framework::DataFeedDesc& data_feed_desc);
-  void AllocResource(const paddle::platform::Place& place,
-                     std::vector<phi::DenseTensor*> feed_vec);
+  void AllocResource(int thread_id, std::vector<phi::DenseTensor*> feed_vec);
+  void AllocTrainResource(int thread_id);
+  void SetFeedVec(std::vector<phi::DenseTensor*> feed_vec);
   int AcquireInstance(BufState* state);
   int GenerateBatch();
-  int FillWalkBuf(std::shared_ptr<phi::Allocation> d_walk);
+  int FillWalkBuf();
+  int FillWalkBufMultiPath();
+  int FillInferBuf();
+  void DoWalkandSage();
+  int FillSlotFeature(uint64_t* d_walk);
   int FillFeatureBuf(uint64_t* d_walk, uint64_t* d_feature, size_t key_num);
   int FillFeatureBuf(std::shared_ptr<phi::Allocation> d_walk,
                      std::shared_ptr<phi::Allocation> d_feature);
@@ -910,57 +920,129 @@ class GraphDataGenerator {
                    int cur_degree,
                    int step,
                    int* len_per_row);
-  int FillInsBuf();
+  int FillInsBuf(cudaStream_t stream);
+  int FillIdShowClkTensor(int total_instance,
+                          bool gpu_graph_training,
+                          size_t cursor = 0);
+  int FillGraphIdShowClkTensor(int uniq_instance,
+                               int total_instance,
+                               int index);
+  int FillGraphSlotFeature(
+      int total_instance,
+      bool gpu_graph_training,
+      std::shared_ptr<phi::Allocation> final_sage_nodes = nullptr);
+  int FillSlotFeature(uint64_t* d_walk, size_t key_num);
+  int MakeInsPair(cudaStream_t stream);
+  uint64_t CopyUniqueNodes();
+  int GetPathNum() { return total_row_; }
+  void ResetPathNum() { total_row_ = 0; }
+  void ResetEpochFinish() { epoch_finish_ = false; }
+  void ClearSampleState();
   void SetDeviceKeys(std::vector<uint64_t>* device_keys, int type) {
-    type_to_index_[type] = h_device_keys_.size();
-    h_device_keys_.push_back(device_keys);
-  }
+    // type_to_index_[type] = h_device_keys_.size();
+    // h_device_keys_.push_back(device_keys);
+  }
+
+  std::vector<std::shared_ptr<phi::Allocation>> SampleNeighbors(
+      int64_t* uniq_nodes,
+      int len,
+      int sample_size,
+      std::vector<int>& edges_split_num,  // NOLINT
+      int64_t* neighbor_len);
+  std::shared_ptr<phi::Allocation> FillReindexHashTable(int64_t* input,
+                                                        int num_input,
+                                                        int64_t len_hashtable,
+                                                        int64_t* keys,
+                                                        int* values,
+                                                        int* key_index,
+                                                        int* final_nodes_len);
+  std::shared_ptr<phi::Allocation> GetReindexResult(int64_t* reindex_src_data,
+                                                    int64_t* center_nodes,
+                                                    int* final_nodes_len,
+                                                    int node_len,
+                                                    int64_t neighbor_len);
+  std::shared_ptr<phi::Allocation> GenerateSampleGraph(
+      uint64_t* node_ids,
+      int len,
+      int* uniq_len,
+      std::shared_ptr<phi::Allocation>& inverse);  // NOLINT
+  int InsertTable(const uint64_t* d_keys,
+                  uint64_t len,
+                  std::shared_ptr<phi::Allocation> d_uniq_node_num);
+  std::vector<uint64_t>& GetHostVec() { return host_vec_; }
+  bool get_epoch_finish() { return epoch_finish_; }
+  void clear_gpu_mem();
 
  protected:
+  HashTable<uint64_t, uint64_t>* table_;
   int walk_degree_;
   int walk_len_;
   int window_;
   int once_sample_startid_len_;
   int gpuid_;
-  // start ids
-  // int64_t* device_keys_;
-  // size_t device_key_size_;
-  std::vector<std::vector<uint64_t>*> h_device_keys_;
-  std::unordered_map<int, int> type_to_index_;
-  // point to device_keys_
   size_t cursor_;
+  int thread_id_;
   size_t jump_rows_;
+  int edge_to_id_len_;
   int64_t* id_tensor_ptr_;
+  int* index_tensor_ptr_;
   int64_t* show_tensor_ptr_;
   int64_t* clk_tensor_ptr_;
-  cudaStream_t stream_;
+
+  cudaStream_t train_stream_;
+  cudaStream_t sample_stream_;
   paddle::platform::Place place_;
   std::vector<phi::DenseTensor*> feed_vec_;
   std::vector<size_t> offset_;
   std::shared_ptr<phi::Allocation> d_prefix_sum_;
   std::vector<std::shared_ptr<phi::Allocation>> d_device_keys_;
+  std::shared_ptr<phi::Allocation> d_train_metapath_keys_;
 
   std::shared_ptr<phi::Allocation> d_walk_;
+  std::shared_ptr<phi::Allocation> d_feature_list_;
   std::shared_ptr<phi::Allocation> d_feature_;
   std::shared_ptr<phi::Allocation> d_len_per_row_;
   std::shared_ptr<phi::Allocation> d_random_row_;
-  //
+  std::shared_ptr<phi::Allocation> d_uniq_node_num_;
+  std::shared_ptr<phi::Allocation> d_slot_feature_num_map_;
+  std::shared_ptr<phi::Allocation> d_actual_slot_id_map_;
+  std::shared_ptr<phi::Allocation> d_fea_offset_map_;
+
   std::vector<std::shared_ptr<phi::Allocation>> d_sampleidx2rows_;
   int cur_sampleidx2row_;
   // record the keys to call graph_neighbor_sample
   std::shared_ptr<phi::Allocation> d_sample_keys_;
   int sample_keys_len_;
 
-  std::set<int> finish_node_type_;
-  std::unordered_map<int, size_t> node_type_start_;
-  std::vector<int> infer_node_type_start_;
-
   std::shared_ptr<phi::Allocation> d_ins_buf_;
-  std::shared_ptr<phi::Allocation> d_feature_buf_;
+  std::shared_ptr<phi::Allocation> d_feature_size_list_buf_;
+  std::shared_ptr<phi::Allocation> d_feature_size_prefixsum_buf_;
   std::shared_ptr<phi::Allocation> d_pair_num_;
   std::shared_ptr<phi::Allocation> d_slot_tensor_ptr_;
   std::shared_ptr<phi::Allocation> d_slot_lod_tensor_ptr_;
+  std::shared_ptr<phi::Allocation> d_reindex_table_key_;
+  std::shared_ptr<phi::Allocation> d_reindex_table_value_;
+  std::shared_ptr<phi::Allocation> d_reindex_table_index_;
+  std::vector<std::shared_ptr<phi::Allocation>> edge_type_graph_;
+  std::shared_ptr<phi::Allocation> d_sorted_keys_;
+  std::shared_ptr<phi::Allocation> d_sorted_idx_;
+  std::shared_ptr<phi::Allocation> d_offset_;
+  std::shared_ptr<phi::Allocation> d_merged_cnts_;
+  std::shared_ptr<phi::Allocation> d_buf_;
+
+  // sage mode batch data
+  std::vector<std::shared_ptr<phi::Allocation>> inverse_vec_;
+  std::vector<std::shared_ptr<phi::Allocation>> final_sage_nodes_vec_;
+  std::vector<int> uniq_instance_vec_;
+  std::vector<int> total_instance_vec_;
+  std::vector<std::vector<std::shared_ptr<phi::Allocation>>> graph_edges_vec_;
+  std::vector<std::vector<std::vector<int>>> edges_split_num_vec_;
+
+  int64_t reindex_table_size_;
+  int sage_batch_count_;
+  int sage_batch_num_;
   int ins_buf_pair_len_;
+
   // size of a d_walk buf
   size_t buf_size_;
   int repeat_time_;
@@ -968,11 +1050,23 @@ class GraphDataGenerator {
   BufState buf_state_;
   int batch_size_;
   int slot_num_;
+  std::vector<int> h_slot_feature_num_map_;
+  int fea_num_per_node_;
   int shuffle_seed_;
   int debug_mode_;
-  std::vector<int> first_node_type_;
-  std::vector<std::vector<int>> meta_path_;
   bool gpu_graph_training_;
+  bool sage_mode_;
+  std::vector<int> samples_;
+  bool epoch_finish_;
+  std::vector<uint64_t> host_vec_;
+  std::vector<uint64_t> h_device_keys_len_;
+  uint64_t h_train_metapath_keys_len_;
+  uint64_t train_table_cap_;
+  uint64_t infer_table_cap_;
+  uint64_t copy_unique_len_;
+  int total_row_;
+  size_t infer_node_start_;
+  size_t infer_node_end_;
 };
 
 class DataFeed {
@@ -1037,11 +1131,14 @@ class DataFeed {
   virtual void SetParseLogKey(bool parse_logkey) {}
   virtual void SetEnablePvMerge(bool enable_pv_merge) {}
   virtual void SetCurrentPhase(int current_phase) {}
-  virtual void SetDeviceKeys(std::vector<uint64_t>* device_keys, int type) {
 #if defined(PADDLE_WITH_GPU_GRAPH) && defined(PADDLE_WITH_HETERPS)
+  virtual void InitGraphResource() {}
+  virtual void InitGraphTrainResource() {}
+  virtual void SetDeviceKeys(std::vector<uint64_t>* device_keys, int type) {
     gpu_graph_data_generator_.SetDeviceKeys(device_keys, type);
-#endif
   }
+#endif
+
   virtual void SetGpuGraphMode(int gpu_graph_mode) {
     gpu_graph_mode_ = gpu_graph_mode;
   }
@@ -1058,6 +1155,42 @@ class DataFeed {
     return ins_content_vec_;
   }
   virtual int GetCurBatchSize() { return batch_size_; }
+  virtual int GetGraphPathNum() {
+#if defined(PADDLE_WITH_GPU_GRAPH) && defined(PADDLE_WITH_HETERPS)
+    return gpu_graph_data_generator_.GetPathNum();
+#else
+    return 0;
+#endif
+  }
+
+#if defined(PADDLE_WITH_GPU_GRAPH) && defined(PADDLE_WITH_HETERPS)
+  virtual const std::vector<uint64_t>* GetHostVec() {
+    return &(gpu_graph_data_generator_.GetHostVec());
+  }
+
+  virtual void clear_gpu_mem() { gpu_graph_data_generator_.clear_gpu_mem(); }
+
+  virtual bool get_epoch_finish() {
+    return gpu_graph_data_generator_.get_epoch_finish();
+  }
+
+  virtual void ResetPathNum() { gpu_graph_data_generator_.ResetPathNum(); }
+
+  virtual void ClearSampleState() {
+    gpu_graph_data_generator_.ClearSampleState();
+  }
+
+  virtual void ResetEpochFinish() {
+    gpu_graph_data_generator_.ResetEpochFinish();
+  }
+
+  virtual void DoWalkandSage() {
+    PADDLE_THROW(platform::errors::Unimplemented(
+        "This function(DoWalkandSage) is not implemented."));
+  }
+#endif
+
+  virtual bool IsTrainMode() { return train_mode_; }
   virtual void LoadIntoMemory() {
     PADDLE_THROW(platform::errors::Unimplemented(
         "This function(LoadIntoMemory) is not implemented."));
@@ -1132,6 +1265,7 @@ class DataFeed {
 #if defined(PADDLE_WITH_GPU_GRAPH) && defined(PADDLE_WITH_HETERPS)
   GraphDataGenerator gpu_graph_data_generator_;
 #endif
+  bool train_mode_;
 };
 
 // PrivateQueueDataFeed is the base virtual class for ohther DataFeeds.
@@ -1669,6 +1803,13 @@ class SlotRecordInMemoryDataFeed : public InMemoryDataFeed<SlotRecord> {
                      const int float_slot_size,
                      const UsedSlotGpuType* used_slots);
 #endif
+
+#if defined(PADDLE_WITH_GPU_GRAPH) && defined(PADDLE_WITH_HETERPS)
+  virtual void InitGraphResource(void);
+  virtual void InitGraphTrainResource(void);
+  virtual void DoWalkandSage();
+#endif
+
   float sample_rate_ = 1.0f;
   int use_slot_size_ = 0;
   int float_use_slot_size_ = 0;
diff --git a/paddle/fluid/framework/data_feed.proto b/paddle/fluid/framework/data_feed.proto
index 18d4cc7d4dc5c..e7ecf06e1551a 100644
--- a/paddle/fluid/framework/data_feed.proto
+++ b/paddle/fluid/framework/data_feed.proto
@@ -38,6 +38,10 @@ message GraphConfig {
   optional string first_node_type = 8;
   optional string meta_path = 9;
   optional bool gpu_graph_training = 10 [ default = true ];
+  optional bool sage_mode = 11 [ default = false ];
+  optional string samples = 12;
+  optional int64 train_table_cap = 13 [ default = 80000 ];
+  optional int64 infer_table_cap = 14 [ default = 80000 ];
 }
 
 message DataFeedDesc {
diff --git a/paddle/fluid/framework/data_set.cc b/paddle/fluid/framework/data_set.cc
index 529d532fb15e5..baa12950d0d6c 100644
--- a/paddle/fluid/framework/data_set.cc
+++ b/paddle/fluid/framework/data_set.cc
@@ -36,7 +36,9 @@
 #endif
 
 USE_INT_STAT(STAT_total_feasign_num_in_mem);
+USE_INT_STAT(STAT_epoch_finish);
 DECLARE_bool(graph_get_neighbor_id);
+DECLARE_int32(gpugraph_storage_mode);
 
 namespace paddle {
 namespace framework {
@@ -456,80 +458,56 @@ void DatasetImpl<T>::LoadIntoMemory() {
   std::vector<std::thread> load_threads;
   if (gpu_graph_mode_) {
     VLOG(0) << "in gpu_graph_mode";
-#ifdef PADDLE_WITH_HETERPS
-    graph_all_type_total_keys_.clear();
-    auto gpu_graph_ptr = GraphGpuWrapper::GetInstance();
-    auto node_to_id = gpu_graph_ptr->feature_to_id;
-    auto edge_to_id = gpu_graph_ptr->edge_to_id;
-    graph_all_type_total_keys_.resize(node_to_id.size());
-    int cnt = 0;
-    for (auto& iter : node_to_id) {
-      int node_idx = iter.second;
-      std::vector<std::vector<uint64_t>> gpu_graph_device_keys;
-      gpu_graph_ptr->get_all_id(
-          1, node_idx, thread_num_, &gpu_graph_device_keys);
-      auto& type_total_key = graph_all_type_total_keys_[cnt];
-      type_total_key.resize(thread_num_);
-      for (size_t i = 0; i < gpu_graph_device_keys.size(); i++) {
-        VLOG(2) << "node type: " << node_idx << ", gpu_graph_device_keys[" << i
-                << "] = " << gpu_graph_device_keys[i].size();
-        for (size_t j = 0; j < gpu_graph_device_keys[i].size(); j++) {
-          gpu_graph_total_keys_.push_back(gpu_graph_device_keys[i][j]);
-          type_total_key[i].push_back(gpu_graph_device_keys[i][j]);
-        }
-      }
+#if defined(PADDLE_WITH_GPU_GRAPH) && defined(PADDLE_WITH_HETERPS)
+    for (size_t i = 0; i < readers_.size(); i++) {
+      readers_[i]->SetGpuGraphMode(gpu_graph_mode_);
+    }
 
+    if (STAT_GET(STAT_epoch_finish) == 1) {
+      VLOG(0) << "get epoch finish true";
+      STAT_RESET(STAT_epoch_finish, 0);
       for (size_t i = 0; i < readers_.size(); i++) {
-        readers_[i]->SetDeviceKeys(&type_total_key[i], node_idx);
-        readers_[i]->SetGpuGraphMode(gpu_graph_mode_);
+        readers_[i]->ResetPathNum();
+        readers_[i]->ResetEpochFinish();
       }
-      cnt++;
+      return;
     }
 
-    VLOG(2) << "begin add feature_id into gpu_graph_total_keys_ size["
-            << gpu_graph_total_keys_.size() << "]";
-    for (auto& iter : node_to_id) {
-      std::vector<std::vector<uint64_t>> gpu_graph_device_keys;
-      int node_idx = iter.second;
-      gpu_graph_ptr->get_all_feature_ids(
-          1, node_idx, thread_num_, &gpu_graph_device_keys);
-      for (size_t i = 0; i < gpu_graph_device_keys.size(); i++) {
-        VLOG(2) << "begin node type: " << node_idx << ", gpu_graph_device_keys["
-                << i << "] = " << gpu_graph_device_keys[i].size();
-        for (size_t j = 0; j < gpu_graph_device_keys[i].size(); j++) {
-          gpu_graph_total_keys_.push_back(gpu_graph_device_keys[i][j]);
-        }
-        VLOG(2) << "end node type: " << node_idx << ", gpu_graph_device_keys["
-                << i << "] = " << gpu_graph_device_keys[i].size();
+    for (int64_t i = 0; i < thread_num_; ++i) {
+      load_threads.push_back(std::thread(
+          &paddle::framework::DataFeed::DoWalkandSage, readers_[i].get()));
+    }
+    for (std::thread& t : load_threads) {
+      t.join();
+    }
+    uint64_t node_num = 0;
+    for (int i = 0; i < thread_num_; i++) {
+      auto host_vec = readers_[i]->GetHostVec();
+      node_num += host_vec->size();
+    }
+    gpu_graph_total_keys_.reserve(node_num);
+    for (int i = 0; i < thread_num_; i++) {
+      auto host_vec = readers_[i]->GetHostVec();
+      for (size_t j = 0; j < host_vec->size(); j++) {
+        gpu_graph_total_keys_.push_back((*host_vec)[j]);
       }
     }
-    VLOG(2) << "end add feature_id into gpu_graph_total_keys_ size["
-            << gpu_graph_total_keys_.size() << "]";
 
-    // FIX: trick for iterate edge table
-    for (auto& iter : edge_to_id) {
-      int edge_idx = iter.second;
-      std::vector<std::vector<uint64_t>> gpu_graph_device_keys;
-      gpu_graph_ptr->get_all_id(
-          0, edge_idx, thread_num_, &gpu_graph_device_keys);
-      for (size_t i = 0; i < gpu_graph_device_keys.size(); i++) {
-        VLOG(1) << "edge type: " << edge_idx << ", gpu_graph_device_keys[" << i
-                << "] = " << gpu_graph_device_keys[i].size();
-        for (size_t j = 0; j < gpu_graph_device_keys[i].size(); j++) {
-          gpu_graph_total_keys_.push_back(gpu_graph_device_keys[i][j]);
-        }
+    if (GetEpochFinish() == true) {
+      VLOG(0) << "epoch finish, set stat and clear sample stat!";
+      STAT_RESET(STAT_epoch_finish, 1);
+      for (size_t i = 0; i < readers_.size(); i++) {
+        readers_[i]->ClearSampleState();
       }
-      if (FLAGS_graph_get_neighbor_id) {
-        std::vector<std::vector<uint64_t>> gpu_graph_neighbor_keys;
-        gpu_graph_ptr->get_all_neighbor_id(
-            0, edge_idx, thread_num_, &gpu_graph_neighbor_keys);
-        for (size_t i = 0; i < gpu_graph_neighbor_keys.size(); i++) {
-          for (size_t k = 0; k < gpu_graph_neighbor_keys[i].size(); k++) {
-            gpu_graph_total_keys_.push_back(gpu_graph_neighbor_keys[i][k]);
-          }
-        }
+    }
+    if (FLAGS_gpugraph_storage_mode != GpuGraphStorageMode::WHOLE_HBM) {
+      for (size_t i = 0; i < readers_.size(); i++) {
+        readers_[i]->clear_gpu_mem();
       }
     }
+
+    VLOG(2) << "end add edge into gpu_graph_total_keys_ size["
+            << gpu_graph_total_keys_.size() << "]";
 #endif
   } else {
     for (int64_t i = 0; i < thread_num_; ++i) {
@@ -1126,7 +1104,30 @@ void DatasetImpl<T>::DestroyPreLoadReaders() {
 
 template <typename T>
 int64_t DatasetImpl<T>::GetMemoryDataSize() {
-  return input_channel_->Size();
+  if (gpu_graph_mode_) {
+    int64_t total_path_num = 0;
+    for (int i = 0; i < thread_num_; i++) {
+      total_path_num += readers_[i]->GetGraphPathNum();
+    }
+    return total_path_num;
+  } else {
+    return input_channel_->Size();
+  }
+}
+
+template <typename T>
+bool DatasetImpl<T>::GetEpochFinish() {
+#if defined(PADDLE_WITH_GPU_GRAPH) && defined(PADDLE_WITH_HETERPS)
+  bool is_epoch_finish = true;
+  if (gpu_graph_mode_) {
+    for (int i = 0; i < thread_num_; i++) {
+      is_epoch_finish = is_epoch_finish && readers_[i]->get_epoch_finish();
+    }
+  }
+  return is_epoch_finish;
+#else
+  return false;
+#endif
 }
 
 template <typename T>
@@ -1783,6 +1784,9 @@ void SlotRecordDataset::CreateReaders() {
     readers_[i]->SetParseLogKey(parse_logkey_);
     readers_[i]->SetEnablePvMerge(enable_pv_merge_);
     readers_[i]->SetCurrentPhase(current_phase_);
+#if defined(PADDLE_WITH_GPU_GRAPH) && defined(PADDLE_WITH_HETERPS)
+    readers_[i]->InitGraphResource();
+#endif
     if (input_channel_ != nullptr) {
       readers_[i]->SetInputChannel(input_channel_.get());
     }
diff --git a/paddle/fluid/framework/data_set.h b/paddle/fluid/framework/data_set.h
index ae84735790aaa..599d40318c7b1 100644
--- a/paddle/fluid/framework/data_set.h
+++ b/paddle/fluid/framework/data_set.h
@@ -167,6 +167,10 @@ class Dataset {
 
   virtual void SetGpuGraphMode(int is_graph_mode) = 0;
   virtual int GetGpuGraphMode() = 0;
+  virtual bool GetEpochFinish() = 0;
+
+  virtual void SetPassId(uint32_t pass_id) = 0;
+  virtual uint32_t GetPassID() = 0;
 
  protected:
   virtual int ReceiveFromClient(int msg_type,
@@ -260,11 +264,7 @@ class DatasetImpl : public Dataset {
   virtual void DynamicAdjustReadersNum(int thread_num);
   virtual void SetFleetSendSleepSeconds(int seconds);
   virtual std::vector<std::string> GetSlots();
-  /* for enable_heterps_
-  virtual void EnableHeterps(bool enable_heterps) {
-    enable_heterps_ = enable_heterps;
-  }
-  */
+  virtual bool GetEpochFinish();
 
   std::vector<paddle::framework::Channel<T>>& GetMultiOutputChannel() {
     return multi_output_channel_;
@@ -280,7 +280,9 @@ class DatasetImpl : public Dataset {
   std::vector<uint64_t>& GetGpuGraphTotalKeys() {
     return gpu_graph_total_keys_;
   }
-  Channel<T>& GetInputChannelRef() { return input_channel_; }
+
+  virtual void SetPassId(uint32_t pass_id) { pass_id_ = pass_id; }
+  virtual uint32_t GetPassID() { return pass_id_; }
 
  protected:
   virtual int ReceiveFromClient(int msg_type,
@@ -341,9 +343,9 @@ class DatasetImpl : public Dataset {
   std::vector<std::string> use_slots_;
   bool enable_heterps_ = false;
   int gpu_graph_mode_ = 0;
-  // std::vector<std::vector<int64_t>> gpu_graph_device_keys_;
-  std::vector<std::vector<std::vector<uint64_t>>> graph_all_type_total_keys_;
+  std::vector<std::vector<std::vector<uint64_t>>> gpu_graph_type_keys_;
   std::vector<uint64_t> gpu_graph_total_keys_;
+  uint32_t pass_id_ = 0;
 };
 
 // use std::vector<MultiSlotType> or Record as data type
diff --git a/paddle/fluid/framework/device_worker.h b/paddle/fluid/framework/device_worker.h
index fa54a723adb3c..921df0452c7ae 100644
--- a/paddle/fluid/framework/device_worker.h
+++ b/paddle/fluid/framework/device_worker.h
@@ -242,6 +242,7 @@ class DeviceWorker {
   ChannelWriter<std::string> writer_;
   const size_t tensor_iterator_thread_num = 16;
   platform::DeviceContext* dev_ctx_ = nullptr;
+  int thread_num_;
 };
 
 class CPUWorkerBase : public DeviceWorker {
@@ -289,6 +290,7 @@ class HogwildWorker : public CPUWorkerBase {
   HogwildWorkerParameter param_;
   std::vector<std::string> skip_ops_;
   std::map<std::string, int> stat_var_name_map_;
+  static std::atomic<uint64_t> worker_num_stat_;
 };
 
 class DownpourWorker : public HogwildWorker {
diff --git a/paddle/fluid/framework/dist_multi_trainer_test.cc b/paddle/fluid/framework/dist_multi_trainer_test.cc
index 06d84bca1273d..ae88dbab057b6 100644
--- a/paddle/fluid/framework/dist_multi_trainer_test.cc
+++ b/paddle/fluid/framework/dist_multi_trainer_test.cc
@@ -53,5 +53,34 @@ TEST(DisMultiTrainerTest, test1) {
   tmp1->Finalize();
 #endif
 }
+
+TEST(DisMultiTrainerTest, testforgpugraph) {
+#ifdef _LINUX
+  TrainerDesc t;
+  t.set_class_name("MultiTrainer");
+  t.set_device_worker_name("HogwildWorker");
+  t.set_thread_num(1);
+  auto* m = t.mutable_downpour_param()->add_program_config();
+  m->set_program_id("123");
+  std::string str;
+  str += "name: \"MultiSlotDataFeed\"\nbatch_size: 2\nmulti_slot_desc {\n";
+  str += "slots {\nname: \"words\"\ntype: \"uint64\"\nis_dense: false\n";
+  str += "is_used: true\n}\nslots {\nname: \"label\"\ntype: \"uint64\"\n";
+  str += "is_dense: false\nis_used: true\n}\n}\n";
+  std::shared_ptr<MultiSlotDataset> dataset =
+      std::make_shared<MultiSlotDataset>();
+  dataset->SetFileList(std::vector<std::string>());
+  dataset->SetThreadNum(1);
+  dataset->SetTrainerNum(1);
+  dataset->SetDataFeedDesc(str);
+  dataset->CreateReaders();
+  dataset->SetGpuGraphMode(true);
+  dataset->GetMemoryDataSize();
+  dataset->SetPassId(2);
+  dataset->GetPassID();
+  dataset->GetEpochFinish();
+#endif
+}
+
 }  // namespace framework
 }  // namespace paddle
diff --git a/paddle/fluid/framework/distributed_strategy.proto b/paddle/fluid/framework/distributed_strategy.proto
index e792d2a38dc7e..52c24fffc7f92 100755
--- a/paddle/fluid/framework/distributed_strategy.proto
+++ b/paddle/fluid/framework/distributed_strategy.proto
@@ -287,6 +287,7 @@ message CtrAccessorParameter {
   optional float delete_after_unseen_days = 8 [ default = 30 ];
   optional int32 ssd_unseenday_threshold = 9 [ default = 1 ];
   optional bool show_scale = 10 [ default = true ];
+  repeated float load_filter_slots = 11;
 }
 
 message TableAccessorSaveParameter {
diff --git a/paddle/fluid/framework/fleet/CMakeLists.txt b/paddle/fluid/framework/fleet/CMakeLists.txt
index 4cf3ab8dc1a67..d4034ee059f27 100644
--- a/paddle/fluid/framework/fleet/CMakeLists.txt
+++ b/paddle/fluid/framework/fleet/CMakeLists.txt
@@ -29,7 +29,8 @@ if(WITH_HETERPS)
       nv_library(
         ps_gpu_wrapper
         SRCS ps_gpu_wrapper.cu ps_gpu_wrapper.cc
-        DEPS heter_ps gloo_wrapper ps_framework_proto ${BRPC_DEPS})
+        DEPS heter_ps gloo_wrapper ps_framework_proto graph_gpu_wrapper
+             ${BRPC_DEPS})
     else()
       nv_library(
         ps_gpu_wrapper
diff --git a/paddle/fluid/framework/fleet/gloo_wrapper.cc b/paddle/fluid/framework/fleet/gloo_wrapper.cc
index 047e121b4e3ca..e6385a3825275 100644
--- a/paddle/fluid/framework/fleet/gloo_wrapper.cc
+++ b/paddle/fluid/framework/fleet/gloo_wrapper.cc
@@ -352,7 +352,8 @@ void GlooWrapper::Init() {
   }
 #endif
   is_initialized_ = true;
-  VLOG(3) << "gloo initialized done.";
+  VLOG(0) << "gloo initialized done, rank=" << rank_ << ", size=" << size_
+          << ", store_type=" << store_type_;
 }
 
 template std::vector<int64_t> GlooWrapper::AllReduce<int64_t>(
diff --git a/paddle/fluid/framework/fleet/heter_context.h b/paddle/fluid/framework/fleet/heter_context.h
index ef2e73d6dd5b5..68fed7bc78e23 100644
--- a/paddle/fluid/framework/fleet/heter_context.h
+++ b/paddle/fluid/framework/fleet/heter_context.h
@@ -85,7 +85,9 @@ class HeterContext {
   std::vector<std::vector<std::mutex*>> dim_mutex_;
   int multi_mf_dim_ = 0;
 
+  void* sub_graph_feas = NULL;
   uint32_t shard_num_ = 37;
+  uint16_t pass_id_ = 0;
   uint64_t size() {
     uint64_t total_size = 0;
     for (auto& keys : feature_keys_) {
diff --git a/paddle/fluid/framework/fleet/heter_ps/cudf/concurrent_unordered_map.cuh.h b/paddle/fluid/framework/fleet/heter_ps/cudf/concurrent_unordered_map.cuh.h
index 85bf6bb553b22..b4590548d70fb 100644
--- a/paddle/fluid/framework/fleet/heter_ps/cudf/concurrent_unordered_map.cuh.h
+++ b/paddle/fluid/framework/fleet/heter_ps/cudf/concurrent_unordered_map.cuh.h
@@ -524,6 +524,7 @@ class concurrent_unordered_map : public managed {
   __forceinline__ __device__ iterator
   insert(const value_type& x,
          aggregation_type op,
+         uint64_t* local_count = NULL,
          comparison_type keys_equal = key_equal(),
          bool precomputed_hash = false,
          hash_value_type precomputed_hash_value = 0) {
@@ -548,7 +549,6 @@ class concurrent_unordered_map : public managed {
     const key_type insert_key = x.first;
 
     bool insert_success = false;
-
     size_type counter = 0;
     while (false == insert_success) {
       if (counter++ >= hashtbl_size) {
@@ -577,19 +577,20 @@ class concurrent_unordered_map : public managed {
       if (keys_equal(unused_key, old_key) || keys_equal(insert_key, old_key)) {
         update_existing_value(existing_value, x, op);
         insert_success = true;
-        if (m_enable_collision_stat) {
-          atomicAdd(&m_insert_times, 1);
+        if (local_count != NULL && keys_equal(unused_key, old_key)) {
+          atomicAdd(local_count, 1);
         }
         break;
       }
-
-      if (m_enable_collision_stat) {
-        atomicAdd(&m_insert_collisions, 1);
-      }
       current_index = (current_index + 1) % hashtbl_size;
       current_hash_bucket = &(hashtbl_values[current_index]);
     }
 
+    if (m_enable_collision_stat) {
+      atomicAdd(&m_insert_times, 1);
+      atomicAdd(&m_insert_collisions, uint64_t(counter + 1));
+    }
+
     return iterator(
         m_hashtbl_values, m_hashtbl_values + hashtbl_size, current_hash_bucket);
   }
@@ -675,15 +676,13 @@ x.second );
         begin_ptr = m_hashtbl_values + m_hashtbl_size;
         break;
       }
-      if (m_enable_collision_stat) {
-        atomicAdd(&m_query_collisions, 1);
-      }
       hash_tbl_idx = (hash_tbl_idx + 1) % m_hashtbl_size;
       ++counter;
     }
 
     if (m_enable_collision_stat) {
       atomicAdd(&m_query_times, 1);
+      atomicAdd(&m_query_collisions, (uint64_t)(counter + 1));
     }
 
     return const_iterator(
diff --git a/paddle/fluid/framework/fleet/heter_ps/feature_value.cu b/paddle/fluid/framework/fleet/heter_ps/feature_value.cu
index 80a827e6ad0e8..d7004a43cd840 100644
--- a/paddle/fluid/framework/fleet/heter_ps/feature_value.cu
+++ b/paddle/fluid/framework/fleet/heter_ps/feature_value.cu
@@ -60,18 +60,17 @@ __global__ void PullDedupCopy(const size_t N,
                               const int64_t* slot_lens,
                               uint64_t max_val_size,
                               const int* slot_dims,
-                              const int hidden,
+                              const size_t hidden,
                               const int* key2slot,
                               const uint32_t* restore_idx,
                               TAccess accessor) {
-  CUDA_KERNEL_LOOP(idx, N) {
+  CUDA_KERNEL_LOOP_TYPE(idx, N, size_t) {
     int i = idx / hidden;
     int off = idx % hidden;
 
     int x = key2slot[i];
     int y = i - slot_lens[x];
 
-    assert(slot_dims[x] == hidden);
     float* dest_ptr = dest[x] + y * hidden;
     // 0 key fill zero
     if (total_keys[i] == 0) {
@@ -92,10 +91,11 @@ __global__ void PullDedupCopy(const size_t N,
         *(dest_ptr + off) = src_ptr[accessor.EmbedWIndex()];
         break;
       default:
-        if (src_ptr[accessor.MfSizeIndex()] == 0) {
+        int embedx_id = off - 3;
+        if (embedx_id >= static_cast<int>(src_ptr[accessor.MfSizeIndex()])) {
           *(dest_ptr + off) = 0;
         } else {
-          *(dest_ptr + off) = src_ptr[accessor.EmbedxWIndex() + off - 3];
+          *(dest_ptr + off) = src_ptr[accessor.EmbedxWIndex() + embedx_id];
         }
         break;
     }
@@ -128,9 +128,10 @@ __global__ void PushCopyWithPool(float* dest,
     float* cur = (float*)((char*)dest + i * grad_value_size);  // NOLINT
 
     cur[gpu_accessor.common_push_value.SlotIndex()] =
-        (float)slot_vector[x];  // NOLINT
+        static_cast<float>(slot_vector[x]);
     int mf_dim = mf_dim_vector[x];
-    cur[gpu_accessor.common_push_value.MfDimIndex()] = mf_dim;
+    cur[gpu_accessor.common_push_value.MfDimIndex()] =
+        static_cast<float>(mf_dim);
 
     cur[gpu_accessor.common_push_value.ShowIndex()] =
         *(src[x] + y * (mf_dim + 3));
@@ -159,7 +160,7 @@ __global__ void PushMergeCopyAtomic(const size_t N,
                                     const uint32_t* d_restore_idx,
                                     size_t grad_value_size,
                                     TAccess accessor) {
-  CUDA_KERNEL_LOOP(idx, N) {
+  CUDA_KERNEL_LOOP_TYPE(idx, N, size_t) {
     int i = idx / hidden;
     int off = idx % hidden;
     // filter 0 keys
@@ -176,8 +177,8 @@ __global__ void PushMergeCopyAtomic(const size_t N,
     int mf_dim = slot_dims[x] - 3;
     switch (off) {
       case 0:
-        cur[accessor.SlotIndex()] = (float)slot_vector[x];  // NOLINT
-        cur[accessor.MfDimIndex()] = mf_dim;
+        cur[accessor.SlotIndex()] = static_cast<float>(slot_vector[x]);
+        cur[accessor.MfDimIndex()] = static_cast<float>(mf_dim);
         phi::CudaAtomicAdd(&cur[accessor.ShowIndex()], *(ptr + off));
         break;
       case 1:
@@ -189,11 +190,10 @@ __global__ void PushMergeCopyAtomic(const size_t N,
         break;
       default:
         int embedx_idx = off - 3;
-        if (mf_dim < embedx_idx) {
-          return;
+        if (embedx_idx < mf_dim) {
+          phi::CudaAtomicAdd(&cur[accessor.EmbedxGIndex() + embedx_idx],
+                             *(ptr + off) * -1. * bs);
         }
-        phi::CudaAtomicAdd(&cur[accessor.EmbedxGIndex() + embedx_idx],
-                           *(ptr + off) * -1. * bs);
         break;
     }
   }
@@ -223,7 +223,7 @@ __global__ void PushMergeCopy(const size_t N,
                               const uint32_t* d_sort_cnt,
                               size_t grad_value_size,
                               TAccess accessor) {
-  CUDA_KERNEL_LOOP(idx, N) {
+  CUDA_KERNEL_LOOP_TYPE(idx, N, size_t) {
     int i = idx / hidden;
     int off = idx % hidden;
     // filter 0 keys
@@ -232,8 +232,8 @@ __global__ void PushMergeCopy(const size_t N,
     if (total_keys[i] == 0) {
       switch (off) {
         case 0:
-          cur[accessor.SlotIndex()] = 0;
-          cur[accessor.MfDimIndex()] = 0;
+          cur[accessor.SlotIndex()] = static_cast<float>(0);
+          cur[accessor.MfDimIndex()] = static_cast<float>(0);
           cur[accessor.ShowIndex()] = 0.0;
           break;
         case 1:
@@ -261,8 +261,8 @@ __global__ void PushMergeCopy(const size_t N,
 
     switch (off) {
       case 0:
-        cur[accessor.SlotIndex()] = (float)slot_vector[x];  // NOLINT
-        cur[accessor.MfDimIndex()] = mf_dim;
+        cur[accessor.SlotIndex()] = static_cast<float>(slot_vector[x]);
+        cur[accessor.MfDimIndex()] = static_cast<float>(mf_dim);
         SUM_GRAD_VALUE
         cur[accessor.ShowIndex()] = val;
         break;
@@ -276,12 +276,12 @@ __global__ void PushMergeCopy(const size_t N,
         break;
       default:
         int embedx_idx = off - 3;
-        if (mf_dim < embedx_idx) {
+        if (embedx_idx < mf_dim) {
+          SUM_GRAD_VALUE
+          cur[accessor.EmbedxGIndex() + embedx_idx] = val * -1. * bs;
+        } else {
           cur[accessor.EmbedxGIndex() + embedx_idx] = 0.0;
-          return;
         }
-        SUM_GRAD_VALUE
-        cur[accessor.EmbedxGIndex() + embedx_idx] = val * -1. * bs;
         break;
     }
   }
diff --git a/paddle/fluid/framework/fleet/heter_ps/feature_value.h b/paddle/fluid/framework/fleet/heter_ps/feature_value.h
index 5150cc5dba717..8c6925e938a88 100644
--- a/paddle/fluid/framework/fleet/heter_ps/feature_value.h
+++ b/paddle/fluid/framework/fleet/heter_ps/feature_value.h
@@ -65,29 +65,35 @@ class CommonFeatureValueAccessor {
     __host__ __device__ size_t Size() {
       return TYPEALIGN(8, Dim() * sizeof(float));
     }  // cpu_ptr:uint64=2float
-    __host__ __device__ int EmbedDim() { return embed_sgd_dim; }
-    __host__ __device__ int EmbedXDim() { return embedx_sgd_dim; }
-    __host__ __device__ int EmbedWDim() { return embedx_dim; }
-    __host__ __device__ int CpuPtrIndex() { return 0; }  // cpuprt uint64
-    __host__ __device__ int DeltaScoreIndex() { return CpuPtrIndex() + 2; }
-    __host__ __device__ int ShowIndex() { return DeltaScoreIndex() + 1; }
-    __host__ __device__ int ClickIndex() { return ShowIndex() + 1; }
-    __host__ __device__ int EmbedWIndex() { return ClickIndex() + 1; }
-    __host__ __device__ int EmbedG2SumIndex() { return EmbedWIndex() + 1; }
-    __host__ __device__ int SlotIndex() {
+    __host__ __device__ int EmbedDim() const { return embed_sgd_dim; }
+    __host__ __device__ int EmbedXDim() const { return embedx_sgd_dim; }
+    __host__ __device__ int EmbedWDim() const { return embedx_dim; }
+    __host__ __device__ int CpuPtrIndex() const { return 0; }  // cpuprt uint64
+    __host__ __device__ int DeltaScoreIndex() const {
+      return CpuPtrIndex() + 2;
+    }
+    __host__ __device__ int ShowIndex() const { return DeltaScoreIndex() + 1; }
+    __host__ __device__ int ClickIndex() const { return ShowIndex() + 1; }
+    __host__ __device__ int EmbedWIndex() const { return ClickIndex() + 1; }
+    __host__ __device__ int EmbedG2SumIndex() const {
+      return EmbedWIndex() + 1;
+    }
+    __host__ __device__ int SlotIndex() const {
       return EmbedG2SumIndex() + embed_sgd_dim;
     }
-    __host__ __device__ int MfDimIndex() { return SlotIndex() + 1; }
-    __host__ __device__ int MfSizeIndex() {
+    __host__ __device__ int MfDimIndex() const { return SlotIndex() + 1; }
+    __host__ __device__ int MfSizeIndex() const {
       return MfDimIndex() + 1;
     }  // actual mf size (ex. 0)
-    __host__ __device__ int EmbedxG2SumIndex() { return MfSizeIndex() + 1; }
-    __host__ __device__ int EmbedxWIndex() {
+    __host__ __device__ int EmbedxG2SumIndex() const {
+      return MfSizeIndex() + 1;
+    }
+    __host__ __device__ int EmbedxWIndex() const {
       return EmbedxG2SumIndex() + embedx_sgd_dim;
     }
 
     // 根据mf_dim计算的总长度
-    __host__ __device__ int Dim(int& mf_dim) {
+    __host__ __device__ int Dim(int mf_dim) {
       int tmp_embedx_sgd_dim = 1;
       if (optimizer_type_ == 3) {  // adam
         tmp_embedx_sgd_dim = mf_dim * 2 + 2;
@@ -98,12 +104,12 @@ class CommonFeatureValueAccessor {
     }
 
     // 根据mf_dim 计算的总byte数
-    __host__ __device__ size_t Size(int& mf_dim) {
+    __host__ __device__ size_t Size(int mf_dim) {
       return TYPEALIGN(8, Dim(mf_dim) * sizeof(float));  // cpu_ptr:2float
     }
 
     // 根据mf_dim 计算的 mf_size byte数
-    __host__ __device__ size_t MFSize(int& mf_dim) {
+    __host__ __device__ size_t MFSize(int mf_dim) {
       int tmp_embedx_sgd_dim = 1;
       if (optimizer_type_ == 3) {  // adam
         tmp_embedx_sgd_dim = mf_dim * 2 + 2;
@@ -117,9 +123,9 @@ class CommonFeatureValueAccessor {
     __host__ __device__ int EmbedxWOffsetIndex(float* val) {
       // has mf
       int tmp_embedx_sgd_dim = 1;
-      if (int(MfSize(val)) > 0) {
+      if (static_cast<int>(MfSize(val)) > 0) {
         if (optimizer_type_ == 3) {  // adam
-          tmp_embedx_sgd_dim = int(MfDim(val)) * 2 + 2;
+          tmp_embedx_sgd_dim = MfDim(val) * 2 + 2;
         } else if (optimizer_type_ == 4) {  // shared_adam
           tmp_embedx_sgd_dim = 4;
         }
@@ -166,12 +172,10 @@ class CommonFeatureValueAccessor {
       float mf_size
       std::vector<float> embedx_w;
     */
-    __host__ __device__ static int Dim(int embedx_dim) {
-      return 4 + embedx_dim;
-    }
+    __host__ __device__ int Dim(int embedx_dim) { return 4 + embedx_dim; }
     __host__ __device__ int DimSize(size_t dim) { return sizeof(float); }
     __host__ __device__ int Size(int embedx_dim) {
-      return TYPEALIGN(8, Dim(embedx_dim) * sizeof(float));
+      return Dim(embedx_dim) * sizeof(float);
     }
     __host__ __device__ int ShowIndex() { return 0; }
     __host__ __device__ int ClickIndex() { return 1; }
@@ -192,46 +196,46 @@ class CommonFeatureValueAccessor {
        std::vector<float> embedx_g;
        */
 
-    __host__ __device__ int Dim(int embedx_dim) { return 5 + embedx_dim; }
+    __host__ __device__ int Dim(int embedx_dim) const { return 5 + embedx_dim; }
 
-    __host__ __device__ int DimSize(int dim, int embedx_dim) {
+    __host__ __device__ int DimSize(int dim, int embedx_dim) const {
       return sizeof(float);
     }
-    __host__ __device__ int Size(int embedx_dim) {
-      return TYPEALIGN(8, Dim(embedx_dim) * sizeof(float));
+    __host__ __device__ int Size(int embedx_dim) const {
+      return Dim(embedx_dim) * sizeof(float);
     }
-    __host__ __device__ int SlotIndex() { return 0; }
-    __host__ __device__ int ShowIndex() {
+    __host__ __device__ int SlotIndex() const { return 0; }
+    __host__ __device__ int ShowIndex() const {
       return CommonPushValue::SlotIndex() + 1;
     }
-    __host__ __device__ int ClickIndex() {
+    __host__ __device__ int ClickIndex() const {
       return CommonPushValue::ShowIndex() + 1;
     }
-    __host__ __device__ int MfDimIndex() {
+    __host__ __device__ int MfDimIndex() const {
       return CommonPushValue::ClickIndex() + 1;
     }
-    __host__ __device__ int EmbedGIndex() {
+    __host__ __device__ int EmbedGIndex() const {
       return CommonPushValue::MfDimIndex() + 1;
     }
-    __host__ __device__ int EmbedxGIndex() {
+    __host__ __device__ int EmbedxGIndex() const {
       return CommonPushValue::EmbedGIndex() + 1;
     }
-    __host__ __device__ float& Slot(float* val) {
+    __host__ __device__ float& Slot(float* val) const {
       return val[CommonPushValue::SlotIndex()];
     }
-    __host__ __device__ float& Show(float* val) {
+    __host__ __device__ float& Show(float* val) const {
       return val[CommonPushValue::ShowIndex()];
     }
-    __host__ __device__ float& Click(float* val) {
+    __host__ __device__ float& Click(float* val) const {
       return val[CommonPushValue::ClickIndex()];
     }
-    __host__ __device__ float& MfDim(float* val) {
+    __host__ __device__ float& MfDim(float* val) const {
       return val[CommonPushValue::MfDimIndex()];
     }
-    __host__ __device__ float& EmbedG(float* val) {
+    __host__ __device__ float& EmbedG(float* val) const {
       return val[CommonPushValue::EmbedGIndex()];
     }
-    __host__ __device__ float* EmbedxG(float* val) {
+    __host__ __device__ float* EmbedxG(float* val) const {
       return val + CommonPushValue::EmbedxGIndex();
     }
   };
@@ -242,10 +246,10 @@ class CommonFeatureValueAccessor {
   __host__ int Initialize() {
     int optimizer_type = (_config.find("optimizer_type") == _config.end())
                              ? 1
-                             : int(_config["optimizer_type"]);
+                             : _config["optimizer_type"];
     int sparse_embedx_dim = (_config.find("embedx_dim") == _config.end())
                                 ? 8
-                                : int(_config["embedx_dim"]);
+                                : _config["embedx_dim"];
     if (optimizer_type == 3) {  // adam
       common_feature_value.embed_sgd_dim = 4;
       common_feature_value.embedx_sgd_dim = sparse_embedx_dim * 2 + 2;
@@ -262,7 +266,7 @@ class CommonFeatureValueAccessor {
     return 0;
   }
 
-  __host__ int Configure(std::unordered_map<std::string, float>& config) {
+  __host__ int Configure(const std::unordered_map<std::string, float>& config) {
     _config = config;
     Initialize();
     return 0;
@@ -298,23 +302,24 @@ class CommonFeatureValueAccessor {
     }
     *(reinterpret_cast<uint64_t*>(
         gpu_val + common_feature_value.CpuPtrIndex())) = (uint64_t)(cpu);
-    cpu_val[cpu_accessor->common_feature_value.MfDimIndex()] = float(mf_dim);
+    cpu_val[cpu_accessor->common_feature_value.MfDimIndex()] =
+        static_cast<float>(mf_dim);
     gpu_val[common_feature_value.MfDimIndex()] = mf_dim;
     if (cpu_dim > cpu_accessor->GetAccessorInfo().dim -
                       cpu_accessor->GetAccessorInfo().mf_size / sizeof(float)) {
       gpu_val[common_feature_value.MfSizeIndex()] =
           common_feature_value.MFSize(mf_dim) / sizeof(float);
 
-      for (int x = 0;
-           x < int(common_feature_value.MFSize(mf_dim) / sizeof(float));
+      for (size_t x = 0;
+           x < (common_feature_value.MFSize(mf_dim) / sizeof(float));
            x++) {
         gpu_val[common_feature_value.EmbedxG2SumIndex() + x] =
             cpu_val[cpu_accessor->common_feature_value.EmbedxG2SumIndex() + x];
       }
     } else {
       gpu_val[common_feature_value.MfSizeIndex()] = 0;
-      for (int x = common_feature_value.EmbedxG2SumIndex();
-           x < int(common_feature_value.Size(mf_dim) / sizeof(float));
+      for (size_t x = common_feature_value.EmbedxG2SumIndex();
+           x < (common_feature_value.Size(mf_dim) / sizeof(float));
            x++) {
         gpu_val[x] = 0;
       }
@@ -335,9 +340,10 @@ class CommonFeatureValueAccessor {
             gpu_val + common_feature_value.CpuPtrIndex())));
     size_t downpour_value_size = downpour_value->size();
     if (gpu_val[common_feature_value.MfSizeIndex()] > 0 &&
-        downpour_value_size == (cpu_accessor->GetAccessorInfo().dim -
-                                int(cpu_accessor->GetAccessorInfo().mf_size /
-                                    sizeof(float)))) {  // cpu_accessor
+        downpour_value_size ==
+            (cpu_accessor->GetAccessorInfo().dim -
+             static_cast<int>(cpu_accessor->GetAccessorInfo().mf_size /
+                              sizeof(float)))) {  // cpu_accessor
       downpour_value->resize(cpu_accessor->common_feature_value.Dim(mf_dim));
     }
     float* cpu_val = downpour_value->data();
@@ -358,8 +364,8 @@ class CommonFeatureValueAccessor {
     }
 
     if (gpu_val[common_feature_value.MfSizeIndex()] > 0) {
-      for (int x = 0;
-           x < int(common_feature_value.MFSize(mf_dim) / sizeof(float));
+      for (size_t x = 0;
+           x < (common_feature_value.MFSize(mf_dim) / sizeof(float));
            x++) {
         cpu_val[cpu_accessor->common_feature_value.EmbedxG2SumIndex() + x] =
             gpu_val[common_feature_value.EmbedxG2SumIndex() + x];
@@ -395,8 +401,8 @@ class CommonFeatureValueAccessor {
     dest_val[common_feature_value.MfSizeIndex()] =
         src_val[common_feature_value.MfSizeIndex()];
 
-    for (int x = common_feature_value.EmbedxG2SumIndex();
-         x < int(common_feature_value.Size(mf_dim) / sizeof(float));
+    for (size_t x = common_feature_value.EmbedxG2SumIndex();
+         x < (common_feature_value.Size(mf_dim) / sizeof(float));
          x++) {
       dest_val[x] = src_val[x];
     }
@@ -412,14 +418,19 @@ class CommonFeatureValueAccessor {
     dest_val[common_pull_value.EmbedWIndex()] =
         src_val[common_feature_value.EmbedWIndex()];
 
-    int mf_size = int(src_val[common_feature_value.MfSizeIndex()]);
+    int mf_size = static_cast<int>(src_val[common_feature_value.MfSizeIndex()]);
     if (mf_size == 0) {
       dest_val[common_pull_value.MfSizeIndex()] = 0;
       return;
     }
     // set pull value real dim size
-    int mf_dim = int(src_val[common_feature_value.MfDimIndex()]);
+    int mf_dim = static_cast<int>(src_val[common_feature_value.MfDimIndex()]);
     dest_val[common_pull_value.MfSizeIndex()] = mf_dim;
+    // check
+    if (mf_dim > mf_size) {
+      printf("mf_dim[%d] <= mf_size[%d]", mf_dim, mf_size);
+      return;
+    }
 
     int embedx_off = common_pull_value.EmbedxWIndex();
     int value_off = common_feature_value.EmbedxWIndex();
@@ -427,6 +438,13 @@ class CommonFeatureValueAccessor {
       dest_val[embedx_off + k] = src_val[value_off + k];
     }
   }
+  // set zero value by infer
+  __host__ __device__ void PullZeroValue(float* dest_val) {
+    dest_val[common_pull_value.ShowIndex()] = 0.0;
+    dest_val[common_pull_value.ClickIndex()] = 0.0;
+    dest_val[common_pull_value.EmbedWIndex()] = 0.0;
+    dest_val[common_pull_value.MfSizeIndex()] = 0;
+  }
 
   // dy_mf_fill_shard_grads_kernel,update_one 阶段 gpukernel
   // 中从src_val赋值给dest_val
@@ -443,7 +461,7 @@ class CommonFeatureValueAccessor {
     dest_val[common_push_value.EmbedGIndex()] =
         src_val[common_push_value.EmbedGIndex()];
 
-    for (int x = 0; x < int(src_val[common_push_value.MfDimIndex()]); x++) {
+    for (size_t x = 0; x < (src_val[common_push_value.MfDimIndex()]); x++) {
       dest_val[common_push_value.EmbedxGIndex() + x] =
           src_val[common_push_value.EmbedxGIndex() + x];
     }
@@ -451,7 +469,7 @@ class CommonFeatureValueAccessor {
 
   // update_basic 阶段 gpukernel 中从src_val赋值给dest_val
   __host__ __device__ void PushValueFillBasic(float* dest_val,
-                                              const float* src_val) {
+                                              const float* src_val) const {
     dest_val[common_push_value.SlotIndex()] =
         src_val[common_push_value.SlotIndex()];
     dest_val[common_push_value.ShowIndex()] =
@@ -473,7 +491,7 @@ class CommonFeatureValueAccessor {
         src_val[common_push_value.ClickIndex()];
     dest_val[common_push_value.EmbedGIndex()] +=
         src_val[common_push_value.EmbedGIndex()];
-    for (int j = 0; j < int(dest_val[common_push_value.MfDimIndex()]); j++) {
+    for (size_t j = 0; j < (dest_val[common_push_value.MfDimIndex()]); j++) {
       dest_val[common_push_value.EmbedxGIndex() + j] +=
           src_val[common_push_value.EmbedxGIndex() + j];
     }
@@ -481,7 +499,7 @@ class CommonFeatureValueAccessor {
 
   // merge_basic 阶段 gpukernel 中 PushValue 从src_val赋值给dest_val
   __host__ __device__ void MergePushValueBasic(float* dest_val,
-                                               const float* src_val) {
+                                               const float* src_val) const {
     dest_val[common_push_value.ShowIndex()] +=
         src_val[common_push_value.ShowIndex()];
     dest_val[common_push_value.ClickIndex()] +=
@@ -507,10 +525,10 @@ class CommonFeatureValueAccessor {
       *(dest_val + common_pull_value.EmbedWIndex()) =
           src_val[common_feature_value.EmbedWIndex()];
     }
-
-    if (src_val[common_feature_value.MfSizeIndex()] == 0 || *key == 0) {
+    int mf_size = static_cast<int>(src_val[common_feature_value.MfSizeIndex()]);
+    if (mf_size == 0 || *key == 0) {
       for (int j = 0; j < mf_dim; j++) {
-        *(dest_val + common_pull_value.EmbedxWIndex() + j) = 0;
+        *(dest_val + 3 + j) = 0;
       }
     } else {
       for (int j = 0; j < mf_dim; j++) {
@@ -545,7 +563,8 @@ class CommonFeatureValueAccessor {
          i++) {
       os << " " << v[i];
     }
-    int mf_dim = int(common_feature_value.MfDim(const_cast<float*>(v)));
+    int mf_dim =
+        static_cast<int>(common_feature_value.MfDim(const_cast<float*>(v)));
     os << " slot: " << common_feature_value.Slot(const_cast<float*>(v))
        << " mf_dim: " << mf_dim
        << " mf_size: " << common_feature_value.MfSize(const_cast<float*>(v))
@@ -642,11 +661,11 @@ class VirtualAccessor {
  public:
   virtual int Configure(std::unordered_map<std::string, float> config) = 0;
 
-  virtual size_t GetFeatureValueSize(int& mf_dim) = 0;
+  virtual size_t GetFeatureValueSize(int& mf_dim) = 0;  // NOLINT
 
-  virtual size_t GetPushValueSize(int& mf_dim) = 0;
+  virtual size_t GetPushValueSize(int& mf_dim) = 0;  // NOLINT
 
-  virtual size_t GetPullValueSize(int& mf_dim) = 0;
+  virtual size_t GetPullValueSize(int& mf_dim) = 0;  // NOLINT
 
   virtual void BuildFill(void* gpu_val,
                          void* cpu_val,
@@ -687,8 +706,8 @@ class VirtualAccessor {
                            const uint64_t total_length,
                            const int batch_size,
                            size_t grad_value_size,
-                           std::vector<int>& slot_vector,
-                           std::vector<int>& slot_mf_dim_vector) = 0;
+                           std::vector<int>& slot_vector,              // NOLINT
+                           std::vector<int>& slot_mf_dim_vector) = 0;  // NOLINT
 
   // dedup
   virtual void CopyForPush(const paddle::platform::Place& place,
@@ -729,7 +748,7 @@ class VirtualAccessor {
 template <typename GPUAccessor>
 class AccessorWrapper : public VirtualAccessor {
  public:
-  explicit AccessorWrapper() {}
+  AccessorWrapper() {}
   virtual ~AccessorWrapper() {}
   AccessorWrapper(const AccessorWrapper&) = delete;
   AccessorWrapper& operator=(const AccessorWrapper&) = delete;
@@ -738,15 +757,15 @@ class AccessorWrapper : public VirtualAccessor {
     return gpu_accessor_.Configure(config);
   }
 
-  virtual size_t GetFeatureValueSize(int& mf_dim) {
+  virtual size_t GetFeatureValueSize(int& mf_dim) {  // NOLINT
     return gpu_accessor_.common_feature_value.Size(mf_dim);
   }
 
-  virtual size_t GetPushValueSize(int& mf_dim) {
+  virtual size_t GetPushValueSize(int& mf_dim) {  // NOLINT
     return gpu_accessor_.common_push_value.Size(mf_dim);
   }
 
-  virtual size_t GetPullValueSize(int& mf_dim) {
+  virtual size_t GetPullValueSize(int& mf_dim) {  // NOLINT
     return gpu_accessor_.common_pull_value.Size(mf_dim);
   }
 
@@ -757,7 +776,7 @@ class AccessorWrapper : public VirtualAccessor {
                          paddle::distributed::ValueAccessor* cpu_table_accessor,
                          int mf_dim) {
     gpu_accessor_.BuildFill(
-        (float*)(gpu_val), cpu_val, cpu_table_accessor, mf_dim);
+        reinterpret_cast<float*>(gpu_val), cpu_val, cpu_table_accessor, mf_dim);
   }
 
   virtual void DumpFill(float* gpu_val,
@@ -819,8 +838,8 @@ class AccessorWrapper : public VirtualAccessor {
                            const uint64_t total_length,
                            const int batch_size,
                            size_t grad_value_size,
-                           std::vector<int>& slot_vector,
-                           std::vector<int>& slot_mf_dim_vector) {
+                           std::vector<int>& slot_vector,           // NOLINT
+                           std::vector<int>& slot_mf_dim_vector) {  // NOLINT
     CopyForPushImpl(place,
                     grad_values,
                     total_grad_values_gpu,
@@ -914,8 +933,8 @@ class AccessorWrapper : public VirtualAccessor {
                        const uint64_t total_length,
                        const int batch_size,
                        size_t grad_value_size,
-                       std::vector<int>& slot_vector,
-                       std::vector<int>& slot_mf_dim_vector);
+                       std::vector<int>& slot_vector,          // NOLINT
+                       std::vector<int>& slot_mf_dim_vector);  // NOLINT
 
   void CopyForPullDedupImpl(const paddle::platform::Place& place,
                             const uint64_t* total_keys,
diff --git a/paddle/fluid/framework/fleet/heter_ps/gpu_graph_node.h b/paddle/fluid/framework/fleet/heter_ps/gpu_graph_node.h
index 08a87b6a84688..e67698110b3a3 100644
--- a/paddle/fluid/framework/fleet/heter_ps/gpu_graph_node.h
+++ b/paddle/fluid/framework/fleet/heter_ps/gpu_graph_node.h
@@ -145,7 +145,7 @@ struct NeighborSampleQuery {
       int gpu_id, int table_idx, uint64_t src_nodes, int sample_size, int len) {
     this->table_idx = table_idx;
     this->gpu_id = gpu_id;
-    this->src_nodes = (uint64_t *)src_nodes;
+    this->src_nodes = reinterpret_cast<uint64_t *>(src_nodes);
     this->sample_size = sample_size;
     this->len = len;
   }
@@ -166,10 +166,12 @@ struct NeighborSampleQuery {
   }
 };
 struct NeighborSampleResult {
+  // Used in deepwalk.
   uint64_t *val;
   uint64_t *actual_val;
   int *actual_sample_size, sample_size, key_size;
   int total_sample_size;
+  cudaStream_t stream = 0;
   std::shared_ptr<memory::Allocation> val_mem, actual_sample_size_mem;
   std::shared_ptr<memory::Allocation> actual_val_mem;
   uint64_t *get_val() { return val; }
@@ -179,18 +181,31 @@ struct NeighborSampleResult {
   int get_key_size() { return key_size; }
   void set_total_sample_size(int s) { total_sample_size = s; }
   int get_len() { return total_sample_size; }
+  void set_stream(cudaStream_t stream_t) { stream = stream_t; }
   void initialize(int _sample_size, int _key_size, int dev_id) {
     sample_size = _sample_size;
     key_size = _key_size;
     platform::CUDADeviceGuard guard(dev_id);
     platform::CUDAPlace place = platform::CUDAPlace(dev_id);
-    val_mem =
-        memory::AllocShared(place, _sample_size * _key_size * sizeof(uint64_t));
-    val = (uint64_t *)val_mem->ptr();
-    actual_sample_size_mem =
-        memory::AllocShared(place, _key_size * sizeof(int));
-    actual_sample_size = (int *)actual_sample_size_mem->ptr();
+    if (stream != 0) {
+      val_mem = memory::AllocShared(
+          place,
+          _sample_size * _key_size * sizeof(uint64_t),
+          phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+      actual_sample_size_mem = memory::AllocShared(
+          place,
+          _key_size * sizeof(int),
+          phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+    } else {
+      val_mem = memory::AllocShared(
+          place, _sample_size * _key_size * sizeof(uint64_t));
+      actual_sample_size_mem =
+          memory::AllocShared(place, _key_size * sizeof(int));
+    }
+    val = reinterpret_cast<uint64_t *>(val_mem->ptr());
+    actual_sample_size = reinterpret_cast<int *>(actual_sample_size_mem->ptr());
   }
+
   void display() {
     VLOG(0) << "in node sample result display ------------------";
     int64_t *res = new int64_t[sample_size * key_size];
@@ -232,6 +247,22 @@ struct NeighborSampleResult {
     delete[] ac_size;
     VLOG(0) << " ------------------";
   }
+  void display2() {
+    VLOG(0) << "in node sample result display -----";
+    uint64_t *res = new uint64_t[total_sample_size];
+    cudaMemcpy(res,
+               actual_val,
+               total_sample_size * sizeof(uint64_t),
+               cudaMemcpyDeviceToHost);
+    std::string sample_str;
+    for (int i = 0; i < total_sample_size; i++) {
+      if (sample_str.size() > 0) sample_str += ";";
+      sample_str += std::to_string(res[i]);
+    }
+    VLOG(0) << "sample result: " << sample_str;
+    delete[] res;
+  }
+
   std::vector<uint64_t> get_sampled_graph(NeighborSampleQuery q) {
     std::vector<uint64_t> graph;
     int64_t *sample_keys = new int64_t[q.len];
@@ -275,10 +306,46 @@ struct NeighborSampleResult {
     delete[] sample_keys;
     return graph;
   }
-  NeighborSampleResult(){};
+  NeighborSampleResult() {}
   ~NeighborSampleResult() {}
 };
 
+struct NeighborSampleResultV2 {
+  // Used in graphsage.
+  uint64_t *val;
+  int *actual_sample_size;
+  std::shared_ptr<memory::Allocation> val_mem, actual_sample_size_mem;
+  cudaStream_t stream = 0;
+
+  void set_stream(cudaStream_t stream_t) { stream = stream_t; }
+  void initialize(int _sample_size,
+                  int _key_size,
+                  int _edge_to_id_len,
+                  int dev_id) {
+    platform::CUDADeviceGuard guard(dev_id);
+    platform::CUDAPlace place = platform::CUDAPlace(dev_id);
+    if (stream != 0) {
+      val_mem = memory::AllocShared(
+          place,
+          _sample_size * _key_size * _edge_to_id_len * sizeof(uint64_t),
+          phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+      actual_sample_size_mem = memory::AllocShared(
+          place,
+          _key_size * _edge_to_id_len * sizeof(int),
+          phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+    } else {
+      val_mem = memory::AllocShared(
+          place, _sample_size * _key_size * _edge_to_id_len * sizeof(uint64_t));
+      actual_sample_size_mem =
+          memory::AllocShared(place, _key_size * _edge_to_id_len * sizeof(int));
+    }
+    val = reinterpret_cast<uint64_t *>(val_mem->ptr());
+    actual_sample_size = reinterpret_cast<int *>(actual_sample_size_mem->ptr());
+  }
+  NeighborSampleResultV2() {}
+  ~NeighborSampleResultV2() {}
+};
+
 struct NodeQueryResult {
   uint64_t *val;
   int actual_sample_size;
@@ -289,7 +356,7 @@ struct NodeQueryResult {
     platform::CUDADeviceGuard guard(dev_id);
     platform::CUDAPlace place = platform::CUDAPlace(dev_id);
     val_mem = memory::AllocShared(place, query_size * sizeof(uint64_t));
-    val = (uint64_t *)val_mem->ptr();
+    val = reinterpret_cast<uint64_t *>(val_mem->ptr());
     actual_sample_size = 0;
   }
   void display() {
@@ -313,7 +380,7 @@ struct NodeQueryResult {
   NodeQueryResult() {
     val = NULL;
     actual_sample_size = 0;
-  };
+  }
   ~NodeQueryResult() {}
 };  // end of struct NodeQueryResult
 
@@ -329,7 +396,7 @@ struct GpuPsCommGraphFea {
   uint8_t *slot_id_list;   // locate on both side
   GpuPsFeaInfo
       *fea_info_list;  // only locate on host side, the list of fea_info
-  uint64_t feature_size, node_size;
+  uint64_t feature_size, node_size, feature_capacity;
   // the size of feature array and graph_node_list array
   GpuPsCommGraphFea()
       : node_list(NULL),
@@ -337,7 +404,8 @@ struct GpuPsCommGraphFea {
         slot_id_list(NULL),
         fea_info_list(NULL),
         feature_size(0),
-        node_size(0) {}
+        node_size(0),
+        feature_capacity(0) {}
   GpuPsCommGraphFea(uint64_t *node_list_,
                     uint64_t *feature_list_,
                     uint8_t *slot_id_list_,
diff --git a/paddle/fluid/framework/fleet/heter_ps/gpu_graph_utils.h b/paddle/fluid/framework/fleet/heter_ps/gpu_graph_utils.h
index 39734cae33fca..4d53ff8a7391a 100644
--- a/paddle/fluid/framework/fleet/heter_ps/gpu_graph_utils.h
+++ b/paddle/fluid/framework/fleet/heter_ps/gpu_graph_utils.h
@@ -17,11 +17,54 @@
 #include <cuda_runtime.h>
 #include <device_launch_parameters.h>
 #include <stdio.h>
+#include <time.h>
+#include <algorithm>
+#include <random>
+#include <vector>
 #include "paddle/fluid/platform/enforce.h"
 
+DECLARE_bool(gpugraph_debug_gpu_memory);
+
 namespace paddle {
 namespace framework {
 
+/**
+ * @brief wrapper of the std::default_random_engine each construction will have
+ * different seeds.
+ */
+struct random_engine_wrapper_t {
+  std::default_random_engine engine;
+#if !defined(_WIN32)
+  random_engine_wrapper_t() {
+    timespec tp;
+    clock_gettime(CLOCK_REALTIME, &tp);
+    static std::atomic<unsigned long> x(  // NOLINT
+        static_cast<unsigned long>(1));   // NOLINT
+    std::seed_seq sseq = {
+        x++, x++, x++, static_cast<uint64_t>(tp.tv_sec * 1e9 + tp.tv_nsec)};
+    engine.seed(sseq);
+  }
+#endif
+};
+
+/**
+ * @brief Get a n-size vector<int>, but its element has unique shuffled int
+ * value (from 0 to n-1).
+ * @param n vector size
+ * @return the shuffled vector.
+ */
+inline std::vector<int> shuffle_int_vector(int n) {
+  random_engine_wrapper_t random_engine_wrapper;
+  std::vector<int> ret(n);
+  int i = 0;
+
+  for (auto& e : ret) {
+    e = i++;
+  }
+  std::shuffle(ret.begin(), ret.end(), random_engine_wrapper.engine);
+  return std::move(ret);
+}
+
 #define CUDA_CHECK(cmd)                                                       \
   do {                                                                        \
     cudaError_t e = cmd;                                                      \
@@ -39,6 +82,9 @@ class CudaDeviceRestorer {
 };
 
 inline void debug_gpu_memory_info(int gpu_id, const char* desc) {
+  if (!FLAGS_gpugraph_debug_gpu_memory) {
+    return;
+  }
   CudaDeviceRestorer r;
 
   size_t avail{0};
@@ -52,11 +98,15 @@ inline void debug_gpu_memory_info(int gpu_id, const char* desc) {
   VLOG(0) << "updatex gpu memory on device " << gpu_id << ", "
           << "avail=" << avail / 1024.0 / 1024.0 / 1024.0 << "g, "
           << "total=" << total / 1024.0 / 1024.0 / 1024.0 << "g, "
-          << "use_rate=" << (total - avail) / double(total) << "%, "
+          << "use_rate=" << (total - avail) / static_cast<double>(total)
+          << "%, "
           << "desc=" << desc;
 }
 
 inline void debug_gpu_memory_info(const char* desc) {
+  if (!FLAGS_gpugraph_debug_gpu_memory) {
+    return;
+  }
   CudaDeviceRestorer r;
 
   int device_num = 0;
@@ -78,7 +128,8 @@ inline void debug_gpu_memory_info(const char* desc) {
     VLOG(0) << "update gpu memory on device " << i << ", "
             << "avail=" << avail / 1024.0 / 1024.0 / 1024.0 << "g, "
             << "total=" << total / 1024.0 / 1024.0 / 1024.0 << "g, "
-            << "use_rate=" << (total - avail) / double(total) << "%, "
+            << "use_rate=" << (total - avail) / static_cast<double>(total)
+            << "%, "
             << "desc=" << desc;
   }
 }
diff --git a/paddle/fluid/framework/fleet/heter_ps/graph_gpu_ps_table.h b/paddle/fluid/framework/fleet/heter_ps/graph_gpu_ps_table.h
index aa202fe020fe9..50895c2645853 100644
--- a/paddle/fluid/framework/fleet/heter_ps/graph_gpu_ps_table.h
+++ b/paddle/fluid/framework/fleet/heter_ps/graph_gpu_ps_table.h
@@ -17,9 +17,9 @@
 
 #include <chrono>
 
-#include "heter_comm.h"
 #include "paddle/fluid/distributed/ps/table/common_graph_table.h"
 #include "paddle/fluid/framework/fleet/heter_ps/gpu_graph_node.h"
+#include "paddle/fluid/framework/fleet/heter_ps/heter_comm.h"
 #include "paddle/fluid/framework/fleet/heter_ps/heter_comm_kernel.h"
 #include "paddle/fluid/platform/enforce.h"
 #ifdef PADDLE_WITH_HETERPS
@@ -38,86 +38,47 @@ class GpuPsGraphTable
            type_id * graph_table_num_ + idx;
   }
   GpuPsGraphTable(std::shared_ptr<HeterPsResource> resource,
-                  int topo_aware,
                   int graph_table_num)
       : HeterComm<uint64_t, uint64_t, int, CommonFeatureValueAccessor>(
-            1, resource) {
+            0, resource) {
     load_factor_ = FLAGS_gpugraph_hbm_table_load_factor;
-    VLOG(0) << "load_factor = " << load_factor_;
+    VLOG(0) << "load_factor = " << load_factor_
+            << ", graph_table_num = " << graph_table_num;
 
     rw_lock.reset(new pthread_rwlock_t());
     this->graph_table_num_ = graph_table_num;
     this->feature_table_num_ = 1;
     gpu_num = resource_->total_device();
     memset(global_device_map, -1, sizeof(global_device_map));
-    for (auto &table : tables_) {
-      delete table;
-      table = NULL;
-    }
-    int feature_table_num = 1;
+
     tables_ = std::vector<Table *>(
-        gpu_num * (graph_table_num + feature_table_num), NULL);
+        gpu_num * (graph_table_num_ + feature_table_num_), NULL);
     for (int i = 0; i < gpu_num; i++) {
       global_device_map[resource_->dev_id(i)] = i;
-      for (int j = 0; j < graph_table_num; j++) {
+      for (int j = 0; j < graph_table_num_; j++) {
         gpu_graph_list_.push_back(GpuPsCommGraph());
       }
-      for (int j = 0; j < feature_table_num; j++) {
+      for (int j = 0; j < feature_table_num_; j++) {
         gpu_graph_fea_list_.push_back(GpuPsCommGraphFea());
       }
     }
     cpu_table_status = -1;
-    if (topo_aware) {
-      int total_gpu = resource_->total_device();
-      std::map<int, int> device_map;
-      for (int i = 0; i < total_gpu; i++) {
-        device_map[resource_->dev_id(i)] = i;
-        VLOG(1) << " device " << resource_->dev_id(i) << " is stored on " << i;
-      }
-      path_.clear();
-      path_.resize(total_gpu);
-      VLOG(1) << "topo aware overide";
-      for (int i = 0; i < total_gpu; ++i) {
-        path_[i].resize(total_gpu);
-        for (int j = 0; j < total_gpu; ++j) {
-          auto &nodes = path_[i][j].nodes_;
-          nodes.clear();
-          int from = resource_->dev_id(i);
-          int to = resource_->dev_id(j);
-          int transfer_id = i;
-          if (need_transfer(from, to) &&
-              (device_map.find((from + 4) % 8) != device_map.end() ||
-               device_map.find((to + 4) % 8) != device_map.end())) {
-            transfer_id = (device_map.find((from + 4) % 8) != device_map.end())
-                              ? ((from + 4) % 8)
-                              : ((to + 4) % 8);
-            transfer_id = device_map[transfer_id];
-            nodes.push_back(Node());
-            Node &node = nodes.back();
-            node.in_stream = resource_->comm_stream(i, transfer_id);
-            node.out_stream = resource_->comm_stream(transfer_id, i);
-            node.key_storage = NULL;
-            node.val_storage = NULL;
-            node.sync = 0;
-            node.dev_num = transfer_id;
-          }
-          nodes.push_back(Node());
-          Node &node = nodes.back();
-          node.in_stream = resource_->comm_stream(i, transfer_id);
-          node.out_stream = resource_->comm_stream(transfer_id, i);
-          node.key_storage = NULL;
-          node.val_storage = NULL;
-          node.sync = 0;
-          node.dev_num = j;
-        }
-      }
+    device_mutex_.resize(gpu_num);
+    for (int i = 0; i < gpu_num; i++) {
+      device_mutex_[i] = new std::mutex();
+    }
+  }
+  ~GpuPsGraphTable() {
+    for (size_t i = 0; i < device_mutex_.size(); ++i) {
+      delete device_mutex_[i];
     }
+    device_mutex_.clear();
   }
-  ~GpuPsGraphTable() {}
   void build_graph_on_single_gpu(const GpuPsCommGraph &g, int gpu_id, int idx);
   void build_graph_fea_on_single_gpu(const GpuPsCommGraphFea &g, int gpu_id);
   void clear_graph_info(int gpu_id, int index);
   void clear_graph_info(int index);
+  void reset_feature_info(int gpu_id, size_t capacity, size_t feature_size);
   void clear_feature_info(int gpu_id, int index);
   void clear_feature_info(int index);
   void build_graph_from_cpu(const std::vector<GpuPsCommGraph> &cpu_node_list,
@@ -126,7 +87,8 @@ class GpuPsGraphTable
       const std::vector<GpuPsCommGraphFea> &cpu_node_list, int idx);
   NodeQueryResult graph_node_sample(int gpu_id, int sample_size);
   NeighborSampleResult graph_neighbor_sample_v3(NeighborSampleQuery q,
-                                                bool cpu_switch);
+                                                bool cpu_switch,
+                                                bool compress);
   NeighborSampleResult graph_neighbor_sample(int gpu_id,
                                              uint64_t *key,
                                              int sample_size,
@@ -136,10 +98,32 @@ class GpuPsGraphTable
                                                 uint64_t *key,
                                                 int sample_size,
                                                 int len,
-                                                bool cpu_query_switch);
-
-  int get_feature_of_nodes(
-      int gpu_id, uint64_t *d_walk, uint64_t *d_offset, int size, int slot_num);
+                                                bool cpu_query_switch,
+                                                bool compress);
+  NeighborSampleResultV2 graph_neighbor_sample_all_edge_type(
+      int gpu_id,
+      int edge_type_len,
+      uint64_t *key,
+      int sample_size,
+      int len,
+      std::vector<std::shared_ptr<phi::Allocation>> edge_type_graphs);
+  std::vector<std::shared_ptr<phi::Allocation>> get_edge_type_graph(
+      int gpu_id, int edge_type_len);
+  int get_feature_of_nodes(int gpu_id,
+                           uint64_t *d_walk,
+                           uint64_t *d_offset,
+                           int size,
+                           int slot_num,
+                           int *d_slot_feature_num_map,
+                           int fea_num_per_node);
+  int get_feature_info_of_nodes(
+      int gpu_id,
+      uint64_t *d_nodes,
+      int node_num,
+      uint32_t *size_list,
+      uint32_t *size_list_prefix_sum,
+      std::shared_ptr<phi::Allocation> &feature_list,  // NOLINT
+      std::shared_ptr<phi::Allocation> &slot_list);    // NOLINT
 
   NodeQueryResult query_node_list(int gpu_id,
                                   int idx,
@@ -153,7 +137,29 @@ class GpuPsGraphTable
                                  int *h_right,
                                  uint64_t *src_sample_res,
                                  int *actual_sample_size);
-  int init_cpu_table(const paddle::distributed::GraphParameter &graph);
+  void move_result_to_source_gpu(int start_index,
+                                 int gpu_num,
+                                 int *h_left,
+                                 int *h_right,
+                                 int *fea_left,
+                                 uint32_t *fea_num_list,
+                                 uint32_t *actual_feature_size,
+                                 uint64_t *feature_list,
+                                 uint8_t *slot_list);
+  void move_result_to_source_gpu_all_edge_type(int gpu_id,
+                                               int gpu_num,
+                                               int sample_size,
+                                               int *h_left,
+                                               int *h_right,
+                                               uint64_t *src_sample_res,
+                                               int *actual_sample_size,
+                                               int edge_type_len,
+                                               int len);
+  int init_cpu_table(const paddle::distributed::GraphParameter &graph,
+                     int gpu_num = 8);
+  gpuStream_t get_local_stream(int gpu_id) {
+    return resource_->local_stream(gpu_id, 0);
+  }
 
   int gpu_num;
   int graph_table_num_, feature_table_num_;
@@ -165,10 +171,11 @@ class GpuPsGraphTable
   std::shared_ptr<paddle::distributed::GraphTable> cpu_graph_table_;
   std::shared_ptr<pthread_rwlock_t> rw_lock;
   mutable std::mutex mutex_;
+  std::vector<std::mutex *> device_mutex_;
   std::condition_variable cv_;
   int cpu_table_status;
 };
-}  // namespace framework
+
+};  // namespace framework
 };  // namespace paddle
-//#include "paddle/fluid/framework/fleet/heter_ps/graph_gpu_ps_table_inl.h"
 #endif
diff --git a/paddle/fluid/framework/fleet/heter_ps/graph_gpu_ps_table_inl.cu b/paddle/fluid/framework/fleet/heter_ps/graph_gpu_ps_table_inl.cu
index 3693277a75d39..f8684af98f203 100644
--- a/paddle/fluid/framework/fleet/heter_ps/graph_gpu_ps_table_inl.cu
+++ b/paddle/fluid/framework/fleet/heter_ps/graph_gpu_ps_table_inl.cu
@@ -15,8 +15,8 @@
 #include <thrust/device_vector.h>
 #include <thrust/reduce.h>
 #include <thrust/scan.h>
-
 #include <functional>
+#include "cub/cub.cuh"
 #pragma once
 #ifdef PADDLE_WITH_HETERPS
 #include "paddle/fluid/framework/fleet/heter_ps/gpu_graph_utils.h"
@@ -78,51 +78,96 @@ __global__ void copy_buffer_ac_to_final_place(uint64_t* gpu_buffer,
   }
 }
 
+__global__ void get_features_size(GpuPsFeaInfo* fea_info_array,
+                                  uint32_t* feature_size,
+                                  int n) {
+  int idx = blockIdx.x * blockDim.y + threadIdx.y;
+  if (idx < n) {
+    feature_size[idx] = fea_info_array[idx].feature_size;
+  }
+}
+
+__global__ void get_features_kernel(GpuPsCommGraphFea graph,
+                                    GpuPsFeaInfo* fea_info_array,
+                                    uint32_t* fea_size_prefix_sum,
+                                    uint64_t* feature_array,
+                                    uint8_t* slot_array,
+                                    int n) {
+  int idx = blockIdx.x * blockDim.y + threadIdx.y;
+  if (idx < n) {
+    uint32_t feature_size = fea_info_array[idx].feature_size;
+    if (feature_size == 0) {
+      return;
+    }
+    uint32_t src_offset = fea_info_array[idx].feature_offset;
+    uint32_t dst_offset = fea_size_prefix_sum[idx];
+    for (uint32_t j = 0; j < feature_size; ++j) {
+      feature_array[dst_offset + j] = graph.feature_list[src_offset + j];
+      slot_array[dst_offset + j] = graph.slot_id_list[src_offset + j];
+    }
+  }
+}
+
 __global__ void get_features_kernel(GpuPsCommGraphFea graph,
                                     GpuPsFeaInfo* fea_info_array,
                                     int* actual_size,
                                     uint64_t* feature,
+                                    int* slot_feature_num_map,
                                     int slot_num,
-                                    int n) {
+                                    int n,
+                                    int fea_num_per_node) {
   int idx = blockIdx.x * blockDim.y + threadIdx.y;
   if (idx < n) {
     int feature_size = fea_info_array[idx].feature_size;
-    int offset = idx * slot_num;
+    int src_offset = fea_info_array[idx].feature_offset;
+    int dst_offset = idx * fea_num_per_node;
+    uint64_t* dst_feature = &feature[dst_offset];
     if (feature_size == 0) {
-      for (int k = 0; k < slot_num; ++k) {
-        feature[offset + k] = 0;
+      for (int k = 0; k < fea_num_per_node; ++k) {
+        dst_feature[k] = 0;
       }
-      actual_size[idx] = slot_num;
+      actual_size[idx] = fea_num_per_node;
       return;
     }
 
-    uint64_t* feature_start =
-        &(graph.feature_list[fea_info_array[idx].feature_offset]);
-    uint8_t* slot_id_start =
-        &(graph.slot_id_list[fea_info_array[idx].feature_offset]);
-    int m = 0;
-    for (int k = 0; k < slot_num; ++k) {
-      if (m >= fea_info_array[idx].feature_size || k < slot_id_start[m]) {
-        feature[offset + k] = 0;
-      } else if (k == slot_id_start[m]) {
-        feature[offset + k] = feature_start[m];
-        ++m;
+    uint64_t* feature_start = &(graph.feature_list[src_offset]);
+    uint8_t* slot_id_start = &(graph.slot_id_list[src_offset]);
+    for (int slot_id = 0, dst_fea_idx = 0, src_fea_idx = 0; slot_id < slot_num;
+         slot_id++) {
+      int feature_num = slot_feature_num_map[slot_id];
+      if (src_fea_idx >= feature_size || slot_id < slot_id_start[src_fea_idx]) {
+        for (int j = 0; j < feature_num; ++j, ++dst_fea_idx) {
+          dst_feature[dst_fea_idx] = 0;
+        }
+      } else if (slot_id == slot_id_start[src_fea_idx]) {
+        for (int j = 0; j < feature_num; ++j, ++dst_fea_idx) {
+          if (slot_id == slot_id_start[src_fea_idx]) {
+            dst_feature[dst_fea_idx] = feature_start[src_fea_idx++];
+          } else {
+            dst_feature[dst_fea_idx] = 0;
+          }
+        }
       } else {
         assert(0);
       }
     }
-    actual_size[idx] = slot_num;
+    actual_size[idx] = fea_num_per_node;
   }
 }
 
 template <int WARP_SIZE, int BLOCK_WARPS, int TILE_SIZE>
-__global__ void neighbor_sample_kernel(GpuPsCommGraph graph,
-                                       GpuPsNodeInfo* node_info_list,
-                                       int* actual_size,
-                                       uint64_t* res,
-                                       int sample_len,
-                                       int n,
-                                       int default_value) {
+__global__ void neighbor_sample_kernel_walking(GpuPsCommGraph graph,
+                                               GpuPsNodeInfo* node_info_list,
+                                               int* actual_size,
+                                               uint64_t* res,
+                                               int sample_len,
+                                               int n,
+                                               int default_value) {
+  // graph: The corresponding edge table.
+  // node_info_list: The input node query, duplicate nodes allowed.
+  // actual_size: The actual sample size of the input nodes.
+  // res: The output sample neighbors of the input nodes.
+  // sample_len: The fix sample size.
   assert(blockDim.x == WARP_SIZE);
   assert(blockDim.y == BLOCK_WARPS);
 
@@ -136,7 +181,7 @@ __global__ void neighbor_sample_kernel(GpuPsCommGraph graph,
       i += BLOCK_WARPS;
       continue;
     }
-    int neighbor_len = (int)node_info_list[i].neighbor_size;
+    int neighbor_len = node_info_list[i].neighbor_size;
     uint32_t data_offset = node_info_list[i].neighbor_offset;
     int offset = i * sample_len;
     uint64_t* data = graph.neighbor_list;
@@ -168,10 +213,76 @@ __global__ void neighbor_sample_kernel(GpuPsCommGraph graph,
   }
 }
 
+__global__ void neighbor_sample_kernel_all_edge_type(
+    GpuPsCommGraph* graphs,
+    GpuPsNodeInfo* node_info_base,
+    int* actual_size_base,
+    uint64_t* sample_array_base,
+    int sample_len,
+    int n,  // edge_type * shard_len
+    int default_value,
+    int shard_len) {
+  // graph: All edge tables.
+  // node_info_list: The input node query, must be unique, otherwise the
+  // randomness gets worse. actual_size_base: The begin position of actual
+  // sample size of the input nodes. sample_array_base: The begin position of
+  // sample neighbors of the input nodes. sample_len: The fix sample size.
+  curandState rng;
+  curand_init(blockIdx.x, threadIdx.x, 0, &rng);
+  const size_t i = blockIdx.x * blockDim.x + threadIdx.x;
+
+  if (i < n) {
+    int edge_idx = i / shard_len, node_i = i % shard_len;
+
+    GpuPsNodeInfo* node_info_list = node_info_base + edge_idx * shard_len;
+    int* actual_size_array = actual_size_base + edge_idx * shard_len;
+
+    if (node_info_list[node_i].neighbor_size == 0) {
+      actual_size_array[node_i] = default_value;
+    } else {
+      uint64_t* sample_array =
+          sample_array_base + edge_idx * shard_len * sample_len;
+      int neighbor_len = node_info_list[node_i].neighbor_size;
+      uint32_t data_offset = node_info_list[node_i].neighbor_offset;
+      int offset = node_i * sample_len;
+      uint64_t* data = graphs[edge_idx].neighbor_list;
+      uint64_t tmp;
+      int split, begin;
+      if (neighbor_len <= sample_len) {
+        actual_size_array[node_i] = neighbor_len;
+        for (int j = 0; j < neighbor_len; j++) {
+          sample_array[offset + j] = data[data_offset + j];
+        }
+      } else {
+        actual_size_array[node_i] = sample_len;
+        if (neighbor_len < 2 * sample_len) {
+          split = sample_len;
+          begin = 0;
+        } else {
+          split = neighbor_len - sample_len;
+          begin = neighbor_len - sample_len;
+        }
+        for (int idx = split; idx <= neighbor_len - 1; idx++) {
+          const int num = curand(&rng) % (idx + 1);
+          data[data_offset + idx] = atomicExch(
+              reinterpret_cast<unsigned long long int*>(data +  // NOLINT
+                                                        data_offset + num),
+              static_cast<unsigned long long int>(  // NOLINT
+                  data[data_offset + idx]));
+        }
+        for (int idx = 0; idx < sample_len; idx++) {
+          sample_array[offset + idx] = data[data_offset + begin + idx];
+        }
+      }
+    }
+  }
+}
+
 int GpuPsGraphTable::init_cpu_table(
-    const paddle::distributed::GraphParameter& graph) {
+    const paddle::distributed::GraphParameter& graph, int gpu_num) {
   cpu_graph_table_.reset(new paddle::distributed::GraphTable);
   cpu_table_status = cpu_graph_table_->Initialize(graph);
+  cpu_graph_table_->init_worker_poll(gpu_num);
   // if (cpu_table_status != 0) return cpu_table_status;
   // std::function<void(std::vector<GpuPsCommGraph>&)> callback =
   //     [this](std::vector<GpuPsCommGraph>& res) {
@@ -210,7 +321,7 @@ void GpuPsGraphTable::display_sample_res(void* key,
                                          void* val,
                                          int len,
                                          int sample_len) {
-  char key_buffer[len * sizeof(uint64_t)];
+  char key_buffer[len * sizeof(uint64_t)];  // NOLINT
   char val_buffer[sample_len * sizeof(int64_t) * len +
                   (len + len % 2) * sizeof(int) + len * sizeof(uint64_t)];
   cudaMemcpy(key_buffer, key, sizeof(uint64_t) * len, cudaMemcpyDeviceToHost);
@@ -219,13 +330,15 @@ void GpuPsGraphTable::display_sample_res(void* key,
              sample_len * sizeof(int64_t) * len +
                  (len + len % 2) * sizeof(int) + len * sizeof(uint64_t),
              cudaMemcpyDeviceToHost);
-  uint64_t* sample_val =
-      (uint64_t*)(val_buffer + (len + len % 2) * sizeof(int) +
-                  len * sizeof(int64_t));
+  uint64_t* sample_val = reinterpret_cast<uint64_t*>(
+      val_buffer + (len + len % 2) * sizeof(int) + len * sizeof(int64_t));
   for (int i = 0; i < len; i++) {
-    printf("key %llu\n", *(int64_t*)(key_buffer + i * sizeof(uint64_t)));
-    printf("index %llu\n", *(int64_t*)(val_buffer + i * sizeof(uint64_t)));
-    int ac_size = *(int*)(val_buffer + i * sizeof(int) + len * sizeof(int64_t));
+    printf("key %llu\n",
+           *reinterpret_cast<int64_t*>(key_buffer + i * sizeof(uint64_t)));
+    printf("index %llu\n",
+           *reinterpret_cast<int64_t*>(val_buffer + i * sizeof(uint64_t)));
+    int ac_size = *reinterpret_cast<int*>(val_buffer + i * sizeof(int) +
+                                          len * sizeof(int64_t));
     printf("sampled %d neigbhors\n", ac_size);
     for (int j = 0; j < ac_size; j++) {
       printf("%llu ", sample_val[i * sample_len + j]);
@@ -234,6 +347,67 @@ void GpuPsGraphTable::display_sample_res(void* key,
   }
 }
 
+void GpuPsGraphTable::move_result_to_source_gpu(int start_index,
+                                                int gpu_num,
+                                                int* h_left,
+                                                int* h_right,
+                                                int* fea_left,
+                                                uint32_t* fea_num_list,
+                                                uint32_t* actual_feature_size,
+                                                uint64_t* feature_list,
+                                                uint8_t* slot_list) {
+  int shard_len[gpu_num];  // NOLINT
+  for (int i = 0; i < gpu_num; i++) {
+    if (h_left[i] == -1 || h_right[i] == -1) {
+      continue;
+    }
+    shard_len[i] = h_right[i] - h_left[i] + 1;
+    int cur_step = path_[start_index][i].nodes_.size() - 1;
+    for (int j = cur_step; j > 0; j--) {
+      CUDA_CHECK(
+          cudaMemcpyAsync(path_[start_index][i].nodes_[j - 1].val_storage,
+                          path_[start_index][i].nodes_[j].val_storage,
+                          path_[start_index][i].nodes_[j - 1].val_bytes_len,
+                          cudaMemcpyDefault,
+                          path_[start_index][i].nodes_[j - 1].out_stream));
+    }
+    auto& node = path_[start_index][i].nodes_.front();
+
+    if (fea_num_list[i] > 0) {
+      CUDA_CHECK(cudaMemcpyAsync(
+          reinterpret_cast<char*>(feature_list + fea_left[i]),
+          node.val_storage +
+              sizeof(uint32_t) * (shard_len[i] + shard_len[i] % 2),
+          sizeof(uint64_t) * fea_num_list[i],
+          cudaMemcpyDefault,
+          node.out_stream));
+      CUDA_CHECK(cudaMemcpyAsync(
+          reinterpret_cast<char*>(slot_list + fea_left[i]),
+          node.val_storage +
+              sizeof(uint32_t) * (shard_len[i] + shard_len[i] % 2) +
+              sizeof(uint64_t) * fea_num_list[i],
+          sizeof(uint8_t) * fea_num_list[i],
+          cudaMemcpyDefault,
+          node.out_stream));
+    }
+    if (shard_len[i] > 0) {
+      CUDA_CHECK(cudaMemcpyAsync(
+          reinterpret_cast<char*>(actual_feature_size + h_left[i]),
+          node.val_storage,
+          sizeof(uint32_t) * shard_len[i],
+          cudaMemcpyDefault,
+          node.out_stream));
+    }
+  }
+  for (int i = 0; i < gpu_num; ++i) {
+    if (h_left[i] == -1 || h_right[i] == -1) {
+      continue;
+    }
+    auto& node = path_[start_index][i].nodes_.front();
+    CUDA_CHECK(cudaStreamSynchronize(node.out_stream));
+  }
+}
+
 void GpuPsGraphTable::move_result_to_source_gpu(int start_index,
                                                 int gpu_num,
                                                 int sample_size,
@@ -241,13 +415,13 @@ void GpuPsGraphTable::move_result_to_source_gpu(int start_index,
                                                 int* h_right,
                                                 uint64_t* src_sample_res,
                                                 int* actual_sample_size) {
-  int shard_len[gpu_num];
+  int shard_len[gpu_num];  // NOLINT
   for (int i = 0; i < gpu_num; i++) {
     if (h_left[i] == -1 || h_right[i] == -1) {
       continue;
     }
     shard_len[i] = h_right[i] - h_left[i] + 1;
-    int cur_step = (int)path_[start_index][i].nodes_.size() - 1;
+    int cur_step = path_[start_index][i].nodes_.size() - 1;
     for (int j = cur_step; j > 0; j--) {
       CUDA_CHECK(
           cudaMemcpyAsync(path_[start_index][i].nodes_[j - 1].val_storage,
@@ -281,6 +455,99 @@ void GpuPsGraphTable::move_result_to_source_gpu(int start_index,
   }
 }
 
+void GpuPsGraphTable::move_result_to_source_gpu_all_edge_type(
+    int start_index,
+    int gpu_num,
+    int sample_size,
+    int* h_left,
+    int* h_right,
+    uint64_t* src_sample_res,
+    int* actual_sample_size,
+    int edge_type_len,
+    int len) {
+  int shard_len[gpu_num];  // NOLINT
+
+  for (int i = 0; i < gpu_num; i++) {
+    if (h_left[i] == -1 || h_right[i] == -1) {
+      continue;
+    }
+    shard_len[i] = h_right[i] - h_left[i] + 1;
+    int cur_step = path_[start_index][i].nodes_.size() - 1;
+    for (int j = cur_step; j > 0; j--) {
+      CUDA_CHECK(
+          cudaMemcpyAsync(path_[start_index][i].nodes_[j - 1].val_storage,
+                          path_[start_index][i].nodes_[j].val_storage,
+                          path_[start_index][i].nodes_[j - 1].val_bytes_len,
+                          cudaMemcpyDefault,
+                          path_[start_index][i].nodes_[j - 1].out_stream));
+    }
+  }
+
+  for (int i = 0; i < edge_type_len; i++) {
+    for (int j = 0; j < gpu_num; j++) {
+      if (h_left[j] == -1 || h_right[j] == -1) {
+        continue;
+      }
+      auto& node = path_[start_index][j].nodes_.front();
+      CUDA_CHECK(cudaMemcpyAsync(
+          reinterpret_cast<char*>(src_sample_res + i * len * sample_size +
+                                  h_left[j] * sample_size),
+          node.val_storage + sizeof(int64_t) * shard_len[j] * edge_type_len +
+              sizeof(int) * (shard_len[j] * edge_type_len +
+                             (shard_len[j] * edge_type_len) % 2) +
+              sizeof(uint64_t) * i * shard_len[j] * sample_size,
+          sizeof(uint64_t) * shard_len[j] * sample_size,
+          cudaMemcpyDefault,
+          node.out_stream));
+      CUDA_CHECK(cudaMemcpyAsync(
+          reinterpret_cast<char*>(actual_sample_size + i * len + h_left[j]),
+          node.val_storage + sizeof(int64_t) * shard_len[j] * edge_type_len +
+              sizeof(int) * i * shard_len[j],
+          sizeof(int) * shard_len[j],
+          cudaMemcpyDefault,
+          node.out_stream));
+    }
+  }
+
+  for (int i = 0; i < gpu_num; i++) {
+    if (h_left[i] == -1 || h_right[i] == -1) {
+      continue;
+    }
+    auto& node = path_[start_index][i].nodes_.front();
+    CUDA_CHECK(cudaStreamSynchronize(node.out_stream));
+  }
+}
+
+__global__ void fill_size(uint32_t* d_actual_size_list,
+                          uint32_t* d_shard_size_list,
+                          int* idx,
+                          int len) {
+  const size_t i = blockIdx.x * blockDim.x + threadIdx.x;
+  if (i < len) {
+    d_actual_size_list[idx[i]] = d_shard_size_list[i];
+  }
+}
+
+__global__ void fill_feature_and_slot(uint64_t* dst_feature_list,
+                                      uint8_t* dst_slot_list,
+                                      uint32_t* dst_size_prefix_sum_list,
+                                      uint64_t* src_feature_list,
+                                      uint8_t* src_slot_list,
+                                      uint32_t* src_size_prefix_sum_list,
+                                      uint32_t* src_size_list,
+                                      int* idx,
+                                      int len) {
+  const size_t i = blockIdx.x * blockDim.x + threadIdx.x;
+  if (i < len) {
+    uint32_t dst_index = dst_size_prefix_sum_list[idx[i]];
+    uint32_t src_index = src_size_prefix_sum_list[i];
+    for (uint32_t j = 0; j < src_size_list[i]; j++) {
+      dst_feature_list[dst_index + j] = src_feature_list[src_index + j];
+      dst_slot_list[dst_index + j] = src_slot_list[src_index + j];
+    }
+  }
+}
+
 /*
 TODO:
 how to optimize it to eliminate the for loop
@@ -295,8 +562,30 @@ __global__ void fill_dvalues(uint64_t* d_shard_vals,
   const size_t i = blockIdx.x * blockDim.x + threadIdx.x;
   if (i < len) {
     d_actual_sample_size[idx[i]] = d_shard_actual_sample_size[i];
-    for (int j = 0; j < sample_size; j++) {
-      d_vals[idx[i] * sample_size + j] = d_shard_vals[i * sample_size + j];
+    size_t offset1 = idx[i] * sample_size;
+    size_t offset2 = i * sample_size;
+    for (int j = 0; j < d_shard_actual_sample_size[i]; j++) {
+      d_vals[offset1 + j] = d_shard_vals[offset2 + j];
+    }
+  }
+}
+
+__global__ void fill_dvalues_with_edge_type(uint64_t* d_shard_vals,
+                                            uint64_t* d_vals,
+                                            int* d_shard_actual_sample_size,
+                                            int* d_actual_sample_size,
+                                            int* idx,
+                                            int sample_size,
+                                            int len,    // len * edge_type_len
+                                            int mod) {  // len
+  const size_t i = blockIdx.x * blockDim.x + threadIdx.x;
+  if (i < len) {
+    int a = i % mod, b = i - i % mod;
+    d_actual_sample_size[b + idx[a]] = d_shard_actual_sample_size[i];
+    size_t offset1 = (b + idx[a]) * sample_size;
+    size_t offset2 = i * sample_size;
+    for (int j = 0; j < d_shard_actual_sample_size[i]; j++) {
+      d_vals[offset1 + j] = d_shard_vals[offset2 + j];
     }
   }
 }
@@ -323,8 +612,10 @@ __global__ void fill_actual_vals(uint64_t* vals,
                                  int len) {
   const size_t i = blockIdx.x * blockDim.x + threadIdx.x;
   if (i < len) {
+    int offset1 = cumsum_actual_sample_size[i];
+    int offset2 = sample_size * i;
     for (int j = 0; j < actual_sample_size[i]; j++) {
-      actual_vals[cumsum_actual_sample_size[i] + j] = vals[sample_size * i + j];
+      actual_vals[offset1 + j] = vals[offset2 + j];
     }
   }
 }
@@ -362,6 +653,43 @@ void GpuPsGraphTable::clear_feature_info(int gpu_id) {
     cudaFree(graph.slot_id_list);
     graph.slot_id_list = NULL;
   }
+  graph.feature_capacity = 0;
+}
+
+void GpuPsGraphTable::reset_feature_info(int gpu_id,
+                                         size_t capacity,
+                                         size_t feature_size) {
+  int idx = 0;
+  if (idx >= feature_table_num_) return;
+  int offset = get_table_offset(gpu_id, GraphTableType::FEATURE_TABLE, idx);
+  if (offset < tables_.size()) {
+    delete tables_[offset];
+    tables_[offset] = new Table(capacity);
+  }
+  int graph_fea_idx = gpu_id * feature_table_num_ + idx;
+  if (graph_fea_idx >= gpu_graph_fea_list_.size()) {
+    return;
+  }
+  auto& graph = gpu_graph_fea_list_[graph_fea_idx];
+  graph.node_list = NULL;
+  if (graph.feature_list == NULL) {
+    CUDA_CHECK(
+        cudaMalloc(&graph.feature_list, feature_size * sizeof(uint64_t)));
+    CUDA_CHECK(cudaMalloc(&graph.slot_id_list, feature_size * sizeof(uint8_t)));
+    graph.feature_capacity = feature_size;
+  } else if (graph.feature_capacity < feature_size) {
+    cudaFree(graph.feature_list);
+    cudaFree(graph.slot_id_list);
+    CUDA_CHECK(
+        cudaMalloc(&graph.feature_list, feature_size * sizeof(uint64_t)));
+    CUDA_CHECK(cudaMalloc(&graph.slot_id_list, feature_size * sizeof(uint8_t)));
+    graph.feature_capacity = feature_size;
+  } else {
+    CUDA_CHECK(
+        cudaMemset(graph.feature_list, 0, feature_size * sizeof(uint64_t)));
+    CUDA_CHECK(
+        cudaMemset(graph.slot_id_list, 0, feature_size * sizeof(uint8_t)));
+  }
 }
 
 void GpuPsGraphTable::clear_graph_info(int gpu_id, int idx) {
@@ -394,72 +722,38 @@ gpu i saves the ith graph from cpu_graph_list
 */
 void GpuPsGraphTable::build_graph_fea_on_single_gpu(const GpuPsCommGraphFea& g,
                                                     int gpu_id) {
-  clear_feature_info(gpu_id);
-  int ntype_id = 0;
-
   platform::CUDADeviceGuard guard(resource_->dev_id(gpu_id));
-
+  size_t capacity = std::max((uint64_t)1, g.node_size) / load_factor_;
+  reset_feature_info(gpu_id, capacity, g.feature_size);
+  int ntype_id = 0;
   int offset = gpu_id * feature_table_num_ + ntype_id;
-  gpu_graph_fea_list_[offset] = GpuPsCommGraphFea();
-
   int table_offset =
       get_table_offset(gpu_id, GraphTableType::FEATURE_TABLE, ntype_id);
-
-  size_t capacity = std::max((uint64_t)1, g.node_size) / load_factor_;
-  tables_[table_offset] = new Table(capacity);
   if (g.node_size > 0) {
     build_ps(gpu_id,
              g.node_list,
-             (uint64_t*)g.fea_info_list,
+             reinterpret_cast<uint64_t*>(g.fea_info_list),
              g.node_size,
              1024,
              8,
              table_offset);
-    gpu_graph_fea_list_[offset].node_list = NULL;
     gpu_graph_fea_list_[offset].node_size = g.node_size;
   } else {
     build_ps(gpu_id, NULL, NULL, 0, 1024, 8, table_offset);
-    gpu_graph_fea_list_[offset].node_list = NULL;
     gpu_graph_fea_list_[offset].node_size = 0;
   }
   if (g.feature_size) {
-    // TODO
-    cudaError_t cudaStatus =
-        cudaMalloc((void**)&gpu_graph_fea_list_[offset].feature_list,
-                   g.feature_size * sizeof(uint64_t));
-    PADDLE_ENFORCE_EQ(
-        cudaStatus,
-        cudaSuccess,
-        platform::errors::InvalidArgument(
-            "ailed to allocate memory for graph-feature on gpu "));
-    VLOG(0) << "sucessfully allocate " << g.feature_size * sizeof(uint64_t)
-            << " bytes of memory for graph-feature on gpu "
-            << resource_->dev_id(gpu_id);
     CUDA_CHECK(cudaMemcpy(gpu_graph_fea_list_[offset].feature_list,
                           g.feature_list,
                           g.feature_size * sizeof(uint64_t),
                           cudaMemcpyHostToDevice));
-
-    // TODO
-    cudaStatus = cudaMalloc((void**)&gpu_graph_fea_list_[offset].slot_id_list,
-                            g.feature_size * sizeof(uint8_t));
-    PADDLE_ENFORCE_EQ(
-        cudaStatus,
-        cudaSuccess,
-        platform::errors::InvalidArgument(
-            "ailed to allocate memory for graph-feature on gpu "));
-    VLOG(0) << "sucessfully allocate " << g.feature_size * sizeof(uint8_t)
-            << " bytes of memory for graph-feature on gpu "
-            << resource_->dev_id(gpu_id);
-    cudaMemcpy(gpu_graph_fea_list_[offset].slot_id_list,
-               g.slot_id_list,
-               g.feature_size * sizeof(uint8_t),
-               cudaMemcpyHostToDevice);
+    CUDA_CHECK(cudaMemcpy(gpu_graph_fea_list_[offset].slot_id_list,
+                          g.slot_id_list,
+                          g.feature_size * sizeof(uint8_t),
+                          cudaMemcpyHostToDevice));
 
     gpu_graph_fea_list_[offset].feature_size = g.feature_size;
   } else {
-    gpu_graph_fea_list_[offset].feature_list = NULL;
-    gpu_graph_fea_list_[offset].slot_id_list = NULL;
     gpu_graph_fea_list_[offset].feature_size = 0;
   }
   VLOG(0) << "gpu node_feature info card :" << gpu_id << " ,node_size is "
@@ -467,6 +761,38 @@ void GpuPsGraphTable::build_graph_fea_on_single_gpu(const GpuPsCommGraphFea& g,
           << gpu_graph_fea_list_[offset].feature_size;
 }
 
+std::vector<std::shared_ptr<phi::Allocation>>
+GpuPsGraphTable::get_edge_type_graph(int gpu_id, int edge_type_len) {
+  int total_gpu = resource_->total_device();
+  auto stream = resource_->local_stream(gpu_id, 0);
+
+  platform::CUDAPlace place = platform::CUDAPlace(resource_->dev_id(gpu_id));
+  platform::CUDADeviceGuard guard(resource_->dev_id(gpu_id));
+
+  std::vector<std::shared_ptr<phi::Allocation>> graphs_vec;
+  for (int i = 0; i < total_gpu; i++) {
+    GpuPsCommGraph graphs[edge_type_len];  // NOLINT
+    for (int idx = 0; idx < edge_type_len; idx++) {
+      int table_offset = get_table_offset(i, GraphTableType::EDGE_TABLE, idx);
+      int offset = i * graph_table_num_ + idx;
+      graphs[idx] = gpu_graph_list_[offset];
+    }
+    auto d_commgraph_mem = memory::AllocShared(
+        place,
+        edge_type_len * sizeof(GpuPsCommGraph),
+        phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+    GpuPsCommGraph* d_commgraph_ptr =
+        reinterpret_cast<GpuPsCommGraph*>(d_commgraph_mem->ptr());
+    CUDA_CHECK(cudaMemcpy(d_commgraph_ptr,
+                          graphs,
+                          sizeof(GpuPsCommGraph) * edge_type_len,
+                          cudaMemcpyHostToDevice));
+    graphs_vec.emplace_back(d_commgraph_mem);
+  }
+
+  return graphs_vec;
+}
+
 /*
 the parameter std::vector<GpuPsCommGraph> cpu_graph_list is generated by cpu.
 it saves the graph to be saved on each gpu.
@@ -487,7 +813,7 @@ void GpuPsGraphTable::build_graph_on_single_gpu(const GpuPsCommGraph& g,
   tables_[table_offset] = new Table(capacity);
   if (g.node_size > 0) {
     if (FLAGS_gpugraph_load_node_list_into_hbm) {
-      CUDA_CHECK(cudaMalloc((void**)&gpu_graph_list_[offset].node_list,
+      CUDA_CHECK(cudaMalloc(&gpu_graph_list_[offset].node_list,
                             g.node_size * sizeof(uint64_t)));
       CUDA_CHECK(cudaMemcpy(gpu_graph_list_[offset].node_list,
                             g.node_list,
@@ -497,7 +823,7 @@ void GpuPsGraphTable::build_graph_on_single_gpu(const GpuPsCommGraph& g,
 
     build_ps(i,
              g.node_list,
-             (uint64_t*)(g.node_info_list),
+             reinterpret_cast<uint64_t*>(g.node_info_list),
              g.node_size,
              1024,
              8,
@@ -509,9 +835,8 @@ void GpuPsGraphTable::build_graph_on_single_gpu(const GpuPsCommGraph& g,
     gpu_graph_list_[offset].node_size = 0;
   }
   if (g.neighbor_size) {
-    cudaError_t cudaStatus =
-        cudaMalloc((void**)&gpu_graph_list_[offset].neighbor_list,
-                   g.neighbor_size * sizeof(uint64_t));
+    cudaError_t cudaStatus = cudaMalloc(&gpu_graph_list_[offset].neighbor_list,
+                                        g.neighbor_size * sizeof(uint64_t));
     PADDLE_ENFORCE_EQ(cudaStatus,
                       cudaSuccess,
                       platform::errors::InvalidArgument(
@@ -553,7 +878,7 @@ void GpuPsGraphTable::build_graph_fea_from_cpu(
     if (cpu_graph_fea_list[i].node_size > 0) {
       build_ps(i,
                cpu_graph_fea_list[i].node_list,
-               (uint64_t*)cpu_graph_fea_list[i].fea_info_list,
+               reinterpret_cast<uint64_t*>(cpu_graph_fea_list[i].fea_info_list),
                cpu_graph_fea_list[i].node_size,
                1024,
                8,
@@ -567,7 +892,7 @@ void GpuPsGraphTable::build_graph_fea_from_cpu(
     if (cpu_graph_fea_list[i].feature_size) {
       // TODO
       CUDA_CHECK(
-          cudaMalloc((void**)&gpu_graph_fea_list_[offset].feature_list,
+          cudaMalloc(&gpu_graph_fea_list_[offset].feature_list,
                      cpu_graph_fea_list[i].feature_size * sizeof(uint64_t)));
 
       CUDA_CHECK(
@@ -578,7 +903,7 @@ void GpuPsGraphTable::build_graph_fea_from_cpu(
 
       // TODO
       CUDA_CHECK(
-          cudaMalloc((void**)&gpu_graph_fea_list_[offset].slot_id_list,
+          cudaMalloc(&gpu_graph_fea_list_[offset].slot_id_list,
                      cpu_graph_fea_list[i].feature_size * sizeof(uint8_t)));
 
       CUDA_CHECK(
@@ -617,7 +942,7 @@ void GpuPsGraphTable::build_graph_from_cpu(
         new Table(std::max((uint64_t)1, (uint64_t)cpu_graph_list[i].node_size) /
                   load_factor_);
     if (cpu_graph_list[i].node_size > 0) {
-      CUDA_CHECK(cudaMalloc((void**)&gpu_graph_list_[offset].node_list,
+      CUDA_CHECK(cudaMalloc(&gpu_graph_list_[offset].node_list,
                             cpu_graph_list[i].node_size * sizeof(uint64_t)));
       CUDA_CHECK(cudaMemcpy(gpu_graph_list_[offset].node_list,
                             cpu_graph_list[i].node_list,
@@ -625,7 +950,7 @@ void GpuPsGraphTable::build_graph_from_cpu(
                             cudaMemcpyHostToDevice));
       build_ps(i,
                cpu_graph_list[i].node_list,
-               (uint64_t*)(cpu_graph_list[i].node_info_list),
+               reinterpret_cast<uint64_t*>(cpu_graph_list[i].node_info_list),
                cpu_graph_list[i].node_size,
                1024,
                8,
@@ -638,7 +963,7 @@ void GpuPsGraphTable::build_graph_from_cpu(
     }
     if (cpu_graph_list[i].neighbor_size) {
       CUDA_CHECK(
-          cudaMalloc((void**)&gpu_graph_list_[offset].neighbor_list,
+          cudaMalloc(&gpu_graph_list_[offset].neighbor_list,
                      cpu_graph_list[i].neighbor_size * sizeof(uint64_t)));
 
       CUDA_CHECK(cudaMemcpy(gpu_graph_list_[offset].neighbor_list,
@@ -655,20 +980,22 @@ void GpuPsGraphTable::build_graph_from_cpu(
 }
 
 NeighborSampleResult GpuPsGraphTable::graph_neighbor_sample_v3(
-    NeighborSampleQuery q, bool cpu_switch) {
+    NeighborSampleQuery q, bool cpu_switch, bool compress = true) {
   return graph_neighbor_sample_v2(global_device_map[q.gpu_id],
                                   q.table_idx,
                                   q.src_nodes,
                                   q.sample_size,
                                   q.len,
-                                  cpu_switch);
+                                  cpu_switch,
+                                  compress);
 }
 
 NeighborSampleResult GpuPsGraphTable::graph_neighbor_sample(int gpu_id,
                                                             uint64_t* key,
                                                             int sample_size,
                                                             int len) {
-  return graph_neighbor_sample_v2(gpu_id, 0, key, sample_size, len, false);
+  return graph_neighbor_sample_v2(
+      gpu_id, 0, key, sample_size, len, false, true);
 }
 
 NeighborSampleResult GpuPsGraphTable::graph_neighbor_sample_v2(
@@ -677,8 +1004,11 @@ NeighborSampleResult GpuPsGraphTable::graph_neighbor_sample_v2(
     uint64_t* key,
     int sample_size,
     int len,
-    bool cpu_query_switch) {
+    bool cpu_query_switch,
+    bool compress) {
   NeighborSampleResult result;
+  auto stream = resource_->local_stream(gpu_id, 0);
+  result.set_stream(stream);
   result.initialize(sample_size, len, resource_->dev_id(gpu_id));
 
   if (len == 0) {
@@ -691,15 +1021,20 @@ NeighborSampleResult GpuPsGraphTable::graph_neighbor_sample_v2(
   int* actual_sample_size = result.actual_sample_size;
   uint64_t* val = result.val;
   int total_gpu = resource_->total_device();
-  auto stream = resource_->local_stream(gpu_id, 0);
 
   int grid_size = (len - 1) / block_size_ + 1;
 
   int h_left[total_gpu];   // NOLINT
   int h_right[total_gpu];  // NOLINT
 
-  auto d_left = memory::Alloc(place, total_gpu * sizeof(int));
-  auto d_right = memory::Alloc(place, total_gpu * sizeof(int));
+  auto d_left =
+      memory::Alloc(place,
+                    total_gpu * sizeof(int),
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+  auto d_right =
+      memory::Alloc(place,
+                    total_gpu * sizeof(int),
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
   int* d_left_ptr = reinterpret_cast<int*>(d_left->ptr());
   int* d_right_ptr = reinterpret_cast<int*>(d_right->ptr());
   int default_value = 0;
@@ -710,30 +1045,53 @@ NeighborSampleResult GpuPsGraphTable::graph_neighbor_sample_v2(
   CUDA_CHECK(cudaMemsetAsync(d_left_ptr, -1, total_gpu * sizeof(int), stream));
   CUDA_CHECK(cudaMemsetAsync(d_right_ptr, -1, total_gpu * sizeof(int), stream));
   //
-  auto d_idx = memory::Alloc(place, len * sizeof(int));
+  auto d_idx =
+      memory::Alloc(place,
+                    len * sizeof(int),
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
   int* d_idx_ptr = reinterpret_cast<int*>(d_idx->ptr());
 
-  auto d_shard_keys = memory::Alloc(place, len * sizeof(uint64_t));
+  auto d_shard_keys =
+      memory::Alloc(place,
+                    len * sizeof(uint64_t),
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
   uint64_t* d_shard_keys_ptr = reinterpret_cast<uint64_t*>(d_shard_keys->ptr());
   auto d_shard_vals =
-      memory::Alloc(place, sample_size * len * sizeof(uint64_t));
+      memory::Alloc(place,
+                    sample_size * len * sizeof(uint64_t),
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
   uint64_t* d_shard_vals_ptr = reinterpret_cast<uint64_t*>(d_shard_vals->ptr());
-  auto d_shard_actual_sample_size = memory::Alloc(place, len * sizeof(int));
+  auto d_shard_actual_sample_size =
+      memory::Alloc(place,
+                    len * sizeof(int),
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
   int* d_shard_actual_sample_size_ptr =
       reinterpret_cast<int*>(d_shard_actual_sample_size->ptr());
 
-  split_input_to_shard(
-      (uint64_t*)(key), d_idx_ptr, len, d_left_ptr, d_right_ptr, gpu_id);
+  split_input_to_shard(reinterpret_cast<uint64_t*>(key),
+                       d_idx_ptr,
+                       len,
+                       d_left_ptr,
+                       d_right_ptr,
+                       gpu_id);
 
   heter_comm_kernel_->fill_shard_key(
       d_shard_keys_ptr, key, d_idx_ptr, len, stream);
 
   CUDA_CHECK(cudaStreamSynchronize(stream));
 
-  CUDA_CHECK(cudaMemcpy(
-      h_left, d_left_ptr, total_gpu * sizeof(int), cudaMemcpyDeviceToHost));
-  CUDA_CHECK(cudaMemcpy(
-      h_right, d_right_ptr, total_gpu * sizeof(int), cudaMemcpyDeviceToHost));
+  CUDA_CHECK(cudaMemcpyAsync(h_left,
+                             d_left_ptr,
+                             total_gpu * sizeof(int),
+                             cudaMemcpyDeviceToHost,
+                             stream));
+  CUDA_CHECK(cudaMemcpyAsync(h_right,
+                             d_right_ptr,
+                             total_gpu * sizeof(int),
+                             cudaMemcpyDeviceToHost,
+                             stream));
+  CUDA_CHECK(cudaStreamSynchronize(stream));
+  device_mutex_[gpu_id]->lock();
   for (int i = 0; i < total_gpu; ++i) {
     int shard_len = h_left[i] == -1 ? 0 : h_right[i] - h_left[i] + 1;
     if (shard_len == 0) {
@@ -746,8 +1104,12 @@ NeighborSampleResult GpuPsGraphTable::graph_neighbor_sample_v2(
                        shard_len * sizeof(uint64_t) +
                        sizeof(int) * (shard_len + shard_len % 2));
   }
-  walk_to_dest(
-      gpu_id, total_gpu, h_left, h_right, (uint64_t*)(d_shard_keys_ptr), NULL);
+  walk_to_dest(gpu_id,
+               total_gpu,
+               h_left,
+               h_right,
+               reinterpret_cast<uint64_t*>(d_shard_keys_ptr),
+               NULL);
 
   for (int i = 0; i < total_gpu; ++i) {
     if (h_left[i] == -1) {
@@ -757,7 +1119,7 @@ NeighborSampleResult GpuPsGraphTable::graph_neighbor_sample_v2(
     auto& node = path_[gpu_id][i].nodes_.back();
 
     CUDA_CHECK(cudaMemsetAsync(
-        node.val_storage, 0, shard_len * sizeof(int64_t), node.in_stream));
+        node.val_storage, 0, shard_len * sizeof(uint64_t), node.in_stream));
     CUDA_CHECK(cudaStreamSynchronize(node.in_stream));
     platform::CUDADeviceGuard guard(resource_->dev_id(i));
     // If not found, val is -1.
@@ -765,22 +1127,22 @@ NeighborSampleResult GpuPsGraphTable::graph_neighbor_sample_v2(
     int offset = i * graph_table_num_ + idx;
     tables_[table_offset]->get(reinterpret_cast<uint64_t*>(node.key_storage),
                                reinterpret_cast<uint64_t*>(node.val_storage),
-                               (size_t)(h_right[i] - h_left[i] + 1),
+                               static_cast<size_t>(h_right[i] - h_left[i] + 1),
                                resource_->remote_stream(i, gpu_id));
 
     auto graph = gpu_graph_list_[offset];
     GpuPsNodeInfo* node_info_list =
         reinterpret_cast<GpuPsNodeInfo*>(node.val_storage);
-    int* actual_size_array = (int*)(node_info_list + shard_len);
-    uint64_t* sample_array =
-        (uint64_t*)(actual_size_array + shard_len + shard_len % 2);
+    int* actual_size_array = reinterpret_cast<int*>(node_info_list + shard_len);
+    uint64_t* sample_array = reinterpret_cast<uint64_t*>(
+        actual_size_array + shard_len + shard_len % 2);
+
     constexpr int WARP_SIZE = 32;
     constexpr int BLOCK_WARPS = 128 / WARP_SIZE;
     constexpr int TILE_SIZE = BLOCK_WARPS * 16;
     const dim3 block(WARP_SIZE, BLOCK_WARPS);
     const dim3 grid((shard_len + TILE_SIZE - 1) / TILE_SIZE);
-
-    neighbor_sample_kernel<WARP_SIZE, BLOCK_WARPS, TILE_SIZE>
+    neighbor_sample_kernel_walking<WARP_SIZE, BLOCK_WARPS, TILE_SIZE>
         <<<grid, block, 0, resource_->remote_stream(i, gpu_id)>>>(
             graph,
             node_info_list,
@@ -804,6 +1166,16 @@ NeighborSampleResult GpuPsGraphTable::graph_neighbor_sample_v2(
                             h_right,
                             d_shard_vals_ptr,
                             d_shard_actual_sample_size_ptr);
+
+  for (int i = 0; i < total_gpu; ++i) {
+    int shard_len = h_left[i] == -1 ? 0 : h_right[i] - h_left[i] + 1;
+    if (shard_len == 0) {
+      continue;
+    }
+    destroy_storage(gpu_id, i);
+  }
+  device_mutex_[gpu_id]->unlock();
+
   fill_dvalues<<<grid_size, block_size_, 0, stream>>>(
       d_shard_vals_ptr,
       val,
@@ -854,7 +1226,9 @@ NeighborSampleResult GpuPsGraphTable::graph_neighbor_sample_v2(
       uint64_t* merge_buffers = new uint64_t[total_cpu_sample_size];
       int start = 0;
       for (int j = 0; j < number_on_cpu; j++) {
-        memcpy(merge_buffers + start, (uint64_t*)(buffers[j].get()), ac[j]);
+        memcpy(merge_buffers + start,
+               reinterpret_cast<uint64_t*>(buffers[j].get()),
+               ac[j]);
         start += ac[j] / sizeof(uint64_t);
       }
 
@@ -905,46 +1279,261 @@ NeighborSampleResult GpuPsGraphTable::graph_neighbor_sample_v2(
       delete[] merge_buffers;
       delete[] cpu_keys;
     }
+    CUDA_CHECK(cudaStreamSynchronize(stream));
   }
 
-  {
+  if (compress) {
     CUDA_CHECK(cudaStreamSynchronize(stream));
     platform::CUDAPlace place = platform::CUDAPlace(resource_->dev_id(gpu_id));
     platform::CUDADeviceGuard guard(resource_->dev_id(gpu_id));
-
-    thrust::device_vector<int> t_actual_sample_size(len);
-    thrust::copy(actual_sample_size,
-                 actual_sample_size + len,
-                 t_actual_sample_size.begin());
-    int total_sample_size = thrust::reduce(t_actual_sample_size.begin(),
-                                           t_actual_sample_size.end());
-
-    result.actual_val_mem =
-        memory::AllocShared(place, total_sample_size * sizeof(uint64_t));
-    result.actual_val = (uint64_t*)(result.actual_val_mem)->ptr();
+    size_t temp_storage_bytes = 0;
+    int total_sample_size = 0;
+    auto cumsum_actual_sample_size =
+        memory::Alloc(place,
+                      (len + 1) * sizeof(int),
+                      phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+    int* cumsum_actual_sample_size_p =
+        reinterpret_cast<int*>(cumsum_actual_sample_size->ptr());
+    CUDA_CHECK(
+        cudaMemsetAsync(cumsum_actual_sample_size_p, 0, sizeof(int), stream));
+    CUDA_CHECK(cub::DeviceScan::InclusiveSum(NULL,
+                                             temp_storage_bytes,
+                                             actual_sample_size,
+                                             cumsum_actual_sample_size_p + 1,
+                                             len,
+                                             stream));
+    auto d_temp_storage =
+        memory::Alloc(place,
+                      temp_storage_bytes,
+                      phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+    CUDA_CHECK(cub::DeviceScan::InclusiveSum(d_temp_storage->ptr(),
+                                             temp_storage_bytes,
+                                             actual_sample_size,
+                                             cumsum_actual_sample_size_p + 1,
+                                             len,
+                                             stream));
+    CUDA_CHECK(cudaMemcpyAsync(&total_sample_size,
+                               cumsum_actual_sample_size_p + len,
+                               sizeof(int),
+                               cudaMemcpyDeviceToHost,
+                               stream));
+    CUDA_CHECK(cudaStreamSynchronize(stream));
+    result.actual_val_mem = memory::AllocShared(
+        place,
+        total_sample_size * sizeof(uint64_t),
+        phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+    result.actual_val =
+        reinterpret_cast<uint64_t*>((result.actual_val_mem)->ptr());
 
     result.set_total_sample_size(total_sample_size);
-    thrust::device_vector<int> cumsum_actual_sample_size(len);
-    thrust::exclusive_scan(t_actual_sample_size.begin(),
-                           t_actual_sample_size.end(),
-                           cumsum_actual_sample_size.begin(),
-                           0);
     fill_actual_vals<<<grid_size, block_size_, 0, stream>>>(
         val,
         result.actual_val,
         actual_sample_size,
-        thrust::raw_pointer_cast(cumsum_actual_sample_size.data()),
+        cumsum_actual_sample_size_p,
         sample_size,
         len);
+    CUDA_CHECK(cudaStreamSynchronize(stream));  // hbm safe
+  }
+
+  cudaStreamSynchronize(stream);
+  return result;
+}
+
+NeighborSampleResultV2 GpuPsGraphTable::graph_neighbor_sample_all_edge_type(
+    int gpu_id,
+    int edge_type_len,
+    uint64_t* key,
+    int sample_size,
+    int len,
+    std::vector<std::shared_ptr<phi::Allocation>> edge_type_graphs) {
+  NeighborSampleResultV2 result;
+  auto stream = resource_->local_stream(gpu_id, 0);
+  result.set_stream(stream);
+  result.initialize(sample_size, len, edge_type_len, resource_->dev_id(gpu_id));
+  if (len == 0) {
+    return result;
+  }
+
+  platform::CUDAPlace place = platform::CUDAPlace(resource_->dev_id(gpu_id));
+  platform::CUDADeviceGuard guard(resource_->dev_id(gpu_id));
+
+  int* actual_sample_size = result.actual_sample_size;
+  uint64_t* val = result.val;
+  int total_gpu = resource_->total_device();
+
+  int grid_size = (len - 1) / block_size_ + 1;
+  int h_left[total_gpu];   // NOLINT
+  int h_right[total_gpu];  // NOLINT
+  auto d_left =
+      memory::Alloc(place,
+                    total_gpu * sizeof(int),
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+  auto d_right =
+      memory::Alloc(place,
+                    total_gpu * sizeof(int),
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+  int* d_left_ptr = reinterpret_cast<int*>(d_left->ptr());
+  int* d_right_ptr = reinterpret_cast<int*>(d_right->ptr());
+  int default_value = 0;
+  CUDA_CHECK(cudaMemsetAsync(d_left_ptr, -1, total_gpu * sizeof(int), stream));
+  CUDA_CHECK(cudaMemsetAsync(d_right_ptr, -1, total_gpu * sizeof(int), stream));
+
+  auto d_idx =
+      memory::Alloc(place,
+                    len * sizeof(int),
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+  int* d_idx_ptr = reinterpret_cast<int*>(d_idx->ptr());
+  auto d_shard_keys =
+      memory::Alloc(place,
+                    len * sizeof(uint64_t),
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+  uint64_t* d_shard_keys_ptr = reinterpret_cast<uint64_t*>(d_shard_keys->ptr());
+  auto d_shard_vals =
+      memory::Alloc(place,
+                    sample_size * len * edge_type_len * sizeof(uint64_t),
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+  uint64_t* d_shard_vals_ptr = reinterpret_cast<uint64_t*>(d_shard_vals->ptr());
+  auto d_shard_actual_sample_size =
+      memory::Alloc(place,
+                    len * edge_type_len * sizeof(int),
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+  int* d_shard_actual_sample_size_ptr =
+      reinterpret_cast<int*>(d_shard_actual_sample_size->ptr());
+
+  split_input_to_shard(reinterpret_cast<uint64_t*>(key),
+                       d_idx_ptr,
+                       len,
+                       d_left_ptr,
+                       d_right_ptr,
+                       gpu_id);
+
+  heter_comm_kernel_->fill_shard_key(
+      d_shard_keys_ptr, key, d_idx_ptr, len, stream);
+
+  CUDA_CHECK(cudaStreamSynchronize(stream));
+
+  CUDA_CHECK(cudaMemcpyAsync(h_left,
+                             d_left_ptr,
+                             total_gpu * sizeof(int),
+                             cudaMemcpyDeviceToHost,
+                             stream));
+  CUDA_CHECK(cudaMemcpyAsync(h_right,
+                             d_right_ptr,
+                             total_gpu * sizeof(int),
+                             cudaMemcpyDeviceToHost,
+                             stream));
+  CUDA_CHECK(cudaStreamSynchronize(stream));
+
+  device_mutex_[gpu_id]->lock();
+  for (int i = 0; i < total_gpu; ++i) {
+    int shard_len = h_left[i] == -1 ? 0 : h_right[i] - h_left[i] + 1;
+    if (shard_len == 0) {
+      continue;
+    }
+    create_storage(
+        gpu_id,
+        i,
+        shard_len * sizeof(uint64_t),
+        shard_len * sizeof(uint64_t) * edge_type_len +  // key
+            (shard_len * sample_size * sizeof(uint64_t)) *
+                edge_type_len +                        // sample
+            shard_len * sizeof(int) * edge_type_len +  // actual sample size
+            ((shard_len * edge_type_len) % 2) * sizeof(int));  // align
   }
+  walk_to_dest(gpu_id,
+               total_gpu,
+               h_left,
+               h_right,
+               reinterpret_cast<uint64_t*>(d_shard_keys_ptr),
+               NULL);
+
   for (int i = 0; i < total_gpu; ++i) {
+    if (h_left[i] == -1) {
+      continue;
+    }
+    int shard_len = h_left[i] == -1 ? 0 : h_right[i] - h_left[i] + 1;
+    auto& node = path_[gpu_id][i].nodes_.back();
+    CUDA_CHECK(cudaMemsetAsync(node.val_storage,
+                               0,
+                               shard_len * edge_type_len * sizeof(uint64_t),
+                               node.in_stream));
+    CUDA_CHECK(cudaStreamSynchronize(node.in_stream));
+    platform::CUDADeviceGuard guard(resource_->dev_id(i));
+
+    GpuPsNodeInfo* node_info_base =
+        reinterpret_cast<GpuPsNodeInfo*>(node.val_storage);
+    for (int idx = 0; idx < edge_type_len; idx++) {
+      int table_offset = get_table_offset(i, GraphTableType::EDGE_TABLE, idx);
+      int offset = i * graph_table_num_ + idx;
+      tables_[table_offset]->get(
+          reinterpret_cast<uint64_t*>(node.key_storage),
+          reinterpret_cast<uint64_t*>(node_info_base + idx * shard_len),
+          static_cast<size_t>(shard_len),
+          resource_->remote_stream(i, gpu_id));
+    }
+
+    auto d_commgraph_mem = edge_type_graphs[i];
+    GpuPsCommGraph* d_commgraph_ptr =
+        reinterpret_cast<GpuPsCommGraph*>(d_commgraph_mem->ptr());
+    int* actual_size_base =
+        reinterpret_cast<int*>(node_info_base + shard_len * edge_type_len);
+    uint64_t* sample_array_base = reinterpret_cast<uint64_t*>(
+        actual_size_base + shard_len * edge_type_len +
+        (shard_len * edge_type_len) % 2);
+    int grid_size_ = (shard_len * edge_type_len - 1) / block_size_ + 1;
+    neighbor_sample_kernel_all_edge_type<<<grid_size_,
+                                           block_size_,
+                                           0,
+                                           resource_->remote_stream(i,
+                                                                    gpu_id)>>>(
+        d_commgraph_ptr,
+        node_info_base,
+        actual_size_base,
+        sample_array_base,
+        sample_size,
+        shard_len * edge_type_len,
+        default_value,
+        shard_len);
+  }
+
+  for (int i = 0; i < total_gpu; ++i) {
+    if (h_left[i] == -1) {
+      continue;
+    }
+    CUDA_CHECK(cudaStreamSynchronize(resource_->remote_stream(i, gpu_id)));
+  }
+
+  move_result_to_source_gpu_all_edge_type(gpu_id,
+                                          total_gpu,
+                                          sample_size,
+                                          h_left,
+                                          h_right,
+                                          d_shard_vals_ptr,
+                                          d_shard_actual_sample_size_ptr,
+                                          edge_type_len,
+                                          len);
+
+  int grid_size_e = (len * edge_type_len - 1) / block_size_ + 1;
+  fill_dvalues_with_edge_type<<<grid_size_e, block_size_, 0, stream>>>(
+      d_shard_vals_ptr,
+      val,
+      d_shard_actual_sample_size_ptr,
+      actual_sample_size,
+      d_idx_ptr,
+      sample_size,
+      len * edge_type_len,
+      len);
+  CUDA_CHECK(cudaStreamSynchronize(stream));
+
+  for (int i = 0; i < total_gpu; i++) {
     int shard_len = h_left[i] == -1 ? 0 : h_right[i] - h_left[i] + 1;
     if (shard_len == 0) {
       continue;
     }
     destroy_storage(gpu_id, i);
   }
-  cudaStreamSynchronize(stream);
+  device_mutex_[gpu_id]->unlock();
   return result;
 }
 
@@ -992,16 +1581,338 @@ NodeQueryResult GpuPsGraphTable::query_node_list(int gpu_id,
                        block_size_,
                        0,
                        resource_->remote_stream(gpu_id, gpu_id)>>>(
-      graph, x2, len, (uint64_t*)val);
+      graph, x2, len, reinterpret_cast<uint64_t*>(val));
   CUDA_CHECK(cudaStreamSynchronize(resource_->remote_stream(gpu_id, gpu_id)));
   return result;
 }
 
+int GpuPsGraphTable::get_feature_info_of_nodes(
+    int gpu_id,
+    uint64_t* d_nodes,
+    int node_num,
+    uint32_t* size_list,
+    uint32_t* size_list_prefix_sum,
+    std::shared_ptr<phi::Allocation>& feature_list,
+    std::shared_ptr<phi::Allocation>& slot_list) {
+  if (node_num == 0) {
+    return 0;
+  }
+  platform::CUDAPlace place = platform::CUDAPlace(resource_->dev_id(gpu_id));
+  platform::CUDADeviceGuard guard(resource_->dev_id(gpu_id));
+  int total_gpu = resource_->total_device();
+  auto stream = resource_->local_stream(gpu_id, 0);
+
+  auto d_left =
+      memory::Alloc(place,
+                    total_gpu * sizeof(int),
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+  auto d_right =
+      memory::Alloc(place,
+                    total_gpu * sizeof(int),
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+  int* d_left_ptr = reinterpret_cast<int*>(d_left->ptr());
+  int* d_right_ptr = reinterpret_cast<int*>(d_right->ptr());
+
+  CUDA_CHECK(cudaMemsetAsync(d_left_ptr, -1, total_gpu * sizeof(int), stream));
+  CUDA_CHECK(cudaMemsetAsync(d_right_ptr, -1, total_gpu * sizeof(int), stream));
+  auto d_idx =
+      memory::Alloc(place,
+                    node_num * sizeof(int),
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+  int* d_idx_ptr = reinterpret_cast<int*>(d_idx->ptr());
+
+  auto d_shard_keys =
+      memory::Alloc(place,
+                    node_num * sizeof(uint64_t),
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+  uint64_t* d_shard_keys_ptr = reinterpret_cast<uint64_t*>(d_shard_keys->ptr());
+  split_input_to_shard(
+      d_nodes, d_idx_ptr, node_num, d_left_ptr, d_right_ptr, gpu_id);
+
+  heter_comm_kernel_->fill_shard_key(
+      d_shard_keys_ptr, d_nodes, d_idx_ptr, node_num, stream);
+  CUDA_CHECK(cudaStreamSynchronize(stream));
+
+  std::vector<void*> d_fea_info(total_gpu, NULL);
+  std::vector<void*> d_fea_size(total_gpu, NULL);
+  std::vector<void*> d_fea_size_prefix_sum(total_gpu, NULL);
+  std::vector<uint32_t> fea_num_list(total_gpu, 0);
+  std::vector<int> fea_left(total_gpu, -1);
+
+  int h_left[total_gpu];  // NOLINT
+  CUDA_CHECK(cudaMemcpy(
+      h_left, d_left_ptr, total_gpu * sizeof(int), cudaMemcpyDeviceToHost));
+  int h_right[total_gpu];  // NOLINT
+  CUDA_CHECK(cudaMemcpy(
+      h_right, d_right_ptr, total_gpu * sizeof(int), cudaMemcpyDeviceToHost));
+  device_mutex_[gpu_id]->lock();
+  int shard_len[total_gpu];  // NOLINT
+  void* d_temp_storage[total_gpu];
+  std::vector<size_t> temp_storage_bytes(total_gpu, 0);
+
+  for (int i = 0; i < total_gpu; ++i) {
+    shard_len[i] = h_left[i] == -1 ? 0 : h_right[i] - h_left[i] + 1;
+    d_temp_storage[i] = NULL;
+    if (h_left[i] == -1) {
+      continue;
+    }
+    create_storage(gpu_id, i, shard_len[i] * sizeof(uint64_t), 0);
+    platform::CUDADeviceGuard guard(resource_->dev_id(i));
+    auto& node = path_[gpu_id][i].nodes_.back();
+    create_tmp_storage(
+        d_fea_info[i], gpu_id, i, shard_len[i] * sizeof(uint64_t));
+    CUDA_CHECK(cudaMemsetAsync(
+        d_fea_info[i], 0, shard_len[i] * sizeof(uint64_t), node.in_stream));
+    create_tmp_storage(
+        d_fea_size[i], gpu_id, i, shard_len[i] * sizeof(uint32_t));
+    create_tmp_storage(d_fea_size_prefix_sum[i],
+                       gpu_id,
+                       i,
+                       (shard_len[i] + 1) * sizeof(uint32_t));
+    CUDA_CHECK(cub::DeviceScan::InclusiveSum(
+        NULL,
+        temp_storage_bytes[i],
+        reinterpret_cast<uint32_t*>(d_fea_size[i]),
+        reinterpret_cast<uint32_t*>(d_fea_size_prefix_sum[i] + 1),
+        shard_len[i],
+        resource_->remote_stream(i, gpu_id)));
+  }
+
+  for (int i = 0; i < total_gpu; ++i) {
+    if (h_left[i] == -1) {
+      continue;
+    }
+    platform::CUDADeviceGuard guard(resource_->dev_id(i));
+    CUDA_CHECK(cudaStreamSynchronize(resource_->remote_stream(
+        i, gpu_id)));  // wait for calc temp_storage_bytes
+    create_tmp_storage(d_temp_storage[i], gpu_id, i, temp_storage_bytes[i]);
+  }
+  walk_to_dest(gpu_id,
+               total_gpu,
+               h_left,
+               h_right,
+               reinterpret_cast<uint64_t*>(d_shard_keys_ptr),
+               NULL);
+
+  // no sync so 8 card can parallel execute
+  for (int i = 0; i < total_gpu; ++i) {
+    if (h_left[i] == -1) {
+      continue;
+    }
+    platform::CUDADeviceGuard guard(resource_->dev_id(i));
+    auto& node = path_[gpu_id][i].nodes_.back();
+    // If not found, val is -1.
+    int table_offset = get_table_offset(i, GraphTableType::FEATURE_TABLE, 0);
+    CUDA_CHECK(cudaStreamSynchronize(
+        node.in_stream));  // wait for walk_to_dest and memset
+    tables_[table_offset]->get(reinterpret_cast<uint64_t*>(node.key_storage),
+                               reinterpret_cast<uint64_t*>(d_fea_info[i]),
+                               static_cast<size_t>(h_right[i] - h_left[i] + 1),
+                               resource_->remote_stream(i, gpu_id));
+    dim3 grid((shard_len[i] - 1) / dim_y + 1);
+    dim3 block(1, dim_y);
+
+    get_features_size<<<grid, block, 0, resource_->remote_stream(i, gpu_id)>>>(
+        reinterpret_cast<GpuPsFeaInfo*>(d_fea_info[i]),
+        reinterpret_cast<uint32_t*>(d_fea_size[i]),
+        shard_len[i]);
+    CUDA_CHECK(cudaMemsetAsync(d_fea_size_prefix_sum[i],
+                               0,
+                               sizeof(uint32_t),
+                               resource_->remote_stream(i, gpu_id)));
+    CUDA_CHECK(cub::DeviceScan::InclusiveSum(
+        d_temp_storage[i],
+        temp_storage_bytes[i],
+        reinterpret_cast<uint32_t*>(d_fea_size[i]),
+        reinterpret_cast<uint32_t*>(d_fea_size_prefix_sum[i]) + 1,
+        shard_len[i],
+        resource_->remote_stream(i, gpu_id)));
+  }
+
+  // wait for fea_num_list
+  for (int i = 0; i < total_gpu; ++i) {
+    platform::CUDADeviceGuard guard(resource_->dev_id(i));
+    if (h_left[i] == -1) {
+      continue;
+    }
+    auto& node = path_[gpu_id][i].nodes_.back();
+    CUDA_CHECK(cudaMemcpyAsync(
+        &fea_num_list[i],
+        reinterpret_cast<uint32_t*>(d_fea_size_prefix_sum[i]) + shard_len[i],
+        sizeof(uint32_t),
+        cudaMemcpyDeviceToHost,
+        resource_->remote_stream(i, gpu_id)));
+
+    CUDA_CHECK(cudaStreamSynchronize(
+        resource_->remote_stream(i, gpu_id)));  // wait for fea_num_list
+
+    create_storage(gpu_id,
+                   i,
+                   0,
+                   (shard_len[i] + shard_len[i] % 2) * sizeof(uint32_t) +
+                       fea_num_list[i] * sizeof(uint64_t) +
+                       fea_num_list[i] * sizeof(uint8_t));
+    uint32_t* actual_size_array = reinterpret_cast<uint32_t*>(node.val_storage);
+    CUDA_CHECK(cudaMemcpyAsync(actual_size_array,
+                               d_fea_size[i],
+                               sizeof(uint32_t) * shard_len[i],
+                               cudaMemcpyDeviceToDevice,
+                               resource_->remote_stream(i, gpu_id)));
+    int offset = i * feature_table_num_;
+    auto graph = gpu_graph_fea_list_[offset];
+
+    uint64_t* feature_array = reinterpret_cast<uint64_t*>(
+        actual_size_array + shard_len[i] + shard_len[i] % 2);
+    uint8_t* slot_array =
+        reinterpret_cast<uint8_t*>(feature_array + fea_num_list[i]);
+    dim3 grid((shard_len[i] - 1) / dim_y + 1);
+    dim3 block(1, dim_y);
+    get_features_kernel<<<grid,
+                          block,
+                          0,
+                          resource_->remote_stream(i, gpu_id)>>>(
+        graph,
+        reinterpret_cast<GpuPsFeaInfo*>(d_fea_info[i]),
+        reinterpret_cast<uint32_t*>(d_fea_size_prefix_sum[i]),
+        feature_array,
+        slot_array,
+        shard_len[i]);
+  }
+
+  for (int i = 0; i < total_gpu; ++i) {
+    if (h_left[i] == -1) {
+      continue;
+    }
+    CUDA_CHECK(cudaStreamSynchronize(resource_->remote_stream(i, gpu_id)));
+  }
+
+  uint32_t all_fea_num = 0;
+  for (int i = 0; i < total_gpu; ++i) {
+    fea_left[i] = all_fea_num;
+    all_fea_num += fea_num_list[i];
+  }
+
+  auto feature_list_tmp =
+      memory::Alloc(place,
+                    all_fea_num * sizeof(uint64_t),
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+  uint64_t* d_feature_list_ptr =
+      reinterpret_cast<uint64_t*>(feature_list_tmp->ptr());
+
+  auto slot_list_tmp =
+      memory::Alloc(place,
+                    all_fea_num * sizeof(uint8_t),
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+  uint8_t* d_slot_list_ptr = reinterpret_cast<uint8_t*>(slot_list_tmp->ptr());
+
+  auto size_list_tmp =
+      memory::Alloc(place,
+                    node_num * sizeof(uint32_t),
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+  uint32_t* d_size_list_ptr = reinterpret_cast<uint32_t*>(size_list_tmp->ptr());
+
+  move_result_to_source_gpu(gpu_id,
+                            total_gpu,
+                            h_left,
+                            h_right,
+                            fea_left.data(),
+                            fea_num_list.data(),
+                            d_size_list_ptr,
+                            d_feature_list_ptr,
+                            d_slot_list_ptr);
+
+  for (int i = 0; i < total_gpu; ++i) {
+    if (shard_len[i] == 0) {
+      continue;
+    }
+    destroy_storage(gpu_id, i);
+    if (d_fea_info[i] != NULL) {
+      destroy_tmp_storage(d_fea_info[i], gpu_id, i);
+    }
+    if (d_fea_size[i] != NULL) {
+      destroy_tmp_storage(d_fea_size[i], gpu_id, i);
+    }
+    if (d_fea_size_prefix_sum[i] != NULL) {
+      destroy_tmp_storage(d_fea_size_prefix_sum[i], gpu_id, i);
+    }
+    if (d_temp_storage[i] != NULL) {
+      destroy_tmp_storage(d_temp_storage[i], gpu_id, i);
+    }
+  }
+
+  d_fea_info.clear();
+  d_fea_size.clear();
+  d_fea_size_prefix_sum.clear();
+  device_mutex_[gpu_id]->unlock();
+  feature_list =
+      memory::Alloc(place,
+                    all_fea_num * sizeof(uint64_t),
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+
+  uint64_t* d_res_feature_list_ptr =
+      reinterpret_cast<uint64_t*>(feature_list->ptr());
+
+  slot_list =
+      memory::Alloc(place,
+                    all_fea_num * sizeof(uint8_t),
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+
+  uint8_t* d_res_slot_list_ptr = reinterpret_cast<uint8_t*>(slot_list->ptr());
+
+  int grid_size = (node_num - 1) / block_size_ + 1;
+  fill_size<<<grid_size, block_size_, 0, stream>>>(
+      size_list, d_size_list_ptr, d_idx_ptr, node_num);
+  size_t storage_bytes = 0;
+  auto src_fea_size_prefix_sum =
+      memory::Alloc(place,
+                    node_num * sizeof(uint32_t),
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+
+  uint32_t* src_fea_size_prefix_sum_ptr =
+      reinterpret_cast<uint32_t*>(src_fea_size_prefix_sum->ptr());
+  CUDA_CHECK(cudaStreamSynchronize(stream));
+  CUDA_CHECK(cub::DeviceScan::ExclusiveSum(
+      NULL, storage_bytes, size_list, size_list_prefix_sum, node_num, stream));
+  CUDA_CHECK(cudaStreamSynchronize(stream));
+  auto d_temp_storage_tmp =
+      memory::Alloc(place,
+                    storage_bytes,
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+  CUDA_CHECK(cub::DeviceScan::ExclusiveSum(d_temp_storage_tmp->ptr(),
+                                           storage_bytes,
+                                           size_list,
+                                           size_list_prefix_sum,
+                                           node_num,
+                                           stream));
+
+  CUDA_CHECK(cub::DeviceScan::ExclusiveSum(d_temp_storage_tmp->ptr(),
+                                           storage_bytes,
+                                           d_size_list_ptr,
+                                           src_fea_size_prefix_sum_ptr,
+                                           node_num,
+                                           stream));
+  fill_feature_and_slot<<<grid_size, block_size_, 0, stream>>>(
+      d_res_feature_list_ptr,
+      d_res_slot_list_ptr,
+      size_list_prefix_sum,
+      d_feature_list_ptr,
+      d_slot_list_ptr,
+      src_fea_size_prefix_sum_ptr,
+      d_size_list_ptr,
+      d_idx_ptr,
+      node_num);
+
+  CUDA_CHECK(cudaStreamSynchronize(stream));
+  return all_fea_num;
+}
+
 int GpuPsGraphTable::get_feature_of_nodes(int gpu_id,
                                           uint64_t* d_nodes,
                                           uint64_t* d_feature,
                                           int node_num,
-                                          int slot_num) {
+                                          int slot_num,
+                                          int* d_slot_feature_num_map,
+                                          int fea_num_per_node) {
   if (node_num == 0) {
     return -1;
   }
@@ -1011,23 +1922,40 @@ int GpuPsGraphTable::get_feature_of_nodes(int gpu_id,
   int total_gpu = resource_->total_device();
   auto stream = resource_->local_stream(gpu_id, 0);
 
-  auto d_left = memory::Alloc(place, total_gpu * sizeof(int));
-  auto d_right = memory::Alloc(place, total_gpu * sizeof(int));
+  auto d_left =
+      memory::Alloc(place,
+                    total_gpu * sizeof(int),
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+  auto d_right =
+      memory::Alloc(place,
+                    total_gpu * sizeof(int),
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
   int* d_left_ptr = reinterpret_cast<int*>(d_left->ptr());
   int* d_right_ptr = reinterpret_cast<int*>(d_right->ptr());
 
   CUDA_CHECK(cudaMemsetAsync(d_left_ptr, -1, total_gpu * sizeof(int), stream));
   CUDA_CHECK(cudaMemsetAsync(d_right_ptr, -1, total_gpu * sizeof(int), stream));
   //
-  auto d_idx = memory::Alloc(place, node_num * sizeof(int));
+  auto d_idx =
+      memory::Alloc(place,
+                    node_num * sizeof(int),
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
   int* d_idx_ptr = reinterpret_cast<int*>(d_idx->ptr());
 
-  auto d_shard_keys = memory::Alloc(place, node_num * sizeof(uint64_t));
+  auto d_shard_keys =
+      memory::Alloc(place,
+                    node_num * sizeof(uint64_t),
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
   uint64_t* d_shard_keys_ptr = reinterpret_cast<uint64_t*>(d_shard_keys->ptr());
   auto d_shard_vals =
-      memory::Alloc(place, slot_num * node_num * sizeof(uint64_t));
+      memory::Alloc(place,
+                    fea_num_per_node * node_num * sizeof(uint64_t),
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
   uint64_t* d_shard_vals_ptr = reinterpret_cast<uint64_t*>(d_shard_vals->ptr());
-  auto d_shard_actual_size = memory::Alloc(place, node_num * sizeof(int));
+  auto d_shard_actual_size =
+      memory::Alloc(place,
+                    node_num * sizeof(int),
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
   int* d_shard_actual_size_ptr =
       reinterpret_cast<int*>(d_shard_actual_size->ptr());
 
@@ -1044,6 +1972,7 @@ int GpuPsGraphTable::get_feature_of_nodes(int gpu_id,
   int h_right[total_gpu];  // NOLINT
   CUDA_CHECK(cudaMemcpy(
       h_right, d_right_ptr, total_gpu * sizeof(int), cudaMemcpyDeviceToHost));
+  device_mutex_[gpu_id]->lock();
   for (int i = 0; i < total_gpu; ++i) {
     int shard_len = h_left[i] == -1 ? 0 : h_right[i] - h_left[i] + 1;
     if (shard_len == 0) {
@@ -1052,13 +1981,17 @@ int GpuPsGraphTable::get_feature_of_nodes(int gpu_id,
     create_storage(gpu_id,
                    i,
                    shard_len * sizeof(uint64_t),
-                   shard_len * slot_num * sizeof(uint64_t) +
+                   shard_len * fea_num_per_node * sizeof(uint64_t) +
                        shard_len * sizeof(uint64_t) +
                        sizeof(int) * (shard_len + shard_len % 2));
   }
 
-  walk_to_dest(
-      gpu_id, total_gpu, h_left, h_right, (uint64_t*)(d_shard_keys_ptr), NULL);
+  walk_to_dest(gpu_id,
+               total_gpu,
+               h_left,
+               h_right,
+               reinterpret_cast<uint64_t*>(d_shard_keys_ptr),
+               NULL);
 
   for (int i = 0; i < total_gpu; ++i) {
     if (h_left[i] == -1) {
@@ -1075,16 +2008,16 @@ int GpuPsGraphTable::get_feature_of_nodes(int gpu_id,
     int table_offset = get_table_offset(i, GraphTableType::FEATURE_TABLE, 0);
     tables_[table_offset]->get(reinterpret_cast<uint64_t*>(node.key_storage),
                                reinterpret_cast<uint64_t*>(node.val_storage),
-                               (size_t)(h_right[i] - h_left[i] + 1),
+                               static_cast<size_t>(h_right[i] - h_left[i] + 1),
                                resource_->remote_stream(i, gpu_id));
 
     int offset = i * feature_table_num_;
     auto graph = gpu_graph_fea_list_[offset];
 
     GpuPsFeaInfo* val_array = reinterpret_cast<GpuPsFeaInfo*>(node.val_storage);
-    int* actual_size_array = (int*)(val_array + shard_len);
-    uint64_t* feature_array =
-        (uint64_t*)(actual_size_array + shard_len + shard_len % 2);
+    int* actual_size_array = reinterpret_cast<int*>(val_array + shard_len);
+    uint64_t* feature_array = reinterpret_cast<uint64_t*>(
+        actual_size_array + shard_len + shard_len % 2);
     dim3 grid((shard_len - 1) / dim_y + 1);
     dim3 block(1, dim_y);
     get_features_kernel<<<grid,
@@ -1095,8 +2028,10 @@ int GpuPsGraphTable::get_feature_of_nodes(int gpu_id,
         val_array,
         actual_size_array,
         feature_array,
+        d_slot_feature_num_map,
         slot_num,
-        shard_len);
+        shard_len,
+        fea_num_per_node);
   }
 
   for (int i = 0; i < total_gpu; ++i) {
@@ -1108,20 +2043,11 @@ int GpuPsGraphTable::get_feature_of_nodes(int gpu_id,
 
   move_result_to_source_gpu(gpu_id,
                             total_gpu,
-                            slot_num,
+                            fea_num_per_node,
                             h_left,
                             h_right,
                             d_shard_vals_ptr,
                             d_shard_actual_size_ptr);
-
-  int grid_size = (node_num - 1) / block_size_ + 1;
-  fill_dvalues<<<grid_size, block_size_, 0, stream>>>(d_shard_vals_ptr,
-                                                      d_feature,
-                                                      d_shard_actual_size_ptr,
-                                                      d_idx_ptr,
-                                                      slot_num,
-                                                      node_num);
-
   for (int i = 0; i < total_gpu; ++i) {
     int shard_len = h_left[i] == -1 ? 0 : h_right[i] - h_left[i] + 1;
     if (shard_len == 0) {
@@ -1129,11 +2055,21 @@ int GpuPsGraphTable::get_feature_of_nodes(int gpu_id,
     }
     destroy_storage(gpu_id, i);
   }
+  device_mutex_[gpu_id]->unlock();
+
+  int grid_size = (node_num - 1) / block_size_ + 1;
+  fill_dvalues<<<grid_size, block_size_, 0, stream>>>(d_shard_vals_ptr,
+                                                      d_feature,
+                                                      d_shard_actual_size_ptr,
+                                                      d_idx_ptr,
+                                                      fea_num_per_node,
+                                                      node_num);
 
   CUDA_CHECK(cudaStreamSynchronize(stream));
 
   return 0;
 }
-}  // namespace framework
+
+};  // namespace framework
 };  // namespace paddle
 #endif
diff --git a/paddle/fluid/framework/fleet/heter_ps/graph_gpu_wrapper.cu b/paddle/fluid/framework/fleet/heter_ps/graph_gpu_wrapper.cu
index fafb5ef26698e..40d69e13d57b7 100644
--- a/paddle/fluid/framework/fleet/heter_ps/graph_gpu_wrapper.cu
+++ b/paddle/fluid/framework/fleet/heter_ps/graph_gpu_wrapper.cu
@@ -14,9 +14,12 @@
 
 #include "paddle/fluid/framework/fleet/heter_ps/graph_gpu_wrapper.h"
 #include <sstream>
+#include "paddle/fluid/framework/fleet/fleet_wrapper.h"
 #include "paddle/fluid/framework/fleet/heter_ps/gpu_graph_utils.h"
 #include "paddle/fluid/framework/fleet/heter_ps/graph_gpu_ps_table.h"
 #include "paddle/fluid/framework/fleet/heter_ps/heter_resource.h"
+DECLARE_int32(gpugraph_storage_mode);
+DECLARE_bool(graph_metapath_split_opt);
 namespace paddle {
 namespace framework {
 #ifdef PADDLE_WITH_HETERPS
@@ -28,16 +31,246 @@ void GraphGpuWrapper::set_device(std::vector<int> ids) {
   }
 }
 
+void GraphGpuWrapper::init_conf(const std::string &first_node_type,
+                                const std::string &meta_path) {
+  static std::mutex mutex;
+  {
+    std::lock_guard<std::mutex> lock(mutex);
+    if (conf_initialized_) {
+      return;
+    }
+    VLOG(2) << "init path config";
+    conf_initialized_ = true;
+    auto node_types =
+        paddle::string::split_string<std::string>(first_node_type, ";");
+    VLOG(2) << "node_types: " << first_node_type;
+    for (auto &type : node_types) {
+      auto iter = feature_to_id.find(type);
+      PADDLE_ENFORCE_NE(iter,
+                        feature_to_id.end(),
+                        platform::errors::NotFound(
+                            "(%s) is not found in feature_to_id.", type));
+      VLOG(2) << "feature_to_id[" << type << "] = " << iter->second;
+      first_node_type_.push_back(iter->second);
+    }
+    meta_path_.resize(first_node_type_.size());
+    auto meta_paths = paddle::string::split_string<std::string>(meta_path, ";");
+
+    for (size_t i = 0; i < meta_paths.size(); i++) {
+      auto path = meta_paths[i];
+      auto nodes = paddle::string::split_string<std::string>(path, "-");
+      for (auto &node : nodes) {
+        auto iter = edge_to_id.find(node);
+        PADDLE_ENFORCE_NE(iter,
+                          edge_to_id.end(),
+                          platform::errors::NotFound(
+                              "(%s) is not found in edge_to_id.", node));
+        VLOG(2) << "edge_to_id[" << node << "] = " << iter->second;
+        meta_path_[i].push_back(iter->second);
+      }
+    }
+    int max_dev_id = 0;
+    for (size_t i = 0; i < device_id_mapping.size(); i++) {
+      if (device_id_mapping[i] > max_dev_id) {
+        max_dev_id = device_id_mapping[i];
+      }
+    }
+    finish_node_type_.resize(max_dev_id + 1);
+    node_type_start_.resize(max_dev_id + 1);
+    global_infer_node_type_start_.resize(max_dev_id + 1);
+    for (size_t i = 0; i < device_id_mapping.size(); i++) {
+      int dev_id = device_id_mapping[i];
+      auto &node_type_start = node_type_start_[i];
+      auto &infer_node_type_start = global_infer_node_type_start_[i];
+      auto &finish_node_type = finish_node_type_[i];
+      finish_node_type.clear();
+
+      for (size_t idx = 0; idx < feature_to_id.size(); idx++) {
+        infer_node_type_start[idx] = 0;
+      }
+      for (auto &type : node_types) {
+        auto iter = feature_to_id.find(type);
+        node_type_start[iter->second] = 0;
+        infer_node_type_start[iter->second] = 0;
+      }
+      infer_cursor_.push_back(0);
+      cursor_.push_back(0);
+    }
+    init_type_keys();
+  }
+}
+
+void GraphGpuWrapper::init_type_keys() {
+  size_t thread_num = device_id_mapping.size();
+  int cnt = 0;
+
+  auto &graph_all_type_total_keys = get_graph_type_keys();
+  auto &type_to_index = get_graph_type_to_index();
+  std::vector<std::vector<uint64_t>> tmp_keys;
+  tmp_keys.resize(thread_num);
+  int first_node_idx;
+  d_graph_all_type_total_keys_.resize(graph_all_type_total_keys.size());
+  h_graph_all_type_keys_len_.resize(graph_all_type_total_keys.size());
+  for (size_t f_idx = 0; f_idx < graph_all_type_total_keys.size(); f_idx++) {
+    for (size_t j = 0; j < tmp_keys.size(); j++) {
+      tmp_keys[j].clear();
+    }
+    d_graph_all_type_total_keys_[f_idx].resize(thread_num);
+    auto &type_total_key = graph_all_type_total_keys[f_idx];
+    for (size_t j = 0; j < type_total_key.size(); j++) {
+      uint64_t shard = type_total_key[j] % thread_num;
+      tmp_keys[shard].push_back(type_total_key[j]);
+    }
+    for (size_t j = 0; j < thread_num; j++) {
+      h_graph_all_type_keys_len_[f_idx].push_back(tmp_keys[j].size());
+      VLOG(1) << "node type: " << type_to_index[f_idx]
+              << ", gpu_graph_device_keys[" << j
+              << "] = " << tmp_keys[j].size();
+    }
+    for (size_t j = 0; j < thread_num; j++) {
+      auto stream = get_local_stream(j);
+      int gpuid = device_id_mapping[j];
+      auto place = platform::CUDAPlace(gpuid);
+      platform::CUDADeviceGuard guard(gpuid);
+      d_graph_all_type_total_keys_[f_idx][j] =
+          memory::AllocShared(place, tmp_keys[j].size() * sizeof(uint64_t));
+      cudaMemcpyAsync(d_graph_all_type_total_keys_[f_idx][j]->ptr(),
+                      tmp_keys[j].data(),
+                      sizeof(uint64_t) * tmp_keys[j].size(),
+                      cudaMemcpyHostToDevice,
+                      stream);
+    }
+  }
+  for (int i = 0; i < thread_num; i++) {
+    auto stream = get_local_stream(i);
+    cudaStreamSynchronize(stream);
+  }
+}
+
+void GraphGpuWrapper::init_metapath(std::string cur_metapath,
+                                    int cur_metapath_index,
+                                    int cur_metapath_len) {
+  cur_metapath_ = cur_metapath;
+  cur_metapath_index_ = cur_metapath_index;
+  cur_metapath_len_ = cur_metapath_len;
+
+  auto nodes = paddle::string::split_string<std::string>(cur_metapath_, "-");
+  cur_parse_metapath_.clear();
+  cur_parse_reverse_metapath_.clear();
+  for (auto &node : nodes) {
+    VLOG(2) << "node: " << node << " , in metapath: " << cur_metapath_;
+    auto iter = edge_to_id.find(node);
+    PADDLE_ENFORCE_NE(
+        iter,
+        edge_to_id.end(),
+        platform::errors::NotFound("(%s) is not found in edge_to_id.", node));
+    cur_parse_metapath_.push_back(iter->second);
+    auto etype_split = paddle::string::split_string<std::string>(node, "2");
+    std::string reverse_type = etype_split[1] + "2" + etype_split[0];
+    iter = edge_to_id.find(reverse_type);
+    PADDLE_ENFORCE_NE(iter,
+                      edge_to_id.end(),
+                      platform::errors::NotFound(
+                          "(%s) is not found in edge_to_id.", reverse_type));
+    cur_parse_reverse_metapath_.push_back(iter->second);
+  }
+
+  size_t thread_num = device_id_mapping.size();
+  cur_metapath_start_.resize(thread_num);
+  for (size_t i = 0; i < thread_num; i++) {
+    cur_metapath_start_[i] = 0;
+  }
+
+  auto &graph_all_type_total_keys = get_graph_type_keys();
+  auto &type_to_index = get_graph_type_to_index();
+  std::vector<std::vector<uint64_t>> tmp_keys;
+  tmp_keys.resize(thread_num);
+  int first_node_idx;
+  std::string first_node =
+      paddle::string::split_string<std::string>(cur_metapath_, "2")[0];
+  auto it = feature_to_id.find(first_node);
+  first_node_idx = it->second;
+  d_graph_train_total_keys_.resize(thread_num);
+  h_graph_train_keys_len_.resize(thread_num);
+
+  for (size_t j = 0; j < tmp_keys.size(); j++) {
+    tmp_keys[j].clear();
+  }
+  size_t f_idx = type_to_index[first_node_idx];
+  auto &type_total_key = graph_all_type_total_keys[f_idx];
+
+  VLOG(2) << "first node type:" << first_node_idx
+          << ", node start size:" << type_total_key.size();
+
+  for (size_t j = 0; j < type_total_key.size(); j++) {
+    uint64_t shard = type_total_key[j] % thread_num;
+    tmp_keys[shard].push_back(type_total_key[j]);
+  }
+  auto fleet_ptr = framework::FleetWrapper::GetInstance();
+  std::shuffle(
+      tmp_keys.begin(), tmp_keys.end(), fleet_ptr->LocalRandomEngine());
+
+  for (size_t j = 0; j < thread_num; j++) {
+    h_graph_train_keys_len_[j] = tmp_keys[j].size();
+    VLOG(2) << j << " th card, graph train keys len: " << tmp_keys[j].size();
+  }
+
+  for (size_t j = 0; j < thread_num; j++) {
+    auto stream = get_local_stream(j);
+    int gpuid = device_id_mapping[j];
+    auto place = platform::CUDAPlace(gpuid);
+    platform::CUDADeviceGuard guard(gpuid);
+    d_graph_train_total_keys_[j] =
+        memory::AllocShared(place, tmp_keys[j].size() * sizeof(uint64_t));
+    cudaMemcpyAsync(d_graph_train_total_keys_[j]->ptr(),
+                    tmp_keys[j].data(),
+                    sizeof(uint64_t) * tmp_keys[j].size(),
+                    cudaMemcpyHostToDevice,
+                    stream);
+  }
+}
+
+void GraphGpuWrapper::clear_metapath_state() {
+  size_t thread_num = device_id_mapping.size();
+  for (size_t j = 0; j < thread_num; j++) {
+    cur_metapath_start_[j] = 0;
+    h_graph_train_keys_len_[j] = 0;
+    d_graph_train_total_keys_[j].reset();
+    for (size_t k = 0; k < cur_parse_metapath_.size(); k++) {
+      reinterpret_cast<GpuPsGraphTable *>(graph_table)
+          ->clear_graph_info(j, cur_parse_metapath_[k]);
+    }
+  }
+  std::vector<int> clear_etype;
+  for (size_t j = 0; j < cur_parse_metapath_.size(); j++) {
+    if (find(clear_etype.begin(), clear_etype.end(), cur_parse_metapath_[j]) ==
+        clear_etype.end()) {
+      clear_etype.push_back(cur_parse_metapath_[j]);
+    }
+  }
+  for (size_t j = 0; j < cur_parse_reverse_metapath_.size(); j++) {
+    if (find(clear_etype.begin(),
+             clear_etype.end(),
+             cur_parse_reverse_metapath_[j]) == clear_etype.end()) {
+      clear_etype.push_back(cur_parse_reverse_metapath_[j]);
+    }
+  }
+  for (size_t j = 0; j < clear_etype.size(); j++) {
+    reinterpret_cast<GpuPsGraphTable *>(graph_table)
+        ->cpu_graph_table_->clear_graph(clear_etype[j]);
+  }
+}
+
 int GraphGpuWrapper::get_all_id(int type,
                                 int slice_num,
                                 std::vector<std::vector<uint64_t>> *output) {
-  return ((GpuPsGraphTable *)graph_table)
+  return reinterpret_cast<GpuPsGraphTable *>(graph_table)
       ->cpu_graph_table_->get_all_id(type, slice_num, output);
 }
 
 int GraphGpuWrapper::get_all_neighbor_id(
     int type, int slice_num, std::vector<std::vector<uint64_t>> *output) {
-  return ((GpuPsGraphTable *)graph_table)
+  return reinterpret_cast<GpuPsGraphTable *>(graph_table)
       ->cpu_graph_table_->get_all_neighbor_id(type, slice_num, output);
 }
 
@@ -45,7 +278,7 @@ int GraphGpuWrapper::get_all_id(int type,
                                 int idx,
                                 int slice_num,
                                 std::vector<std::vector<uint64_t>> *output) {
-  return ((GpuPsGraphTable *)graph_table)
+  return reinterpret_cast<GpuPsGraphTable *>(graph_table)
       ->cpu_graph_table_->get_all_id(type, idx, slice_num, output);
 }
 
@@ -54,7 +287,7 @@ int GraphGpuWrapper::get_all_neighbor_id(
     int idx,
     int slice_num,
     std::vector<std::vector<uint64_t>> *output) {
-  return ((GpuPsGraphTable *)graph_table)
+  return reinterpret_cast<GpuPsGraphTable *>(graph_table)
       ->cpu_graph_table_->get_all_neighbor_id(type, idx, slice_num, output);
 }
 
@@ -63,12 +296,12 @@ int GraphGpuWrapper::get_all_feature_ids(
     int idx,
     int slice_num,
     std::vector<std::vector<uint64_t>> *output) {
-  return ((GpuPsGraphTable *)graph_table)
+  return reinterpret_cast<GpuPsGraphTable *>(graph_table)
       ->cpu_graph_table_->get_all_feature_ids(type, idx, slice_num, output);
 }
 
-void GraphGpuWrapper::set_up_types(std::vector<std::string> &edge_types,
-                                   std::vector<std::string> &node_types) {
+void GraphGpuWrapper::set_up_types(const std::vector<std::string> &edge_types,
+                                   const std::vector<std::string> &node_types) {
   id_to_edge = edge_types;
   for (size_t table_id = 0; table_id < edge_types.size(); table_id++) {
     int res = edge_to_id.size();
@@ -88,36 +321,45 @@ void GraphGpuWrapper::set_up_types(std::vector<std::string> &edge_types,
 void GraphGpuWrapper::set_feature_separator(std::string ch) {
   feature_separator_ = ch;
   if (graph_table != nullptr) {
-    ((GpuPsGraphTable *)graph_table)
+    reinterpret_cast<GpuPsGraphTable *>(graph_table)
         ->cpu_graph_table_->set_feature_separator(feature_separator_);
   }
 }
 
+void GraphGpuWrapper::set_slot_feature_separator(std::string ch) {
+  slot_feature_separator_ = ch;
+  if (graph_table != nullptr) {
+    reinterpret_cast<GpuPsGraphTable *>(graph_table)
+        ->cpu_graph_table_->set_slot_feature_separator(slot_feature_separator_);
+  }
+}
+
 void GraphGpuWrapper::make_partitions(int idx,
                                       int64_t byte_size,
                                       int device_len) {
-  ((GpuPsGraphTable *)graph_table)
+  reinterpret_cast<GpuPsGraphTable *>(graph_table)
       ->cpu_graph_table_->make_partitions(idx, byte_size, device_len);
 }
 int32_t GraphGpuWrapper::load_next_partition(int idx) {
-  return ((GpuPsGraphTable *)graph_table)
+  return reinterpret_cast<GpuPsGraphTable *>(graph_table)
       ->cpu_graph_table_->load_next_partition(idx);
 }
 
 void GraphGpuWrapper::set_search_level(int level) {
-  ((GpuPsGraphTable *)graph_table)->cpu_graph_table_->set_search_level(level);
+  reinterpret_cast<GpuPsGraphTable *>(graph_table)
+      ->cpu_graph_table_->set_search_level(level);
 }
 
 std::vector<uint64_t> GraphGpuWrapper::get_partition(int idx, int num) {
-  return ((GpuPsGraphTable *)graph_table)
+  return reinterpret_cast<GpuPsGraphTable *>(graph_table)
       ->cpu_graph_table_->get_partition(idx, num);
 }
 int32_t GraphGpuWrapper::get_partition_num(int idx) {
-  return ((GpuPsGraphTable *)graph_table)
+  return reinterpret_cast<GpuPsGraphTable *>(graph_table)
       ->cpu_graph_table_->get_partition_num(idx);
 }
 void GraphGpuWrapper::make_complementary_graph(int idx, int64_t byte_size) {
-  ((GpuPsGraphTable *)graph_table)
+  reinterpret_cast<GpuPsGraphTable *>(graph_table)
       ->cpu_graph_table_->make_complementary_graph(idx, byte_size);
 }
 void GraphGpuWrapper::load_edge_file(std::string name,
@@ -133,31 +375,47 @@ void GraphGpuWrapper::load_edge_file(std::string name,
     params += ">" + name;
   }
   if (edge_to_id.find(name) != edge_to_id.end()) {
-    ((GpuPsGraphTable *)graph_table)
+    reinterpret_cast<GpuPsGraphTable *>(graph_table)
         ->cpu_graph_table_->Load(std::string(filepath), params);
   }
 }
 
+void GraphGpuWrapper::load_edge_file(std::string etype2files,
+                                     std::string graph_data_local_path,
+                                     int part_num,
+                                     bool reverse) {
+  reinterpret_cast<GpuPsGraphTable *>(graph_table)
+      ->cpu_graph_table_->parse_edge_and_load(
+          etype2files, graph_data_local_path, part_num, reverse);
+}
+
 void GraphGpuWrapper::load_node_file(std::string name, std::string filepath) {
   // 'n' means load nodes and 'node_type' follows
 
   std::string params = "n" + name;
 
   if (feature_to_id.find(name) != feature_to_id.end()) {
-    ((GpuPsGraphTable *)graph_table)
+    reinterpret_cast<GpuPsGraphTable *>(graph_table)
         ->cpu_graph_table_->Load(std::string(filepath), params);
   }
 }
 
-void GraphGpuWrapper::load_node_and_edge(std::string etype,
-                                         std::string ntype,
-                                         std::string epath,
-                                         std::string npath,
+void GraphGpuWrapper::load_node_file(std::string ntype2files,
+                                     std::string graph_data_local_path,
+                                     int part_num) {
+  reinterpret_cast<GpuPsGraphTable *>(graph_table)
+      ->cpu_graph_table_->parse_node_and_load(
+          ntype2files, graph_data_local_path, part_num);
+}
+
+void GraphGpuWrapper::load_node_and_edge(std::string etype2files,
+                                         std::string ntype2files,
+                                         std::string graph_data_local_path,
                                          int part_num,
                                          bool reverse) {
-  ((GpuPsGraphTable *)graph_table)
+  reinterpret_cast<GpuPsGraphTable *>(graph_table)
       ->cpu_graph_table_->load_node_and_edge_file(
-          etype, ntype, epath, npath, part_num, reverse);
+          etype2files, ntype2files, graph_data_local_path, part_num, reverse);
 }
 
 void GraphGpuWrapper::add_table_feat_conf(std::string table_name,
@@ -168,7 +426,7 @@ void GraphGpuWrapper::add_table_feat_conf(std::string table_name,
     int idx = feature_to_id[table_name];
     if (table_feat_mapping[idx].find(feat_name) ==
         table_feat_mapping[idx].end()) {
-      int res = (int)table_feat_mapping[idx].size();
+      int res = table_feat_mapping[idx].size();
       table_feat_mapping[idx][feat_name] = res;
     }
     int feat_idx = table_feat_mapping[idx][feat_name];
@@ -190,8 +448,13 @@ void GraphGpuWrapper::add_table_feat_conf(std::string table_name,
 }
 void GraphGpuWrapper::init_search_level(int level) { search_level = level; }
 
+gpuStream_t GraphGpuWrapper::get_local_stream(int gpuid) {
+  return reinterpret_cast<GpuPsGraphTable *>(graph_table)
+      ->get_local_stream(gpuid);
+}
+
 void GraphGpuWrapper::init_service() {
-  table_proto.set_task_pool_size(24);
+  table_proto.set_task_pool_size(64);
   table_proto.set_shard_num(1000);
   table_proto.set_build_sampler_on_cpu(false);
   table_proto.set_search_level(search_level);
@@ -212,15 +475,18 @@ void GraphGpuWrapper::init_service() {
   std::shared_ptr<HeterPsResource> resource =
       std::make_shared<HeterPsResource>(device_id_mapping);
   resource->enable_p2p();
-  GpuPsGraphTable *g = new GpuPsGraphTable(resource, 1, id_to_edge.size());
-  g->init_cpu_table(table_proto);
+  GpuPsGraphTable *g = new GpuPsGraphTable(resource, id_to_edge.size());
+  size_t gpu_num = device_id_mapping.size();
+  g->init_cpu_table(table_proto, gpu_num);
   g->cpu_graph_table_->set_feature_separator(feature_separator_);
-  graph_table = (char *)g;
+  g->cpu_graph_table_->set_slot_feature_separator(slot_feature_separator_);
+  graph_table = reinterpret_cast<char *>(g);
+  upload_num = gpu_num;
   upload_task_pool.reset(new ::ThreadPool(upload_num));
 }
 
 void GraphGpuWrapper::finalize() {
-  ((GpuPsGraphTable *)graph_table)->show_table_collisions();
+  reinterpret_cast<GpuPsGraphTable *>(graph_table)->show_table_collisions();
 }
 
 void GraphGpuWrapper::upload_batch(int type,
@@ -228,11 +494,14 @@ void GraphGpuWrapper::upload_batch(int type,
                                    int slice_num,
                                    const std::string &edge_type) {
   VLOG(0) << "begin upload edge, type[" << edge_type << "]";
+  auto iter = edge_to_id.find(edge_type);
+  idx = iter->second;
+  VLOG(2) << "cur edge: " << edge_type << ",idx: " << idx;
   std::vector<std::vector<uint64_t>> ids;
-  ((GpuPsGraphTable *)graph_table)
+  reinterpret_cast<GpuPsGraphTable *>(graph_table)
       ->cpu_graph_table_->get_all_id(type, idx, slice_num, &ids);
   debug_gpu_memory_info("upload_batch node start");
-  GpuPsGraphTable *g = (GpuPsGraphTable *)graph_table;
+  GpuPsGraphTable *g = reinterpret_cast<GpuPsGraphTable *>(graph_table);
   std::vector<std::future<int>> tasks;
 
   for (int i = 0; i < ids.size(); i++) {
@@ -253,18 +522,25 @@ void GraphGpuWrapper::upload_batch(int type,
 
 // feature table
 void GraphGpuWrapper::upload_batch(int type, int slice_num, int slot_num) {
+  if (type == 1 &&
+      (FLAGS_gpugraph_storage_mode == paddle::framework::GpuGraphStorageMode::
+                                          MEM_EMB_FEATURE_AND_GPU_GRAPH ||
+       FLAGS_gpugraph_storage_mode == paddle::framework::GpuGraphStorageMode::
+                                          SSD_EMB_AND_MEM_FEATURE_GPU_GRAPH)) {
+    return;
+  }
   std::vector<std::vector<uint64_t>> node_ids;
-  ((GpuPsGraphTable *)graph_table)
+  reinterpret_cast<GpuPsGraphTable *>(graph_table)
       ->cpu_graph_table_->get_all_id(type, slice_num, &node_ids);
   debug_gpu_memory_info("upload_batch feature start");
-  GpuPsGraphTable *g = (GpuPsGraphTable *)graph_table;
+  GpuPsGraphTable *g = reinterpret_cast<GpuPsGraphTable *>(graph_table);
   std::vector<std::future<int>> tasks;
   for (int i = 0; i < node_ids.size(); i++) {
     tasks.push_back(upload_task_pool->enqueue([&, i, this]() -> int {
       VLOG(0) << "begin make_gpu_ps_graph_fea, node_ids[" << i << "]_size["
               << node_ids[i].size() << "]";
       GpuPsCommGraphFea sub_graph =
-          g->cpu_graph_table_->make_gpu_ps_graph_fea(node_ids[i], slot_num);
+          g->cpu_graph_table_->make_gpu_ps_graph_fea(i, node_ids[i], slot_num);
       // sub_graph.display_on_cpu();
       VLOG(0) << "begin build_graph_fea_on_single_gpu, node_ids[" << i
               << "]_size[" << node_ids[i].size() << "]";
@@ -279,30 +555,106 @@ void GraphGpuWrapper::upload_batch(int type, int slice_num, int slot_num) {
   debug_gpu_memory_info("upload_batch feature end");
 }
 
+// get sub_graph_fea
+std::vector<GpuPsCommGraphFea> GraphGpuWrapper::get_sub_graph_fea(
+    std::vector<std::vector<uint64_t>> &node_ids, int slot_num) {
+  GpuPsGraphTable *g = reinterpret_cast<GpuPsGraphTable *>(graph_table);
+  std::vector<std::future<int>> tasks;
+  std::vector<GpuPsCommGraphFea> sub_graph_feas(node_ids.size());
+  for (int i = 0; i < node_ids.size(); i++) {
+    tasks.push_back(upload_task_pool->enqueue([&, i, this]() -> int {
+      GpuPsGraphTable *g = reinterpret_cast<GpuPsGraphTable *>(graph_table);
+      sub_graph_feas[i] =
+          g->cpu_graph_table_->make_gpu_ps_graph_fea(i, node_ids[i], slot_num);
+      return 0;
+    }));
+  }
+  for (size_t i = 0; i < tasks.size(); i++) tasks[i].get();
+  return sub_graph_feas;
+}
+
+// build_gpu_graph_fea
+void GraphGpuWrapper::build_gpu_graph_fea(GpuPsCommGraphFea &sub_graph_fea,
+                                          int i) {
+  GpuPsGraphTable *g = reinterpret_cast<GpuPsGraphTable *>(graph_table);
+  g->build_graph_fea_on_single_gpu(sub_graph_fea, i);
+  sub_graph_fea.release_on_cpu();
+  VLOG(0) << "sub graph fea on gpu " << i << " is built";
+  return;
+}
+
 NeighborSampleResult GraphGpuWrapper::graph_neighbor_sample_v3(
-    NeighborSampleQuery q, bool cpu_switch) {
-  return ((GpuPsGraphTable *)graph_table)
-      ->graph_neighbor_sample_v3(q, cpu_switch);
+    NeighborSampleQuery q, bool cpu_switch, bool compress = true) {
+  return reinterpret_cast<GpuPsGraphTable *>(graph_table)
+      ->graph_neighbor_sample_v3(q, cpu_switch, compress);
+}
+
+NeighborSampleResultV2 GraphGpuWrapper::graph_neighbor_sample_all_edge_type(
+    int gpu_id,
+    int edge_type_len,
+    uint64_t *key,
+    int sample_size,
+    int len,
+    std::vector<std::shared_ptr<phi::Allocation>> edge_type_graphs) {
+  return reinterpret_cast<GpuPsGraphTable *>(graph_table)
+      ->graph_neighbor_sample_all_edge_type(
+          gpu_id, edge_type_len, key, sample_size, len, edge_type_graphs);
+}
+
+std::vector<std::shared_ptr<phi::Allocation>>
+GraphGpuWrapper::get_edge_type_graph(int gpu_id, int edge_type_len) {
+  return reinterpret_cast<GpuPsGraphTable *>(graph_table)
+      ->get_edge_type_graph(gpu_id, edge_type_len);
+}
+
+int GraphGpuWrapper::get_feature_info_of_nodes(
+    int gpu_id,
+    uint64_t *d_nodes,
+    int node_num,
+    uint32_t *size_list,
+    uint32_t *size_list_prefix_sum,
+    std::shared_ptr<phi::Allocation> &feature_list,
+    std::shared_ptr<phi::Allocation> &slot_list) {
+  platform::CUDADeviceGuard guard(gpu_id);
+  PADDLE_ENFORCE_NOT_NULL(graph_table,
+                          paddle::platform::errors::InvalidArgument(
+                              "graph_table should not be null"));
+  return reinterpret_cast<GpuPsGraphTable *>(graph_table)
+      ->get_feature_info_of_nodes(gpu_id,
+                                  d_nodes,
+                                  node_num,
+                                  size_list,
+                                  size_list_prefix_sum,
+                                  feature_list,
+                                  slot_list);
 }
 
 int GraphGpuWrapper::get_feature_of_nodes(int gpu_id,
                                           uint64_t *d_walk,
                                           uint64_t *d_offset,
                                           uint32_t size,
-                                          int slot_num) {
+                                          int slot_num,
+                                          int *d_slot_feature_num_map,
+                                          int fea_num_per_node) {
   platform::CUDADeviceGuard guard(gpu_id);
   PADDLE_ENFORCE_NOT_NULL(graph_table,
                           paddle::platform::errors::InvalidArgument(
                               "graph_table should not be null"));
-  return ((GpuPsGraphTable *)graph_table)
-      ->get_feature_of_nodes(gpu_id, d_walk, d_offset, size, slot_num);
+  return reinterpret_cast<GpuPsGraphTable *>(graph_table)
+      ->get_feature_of_nodes(gpu_id,
+                             d_walk,
+                             d_offset,
+                             size,
+                             slot_num,
+                             d_slot_feature_num_map,
+                             fea_num_per_node);
 }
 
 NeighborSampleResult GraphGpuWrapper::graph_neighbor_sample(
     int gpu_id, uint64_t *device_keys, int walk_degree, int len) {
   platform::CUDADeviceGuard guard(gpu_id);
   auto neighbor_sample_res =
-      ((GpuPsGraphTable *)graph_table)
+      reinterpret_cast<GpuPsGraphTable *>(graph_table)
           ->graph_neighbor_sample(gpu_id, device_keys, walk_degree, len);
 
   return neighbor_sample_res;
@@ -325,9 +677,9 @@ std::vector<uint64_t> GraphGpuWrapper::graph_neighbor_sample(
              cudaMemcpyHostToDevice);
   VLOG(0) << "key_size: " << key.size();
   auto neighbor_sample_res =
-      ((GpuPsGraphTable *)graph_table)
+      reinterpret_cast<GpuPsGraphTable *>(graph_table)
           ->graph_neighbor_sample_v2(
-              gpu_id, idx, cuda_key, sample_size, key.size(), false);
+              gpu_id, idx, cuda_key, sample_size, key.size(), false, true);
   int *actual_sample_size = new int[key.size()];
   cudaMemcpy(actual_sample_size,
              neighbor_sample_res.actual_sample_size,
@@ -351,9 +703,6 @@ std::vector<uint64_t> GraphGpuWrapper::graph_neighbor_sample(
       res.push_back(cpu_key[i * sample_size + j]);
     }
   }
-  /* for(int i = 0;i < res.size();i ++) { */
-  /*     VLOG(0) << i << " " << res[i]; */
-  /* } */
   delete[] actual_sample_size;
   cudaFree(cuda_key);
   return res;
@@ -368,18 +717,88 @@ NodeQueryResult GraphGpuWrapper::query_node_list(int gpu_id,
                     paddle::platform::errors::PreconditionNotMet(
                         "when use query_node_list should set "
                         "gpugraph_load_node_list_into_hbm true"));
-  return ((GpuPsGraphTable *)graph_table)
+  return reinterpret_cast<GpuPsGraphTable *>(graph_table)
       ->query_node_list(gpu_id, idx, start, query_size);
 }
 void GraphGpuWrapper::load_node_weight(int type_id, int idx, std::string path) {
-  return ((GpuPsGraphTable *)graph_table)
+  return reinterpret_cast<GpuPsGraphTable *>(graph_table)
       ->cpu_graph_table_->load_node_weight(type_id, idx, path);
 }
 
+std::vector<int> GraphGpuWrapper::slot_feature_num_map() const {
+  return reinterpret_cast<GpuPsGraphTable *>(graph_table)
+      ->cpu_graph_table_->slot_feature_num_map();
+}
+
 void GraphGpuWrapper::export_partition_files(int idx, std::string file_path) {
-  return ((GpuPsGraphTable *)graph_table)
+  return reinterpret_cast<GpuPsGraphTable *>(graph_table)
       ->cpu_graph_table_->export_partition_files(idx, file_path);
 }
+
+void GraphGpuWrapper::release_graph() {
+  return reinterpret_cast<GpuPsGraphTable *>(graph_table)
+      ->cpu_graph_table_->release_graph();
+}
+
+void GraphGpuWrapper::release_graph_edge() {
+  return reinterpret_cast<GpuPsGraphTable *>(graph_table)
+      ->cpu_graph_table_->release_graph_edge();
+}
+
+void GraphGpuWrapper::release_graph_node() {
+  return reinterpret_cast<GpuPsGraphTable *>(graph_table)
+      ->cpu_graph_table_->release_graph_node();
+}
+
+std::vector<uint64_t> &GraphGpuWrapper::get_graph_total_keys() {
+  return reinterpret_cast<GpuPsGraphTable *>(graph_table)
+      ->cpu_graph_table_->graph_total_keys_;
+}
+
+std::vector<std::vector<uint64_t>> &GraphGpuWrapper::get_graph_type_keys() {
+  return reinterpret_cast<GpuPsGraphTable *>(graph_table)
+      ->cpu_graph_table_->graph_type_keys_;
+}
+
+std::unordered_map<int, int> &GraphGpuWrapper::get_graph_type_to_index() {
+  return reinterpret_cast<GpuPsGraphTable *>(graph_table)
+      ->cpu_graph_table_->type_to_index_;
+}
+
+std::string &GraphGpuWrapper::get_node_type_size(std::string first_node_type) {
+  auto node_types =
+      paddle::string::split_string<std::string>(first_node_type, ";");
+  for (auto &type : node_types) {
+    uniq_first_node_.insert(type);
+  }
+
+  auto &graph_all_type_total_keys = get_graph_type_keys();
+  auto &type_to_index = get_graph_type_to_index();
+  std::vector<std::string> node_type_size;
+  for (auto node : uniq_first_node_) {
+    auto it = feature_to_id.find(node);
+    auto first_node_idx = it->second;
+    size_t f_idx = type_to_index[first_node_idx];
+    int type_total_key_size = graph_all_type_total_keys[f_idx].size();
+    std::string node_type_str =
+        node + ":" + std::to_string(type_total_key_size);
+    node_type_size.push_back(node_type_str);
+  }
+  std::string delim = ";";
+  node_type_size_str_ = paddle::string::join_strings(node_type_size, delim);
+
+  return node_type_size_str_;
+}
+
+std::string &GraphGpuWrapper::get_edge_type_size() {
+  auto edge_type_size = reinterpret_cast<GpuPsGraphTable *>(graph_table)
+                            ->cpu_graph_table_->edge_type_size;
+  std::string delim = ";";
+  edge_type_size_str_ = paddle::string::join_strings(edge_type_size, delim);
+  std::cout << "edge_type_size_str: " << edge_type_size_str_ << std::endl;
+  return edge_type_size_str_;
+}
+
 #endif
 }  // namespace framework
 };  // namespace paddle
diff --git a/paddle/fluid/framework/fleet/heter_ps/graph_gpu_wrapper.h b/paddle/fluid/framework/fleet/heter_ps/graph_gpu_wrapper.h
index 9f3448714653e..52d7132a0e460 100644
--- a/paddle/fluid/framework/fleet/heter_ps/graph_gpu_wrapper.h
+++ b/paddle/fluid/framework/fleet/heter_ps/graph_gpu_wrapper.h
@@ -13,6 +13,7 @@
 // limitations under the License.
 
 #pragma once
+#include <algorithm>
 #include <string>
 #include <unordered_map>
 #include <vector>
@@ -22,6 +23,14 @@
 namespace paddle {
 namespace framework {
 #ifdef PADDLE_WITH_HETERPS
+
+enum GpuGraphStorageMode {
+  WHOLE_HBM = 1,
+  MEM_EMB_AND_GPU_GRAPH,
+  MEM_EMB_FEATURE_AND_GPU_GRAPH,
+  SSD_EMB_AND_MEM_FEATURE_GPU_GRAPH
+};
+
 class GraphGpuWrapper {
  public:
   static std::shared_ptr<GraphGpuWrapper> GetInstance() {
@@ -31,27 +40,39 @@ class GraphGpuWrapper {
     return s_instance_;
   }
   static std::shared_ptr<GraphGpuWrapper> s_instance_;
+  void init_conf(const std::string& first_node_type,
+                 const std::string& meta_path);
   void initialize();
   void finalize();
   void set_device(std::vector<int> ids);
   void init_service();
-  void set_up_types(std::vector<std::string>& edge_type,
-                    std::vector<std::string>& node_type);
+  void set_up_types(const std::vector<std::string>& edge_type,
+                    const std::vector<std::string>& node_type);
   void upload_batch(int type,
                     int idx,
                     int slice_num,
                     const std::string& edge_type);
   void upload_batch(int type, int slice_num, int slot_num);
+  std::vector<GpuPsCommGraphFea> get_sub_graph_fea(
+      std::vector<std::vector<uint64_t>>& node_ids, int slot_num);    // NOLINT
+  void build_gpu_graph_fea(GpuPsCommGraphFea& sub_graph_fea, int i);  // NOLINT
   void add_table_feat_conf(std::string table_name,
                            std::string feat_name,
                            std::string feat_dtype,
                            int feat_shape);
   void load_edge_file(std::string name, std::string filepath, bool reverse);
+  void load_edge_file(std::string etype2files,
+                      std::string graph_data_local_path,
+                      int part_num,
+                      bool reverse);
+
   void load_node_file(std::string name, std::string filepath);
-  void load_node_and_edge(std::string etype,
-                          std::string ntype,
-                          std::string epath,
-                          std::string npath,
+  void load_node_file(std::string ntype2files,
+                      std::string graph_data_local_path,
+                      int part_num);
+  void load_node_and_edge(std::string etype2files,
+                          std::string ntype2files,
+                          std::string graph_data_local_path,
                           int part_num,
                           bool reverse);
   int32_t load_next_partition(int idx);
@@ -86,21 +107,58 @@ class GraphGpuWrapper {
                                   int start,
                                   int query_size);
   NeighborSampleResult graph_neighbor_sample_v3(NeighborSampleQuery q,
-                                                bool cpu_switch);
+                                                bool cpu_switch,
+                                                bool compress);
   NeighborSampleResult graph_neighbor_sample(int gpu_id,
                                              uint64_t* device_keys,
                                              int walk_degree,
                                              int len);
-  std::vector<uint64_t> graph_neighbor_sample(int gpu_id,
-                                              int idx,
-                                              std::vector<uint64_t>& key,
-                                              int sample_size);
+  NeighborSampleResultV2 graph_neighbor_sample_all_edge_type(
+      int gpu_id,
+      int edge_type_len,
+      uint64_t* key,
+      int sample_size,
+      int len,
+      std::vector<std::shared_ptr<phi::Allocation>> edge_type_graphs);
+  gpuStream_t get_local_stream(int gpuid);
+  std::vector<uint64_t> graph_neighbor_sample(
+      int gpu_id,
+      int idx,
+      std::vector<uint64_t>& key,  // NOLINT
+      int sample_size);
+  std::vector<std::shared_ptr<phi::Allocation>> get_edge_type_graph(
+      int gpu_id, int edge_type_len);
+  std::vector<int> slot_feature_num_map() const;
   void set_feature_separator(std::string ch);
+  void set_slot_feature_separator(std::string ch);
   int get_feature_of_nodes(int gpu_id,
                            uint64_t* d_walk,
                            uint64_t* d_offset,
                            uint32_t size,
-                           int slot_num);
+                           int slot_num,
+                           int* d_slot_feature_num_map,
+                           int fea_num_per_node);
+  int get_feature_info_of_nodes(
+      int gpu_id,
+      uint64_t* d_nodes,
+      int node_num,
+      uint32_t* size_list,
+      uint32_t* size_list_prefix_sum,
+      std::shared_ptr<phi::Allocation>& feature_list,  // NOLINT
+      std::shared_ptr<phi::Allocation>& slot_list);    // NOLINT
+  void init_metapath(std::string cur_metapath,
+                     int cur_metapath_index,
+                     int cur_metapath_len);
+  void clear_metapath_state();
+  void release_graph();
+  void release_graph_edge();
+  void release_graph_node();
+  void init_type_keys();
+  std::vector<uint64_t>& get_graph_total_keys();
+  std::vector<std::vector<uint64_t>>& get_graph_type_keys();
+  std::unordered_map<int, int>& get_graph_type_to_index();
+  std::string& get_node_type_size(std::string first_node_type);
+  std::string& get_edge_type_size();
 
   std::unordered_map<std::string, int> edge_to_id, feature_to_id;
   std::vector<std::string> id_to_feature, id_to_edge;
@@ -115,6 +173,31 @@ class GraphGpuWrapper {
   int upload_num = 8;
   std::shared_ptr<::ThreadPool> upload_task_pool;
   std::string feature_separator_ = std::string(" ");
+  bool conf_initialized_ = false;
+  std::vector<int> first_node_type_;
+  std::vector<std::vector<int>> meta_path_;
+
+  std::vector<std::set<int>> finish_node_type_;
+  std::vector<std::unordered_map<int, size_t>> node_type_start_;
+  std::vector<size_t> cur_metapath_start_;
+  std::vector<std::unordered_map<int, size_t>> global_infer_node_type_start_;
+  std::vector<size_t> infer_cursor_;
+  std::vector<size_t> cursor_;
+  std::vector<std::shared_ptr<phi::Allocation>> d_graph_train_total_keys_;
+  std::vector<size_t> h_graph_train_keys_len_;
+  std::vector<std::vector<std::shared_ptr<phi::Allocation>>>
+      d_graph_all_type_total_keys_;
+  std::vector<std::vector<uint64_t>> h_graph_all_type_keys_len_;
+  std::string slot_feature_separator_ = std::string(" ");
+
+  std::string cur_metapath_;
+  std::vector<int> cur_parse_metapath_;
+  std::vector<int> cur_parse_reverse_metapath_;
+  int cur_metapath_index_;
+  int cur_metapath_len_;
+  std::set<std::string> uniq_first_node_;
+  std::string node_type_size_str_;
+  std::string edge_type_size_str_;
 };
 #endif
 }  // namespace framework
diff --git a/paddle/fluid/framework/fleet/heter_ps/hashtable.h b/paddle/fluid/framework/fleet/heter_ps/hashtable.h
index 18fb2eca5b752..f4afb974d035b 100644
--- a/paddle/fluid/framework/fleet/heter_ps/hashtable.h
+++ b/paddle/fluid/framework/fleet/heter_ps/hashtable.h
@@ -42,8 +42,8 @@ limitations under the License. */
 #include "xpu/kernel/math.h"
 #include "xpu/kernel/simd.h"
 #endif
-
 #include "paddle/fluid/framework/fleet/heter_ps/optimizer_conf.h"
+#include "paddle/phi/core/enforce.h"
 
 namespace paddle {
 namespace framework {
@@ -65,7 +65,7 @@ class TableContainer
 template <typename KeyType, typename ValType>
 class XPUCacheArray {
  public:
-  explicit XPUCacheArray(long long capacity) : capacity_(capacity), size_(0) {
+  explicit XPUCacheArray(int64_t capacity) : capacity_(capacity), size_(0) {
     xpu_malloc(reinterpret_cast<void**>(&keys), capacity_ * sizeof(KeyType));
     xpu_malloc(reinterpret_cast<void**>(&vals), capacity_ * sizeof(ValType));
   }
@@ -103,8 +103,8 @@ class XPUCacheArray {
   size_t size() { return size_; }
 
  private:
-  long long capacity_;
-  long long size_;
+  int64_t capacity_;
+  int64_t size_;
   KeyType* keys;
   ValType* vals;
 };
@@ -124,6 +124,12 @@ class HashTable {
               size_t len,
               StreamType stream);
 
+  template <typename StreamType>
+  void insert(const KeyType* d_keys,
+              size_t len,
+              uint64_t* global_num,
+              StreamType stream);
+
   template <typename StreamType>
   void insert(const KeyType* d_keys,
               size_t len,
@@ -143,7 +149,7 @@ class HashTable {
            char* d_vals,
            size_t len,
            StreamType stream,
-           GPUAccessor& fv_accessor);
+           const GPUAccessor& fv_accessor);
 
   void show();
 
@@ -153,6 +159,9 @@ class HashTable {
   template <typename StreamType>
   void dump_to_cpu(int devid, StreamType stream);
 
+  template <typename StreamType>
+  void get_keys(KeyType* d_out, uint64_t* global_cursor, StreamType stream);
+
 #if defined(PADDLE_WITH_CUDA)
 
   template <typename Sgd, typename StreamType>
@@ -185,7 +194,7 @@ class HashTable {
 #endif
 
   int size() { return container_->size(); }
-
+  thrust::pair<KeyType, ValType>* data() { return container_->data(); }
   void set_feature_value_size(size_t pull_feature_value_size,
                               size_t push_grad_value_size) {
     pull_feature_value_size_ = pull_feature_value_size;
@@ -194,7 +203,15 @@ class HashTable {
             << " push value size: " << push_grad_value_size_;
   }
 
+  int prefetch(const int dev_id, cudaStream_t stream = 0) {
+    return container_->prefetch(dev_id, stream);
+  }
+
+  void clear(cudaStream_t stream = 0) { container_->clear_async(stream); }
+
   void show_collision(int id) { return container_->print_collision(id); }
+  // infer mode
+  void set_mode(bool infer_mode) { infer_mode_ = infer_mode; }
 
   std::unique_ptr<phi::RWLock> rwlock_{nullptr};
 
@@ -213,6 +230,7 @@ class HashTable {
   size_t max_mf_dim_ = 8;
   size_t pull_feature_value_size_;
   size_t push_grad_value_size_;
+  bool infer_mode_ = false;
 };
 }  // end namespace framework
 }  // end namespace paddle
diff --git a/paddle/fluid/framework/fleet/heter_ps/hashtable_kernel.cu b/paddle/fluid/framework/fleet/heter_ps/hashtable_kernel.cu
index 1fda5a586a2e8..a81a70d7e0ef6 100644
--- a/paddle/fluid/framework/fleet/heter_ps/hashtable_kernel.cu
+++ b/paddle/fluid/framework/fleet/heter_ps/hashtable_kernel.cu
@@ -31,6 +31,35 @@ struct ReplaceOp {
   }
 };
 
+template <typename Table>
+__global__ void insert_kernel(Table* table,
+                              const typename Table::key_type* const keys,
+                              size_t len,
+                              uint64_t* global_num) {
+  ReplaceOp<typename Table::mapped_type> op;
+  thrust::pair<typename Table::key_type, typename Table::mapped_type> kv;
+
+  __shared__ uint64_t local_num;
+
+  const size_t i = blockIdx.x * blockDim.x + threadIdx.x;
+  if (threadIdx.x == 0) {
+    local_num = 0;
+  }
+  __syncthreads();
+
+  if (i < len) {
+    kv.first = keys[i];
+    kv.second = 1;  // fake value
+    auto it = table->insert(kv, op, &local_num);
+    assert(it != table->end() && "error: insert fails: table is full");
+  }
+  __syncthreads();
+
+  if (threadIdx.x == 0) {
+    atomicAdd(global_num, local_num);
+  }
+}
+
 template <typename Table>
 __global__ void insert_kernel(Table* table,
                               const typename Table::key_type* const keys,
@@ -38,7 +67,6 @@ __global__ void insert_kernel(Table* table,
                               size_t len) {
   ReplaceOp<typename Table::mapped_type> op;
   thrust::pair<typename Table::key_type, typename Table::mapped_type> kv;
-
   const size_t i = blockIdx.x * blockDim.x + threadIdx.x;
   if (i < len) {
     kv.first = keys[i];
@@ -65,7 +93,9 @@ __global__ void insert_kernel(Table* table,
     uint64_t offset = uint64_t(start_index + i) * feature_value_size;
     kv.second = (Table::mapped_type)(pool + offset);
     auto it = table->insert(kv, op);
-    assert(it != table->end() && "error: insert fails: table is full");
+    if (it == table->end()) {
+      printf("error: insert fails: table is full");
+    }
   }
 }
 
@@ -83,6 +113,29 @@ __global__ void search_kernel(Table* table,
   }
 }
 
+template <typename Table, typename GPUAccessor>
+__global__ void dy_mf_search_kernel_fill(
+    Table* table,
+    const typename Table::key_type* const keys,
+    char* vals,
+    size_t len,
+    size_t pull_feature_value_size,
+    GPUAccessor gpu_accessor) {
+  const size_t i = blockIdx.x * blockDim.x + threadIdx.x;
+  if (i < len) {
+    auto it = table->find(keys[i]);
+    if (it != table->end()) {
+      uint64_t offset = i * pull_feature_value_size;
+      float* cur = reinterpret_cast<float*>(vals + offset);
+      float* input = it->second;
+      gpu_accessor.PullValueFill(cur, input);
+    } else {
+      float* cur = reinterpret_cast<float*>(&vals[i * pull_feature_value_size]);
+      gpu_accessor.PullZeroValue(cur);
+    }
+  }
+}
+
 template <typename Table, typename GPUAccessor>
 __global__ void dy_mf_search_kernel(Table* table,
                                     const typename Table::key_type* const keys,
@@ -91,14 +144,15 @@ __global__ void dy_mf_search_kernel(Table* table,
                                     size_t pull_feature_value_size,
                                     GPUAccessor gpu_accessor) {
   const size_t i = blockIdx.x * blockDim.x + threadIdx.x;
-  // return;
   if (i < len) {
     auto it = table->find(keys[i]);
     if (it != table->end()) {
       uint64_t offset = i * pull_feature_value_size;
-      float* cur = (float*)(vals + offset);
+      float* cur = reinterpret_cast<float*>(vals + offset);
       float* input = it->second;
       gpu_accessor.PullValueFill(cur, input);
+    } else {
+      printf("warning: pull miss key: %lu", keys[i]);
     }
   }
 }
@@ -131,7 +185,8 @@ __global__ void dy_mf_update_kernel(Table* table,
   if (i < len) {
     auto it = table->find(keys[i]);
     if (it != table->end()) {
-      float* cur = (float*)(grads + i * grad_value_size);
+      const float* cur =
+          reinterpret_cast<const float*>(grads + i * grad_value_size);
       sgd.dy_mf_update_value(optimizer_config, (it.getter())->second, cur);
     } else {
       printf("warning: push miss key: %lu", keys[i]);
@@ -139,12 +194,46 @@ __global__ void dy_mf_update_kernel(Table* table,
   }
 }
 
+template <typename Table>
+__global__ void get_keys_kernel(Table* table,
+                                typename Table::key_type* d_out,
+                                uint64_t* global_cursor,
+                                uint64_t unused_key) {
+  extern __shared__ typename Table::key_type local_key[];
+  __shared__ uint64_t local_num;
+  __shared__ uint64_t global_num;
+
+  size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
+  if (threadIdx.x == 0) {
+    local_num = 0;
+  }
+  __syncthreads();
+  uint64_t len = table->size();
+  if (idx < len) {
+    typename Table::value_type val = *(table->data() + idx);
+    if (val.first != unused_key) {
+      uint64_t dst = atomicAdd(&local_num, 1);
+      local_key[dst] = val.first;
+    }
+  }
+
+  __syncthreads();
+
+  if (threadIdx.x == 0) {
+    global_num = atomicAdd(global_cursor, local_num);
+  }
+  __syncthreads();
+
+  if (threadIdx.x < local_num) {
+    d_out[global_num + threadIdx.x] = local_key[threadIdx.x];
+  }
+}
+
 template <typename KeyType, typename ValType>
 HashTable<KeyType, ValType>::HashTable(size_t capacity) {
   container_ = new TableContainer<KeyType, ValType>(capacity);
-  CUDA_RT_CALL(
-      cudaMalloc((void**)&device_optimizer_config_, sizeof(OptimizerConfig)));
-  CUDA_RT_CALL(cudaMemcpy((void*)device_optimizer_config_,
+  CUDA_RT_CALL(cudaMalloc(&device_optimizer_config_, sizeof(OptimizerConfig)));
+  CUDA_RT_CALL(cudaMemcpy(device_optimizer_config_,
                           &host_optimizer_config_,
                           sizeof(OptimizerConfig),
                           cudaMemcpyHostToDevice));
@@ -161,7 +250,7 @@ template <typename KeyType, typename ValType>
 void HashTable<KeyType, ValType>::set_sparse_sgd(
     const OptimizerConfig& optimizer_config) {
   host_optimizer_config_.set_sparse_sgd(optimizer_config);
-  cudaMemcpy((void*)device_optimizer_config_,
+  cudaMemcpy(device_optimizer_config_,
              &host_optimizer_config_,
              sizeof(OptimizerConfig),
              cudaMemcpyHostToDevice);
@@ -171,7 +260,7 @@ template <typename KeyType, typename ValType>
 void HashTable<KeyType, ValType>::set_embedx_sgd(
     const OptimizerConfig& optimizer_config) {
   host_optimizer_config_.set_embedx_sgd(optimizer_config);
-  cudaMemcpy((void*)device_optimizer_config_,
+  cudaMemcpy(device_optimizer_config_,
              &host_optimizer_config_,
              sizeof(OptimizerConfig),
              cudaMemcpyHostToDevice);
@@ -202,13 +291,33 @@ void HashTable<KeyType, ValType>::get(const KeyType* d_keys,
                                       char* d_vals,
                                       size_t len,
                                       StreamType stream,
-                                      GPUAccessor& fv_accessor) {
+                                      const GPUAccessor& fv_accessor) {
   if (len == 0) {
     return;
   }
   const int grid_size = (len - 1) / BLOCK_SIZE_ + 1;
-  dy_mf_search_kernel<<<grid_size, BLOCK_SIZE_, 0, stream>>>(
-      container_, d_keys, d_vals, len, pull_feature_value_size_, fv_accessor);
+  // infer need zero fill
+  if (infer_mode_) {
+    dy_mf_search_kernel_fill<<<grid_size, BLOCK_SIZE_, 0, stream>>>(
+        container_, d_keys, d_vals, len, pull_feature_value_size_, fv_accessor);
+  } else {
+    dy_mf_search_kernel<<<grid_size, BLOCK_SIZE_, 0, stream>>>(
+        container_, d_keys, d_vals, len, pull_feature_value_size_, fv_accessor);
+  }
+}
+
+template <typename KeyType, typename ValType>
+template <typename StreamType>
+void HashTable<KeyType, ValType>::insert(const KeyType* d_keys,
+                                         size_t len,
+                                         uint64_t* global_num,
+                                         StreamType stream) {
+  if (len == 0) {
+    return;
+  }
+  const int grid_size = (len - 1) / BLOCK_SIZE_ + 1;
+  insert_kernel<<<grid_size, BLOCK_SIZE_, 0, stream>>>(
+      container_, d_keys, len, global_num);
 }
 
 template <typename KeyType, typename ValType>
@@ -225,6 +334,19 @@ void HashTable<KeyType, ValType>::insert(const KeyType* d_keys,
       container_, d_keys, d_vals, len);
 }
 
+template <typename KeyType, typename ValType>
+template <typename StreamType>
+void HashTable<KeyType, ValType>::get_keys(KeyType* d_out,
+                                           uint64_t* global_cursor,
+                                           StreamType stream) {
+  size_t len = container_->size();
+  const int grid_size = (len - 1) / BLOCK_SIZE_ + 1;
+  KeyType unuse_key = std::numeric_limits<KeyType>::max();
+  size_t shared_mem_size = sizeof(KeyType) * BLOCK_SIZE_;
+  get_keys_kernel<<<grid_size, BLOCK_SIZE_, shared_mem_size, stream>>>(
+      container_, d_out, global_cursor, unuse_key);
+}
+
 template <typename KeyType, typename ValType>
 template <typename StreamType>
 void HashTable<KeyType, ValType>::insert(const KeyType* d_keys,
@@ -336,142 +458,142 @@ void HashTable<KeyType, ValType>::update(const KeyType* d_keys,
       push_grad_value_size_);
 }
 
-template class HashTable<unsigned long, float>;
-template class HashTable<unsigned long, float*>;
-template class HashTable<long, int>;
-template class HashTable<unsigned long, int>;
-template class HashTable<unsigned long, unsigned long>;
-template class HashTable<unsigned long, unsigned long*>;
-template class HashTable<unsigned long, long>;
-template class HashTable<unsigned long, long*>;
-template class HashTable<long, long>;
-template class HashTable<long, unsigned long>;
-template class HashTable<long, unsigned int>;
-
-template void HashTable<unsigned long, float>::get<cudaStream_t>(
-    const unsigned long* d_keys,
-    float* d_vals,
-    size_t len,
-    cudaStream_t stream);
+template class HashTable<uint64_t, float>;
+template class HashTable<uint64_t, float*>;
+template class HashTable<int64_t, int>;
+template class HashTable<uint64_t, int>;
+template class HashTable<uint64_t, uint64_t>;
+template class HashTable<uint64_t, uint64_t*>;
+template class HashTable<uint64_t, int64_t>;
+template class HashTable<uint64_t, int64_t*>;
+template class HashTable<int64_t, int64_t>;
+template class HashTable<int64_t, uint64_t>;
+template class HashTable<int64_t, unsigned int>;
+
+template void HashTable<uint64_t, float>::get<cudaStream_t>(
+    const uint64_t* d_keys, float* d_vals, size_t len, cudaStream_t stream);
 
 template void
-HashTable<unsigned long, float*>::get<cudaStream_t, CommonFeatureValueAccessor>(
-    const unsigned long* d_keys,
+HashTable<uint64_t, float*>::get<cudaStream_t, CommonFeatureValueAccessor>(
+    const uint64_t* d_keys,
     char* d_vals,
     size_t len,
     cudaStream_t stream,
-    CommonFeatureValueAccessor& fv_accessor);
-
-template void HashTable<long, int>::get<cudaStream_t>(const long* d_keys,
-                                                      int* d_vals,
-                                                      size_t len,
-                                                      cudaStream_t stream);
-
-template void HashTable<unsigned long, int>::get<cudaStream_t>(
-    const unsigned long* d_keys, int* d_vals, size_t len, cudaStream_t stream);
-template void HashTable<unsigned long, unsigned long>::get<cudaStream_t>(
-    const unsigned long* d_keys,
-    unsigned long* d_vals,
+    const CommonFeatureValueAccessor& fv_accessor);
+
+template void HashTable<int64_t, int>::get<cudaStream_t>(const int64_t* d_keys,
+                                                         int* d_vals,
+                                                         size_t len,
+                                                         cudaStream_t stream);
+
+template void HashTable<uint64_t, int>::get<cudaStream_t>(
+    const uint64_t* d_keys, int* d_vals, size_t len, cudaStream_t stream);
+template void HashTable<uint64_t, uint64_t>::get<cudaStream_t>(
+    const uint64_t* d_keys, uint64_t* d_vals, size_t len, cudaStream_t stream);
+template void HashTable<uint64_t, int64_t>::get<cudaStream_t>(
+    const uint64_t* d_keys, int64_t* d_vals, size_t len, cudaStream_t stream);
+template void HashTable<int64_t, uint64_t>::get<cudaStream_t>(
+    const int64_t* d_keys, uint64_t* d_vals, size_t len, cudaStream_t stream);
+template void HashTable<int64_t, int64_t>::get<cudaStream_t>(
+    const int64_t* d_keys, int64_t* d_vals, size_t len, cudaStream_t stream);
+template void HashTable<int64_t, unsigned int>::get<cudaStream_t>(
+    const int64_t* d_keys,
+    unsigned int* d_vals,
     size_t len,
     cudaStream_t stream);
-template void HashTable<unsigned long, long>::get<cudaStream_t>(
-    const unsigned long* d_keys, long* d_vals, size_t len, cudaStream_t stream);
-template void HashTable<long, unsigned long>::get<cudaStream_t>(
-    const long* d_keys, unsigned long* d_vals, size_t len, cudaStream_t stream);
-template void HashTable<long, long>::get<cudaStream_t>(const long* d_keys,
-                                                       long* d_vals,
-                                                       size_t len,
-                                                       cudaStream_t stream);
-template void HashTable<long, unsigned int>::get<cudaStream_t>(
-    const long* d_keys, unsigned int* d_vals, size_t len, cudaStream_t stream);
 // template void
-// HashTable<unsigned long, paddle::framework::FeatureValue>::get<cudaStream_t>(
-//    const unsigned long* d_keys, char* d_vals, size_t len, cudaStream_t
+// HashTable<uint64_t, paddle::framework::FeatureValue>::get<cudaStream_t>(
+//    const uint64_t* d_keys, char* d_vals, size_t len, cudaStream_t
 //    stream);
 
-template void HashTable<unsigned long, float>::insert<cudaStream_t>(
-    const unsigned long* d_keys,
+template void HashTable<uint64_t, float>::insert<cudaStream_t>(
+    const uint64_t* d_keys,
     const float* d_vals,
     size_t len,
     cudaStream_t stream);
 
-template void HashTable<unsigned long, float*>::insert<cudaStream_t>(
-    const unsigned long* d_keys,
+template void HashTable<uint64_t, float*>::insert<cudaStream_t>(
+    const uint64_t* d_keys,
     size_t len,
     char* pool,
     size_t feature_value_size,
     size_t start_index,
     cudaStream_t stream);
 
-template void HashTable<long, int>::insert<cudaStream_t>(const long* d_keys,
-                                                         const int* d_vals,
-                                                         size_t len,
-                                                         cudaStream_t stream);
-template void HashTable<long, long>::insert<cudaStream_t>(const long* d_keys,
-                                                          const long* d_vals,
-                                                          size_t len,
-                                                          cudaStream_t stream);
-
-template void HashTable<unsigned long, int>::insert<cudaStream_t>(
-    const unsigned long* d_keys,
-    const int* d_vals,
+template void HashTable<int64_t, int>::insert<cudaStream_t>(
+    const int64_t* d_keys, const int* d_vals, size_t len, cudaStream_t stream);
+template void HashTable<int64_t, int64_t>::insert<cudaStream_t>(
+    const int64_t* d_keys,
+    const int64_t* d_vals,
     size_t len,
     cudaStream_t stream);
 
-template void HashTable<unsigned long, long>::insert<cudaStream_t>(
-    const unsigned long* d_keys,
-    const long* d_vals,
+template void HashTable<uint64_t, int>::insert<cudaStream_t>(
+    const uint64_t* d_keys, const int* d_vals, size_t len, cudaStream_t stream);
+
+template void HashTable<uint64_t, int64_t>::insert<cudaStream_t>(
+    const uint64_t* d_keys,
+    const int64_t* d_vals,
     size_t len,
     cudaStream_t stream);
 
-template void HashTable<long, unsigned long>::insert<cudaStream_t>(
-    const long* d_keys,
-    const unsigned long* d_vals,
+template void HashTable<int64_t, uint64_t>::insert<cudaStream_t>(
+    const int64_t* d_keys,
+    const uint64_t* d_vals,
     size_t len,
     cudaStream_t stream);
 
-template void HashTable<long, unsigned int>::insert<cudaStream_t>(
-    const long* d_keys,
+template void HashTable<int64_t, unsigned int>::insert<cudaStream_t>(
+    const int64_t* d_keys,
     const unsigned int* d_vals,
     size_t len,
     cudaStream_t stream);
 
-template void HashTable<unsigned long, unsigned long>::insert<cudaStream_t>(
-    const unsigned long* d_keys,
-    const unsigned long* d_vals,
+template void HashTable<uint64_t, uint64_t>::get_keys<cudaStream_t>(
+    uint64_t* d_out, uint64_t* global_cursor, cudaStream_t stream);
+
+template void HashTable<uint64_t, uint64_t>::insert<cudaStream_t>(
+    const uint64_t* d_keys,
+    uint64_t len,
+    uint64_t* global_num,
+    cudaStream_t stream);
+
+template void HashTable<uint64_t, uint64_t>::insert<cudaStream_t>(
+    const uint64_t* d_keys,
+    const uint64_t* d_vals,
     size_t len,
     cudaStream_t stream);
 
-template void HashTable<unsigned long, float*>::dump_to_cpu<cudaStream_t>(
+template void HashTable<uint64_t, float*>::dump_to_cpu<cudaStream_t>(
     int devid, cudaStream_t stream);
 
-template void HashTable<unsigned long, float*>::update<
+template void HashTable<uint64_t, float*>::update<
     SparseAdagradOptimizer<CommonFeatureValueAccessor>,
-    cudaStream_t>(const unsigned long* d_keys,
+    cudaStream_t>(const uint64_t* d_keys,
                   const char* d_grads,
                   size_t len,
                   SparseAdagradOptimizer<CommonFeatureValueAccessor> sgd,
                   cudaStream_t stream);
-template void HashTable<unsigned long, float*>::update<
+template void HashTable<uint64_t, float*>::update<
     SparseAdamOptimizer<CommonFeatureValueAccessor>,
-    cudaStream_t>(const unsigned long* d_keys,
+    cudaStream_t>(const uint64_t* d_keys,
                   const char* d_grads,
                   size_t len,
                   SparseAdamOptimizer<CommonFeatureValueAccessor> sgd,
                   cudaStream_t stream);
-template void HashTable<unsigned long, float*>::update<
+template void HashTable<uint64_t, float*>::update<
     SparseAdamSharedOptimizer<CommonFeatureValueAccessor>,
-    cudaStream_t>(const unsigned long* d_keys,
+    cudaStream_t>(const uint64_t* d_keys,
                   const char* d_grads,
                   size_t len,
                   SparseAdamSharedOptimizer<CommonFeatureValueAccessor> sgd,
                   cudaStream_t stream);
 
-// template void HashTable<unsigned long,
+// template void HashTable<uint64_t,
 // paddle::framework::FeatureValue>::update<
 //    Optimizer<paddle::framework::FeatureValue,
 //              paddle::framework::FeaturePushValue>,
-//    cudaStream_t>(const unsigned long* d_keys, const char* d_grads, size_t
+//    cudaStream_t>(const uint64_t* d_keys, const char* d_grads, size_t
 //    len,
 //                  Optimizer<paddle::framework::FeatureValue,
 //                            paddle::framework::FeaturePushValue>
diff --git a/paddle/fluid/framework/fleet/heter_ps/hashtable_kernel.kps b/paddle/fluid/framework/fleet/heter_ps/hashtable_kernel.kps
index 79c5f3d757781..7d581935008a5 100644
--- a/paddle/fluid/framework/fleet/heter_ps/hashtable_kernel.kps
+++ b/paddle/fluid/framework/fleet/heter_ps/hashtable_kernel.kps
@@ -21,7 +21,8 @@ namespace framework {
 
 #if defined(PADDLE_WITH_XPU_KP)
 
-__device__ void update_lr(OptimizerConfig& optimizer_config, float& w,
+__device__ void update_lr(OptimizerConfig& optimizer_config,
+                          float& w,
                           float& g2sum,
                           float g,  // NOLINT
                           float scale) {
@@ -45,8 +46,12 @@ __device__ void update_lr(OptimizerConfig& optimizer_config, float& w,
   g2sum += add_g2sum;
 }
 
-__device__ void update_mf(OptimizerConfig& optimizer_config, int n, float* w,
-                          float& g2sum, const float* g, float scale) {
+__device__ void update_mf(OptimizerConfig& optimizer_config,
+                          int n,
+                          float* w,
+                          float& g2sum,
+                          const float* g,
+                          float scale) {
   float local_mf_learning_rate = optimizer_config.mf_learning_rate;
   float local_mf_initial_g2sum = optimizer_config.mf_initial_g2sum;
   float local_mf_min_bound = optimizer_config.mf_min_bound;
@@ -71,7 +76,8 @@ __device__ void update_mf(OptimizerConfig& optimizer_config, int n, float* w,
 __device__ float xpu_rand_uniform() { return 0.1; }
 
 template <typename ValType, typename GradType>
-__device__ void update_value(OptimizerConfig& optimizer_config, ValType& val,
+__device__ void update_value(OptimizerConfig& optimizer_config,
+                             ValType& val,
                              const GradType& grad) {  // NOLINT
   val.slot = grad.slot;
   val.show += grad.show;
@@ -99,14 +105,16 @@ __device__ void update_value(OptimizerConfig& optimizer_config, ValType& val,
       }
     }
   } else {
-    update_mf(optimizer_config, MF_DIM, &val.mf[1], val.mf[0], grad.mf_g,
-              grad.show);
+    update_mf(
+        optimizer_config, MF_DIM, &val.mf[1], val.mf[0], grad.mf_g, grad.show);
   }
 }
 
 template <typename KeyType, typename ValType, typename Table>
-__global__ void insert_kernel(Table& table, const KeyType* const keys,
-                              const ValType* const vals, long long len) {
+__global__ void insert_kernel(Table& table,
+                              const KeyType* const keys,
+                              const ValType* const vals,
+                              long long len) {
   int cid = core_id();
   int ncores = core_num();
   if (cid >= ncores) {
@@ -133,8 +141,10 @@ __global__ void insert_kernel(Table& table, const KeyType* const keys,
 }
 
 template <typename KeyType, typename ValType, typename Table>
-__global__ void search_kernel(Table& table, const KeyType* const keys,
-                              ValType* const vals, long long len) {
+__global__ void search_kernel(Table& table,
+                              const KeyType* const keys,
+                              ValType* const vals,
+                              long long len) {
   int cid = core_id();
   int ncores = core_num();
   if (cid >= ncores) {
@@ -163,9 +173,11 @@ __global__ void search_kernel(Table& table, const KeyType* const keys,
 }
 
 template <typename KeyType, typename ValType, typename Table, typename GradType>
-__global__ void update_kernel(Table& table, OptimizerConfig& optimizer_config,
+__global__ void update_kernel(Table& table,
+                              OptimizerConfig& optimizer_config,
                               const KeyType* const keys,
-                              const GradType* const grads, long long len) {
+                              const GradType* const grads,
+                              long long len) {
   int cid = core_id();
   int ncores = core_num();
   if (cid >= ncores) {
@@ -200,12 +212,16 @@ HashTable<KeyType, ValType>::HashTable(size_t capacity) {
   auto tmp_container = XPUCacheArray<KeyType, ValType>(capacity);
   xpu_malloc(reinterpret_cast<void**>(&container_),
              sizeof(XPUCacheArray<KeyType, ValType>));
-  xpu_memcpy((void*)container_, &tmp_container,
-             sizeof(XPUCacheArray<KeyType, ValType>), XPU_HOST_TO_DEVICE);
+  xpu_memcpy((void*)container_,
+             &tmp_container,
+             sizeof(XPUCacheArray<KeyType, ValType>),
+             XPU_HOST_TO_DEVICE);
   xpu_malloc(reinterpret_cast<void**>(&device_optimizer_config_),
              sizeof(OptimizerConfig));
-  xpu_memcpy((void*)device_optimizer_config_, &host_optimizer_config_,
-             sizeof(OptimizerConfig), XPU_HOST_TO_DEVICE);
+  xpu_memcpy((void*)device_optimizer_config_,
+             &host_optimizer_config_,
+             sizeof(OptimizerConfig),
+             XPU_HOST_TO_DEVICE);
 
   rwlock_.reset(new phi::RWLock);
 }
@@ -225,35 +241,42 @@ template <typename KeyType, typename ValType>
 void HashTable<KeyType, ValType>::set_sparse_sgd(
     const OptimizerConfig& optimizer_config) {
   host_optimizer_config_.set_sparse_sgd(optimizer_config);
-  xpu_memcpy((void*)device_optimizer_config_, &host_optimizer_config_,
-             sizeof(OptimizerConfig), XPU_HOST_TO_DEVICE);
+  xpu_memcpy((void*)device_optimizer_config_,
+             &host_optimizer_config_,
+             sizeof(OptimizerConfig),
+             XPU_HOST_TO_DEVICE);
 }
 
 template <typename KeyType, typename ValType>
 void HashTable<KeyType, ValType>::set_embedx_sgd(
     const OptimizerConfig& optimizer_config) {
   host_optimizer_config_.set_embedx_sgd(optimizer_config);
-  xpu_memcpy((void*)device_optimizer_config_, &host_optimizer_config_,
-             sizeof(OptimizerConfig), XPU_HOST_TO_DEVICE);
+  xpu_memcpy((void*)device_optimizer_config_,
+             &host_optimizer_config_,
+             sizeof(OptimizerConfig),
+             XPU_HOST_TO_DEVICE);
 }
 
 template <typename KeyType, typename ValType>
 template <typename StreamType>
-void HashTable<KeyType, ValType>::get(const KeyType* d_keys, ValType* d_vals,
-                                      size_t len, StreamType stream) {
+void HashTable<KeyType, ValType>::get(const KeyType* d_keys,
+                                      ValType* d_vals,
+                                      size_t len,
+                                      StreamType stream) {
   if (len == 0) {
     return;
   }
   long long c_len = (long long)len;
-  search_kernel<KeyType, ValType,
-                XPUCacheArray<KeyType, ValType>><<<4, 64, stream>>>(
-      *container_, d_keys, d_vals, c_len);
+  search_kernel<KeyType, ValType, XPUCacheArray<KeyType, ValType>>
+      <<<4, 64, stream>>>(*container_, d_keys, d_vals, c_len);
 }
 
 template <typename KeyType, typename ValType>
 template <typename StreamType>
-void HashTable<KeyType, ValType>::get(const KeyType* d_keys, char* d_vals,
-                                      size_t len, StreamType stream) {
+void HashTable<KeyType, ValType>::get(const KeyType* d_keys,
+                                      char* d_vals,
+                                      size_t len,
+                                      StreamType stream) {
   if (len == 0) {
     return;
   }
@@ -263,15 +286,15 @@ void HashTable<KeyType, ValType>::get(const KeyType* d_keys, char* d_vals,
 template <typename KeyType, typename ValType>
 template <typename StreamType>
 void HashTable<KeyType, ValType>::insert(const KeyType* d_keys,
-                                         const ValType* d_vals, size_t len,
+                                         const ValType* d_vals,
+                                         size_t len,
                                          StreamType stream) {
   if (len == 0) {
     return;
   }
   long long c_len = (long long)len;
-  insert_kernel<KeyType, ValType,
-                XPUCacheArray<KeyType, ValType>><<<4, 64, stream>>>(
-      *container_, d_keys, d_vals, c_len);
+  insert_kernel<KeyType, ValType, XPUCacheArray<KeyType, ValType>>
+      <<<4, 64, stream>>>(*container_, d_keys, d_vals, c_len);
 }
 
 template <typename KeyType, typename ValType>
@@ -283,21 +306,23 @@ void HashTable<KeyType, ValType>::dump_to_cpu(int devid, StreamType stream) {
 template <typename KeyType, typename ValType>
 template <typename GradType, typename StreamType>
 void HashTable<KeyType, ValType>::update(const KeyType* d_keys,
-                                         const GradType* d_grads, size_t len,
+                                         const GradType* d_grads,
+                                         size_t len,
                                          StreamType stream) {
   if (len == 0) {
     return;
   }
   long long c_len = (long long)len;
-  update_kernel<KeyType, ValType, XPUCacheArray<KeyType, ValType>,
-                GradType><<<4, 64, stream>>>(
-      *container_, *device_optimizer_config_, d_keys, d_grads, c_len);
+  update_kernel<KeyType, ValType, XPUCacheArray<KeyType, ValType>, GradType>
+      <<<4, 64, stream>>>(
+          *container_, *device_optimizer_config_, d_keys, d_grads, c_len);
 }
 
 template <typename KeyType, typename ValType>
 template <typename StreamType>
 void HashTable<KeyType, ValType>::update(const KeyType* d_keys,
-                                         const char* d_grads, size_t len,
+                                         const char* d_grads,
+                                         size_t len,
                                          StreamType stream) {
   if (len == 0) {
     return;
@@ -309,7 +334,8 @@ template class HashTable<unsigned long, paddle::framework::FeatureValue>;
 
 template void HashTable<unsigned long, paddle::framework::FeatureValue>::get<
     XPUStream>(const unsigned long* d_keys,
-               paddle::framework::FeatureValue* d_vals, size_t len,
+               paddle::framework::FeatureValue* d_vals,
+               size_t len,
                XPUStream stream);
 
 // template void
@@ -318,7 +344,8 @@ template void HashTable<unsigned long, paddle::framework::FeatureValue>::get<
 
 template void HashTable<unsigned long, paddle::framework::FeatureValue>::insert<
     XPUStream>(const unsigned long* d_keys,
-               const paddle::framework::FeatureValue* d_vals, size_t len,
+               const paddle::framework::FeatureValue* d_vals,
+               size_t len,
                XPUStream stream);
 
 // template void HashTable<unsigned long,
@@ -330,10 +357,11 @@ template void HashTable<unsigned long, paddle::framework::FeatureValue>::
     dump_to_cpu<XPUStream>(int devid, XPUStream stream);
 
 template void HashTable<unsigned long, paddle::framework::FeatureValue>::update<
-    paddle::framework::FeaturePushValue, XPUStream>(
-    const unsigned long* d_keys,
-    const paddle::framework::FeaturePushValue* d_grads, size_t len,
-    XPUStream stream);
+    paddle::framework::FeaturePushValue,
+    XPUStream>(const unsigned long* d_keys,
+               const paddle::framework::FeaturePushValue* d_grads,
+               size_t len,
+               XPUStream stream);
 
 // template void HashTable<unsigned long,
 // paddle::framework::FeatureValue>::update<
diff --git a/paddle/fluid/framework/fleet/heter_ps/heter_comm.h b/paddle/fluid/framework/fleet/heter_ps/heter_comm.h
index 01bd6962ddbe1..c7a09fb4288fd 100644
--- a/paddle/fluid/framework/fleet/heter_ps/heter_comm.h
+++ b/paddle/fluid/framework/fleet/heter_ps/heter_comm.h
@@ -30,6 +30,7 @@ limitations under the License. */
 #include "paddle/fluid/platform/device/xpu/enforce_xpu.h"
 #endif
 
+#include "paddle/fluid/framework/barrier.h"
 #include "paddle/fluid/framework/fleet/heter_ps/hashtable.h"
 #include "paddle/fluid/framework/fleet/heter_ps/heter_comm_kernel.h"
 #include "paddle/fluid/framework/fleet/heter_ps/heter_resource.h"
@@ -50,28 +51,40 @@ template <typename KeyType,
           typename GradType,
           typename GPUAccessor>
 class HeterComm {
+  using HeterCommType = HeterComm<KeyType, ValType, GradType, GPUAccessor>;
+  static const int COPY_KEY = 0x01;
+  static const int COPY_VAL = 0x02;
+  static const int COPY_ALL = COPY_KEY | COPY_VAL;
+
  public:
   HeterComm(size_t capacity, std::shared_ptr<HeterPsResource> resource);
   HeterComm(size_t capacity,
             std::shared_ptr<HeterPsResource> resource,
-            GPUAccessor& gpu_accessor);
+            const GPUAccessor& gpu_accessor);
   virtual ~HeterComm();
   HeterComm(const HeterComm&) = delete;
   HeterComm& operator=(const HeterComm&) = delete;
-
-  void merge_keys(int gpu_num,
-                  const KeyType* d_keys,
-                  size_t len,
-                  KeyType* d_sorted_keys,
-                  KeyType* d_merged_keys,
-                  uint32_t* d_restore_idx,
-                  size_t& uniq_len);
+  // reset table
+  void reset_table(const int dev_id,
+                   size_t capacity,
+                   const OptimizerConfig& sgd_config,
+                   const OptimizerConfig& embedx_config,
+                   bool infer_mode);
+  void set_mode(bool infer_mode);
+  template <typename StreamType>
+  size_t merge_keys(const int gpu_num,
+                    const KeyType* d_keys,
+                    const size_t& len,
+                    KeyType* d_sorted_keys,
+                    KeyType* d_merged_keys,
+                    uint32_t* d_restore_idx,
+                    StreamType stream);
   void dynamic_merge_grad(int gpu_num,
                           KeyType* d_keys,
                           float* d_grads,
                           size_t len,
-                          int& uniq_len,
-                          size_t& segment_len,
+                          int& uniq_len,        // NOLINT
+                          size_t& segment_len,  // NOLINT
                           bool enable_segment_merge_grad);
   void segment_merge_grad(int gpu_num,
                           KeyType* d_keys,
@@ -80,7 +93,7 @@ class HeterComm {
                           size_t len,
                           const uint32_t* d_fea_num_info,
                           size_t uniq_len,
-                          size_t& segment_len);
+                          size_t& segment_len);  // NOLINT
   void build_ps(int num,
                 KeyType* h_keys,
                 ValType* h_vals,
@@ -99,8 +112,11 @@ class HeterComm {
                   GradType* d_grads,
                   size_t len,
                   int& uniq_len);  // NOLINT
-  void dynamic_merge_grad(
-      int gpu_num, KeyType* d_keys, float* d_grads, size_t len, int& uniq_len);
+  void dynamic_merge_grad(int gpu_num,
+                          KeyType* d_keys,
+                          float* d_grads,
+                          size_t len,
+                          int& uniq_len);  // NOLINT
   void pull_sparse(int num, KeyType* d_keys, float* d_vals, size_t len);
   void build_ps(int num,
                 KeyType* h_keys,
@@ -165,10 +181,12 @@ class HeterComm {
 
   void set_nccl_comm_and_size(const std::vector<ncclComm_t>& inner_comms,
                               const std::vector<ncclComm_t>& inter_comms,
-                              int comm_size) {
+                              int comm_size,
+                              int rank_id) {
     nccl_inner_comms_ = inner_comms;
     nccl_inter_comms_ = inter_comms;
     node_size_ = comm_size;
+    rank_id_ = rank_id;
   }
 
   void set_multi_mf_dim(int multi_mf_dim, int max_mf_dim) {
@@ -179,12 +197,13 @@ class HeterComm {
 #endif
 
   bool need_transfer(int send_id, int receive_id) {
-    return ((send_id / 4 != receive_id / 4) && (send_id + 4) % 8 != receive_id);
+    return ((send_id / 4 != receive_id / 4) &&
+            (send_id + 4) % device_num_ != receive_id);
   }
 
   // void dump_to_cpu(int index);
 
-  int get_transfer_devid(int send_id) { return (send_id + 4) % 8; }
+  int get_transfer_devid(int send_id) { return (send_id + 4) % device_num_; }
 
   void end_pass();
 #if defined(PADDLE_WITH_CUDA)
@@ -198,8 +217,17 @@ class HeterComm {
                              uint32_t* d_sorted_idx,
                              uint32_t* d_offset,
                              uint32_t* d_merged_cnts,
-                             bool filter_zero);
+                             bool filter_zero,
+                             cudaStream_t stream = 0);
 #endif
+  template <typename T, typename StreamType>
+  void split_idx_to_shard(KeyType* d_keys,
+                          T* d_idx_ptr,
+                          size_t len,
+                          T* left,
+                          T* right,
+                          int gpu_num,
+                          StreamType stream);
 
   struct Node {
     ppStream in_stream;
@@ -221,31 +249,124 @@ class HeterComm {
     int step;
     CopyTask(Path* path_, int step_) : path(path_), step(step_) {}
   };
+  // inner card
+  struct InnerResource {
+    uint32_t* d_idx = nullptr;
+    size_t* h_part_sizes = nullptr;
+    std::vector<size_t> h_offsets;
+    uint32_t* d_offset_ptr = nullptr;
+
+    KeyType* d_keys_parted = nullptr;
+    char* d_vals_parted = nullptr;
+    std::vector<KeyType*> d_remote_keys;
+    std::vector<char*> d_remote_vals;
+    KeyType* d_trans_keys = nullptr;
+    char* d_trans_vals = nullptr;
+
+    // resize vector
+    void resize(const int num_gpu) {
+      h_offsets.resize(num_gpu);
+      d_remote_keys.resize(num_gpu);
+      d_remote_vals.resize(num_gpu);
+    }
+  };
+  // Resource for partition shard Key by nodes
+  struct ShardResource {
+    uint32_t* d_local_idx_parted = nullptr;  // uint32_t for multisplit
+    std::vector<size_t> h_local_part_sizes;
+    std::vector<size_t> h_local_part_offsets;
+    std::vector<size_t> h_remote_part_sizes;
+    std::vector<size_t> h_remote_part_offsets;
+    uint32_t* d_node_size_ptr = nullptr;
+    std::vector<uint32_t> h_push_fea_sizes;
+    // shard part
+    void resize_part_size(const int node_size) {
+      if (h_local_part_sizes.size() >= static_cast<size_t>(node_size)) {
+        return;
+      }
+      h_local_part_sizes.resize(node_size);
+      h_local_part_offsets.resize(node_size + 1);
+      h_remote_part_sizes.resize(node_size);
+      h_remote_part_offsets.resize(node_size + 1);
+      h_push_fea_sizes.resize(node_size * node_size);
+    }
+  };
+  // pull parition shard key by devices
+  struct PullResource {
+    size_t h_recv_fea_num = 0;
+    uint32_t* d_restore_keys_idx = nullptr;
+  };
 
   struct LocalStorage {
-    LocalStorage() {}
-    void init(int size, int dev_id) {
+    LocalStorage() { sem_wait = std::make_unique<Semaphore>(); }
+    void init(int device_num, int dev_id) {
       place_ = platform::CUDAPlace(dev_id);
-      alloc(size, true);
+      h_recv_offsets.resize(device_num);
+      h_fea_sizes.resize(device_num);
     }
-
-    void alloc(size_t size, bool force = false) {
-      if (force || size > all_keys_mem->size()) {
-        all_keys_mem.reset();
-        all_grads_mem.reset();
-        all_keys_mem = memory::Alloc(place_, size * sizeof(KeyType));
-        all_grads_mem = memory::Alloc(place_, size * sizeof(GradType));
-        all_keys = reinterpret_cast<KeyType*>(all_keys_mem->ptr());
-        all_grads = reinterpret_cast<GradType*>(all_grads_mem->ptr());
-      }
-      if (force || size > local_keys_mem->size()) {
-        local_keys_mem.reset();
-        local_grads_mem.reset();
-        local_keys_mem = memory::Alloc(place_, size * sizeof(KeyType));
-        local_grads_mem = memory::Alloc(place_, size * sizeof(GradType));
-        local_keys = reinterpret_cast<KeyType*>(local_keys_mem->ptr());
-        local_grads = reinterpret_cast<GradType*>(local_grads_mem->ptr());
+    template <typename T>
+    T* alloc_cache(const size_t& len,
+                   std::shared_ptr<memory::Allocation>& alloc,  // NOLINT
+                   bool need_copy = false) {
+      size_t need_mem = len * sizeof(T);
+      if (alloc.get() == nullptr) {
+        alloc = memory::Alloc(place_, need_mem);
+      } else if (need_mem > alloc->size()) {
+        if (need_copy) {
+          std::shared_ptr<memory::Allocation> tmp =
+              memory::Alloc(place_, need_mem);
+          cudaMemcpy(tmp->ptr(),
+                     alloc->ptr(),
+                     alloc->size(),
+                     cudaMemcpyDeviceToDevice);
+          alloc.reset();
+          alloc = tmp;
+        } else {
+          alloc.reset();
+          alloc = memory::Alloc(place_, need_mem);
+        }
       }
+      return reinterpret_cast<T*>(alloc->ptr());
+    }
+    void alloc(const size_t& len,
+               const size_t& value_bytes = sizeof(GradType),
+               const int copy_mode = 0) {
+      all_keys =
+          alloc_cache<KeyType>(len, all_keys_mem, (copy_mode & COPY_KEY));
+      all_grads = alloc_cache<char>(
+          len * value_bytes, all_grads_mem, (copy_mode & COPY_VAL));
+      local_keys =
+          alloc_cache<KeyType>(len, local_keys_mem, (copy_mode & COPY_KEY));
+      local_grads = alloc_cache<char>(
+          len * value_bytes, local_grads_mem, (copy_mode & COPY_VAL));
+      d_merged_keys = all_keys;
+      d_merged_push_keys = local_keys;
+      d_merged_vals = all_grads;
+      d_merged_push_vals = local_grads;
+    }
+    void init_pull(const size_t& len) {
+      pull_res.h_recv_fea_num = len;
+      pull_res.d_restore_keys_idx = alloc_cache<uint32_t>(len, local_pull_idx);
+    }
+    void init_shard(const size_t& len, const size_t& node_size) {
+      shard_res.d_local_idx_parted =
+          alloc_cache<uint32_t>(len, local_shard_idx);
+      shard_res.d_node_size_ptr =
+          alloc_cache<uint32_t>(node_size * node_size, d_node_size_buf);
+      shard_res.resize_part_size(node_size);
+    }
+    void init_inner(const size_t& len, const int& device_num) {
+      inner_res.d_idx = alloc_cache<uint32_t>(len, local_inner_idx);
+      inner_res.d_offset_ptr =
+          alloc_cache<uint32_t>(device_num * 2, inner_offset);
+      inner_res.resize(device_num);
+    }
+    void init_trans(const size_t& fea_num, const size_t& value_bytes) {
+      d_merged_trans_keys = alloc_cache<KeyType>(fea_num * 2, trans_keys_buff);
+      d_merged_push_trans_keys = &d_merged_trans_keys[fea_num];
+      d_merged_trans_vals =
+          alloc_cache<char>(fea_num * 2 * value_bytes, trans_vals_buff);
+      d_merged_push_trans_vals = &d_merged_trans_vals[fea_num * value_bytes];
     }
 
 #if defined(PADDLE_WITH_CUDA)
@@ -254,16 +375,52 @@ class HeterComm {
 #elif defined(PADDLE_WITH_XPU_KP)
     platform::XPUPlace place_;
 #endif
-    std::shared_ptr<memory::Allocation> all_keys_mem;
-    std::shared_ptr<memory::Allocation> all_grads_mem;
+    std::shared_ptr<memory::Allocation> all_keys_mem = nullptr;
+    std::shared_ptr<memory::Allocation> all_grads_mem = nullptr;
 
     KeyType* all_keys;
-    GradType* all_grads;
+    char* all_grads;
 
-    std::shared_ptr<memory::Allocation> local_keys_mem;
-    std::shared_ptr<memory::Allocation> local_grads_mem;
+    std::shared_ptr<memory::Allocation> local_keys_mem = nullptr;
+    std::shared_ptr<memory::Allocation> local_grads_mem = nullptr;
     KeyType* local_keys;
-    GradType* local_grads;
+    char* local_grads;
+
+    // all2all
+    std::shared_ptr<memory::Allocation> local_inner_idx = nullptr;
+    std::shared_ptr<memory::Allocation> local_pull_idx = nullptr;
+    std::shared_ptr<memory::Allocation> local_shard_idx = nullptr;
+    std::shared_ptr<memory::Allocation> inner_offset = nullptr;
+    std::shared_ptr<memory::Allocation> d_node_size_buf = nullptr;
+
+    InnerResource inner_res;
+    ShardResource shard_res;
+    PullResource pull_res;
+
+    KeyType* d_merged_keys = nullptr;
+    char* d_merged_vals = nullptr;
+    KeyType* d_merged_push_keys = nullptr;
+    char* d_merged_push_vals = nullptr;
+    std::vector<size_t> h_recv_offsets;
+    std::vector<size_t> h_fea_sizes;
+    // inner trans comm and stream buffer
+    size_t h_trans_size;
+    size_t h_trans_offset;
+
+    // node trans comm and stream buffer
+    std::unique_ptr<Semaphore> sem_wait;
+    std::shared_ptr<memory::Allocation> trans_keys_buff = nullptr;
+    std::shared_ptr<memory::Allocation> trans_vals_buff = nullptr;
+    KeyType* d_merged_trans_keys = nullptr;
+    char* d_merged_trans_vals = nullptr;
+    KeyType* d_merged_push_trans_keys = nullptr;
+    char* d_merged_push_trans_vals = nullptr;
+
+    platform::Timer all2all_span_;
+    platform::Timer inner_span_;
+    platform::Timer inner_barrier_;
+    platform::Timer node_span_;
+    platform::Timer node_barrier_;
   };
 
   void init_path();
@@ -299,6 +456,11 @@ class HeterComm {
                       int end_index,
                       size_t keylen,
                       size_t vallen);
+  void create_tmp_storage(void*& dest,  // NOLINT
+                          int start_index,
+                          int end_index,
+                          size_t vallen);
+  void destroy_tmp_storage(void*& p, int start_index, int end_index);  // NOLINT
   void destroy_storage(int start_index, int end_index);
   void walk_to_dest(int start_index,
                     int gpu_num,
@@ -326,8 +488,160 @@ class HeterComm {
                    size_t val_size);
 
  protected:
-  void pull_merge_sparse(int num, KeyType* d_keys, float* d_vals, size_t len);
-  void pull_normal_sparse(int num, KeyType* d_keys, float* d_vals, size_t len);
+  void pull_merge_sparse(const int gpu_id,
+                         KeyType* d_keys,
+                         float* d_vals,
+                         size_t len);
+  void pull_normal_sparse(const int gpu_id,
+                          KeyType* d_keys,
+                          float* d_vals,
+                          size_t len);
+  void pull_one_table(const int gpu_id,
+                      KeyType* d_keys,
+                      float* d_vals,
+                      const size_t& len,
+                      const cudaStream_t& stream);
+
+  // node all2all pull
+  void pull_sparse_all2all(const int& gpu_id,
+                           KeyType* d_keys,
+                           float* d_vals,
+                           const size_t& len);
+
+  template <typename Sgd>
+  void push_normal_sparse(int num,
+                          KeyType* d_keys,
+                          float* d_grads,
+                          size_t len,
+                          Sgd& sgd);  // NOLINT
+
+  void shard_inner_keys(const size_t& total_fea_num,
+                        const KeyType* d_keys,
+                        const int& gpu_id,
+                        const int& gpu_num,
+                        InnerResource* res,
+                        const cudaStream_t& stream);
+  void gather_inner_keys_p2p(const size_t& total_fea_num,
+                             const KeyType* d_keys,
+                             InnerResource& res,  // NOLINT
+                             const int& gpu_id,
+                             const int& gpu_num,
+                             const int& trans_id,
+                             const cudaStream_t& stream);
+  size_t gather_inter_keys_by_copy(const int& gpu_id,
+                                   const size_t& fea_size,
+                                   const KeyType* d_keys,
+                                   const cudaStream_t& stream);
+  void partition_shard_keys(const int& gpu_id,
+                            const size_t& total_fea_num,
+                            const KeyType* d_keys,
+                            uint32_t* d_idx_parted,
+                            KeyType* d_keys_parted,
+                            size_t* h_part_sizes,
+                            const int& shard_num,
+                            const cudaStream_t& stream);
+  size_t send_data_by_all2all(const int& gpu_id,
+                              const int& nccl_node_size,
+                              const int& nccl_rank_id,
+                              const int& value_bytes,
+                              const size_t* h_send_part_sizes,
+                              const size_t* h_send_part_offsets,
+                              const size_t* h_recv_part_sizes,
+                              const size_t* h_recv_part_offsets,
+                              const char* d_send_buff,
+                              char* d_rev_buff,
+                              const cudaStream_t& stream);
+  size_t gather_sparse_keys_by_all2all(const int& gpu_id,
+                                       const size_t& fea_size,
+                                       const KeyType* d_in_keys,
+                                       KeyType* d_out_keys,
+                                       KeyType* d_tmp_keys,
+                                       const cudaStream_t& stream);
+  void scatter_sparse_vals_by_all2all(const int& gpu_id,
+                                      const size_t& fea_size,
+                                      const char* d_in_vals,
+                                      void* d_out_vals,
+                                      const size_t& value_bytes,
+                                      void* d_tmp_vals,
+                                      const cudaStream_t& stream);
+  void scatter_inner_vals_p2p(const size_t& total_fea_num,
+                              void* d_out_vals,
+                              InnerResource& res,  // NOLINT
+                              const int& gpu_id,
+                              const int& gpu_num,
+                              const int& trans_id,
+                              const size_t& value_bytes,
+                              const cudaStream_t& stream);
+  void scatter_inter_vals_by_copy(const int& gpu_id,
+                                  const size_t& fea_size,
+                                  const char* d_in_vals,
+                                  void* d_out_vals,
+                                  const size_t& value_bytes,
+                                  const cudaStream_t& stream);
+  void gather_inner_data_p2p(const size_t& total_fea_num,
+                             const KeyType* d_keys,
+                             const void* d_vals,
+                             InnerResource& res,  // NOLINT
+                             const int& gpu_id,
+                             const int& gpu_num,
+                             const int& trans_id,
+                             const size_t& value_bytes,
+                             const cudaStream_t& stream);
+  template <typename Sgd>
+  void push_sparse_all2all(const int& gpu_id,
+                           KeyType* d_keys,
+                           float* d_grads,
+                           const size_t& len,
+                           Sgd& sgd);  // NOLINT
+  size_t merge_grad(const int& gpu_id,
+                    const size_t& len,
+                    const KeyType* d_in_keys,
+                    KeyType* d_out_keys,
+                    const void* d_in_grads,
+                    void* d_out_grads,
+                    const cudaStream_t& stream);
+  size_t gather_inter_gradient_by_copy(const int& gpu_id,
+                                       const size_t& push_size,
+                                       KeyType* d_keys,
+                                       void* d_push_vals,
+                                       const size_t& value_bytes,
+                                       const cudaStream_t& stream);
+  size_t gather_sparse_gradient_by_all2all(const int& gpu_id,
+                                           const size_t& push_size,
+                                           const KeyType* d_keys,
+                                           const char* d_push_vals,
+                                           const size_t& value_bytes,
+                                           KeyType* d_out_keys,
+                                           KeyType* d_tmp_keys,
+                                           char* d_out_vals,
+                                           char* d_tmp_vals,
+                                           const cudaStream_t& stream);
+  size_t send_keys_by_all2all_trans(const int& gpu_id,
+                                    const int& rank_id,
+                                    const int& node_size,
+                                    const size_t& fea_size,
+                                    const KeyType* d_in_keys,
+                                    KeyType* d_out_keys,
+                                    const cudaStream_t& stream);
+  size_t send_vals_by_all2all_trans(const int& gpu_id,
+                                    const int& rank_id,
+                                    const int& node_size,
+                                    const char* d_in_vals,
+                                    char* d_out_vals,
+                                    const size_t& value_bytes,
+                                    const cudaStream_t& stream);
+  size_t send_gradient_by_all2all_trans(const int& gpu_id,
+                                        const int& rank_id,
+                                        const int& node_size,
+                                        const size_t& fea_size,
+                                        const KeyType* d_keys,
+                                        const char* d_push_vals,
+                                        const size_t& value_bytes,
+                                        KeyType* d_out_keys,
+                                        char* d_out_vals,
+                                        const cudaStream_t& stream);
+  // debug time
+  void print_debug_time(const int& gpu_id, bool force = false);
 
   using Table = HashTable<KeyType, ValType>;
   using PtrTable = HashTable<KeyType, float*>;
@@ -341,21 +655,34 @@ class HeterComm {
 
   GPUAccessor gpu_accessor_;
 
- private:
+ protected:
   int topo_aware_{0};
   std::vector<LocalStorage> storage_;
   DynamicGradMerger merger_;
-  int feanum_{1800 * 2048};
+  int device_num_ = 8;
   int multi_node_{0};
-  int node_size_;
+  int rank_id_ = 0;
+  int node_size_ = 1;
+  // inner sync barrier
+  Barrier barrier_;
+  size_t val_type_size_;
+  size_t pull_type_size_;
+  size_t grad_type_size_;
+  size_t max_type_size_;
+  bool enable_gpu_direct_access_ = false;
+  // set compress bound
+  float max_value_bound_ = 10.0;
+  float max_grad_bound_ = 10.0;
 
 #if defined(PADDLE_WITH_CUDA)
+  GpuRDMAChecker* rdma_checker_ = nullptr;
   std::vector<ncclComm_t> nccl_inner_comms_;
   std::vector<ncclComm_t> nccl_inter_comms_;
   int multi_mf_dim_{8};
   int max_mf_dim_ = 8;
   std::vector<std::shared_ptr<cub::CachingDeviceAllocator>> allocators_;
 #endif
+  int64_t start_time_ = 0;
 };
 
 }  // end namespace framework
diff --git a/paddle/fluid/framework/fleet/heter_ps/heter_comm_inl.h b/paddle/fluid/framework/fleet/heter_ps/heter_comm_inl.h
index eb55209f856aa..7cd5123b24e34 100644
--- a/paddle/fluid/framework/fleet/heter_ps/heter_comm_inl.h
+++ b/paddle/fluid/framework/fleet/heter_ps/heter_comm_inl.h
@@ -28,10 +28,17 @@ DECLARE_bool(gpugraph_enable_gpu_direct_access);
 DECLARE_bool(gpugraph_enable_segment_merge_grads);
 DECLARE_uint64(gpugraph_merge_grads_segment_size);
 DECLARE_int32(gpugraph_dedup_pull_push_mode);
+DECLARE_bool(enable_tracker_all2all);
+DECLARE_bool(enable_all2all_use_fp16);
+DECLARE_bool(enable_sparse_inner_gather);
 
 namespace paddle {
 namespace framework {
-
+inline int64_t tick_usec() {
+  struct timeval tm;
+  gettimeofday(&tm, NULL);
+  return tm.tv_sec * 1000 * 1000L + tm.tv_usec;
+}
 template <typename KeyType,
           typename ValType,
           typename GradType,
@@ -40,41 +47,61 @@ HeterComm<KeyType, ValType, GradType, GPUAccessor>::HeterComm(
     size_t capacity, std::shared_ptr<HeterPsResource> resource) {
   VLOG(1) << "Construct new HeterComm";
   resource_ = resource;
-  storage_.resize(resource_->total_device());
+  device_num_ = resource_->total_device();
+  storage_.resize(device_num_);
   multi_mf_dim_ = resource->multi_mf();
   load_factor_ = FLAGS_gpugraph_hbm_table_load_factor;
-  VLOG(0) << "load_factor = " << load_factor_;
-  for (int i = 0; i < resource_->total_device(); ++i) {
+  multi_node_ = resource_->multi_node();
+#if defined(PADDLE_WITH_CUDA)
+  rdma_checker_ = GpuRDMAChecker::get(device_num_);
+  topo_aware_ = rdma_checker_->topo_aware();
+#endif
+  enable_gpu_direct_access_ =
+      (topo_aware_) ? false : FLAGS_gpugraph_enable_gpu_direct_access;
+  VLOG(0) << "device_num = " << device_num_ << ", multi_node = " << multi_node_
+          << ", multi_mf_dim = " << multi_mf_dim_
+          << ", topo_aware = " << topo_aware_
+          << ", enable_gpu_direct_access = " << enable_gpu_direct_access_
+          << ", load_factor = " << load_factor_;
+  if (multi_mf_dim_) {
+    max_mf_dim_ = resource_->max_mf_dim();
+    auto accessor_wrapper_ptr =
+        GlobalAccessorFactory::GetInstance().GetAccessorWrapper();
+    val_type_size_ = accessor_wrapper_ptr->GetFeatureValueSize(max_mf_dim_);
+    grad_type_size_ = accessor_wrapper_ptr->GetPushValueSize(max_mf_dim_);
+    pull_type_size_ = accessor_wrapper_ptr->GetPullValueSize(max_mf_dim_);
+    VLOG(0) << " HeterComm init, max_mf_dim: " << max_mf_dim_
+            << ", max feature_value_size:" << val_type_size_
+            << ", feature_value_push_size:" << grad_type_size_
+            << ", feature_pull_type_size:" << pull_type_size_;
+  } else {
+    val_type_size_ = sizeof(ValType);
+    pull_type_size_ = sizeof(ValType);
+    grad_type_size_ = sizeof(GradType);
+  }
+  max_type_size_ = std::max(pull_type_size_, grad_type_size_);
+
+  for (int i = 0; i < device_num_; ++i) {
 #if defined(PADDLE_WITH_CUDA)
     platform::CUDADeviceGuard guard(resource_->dev_id(i));
     allocators_.push_back(std::make_shared<cub::CachingDeviceAllocator>(
         8, 1, (unsigned int)-1, (size_t)-1, false, false));  // NOLINT
 #endif
     if (!multi_mf_dim_) {
-      auto table = new Table(capacity / load_factor_);
-      tables_.push_back(table);
+      if (capacity > 0) {
+        auto table = new Table(capacity / load_factor_);
+        tables_.push_back(table);
+      }
     } else {
-      max_mf_dim_ = resource_->max_mf_dim();
-      auto accessor_wrapper_ptr =
-          GlobalAccessorFactory::GetInstance().GetAccessorWrapper();
-      size_t val_type_size =
-          accessor_wrapper_ptr->GetFeatureValueSize(max_mf_dim_);
-      size_t grad_type_size =
-          accessor_wrapper_ptr->GetPushValueSize(max_mf_dim_);
-      size_t pull_type_size =
-          accessor_wrapper_ptr->GetPullValueSize(max_mf_dim_);
-
-      VLOG(0) << " HeterComm init, max feature_value_size:" << val_type_size
-              << ", feature_value_push_size:" << grad_type_size
-              << ", feature_pull_type_size:" << pull_type_size;
       auto ptr_table = new PtrTable(capacity / load_factor_);
-      ptr_table->set_feature_value_size(pull_type_size, grad_type_size);
+      ptr_table->set_feature_value_size(pull_type_size_, grad_type_size_);
       ptr_tables_.push_back(ptr_table);
     }
     if (multi_node_) {
-      storage_[i].init(feanum_, resource_->dev_id(i));
+      storage_[i].init(device_num_, resource_->dev_id(i));
     }
   }
+  barrier_.reset(device_num_);
   heter_comm_kernel_ = std::make_unique<HeterCommKernel>(block_size_);
   init_path();
 }
@@ -86,15 +113,46 @@ template <typename KeyType,
 HeterComm<KeyType, ValType, GradType, GPUAccessor>::HeterComm(
     size_t capacity,
     std::shared_ptr<HeterPsResource> resource,
-    GPUAccessor& gpu_accessor) {
+    const GPUAccessor &gpu_accessor) {
   VLOG(1) << "Construct new HeterComm";
   resource_ = resource;
-  storage_.resize(resource_->total_device());
+  device_num_ = resource_->total_device();
+  storage_.resize(device_num_);
   multi_mf_dim_ = resource->multi_mf();
   gpu_accessor_ = gpu_accessor;
   load_factor_ = FLAGS_gpugraph_hbm_table_load_factor;
-  VLOG(0) << "load_factor = " << load_factor_;
-  for (int i = 0; i < resource_->total_device(); ++i) {
+  multi_node_ = resource_->multi_node();
+#if defined(PADDLE_WITH_CUDA)
+  rdma_checker_ = GpuRDMAChecker::get(device_num_);
+  topo_aware_ = rdma_checker_->topo_aware();
+#endif
+  enable_gpu_direct_access_ =
+      (topo_aware_) ? false : FLAGS_gpugraph_enable_gpu_direct_access;
+  VLOG(0) << "gpu access device_num = " << device_num_
+          << ", multi_node = " << multi_node_
+          << ", multi_mf_dim = " << multi_mf_dim_
+          << ", topo_aware = " << topo_aware_
+          << ", enable_gpu_direct_access = " << enable_gpu_direct_access_
+          << ", load_factor = " << load_factor_;
+  if (multi_mf_dim_) {
+    max_mf_dim_ = resource_->max_mf_dim();
+    auto accessor_wrapper_ptr =
+        GlobalAccessorFactory::GetInstance().GetAccessorWrapper();
+    val_type_size_ = accessor_wrapper_ptr->GetFeatureValueSize(max_mf_dim_);
+    grad_type_size_ = accessor_wrapper_ptr->GetPushValueSize(max_mf_dim_);
+    pull_type_size_ = accessor_wrapper_ptr->GetPullValueSize(max_mf_dim_);
+    VLOG(0) << " HeterComm init, max_mf_dim: " << max_mf_dim_
+            << ", max feature_value_size:" << val_type_size_
+            << ", feature_value_push_size:" << grad_type_size_
+            << ", feature_pull_type_size:" << pull_type_size_;
+  } else {
+    val_type_size_ = sizeof(ValType);
+    pull_type_size_ = sizeof(ValType);
+    grad_type_size_ = sizeof(GradType);
+  }
+  max_type_size_ = std::max(pull_type_size_, grad_type_size_);
+
+  for (int i = 0; i < device_num_; ++i) {
 #if defined(PADDLE_WITH_CUDA)
     platform::CUDADeviceGuard guard(resource_->dev_id(i));
     allocators_.push_back(std::make_shared<cub::CachingDeviceAllocator>(
@@ -104,27 +162,15 @@ HeterComm<KeyType, ValType, GradType, GPUAccessor>::HeterComm(
       auto table = new Table(capacity / load_factor_);
       tables_.push_back(table);
     } else {
-      max_mf_dim_ = resource_->max_mf_dim();
-      auto accessor_wrapper_ptr =
-          GlobalAccessorFactory::GetInstance().GetAccessorWrapper();
-      size_t val_type_size =
-          accessor_wrapper_ptr->GetFeatureValueSize(max_mf_dim_);
-      size_t grad_type_size =
-          accessor_wrapper_ptr->GetPushValueSize(max_mf_dim_);
-      size_t pull_type_size =
-          accessor_wrapper_ptr->GetPullValueSize(max_mf_dim_);
-
-      VLOG(0) << " HeterComm init, max feature_value_size:" << val_type_size
-              << ", feature_value_push_size:" << grad_type_size
-              << ", feature_pull_type_size:" << pull_type_size;
       auto ptr_table = new PtrTable(capacity / load_factor_);
-      ptr_table->set_feature_value_size(pull_type_size, grad_type_size);
+      ptr_table->set_feature_value_size(pull_type_size_, grad_type_size_);
       ptr_tables_.push_back(ptr_table);
     }
     if (multi_node_) {
-      storage_[i].init(feanum_, resource_->dev_id(i));
+      storage_[i].init(device_num_, resource_->dev_id(i));
     }
   }
+  barrier_.reset(device_num_);
   heter_comm_kernel_ = std::make_unique<HeterCommKernel>(block_size_);
   init_path();
 }
@@ -141,10 +187,10 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::init_path() {
     for (int i = 0; i < total_device; ++i) {
       path_[i].resize(total_device);
       for (int j = 0; j < total_device; ++j) {
-        auto& nodes = path_[i][j].nodes_;
+        auto &nodes = path_[i][j].nodes_;
         nodes.resize(1);
-        nodes[0].in_stream = resource_->comm_stream(i, j);
-        nodes[0].out_stream = resource_->comm_stream(i, j);
+        nodes[0].in_stream = resource_->remote_stream(i, j);
+        nodes[0].out_stream = resource_->remote_stream(i, j);
         nodes[0].key_storage = NULL;
         nodes[0].val_storage = NULL;
         nodes[0].sync = 0;
@@ -156,7 +202,7 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::init_path() {
     for (int i = 0; i < total_device; ++i) {
       path_[i].resize(total_device);
       for (int j = 0; j < total_device; ++j) {
-        auto& nodes = path_[i][j].nodes_;
+        auto &nodes = path_[i][j].nodes_;
         int from = resource_->dev_id(i);
         int to = resource_->dev_id(j);
 
@@ -164,18 +210,18 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::init_path() {
         if (need_transfer(from, to)) {
           transfer_id = resource_->get_index_by_devid(get_transfer_devid(from));
           nodes.push_back(Node());
-          Node& node = nodes.back();
-          node.in_stream = resource_->comm_stream(i, transfer_id);
-          node.out_stream = resource_->comm_stream(transfer_id, i);
+          Node &node = nodes.back();
+          node.in_stream = resource_->remote_stream(i, transfer_id);
+          node.out_stream = resource_->remote_stream(transfer_id, i);
           node.key_storage = NULL;
           node.val_storage = NULL;
           node.sync = 1;
           node.dev_num = transfer_id;
         }
         nodes.push_back(Node());
-        Node& node = nodes.back();
-        node.in_stream = resource_->comm_stream(i, transfer_id);
-        node.out_stream = resource_->comm_stream(transfer_id, i);
+        Node &node = nodes.back();
+        node.in_stream = resource_->remote_stream(i, transfer_id);
+        node.out_stream = resource_->remote_stream(transfer_id, i);
         node.key_storage = NULL;
         node.val_storage = NULL;
         node.sync = 0;
@@ -183,6 +229,96 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::init_path() {
       }
     }
   }
+  start_time_ = tick_usec();
+}
+template <typename KeyType,
+          typename ValType,
+          typename GradType,
+          typename GPUAccessor>
+void HeterComm<KeyType, ValType, GradType, GPUAccessor>::reset_table(
+    const int dev_id,
+    size_t capacity,
+    const OptimizerConfig &sgd_config,
+    const OptimizerConfig &embedx_config,
+    bool infer_mode) {
+  PADDLE_ENFORCE_LT(
+      dev_id,
+      device_num_,
+      paddle::platform::errors::InvalidArgument(
+          "dev id %d more than device num %d", dev_id, device_num_));
+#if defined(PADDLE_WITH_CUDA)
+  platform::CUDADeviceGuard guard(resource_->dev_id(dev_id));
+#endif
+  size_t need_capacity = capacity / load_factor_;
+  if (!multi_mf_dim_) {
+    auto table = tables_[dev_id];
+    if (static_cast<size_t>(table->size()) < need_capacity) {
+      delete table;
+      table = new Table(need_capacity);
+      table->set_sparse_sgd(sgd_config);
+      table->set_embedx_sgd(sgd_config);
+      tables_[dev_id] = table;
+    } else {
+      table->clear();
+    }
+    table->set_mode(infer_mode);
+  } else {
+    auto table = ptr_tables_[dev_id];
+    if (static_cast<size_t>(table->size()) < need_capacity) {
+      delete table;
+      table = new PtrTable(need_capacity);
+      table->set_feature_value_size(pull_type_size_, grad_type_size_);
+      table->set_sparse_sgd(sgd_config);
+      table->set_embedx_sgd(sgd_config);
+      ptr_tables_[dev_id] = table;
+    } else {
+      table->clear();
+    }
+    table->set_mode(infer_mode);
+  }
+}
+template <typename KeyType,
+          typename ValType,
+          typename GradType,
+          typename GPUAccessor>
+void HeterComm<KeyType, ValType, GradType, GPUAccessor>::set_mode(
+    bool infer_mode) {
+  if (!multi_mf_dim_) {
+    for (auto &table : tables_) {
+      table->set_mode(infer_mode);
+    }
+  } else {
+    for (auto &table : ptr_tables_) {
+      table->set_mode(infer_mode);
+    }
+  }
+}
+// debug time
+template <typename KeyType,
+          typename ValType,
+          typename GradType,
+          typename GPUAccessor>
+void HeterComm<KeyType, ValType, GradType, GPUAccessor>::print_debug_time(
+    const int &gpu_id, bool force) {
+  if (!multi_node_) {
+    return;
+  }
+  static int64_t count_ = 0;
+  if (count_++ % 5000 != 0) {
+    return;
+  }
+  auto &cc = storage_[gpu_id];
+  printf(
+      "gpu id=%d, total span: %lf, "
+      "all2all: %lf, node: %lf, barrier: %lf, "
+      "inner: %lf, barrier: %lf\n",
+      gpu_id,
+      (tick_usec() - start_time_) / 1000000.0,
+      cc.all2all_span_.ElapsedSec(),
+      cc.node_span_.ElapsedSec(),
+      cc.node_barrier_.ElapsedSec(),
+      cc.inner_span_.ElapsedSec(),
+      cc.inner_barrier_.ElapsedSec());
 }
 
 template <typename KeyType,
@@ -192,9 +328,9 @@ template <typename KeyType,
 template <typename DstPlace, typename SrcPlace, typename StreamType>
 void HeterComm<KeyType, ValType, GradType, GPUAccessor>::memory_copy(
     DstPlace dst_place,
-    void* dst,
+    void *dst,
     SrcPlace src_place,
-    const void* src,
+    const void *src,
     size_t count,
     StreamType stream) {
 #if defined(PADDLE_WITH_CUDA)
@@ -214,38 +350,69 @@ template <typename KeyType,
 void HeterComm<KeyType, ValType, GradType, GPUAccessor>::create_storage(
     int start_index, int end_index, size_t keylen, size_t vallen) {
 #if defined(PADDLE_WITH_CUDA)
-  auto& allocator = allocators_[start_index];
-  auto& nodes = path_[start_index][end_index].nodes_;
+  auto &allocator = allocators_[start_index];
+  auto &nodes = path_[start_index][end_index].nodes_;
   for (size_t i = 0; i < nodes.size(); ++i) {
     platform::CUDADeviceGuard guard(resource_->dev_id(nodes[i].dev_num));
-    PADDLE_ENFORCE_GPU_SUCCESS(allocator->DeviceAllocate(
-        resource_->dev_id(nodes[i].dev_num),
-        (void**)&(nodes[i].key_storage),  // NOLINT
-        keylen,
-        resource_->remote_stream(nodes[i].dev_num, start_index)));
-    PADDLE_ENFORCE_GPU_SUCCESS(allocator->DeviceAllocate(
-        resource_->dev_id(nodes[i].dev_num),
-        (void**)&(nodes[i].val_storage),  // NOLINT
-        vallen,
-        resource_->remote_stream(nodes[i].dev_num, start_index)));
-    nodes[i].key_bytes_len = keylen;
-    nodes[i].val_bytes_len = vallen;
+    if (keylen > 0) {
+      PADDLE_ENFORCE_GPU_SUCCESS(allocator->DeviceAllocate(
+          resource_->dev_id(nodes[i].dev_num),
+          (void **)&(nodes[i].key_storage),  // NOLINT
+          keylen,
+          resource_->remote_stream(nodes[i].dev_num, start_index)));
+      nodes[i].key_bytes_len = keylen;
+    }
+    if (vallen > 0) {
+      PADDLE_ENFORCE_GPU_SUCCESS(allocator->DeviceAllocate(
+          resource_->dev_id(nodes[i].dev_num),
+          (void **)&(nodes[i].val_storage),  // NOLINT
+          vallen,
+          resource_->remote_stream(nodes[i].dev_num, start_index)));
+      nodes[i].val_bytes_len = vallen;
+    }
   }
 #elif defined(PADDLE_WITH_XPU_KP)
-  auto& nodes = path_[start_index][end_index].nodes_;
+  auto &nodes = path_[start_index][end_index].nodes_;
   for (size_t i = 0; i < nodes.size(); ++i) {
     platform::XPUDeviceGuard guard(resource_->dev_id(nodes[i].dev_num));
     auto place = DevPlace(resource_->dev_id(nodes[i].dev_num));
-    auto node_keys_mem = memory::Alloc(place, keylen);
-    nodes[i].key_storage = reinterpret_cast<char*>(node_keys_mem->ptr());
-    auto node_vals_mem = memory::Alloc(place, vallen);
-    nodes[i].val_storage = reinterpret_cast<char*>(node_vals_mem->ptr());
-    nodes[i].key_bytes_len = keylen;
-    nodes[i].val_bytes_len = vallen;
+    if (keylen > 0) {
+      auto node_keys_mem = memory::Alloc(place, keylen);
+      nodes[i].key_storage = reinterpret_cast<char *>(node_keys_mem->ptr());
+      nodes[i].key_bytes_len = keylen;
+    }
+    if (vallen > 0) {
+      auto node_vals_mem = memory::Alloc(place, vallen);
+      nodes[i].val_storage = reinterpret_cast<char *>(node_vals_mem->ptr());
+      nodes[i].val_bytes_len = vallen;
+    }
   }
 #endif
 }
 
+template <typename KeyType,
+          typename ValType,
+          typename GradType,
+          typename GPUAccessor>
+void HeterComm<KeyType, ValType, GradType, GPUAccessor>::create_tmp_storage(
+    void *&dest, int start_index, int end_index, size_t vallen) {  // NOLINT
+#if defined(PADDLE_WITH_CUDA)
+  auto &allocator = allocators_[start_index];
+  platform::CUDADeviceGuard guard(resource_->dev_id(end_index));
+  PADDLE_ENFORCE_GPU_SUCCESS(allocator->DeviceAllocate(
+      resource_->dev_id(end_index),
+      (void **)&(dest),  // NOLINT
+      vallen,
+      resource_->remote_stream(end_index, start_index)));
+
+#elif defined(PADDLE_WITH_XPU_KP)
+  platform::XPUDeviceGuard guard(resource_->dev_id(end_index));
+  auto place = DevPlace(resource_->dev_id(end_index));
+  auto node_vals_mem = memory::Alloc(place, vallen);
+  dest = reinterpret_cast<void *>(node_vals_mem->ptr());
+#endif
+}
+
 template <typename KeyType,
           typename ValType,
           typename GradType,
@@ -253,8 +420,8 @@ template <typename KeyType,
 void HeterComm<KeyType, ValType, GradType, GPUAccessor>::destroy_storage(
     int start_index, int end_index) {
 #if defined(PADDLE_WITH_CUDA)
-  auto& allocator = allocators_[start_index];
-  auto& nodes = path_[start_index][end_index].nodes_;
+  auto &allocator = allocators_[start_index];
+  auto &nodes = path_[start_index][end_index].nodes_;
   for (size_t i = 0; i < nodes.size(); ++i) {
     platform::CUDADeviceGuard guard(resource_->dev_id(nodes[i].dev_num));
 
@@ -266,6 +433,20 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::destroy_storage(
 #endif
 }
 
+template <typename KeyType,
+          typename ValType,
+          typename GradType,
+          typename GPUAccessor>
+void HeterComm<KeyType, ValType, GradType, GPUAccessor>::destroy_tmp_storage(
+    void *&p, int start_index, int end_index) {  // NOLINT
+#if defined(PADDLE_WITH_CUDA)
+  auto &allocator = allocators_[start_index];
+  platform::CUDADeviceGuard guard(resource_->dev_id(end_index));
+  PADDLE_ENFORCE_GPU_SUCCESS(
+      allocator->DeviceFree(resource_->dev_id(end_index), p));
+#endif
+}
+
 template <typename KeyType,
           typename ValType,
           typename GradType,
@@ -273,10 +454,10 @@ template <typename KeyType,
 void HeterComm<KeyType, ValType, GradType, GPUAccessor>::walk_to_dest(
     int start_index,
     int num,
-    int* h_left,
-    int* h_right,
-    KeyType* src_key,
-    GradType* src_val) {
+    int *h_left,
+    int *h_right,
+    KeyType *src_key,
+    GradType *src_val) {
   int need_copy_val = 0;
   if (src_val) {
     need_copy_val = 1;
@@ -287,7 +468,7 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::walk_to_dest(
       continue;
     }
     // int size = path_[start_index][i].nodes_.size();
-    auto& node = path_[start_index][i].nodes_[0];
+    auto &node = path_[start_index][i].nodes_[0];
 
     CopyTask t(&path_[start_index][i], 0);
     que.push(t);
@@ -299,7 +480,7 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::walk_to_dest(
     memory_copy(dst_place,
                 node.key_storage,
                 src_place,
-                reinterpret_cast<char*>(src_key + h_left[i]),
+                reinterpret_cast<char *>(src_key + h_left[i]),
                 node.key_bytes_len,
                 node.in_stream);
     // #if defined(PADDLE_WITH_CUDA)  // adapt for gpu-graph
@@ -311,13 +492,13 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::walk_to_dest(
       memory_copy(dst_place,
                   node.val_storage,
                   src_place,
-                  reinterpret_cast<char*>(src_val + h_left[i]),
+                  reinterpret_cast<char *>(src_val + h_left[i]),
                   node.val_bytes_len,
                   node.in_stream);
     }
   }
   while (!que.empty()) {
-    CopyTask& cur_task = que.front();
+    CopyTask &cur_task = que.front();
     que.pop();
     if (cur_task.path->nodes_[cur_task.step].sync) {
       sync_stream(cur_task.path->nodes_[cur_task.step].in_stream);
@@ -360,10 +541,10 @@ template <typename KeyType,
 void HeterComm<KeyType, ValType, GradType, GPUAccessor>::walk_to_dest(
     int start_index,
     int gpu_num,
-    int* h_left,
-    int* h_right,
-    KeyType* src_key,
-    char* src_val,
+    int *h_left,
+    int *h_right,
+    KeyType *src_key,
+    char *src_val,
     size_t val_size) {
   int need_copy_val = 0;
   if (src_val) {
@@ -375,11 +556,11 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::walk_to_dest(
       continue;
     }
     int size = path_[start_index][i].nodes_.size();
-    auto& node = path_[start_index][i].nodes_[0];
+    auto &node = path_[start_index][i].nodes_[0];
     CopyTask t(&path_[start_index][i], 0);
     que.push(t);
     CUDA_CHECK(cudaMemcpyAsync(node.key_storage,
-                               reinterpret_cast<char*>(src_key + h_left[i]),
+                               reinterpret_cast<char *>(src_key + h_left[i]),
                                node.key_bytes_len,
                                cudaMemcpyDefault,
                                node.in_stream));
@@ -393,7 +574,7 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::walk_to_dest(
     }
   }
   while (!que.empty()) {
-    CopyTask& cur_task = que.front();
+    CopyTask &cur_task = que.front();
     que.pop();
     if (cur_task.path->nodes_[cur_task.step].sync) {
       CUDA_CHECK(cudaStreamSynchronize(
@@ -428,9 +609,9 @@ template <typename KeyType,
 void HeterComm<KeyType, ValType, GradType, GPUAccessor>::walk_to_src(
     int start_index,
     int gpu_num,
-    int* h_left,
-    int* h_right,
-    char* src_val,
+    int *h_left,
+    int *h_right,
+    char *src_val,
     size_t val_size) {
   std::queue<CopyTask> que;
   for (int i = 0; i < gpu_num; i++) {
@@ -438,7 +619,7 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::walk_to_src(
       continue;
     }
     int cur_step = path_[start_index][i].nodes_.size() - 1;
-    auto& node = path_[start_index][i].nodes_[cur_step];
+    auto &node = path_[start_index][i].nodes_[cur_step];
     if (cur_step == 0) {
       CUDA_CHECK(cudaMemcpyAsync(src_val + uint64_t(h_left[i]) * val_size,
                                  node.val_storage,
@@ -457,7 +638,7 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::walk_to_src(
     }
   }
   while (!que.empty()) {
-    CopyTask& cur_task = que.front();
+    CopyTask &cur_task = que.front();
     que.pop();
     int cur_step = cur_task.step;
     if (cur_task.path->nodes_[cur_step].sync) {
@@ -489,17 +670,20 @@ template <typename KeyType,
           typename GradType,
           typename GPUAccessor>
 HeterComm<KeyType, ValType, GradType, GPUAccessor>::~HeterComm() {
+  for (int i = 0; i < device_num_; ++i) {
+    print_debug_time(i, true);
+  }
   if (!multi_mf_dim_) {
-    for (auto& table : tables_) {
+    for (auto &table : tables_) {
       delete table;
       table = nullptr;
     }
   } else {
-    for (auto& table : ptr_tables_) {
+    for (auto &table : ptr_tables_) {
       delete table;
       table = nullptr;
     }
-    for (auto& table : tables_) {
+    for (auto &table : tables_) {
       delete table;
       table = nullptr;
     }
@@ -524,13 +708,13 @@ template <typename KeyType,
 void HeterComm<KeyType, ValType, GradType, GPUAccessor>::
     show_table_collisions() {
   size_t idx = 0;
-  for (auto& table : tables_) {
+  for (auto &table : tables_) {
     if (table != nullptr) {
       table->show_collision(idx++);
     }
   }
   idx = 0;
-  for (auto& table : ptr_tables_) {
+  for (auto &table : ptr_tables_) {
     if (table != nullptr) {
       table->show_collision(idx++);
     }
@@ -563,7 +747,7 @@ template <typename KeyType,
           typename GradType,
           typename GPUAccessor>
 void HeterComm<KeyType, ValType, GradType, GPUAccessor>::set_sparse_sgd(
-    const OptimizerConfig& optimizer_config) {
+    const OptimizerConfig &optimizer_config) {
   for (int i = 0; i < resource_->total_device(); ++i) {
     AnyDeviceGuard guard(resource_->dev_id(i));
     if (!multi_mf_dim_) {
@@ -579,7 +763,7 @@ template <typename KeyType,
           typename GradType,
           typename GPUAccessor>
 void HeterComm<KeyType, ValType, GradType, GPUAccessor>::set_embedx_sgd(
-    const OptimizerConfig& optimizer_config) {
+    const OptimizerConfig &optimizer_config) {
   for (int i = 0; i < resource_->total_device(); ++i) {
     AnyDeviceGuard guard(resource_->dev_id(i));
     if (!multi_mf_dim_) {
@@ -596,8 +780,8 @@ template <typename KeyType,
           typename GPUAccessor>
 void HeterComm<KeyType, ValType, GradType, GPUAccessor>::build_ps(
     int dev_num,
-    KeyType* h_keys,
-    ValType* h_vals,
+    KeyType *h_keys,
+    ValType *h_vals,
     size_t len,
     size_t chunk_size,
     int stream_num,
@@ -637,22 +821,22 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::build_ps(
     auto src_place = platform::CPUPlace();
 
     memory_copy(dst_place,
-                reinterpret_cast<char*>(d_key_bufs[cur_stream]->ptr()),
+                reinterpret_cast<char *>(d_key_bufs[cur_stream]->ptr()),
                 src_place,
                 h_keys + cur_len,
                 sizeof(KeyType) * tmp_len,
                 cur_use_stream);
     memory_copy(dst_place,
-                reinterpret_cast<char*>(d_val_bufs[cur_stream]->ptr()),
+                reinterpret_cast<char *>(d_val_bufs[cur_stream]->ptr()),
                 src_place,
                 h_vals + cur_len,
                 sizeof(ValType) * tmp_len,
                 cur_use_stream);
     if (offset == -1) offset = dev_num;
     tables_[offset]->insert(
-        reinterpret_cast<KeyType*>(d_key_bufs[cur_stream]->ptr()),
-        reinterpret_cast<ValType*>(d_val_bufs[cur_stream]->ptr()),
-        (size_t)tmp_len,
+        reinterpret_cast<KeyType *>(d_key_bufs[cur_stream]->ptr()),
+        reinterpret_cast<ValType *>(d_val_bufs[cur_stream]->ptr()),
+        static_cast<size_t>(tmp_len),
         cur_use_stream);
 
     cur_stream += 1;
@@ -670,8 +854,8 @@ template <typename KeyType,
           typename GPUAccessor>
 void HeterComm<KeyType, ValType, GradType, GPUAccessor>::build_ps(
     int num,
-    KeyType* h_keys,
-    char* pool,
+    KeyType *h_keys,
+    char *pool,
     size_t len,
     size_t feature_value_size,
     size_t chunk_size,
@@ -709,13 +893,13 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::build_ps(
     auto src_place = platform::CPUPlace();
 
     memory_copy(dst_place,
-                reinterpret_cast<char*>(d_key_bufs[cur_stream]->ptr()),
+                reinterpret_cast<char *>(d_key_bufs[cur_stream]->ptr()),
                 src_place,
                 h_keys + cur_len,
                 sizeof(KeyType) * tmp_len,
                 cur_use_stream);
     ptr_tables_[num]->insert(
-        reinterpret_cast<KeyType*>(d_key_bufs[cur_stream]->ptr()),
+        reinterpret_cast<KeyType *>(d_key_bufs[cur_stream]->ptr()),
         tmp_len,
         pool,
         feature_value_size,
@@ -736,20 +920,20 @@ template <typename KeyType,
           typename GPUAccessor>
 void HeterComm<KeyType, ValType, GradType, GPUAccessor>::merge_grad(
     int dev_num,
-    KeyType* d_keys,
-    GradType* d_grads,
+    KeyType *d_keys,
+    GradType *d_grads,
     size_t len,
-    int& uniq_len) {  // NOLINT
+    int &uniq_len) {  // NOLINT
   int dev_id = resource_->dev_id(dev_num);
   DevPlace place = DevPlace(dev_id);
   AnyDeviceGuard guard(dev_id);
   auto stream = resource_->local_stream(dev_num, 0);
   size_t temp_storage_bytes;
   auto d_merge_keys = memory::Alloc(place, len * sizeof(KeyType));
-  KeyType* d_merge_keys_ptr = reinterpret_cast<KeyType*>(d_merge_keys->ptr());
+  KeyType *d_merge_keys_ptr = reinterpret_cast<KeyType *>(d_merge_keys->ptr());
   auto d_merge_grads = memory::Alloc(place, len * sizeof(GradType));
-  GradType* d_merge_grads_ptr =
-      reinterpret_cast<GradType*>(d_merge_grads->ptr());
+  GradType *d_merge_grads_ptr =
+      reinterpret_cast<GradType *>(d_merge_grads->ptr());
   heter_comm_kernel_->sort_pairs(NULL,
                                  temp_storage_bytes,
                                  d_keys,
@@ -775,7 +959,7 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::merge_grad(
                                  false);
   temp_storage_bytes = 0;
   auto d_num_runs_out_mem = memory::Alloc(place, sizeof(int));
-  int* d_num_runs_out = reinterpret_cast<int*>(d_num_runs_out_mem->ptr());
+  int *d_num_runs_out = reinterpret_cast<int *>(d_num_runs_out_mem->ptr());
   heter_comm_kernel_->reduce_by_key(NULL,
                                     temp_storage_bytes,
                                     d_merge_keys_ptr,
@@ -813,11 +997,11 @@ template <typename KeyType,
           typename GPUAccessor>
 void HeterComm<KeyType, ValType, GradType, GPUAccessor>::dynamic_merge_grad(
     int gpu_num,
-    KeyType* d_keys,
-    float* d_grads,
+    KeyType *d_keys,
+    float *d_grads,
     size_t len,
-    int& uniq_len,
-    size_t& segment_len,
+    int &uniq_len,        // NOLINT
+    size_t &segment_len,  // NOLINT
     bool enable_segment_merge_grad) {
   int dev_id = resource_->dev_id(gpu_num);
   platform::CUDAPlace place = platform::CUDAPlace(dev_id);
@@ -831,13 +1015,13 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::dynamic_merge_grad(
   size_t grad_value_size = accessor_wrapper_ptr->GetPushValueSize(max_mf_dim_);
 
   auto d_merge_keys = memory::Alloc(place, len * sizeof(KeyType));
-  KeyType* d_merge_keys_ptr = reinterpret_cast<KeyType*>(d_merge_keys->ptr());
+  KeyType *d_merge_keys_ptr = reinterpret_cast<KeyType *>(d_merge_keys->ptr());
   auto d_fea_num_info = memory::Alloc(place, sizeof(uint32_t) * (len * 3 + 1));
-  uint32_t* d_fea_num_info_ptr =
-      reinterpret_cast<uint32_t*>(d_fea_num_info->ptr());
-  uint32_t* d_index = (uint32_t*)&d_fea_num_info_ptr[len];
-  uint32_t* d_idx = (uint32_t*)&d_index[len];
-  int* d_merged_size = (int*)&d_idx[len];
+  uint32_t *d_fea_num_info_ptr =
+      reinterpret_cast<uint32_t *>(d_fea_num_info->ptr());
+  uint32_t *d_index = static_cast<uint32_t *>(&d_fea_num_info_ptr[len]);
+  uint32_t *d_idx = reinterpret_cast<uint32_t *>(&d_index[len]);
+  int *d_merged_size = reinterpret_cast<int *>(&d_idx[len]);
   heter_comm_kernel_->fill_idx(d_idx, len, stream);
 
   PADDLE_ENFORCE_GPU_SUCCESS(
@@ -889,15 +1073,12 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::dynamic_merge_grad(
                                          len,
                                          stream));
 
-  cudaMemcpyAsync((void*)&uniq_len,
-                  d_merged_size,
-                  sizeof(int),
-                  cudaMemcpyDeviceToHost,
-                  stream);
+  cudaMemcpyAsync(
+      &uniq_len, d_merged_size, sizeof(int), cudaMemcpyDeviceToHost, stream);
   PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
 
   assert(d_merged_size > 0);
-  uint32_t* d_offset = (uint32_t*)&d_index[len];
+  uint32_t *d_offset = reinterpret_cast<uint32_t *>(&d_index[len]);
   temp_storage_bytes = 0;
   PADDLE_ENFORCE_GPU_SUCCESS(cub::DeviceScan::ExclusiveSum(NULL,
                                                            temp_storage_bytes,
@@ -935,20 +1116,26 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::dynamic_merge_grad(
     PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
   } else {
     auto d_merge_grads = memory::Alloc(place, len * grad_value_size);
-    float* d_merge_grads_ptr = reinterpret_cast<float*>(d_merge_grads->ptr());
-
-    heter_comm_kernel_->merge_gradient(d_keys,
-                                       d_offset,
-                                       d_fea_num_info_ptr,
-                                       d_index,
-                                       (char*)d_grads,
-                                       (char*)d_merge_grads_ptr,
-                                       uniq_len,
-                                       grad_dim,
-                                       grad_value_size,
-                                       merger_,
-                                       stream,
-                                       gpu_accessor_);
+    float *d_merge_grads_ptr = reinterpret_cast<float *>(d_merge_grads->ptr());
+    // copy merge keys to d_keys
+    PADDLE_ENFORCE_GPU_SUCCESS(cudaMemcpyAsync(d_keys,
+                                               d_merge_keys_ptr,
+                                               sizeof(KeyType) * segment_len,
+                                               cudaMemcpyDeviceToDevice,
+                                               stream));
+    heter_comm_kernel_->merge_gradient(
+        d_keys,
+        d_offset,
+        d_fea_num_info_ptr,
+        d_index,
+        reinterpret_cast<char *>(d_grads),
+        reinterpret_cast<char *>(d_merge_grads_ptr),
+        uniq_len,
+        grad_dim,
+        grad_value_size,
+        merger_,
+        stream,
+        gpu_accessor_);
     PADDLE_ENFORCE_GPU_SUCCESS(cudaMemcpyAsync(d_grads,
                                                d_merge_grads_ptr,
                                                grad_value_size * uniq_len,
@@ -964,16 +1151,16 @@ template <typename KeyType,
           typename GPUAccessor>
 void HeterComm<KeyType, ValType, GradType, GPUAccessor>::segment_merge_grad(
     int gpu_num,  // the device number
-    KeyType*
-        d_keys,  // the sorted keys list, which will be modified after merged
-    float* d_grads,  // the raw grads list, which will be modified after merged
-    const uint32_t*
-        d_index,  // the storage position of d_keys, its length is len.
-    size_t len,   // the number of raw input keys
-    const uint32_t*
-        d_fea_num_info,      // prefix sum array, its length is uniq_len+1
+    KeyType
+        *d_keys,  // the sorted keys list, which will be modified after merged
+    float *d_grads,  // the raw grads list, which will be modified after merged
+    const uint32_t
+        *d_index,  // the storage position of d_keys, its length is len.
+    size_t len,    // the number of raw input keys
+    const uint32_t
+        *d_fea_num_info,     // prefix sum array, its length is uniq_len+1
     size_t uniq_len,         // the number of unique keys
-    size_t& segments_num) {  // the number of segment merged keys
+    size_t &segments_num) {  // the number of segment merged keys // NOLINT
 
   int dev_id = resource_->dev_id(gpu_num);
   platform::CUDAPlace place = platform::CUDAPlace(dev_id);
@@ -986,16 +1173,16 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::segment_merge_grad(
   size_t grad_value_size = accessor_wrapper_ptr->GetPushValueSize(max_mf_dim_);
 
   auto d_buffer1 = memory::Alloc(place, sizeof(uint32_t) * len);
-  auto d_segments = reinterpret_cast<uint32_t*>(d_buffer1->ptr());
+  auto d_segments = reinterpret_cast<uint32_t *>(d_buffer1->ptr());
   auto d_buffer2 = memory::Alloc(place, sizeof(uint32_t) * len);
-  auto d_segments_offset = reinterpret_cast<uint32_t*>(d_buffer2->ptr());
+  auto d_segments_offset = reinterpret_cast<uint32_t *>(d_buffer2->ptr());
   auto d_buffer3 = memory::Alloc(place, sizeof(uint32_t) * len);
-  auto d_segments_fea_num_info = reinterpret_cast<uint32_t*>(d_buffer3->ptr());
+  auto d_segments_fea_num_info = reinterpret_cast<uint32_t *>(d_buffer3->ptr());
   auto d_buffer4 = memory::Alloc(place, sizeof(uint32_t) * len);
   auto d_segments_fea_num_offset =
-      reinterpret_cast<uint32_t*>(d_buffer4->ptr());
+      reinterpret_cast<uint32_t *>(d_buffer4->ptr());
   auto d_buffer5 = memory::Alloc(place, sizeof(uint32_t));
-  auto d_segments_num = reinterpret_cast<uint32_t*>(d_buffer5->ptr());
+  auto d_segments_num = reinterpret_cast<uint32_t *>(d_buffer5->ptr());
   CUDA_CHECK(cudaMemsetAsync(d_segments_num, 0, sizeof(uint32_t), stream));
 
   uint32_t segment_size = FLAGS_gpugraph_merge_grads_segment_size;
@@ -1073,7 +1260,8 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::segment_merge_grad(
   PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
 
   auto d_segments_keys = memory::Alloc(place, sizeof(KeyType) * segments_num);
-  auto d_segments_keys_ptr = reinterpret_cast<KeyType*>(d_segments_keys->ptr());
+  auto d_segments_keys_ptr =
+      reinterpret_cast<KeyType *>(d_segments_keys->ptr());
   heter_comm_kernel_->shrink_keys(d_keys,
                                   d_segments_fea_num_offset,
                                   d_segments_keys_ptr,
@@ -1082,19 +1270,20 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::segment_merge_grad(
   PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
 
   auto d_segment_grads = memory::Alloc(place, segments_num * grad_value_size);
-  auto d_segment_grads_ptr = reinterpret_cast<float*>(d_segment_grads->ptr());
-  heter_comm_kernel_->merge_gradient(d_segments_keys_ptr,
-                                     d_segments_fea_num_offset,
-                                     d_segments_fea_num_info,
-                                     d_index,
-                                     (char*)d_grads,
-                                     (char*)d_segment_grads_ptr,
-                                     segments_num,
-                                     grad_dim,
-                                     grad_value_size,
-                                     merger_,
-                                     stream,
-                                     gpu_accessor_);
+  auto d_segment_grads_ptr = reinterpret_cast<float *>(d_segment_grads->ptr());
+  heter_comm_kernel_->merge_gradient(
+      d_segments_keys_ptr,
+      d_segments_fea_num_offset,
+      d_segments_fea_num_info,
+      d_index,
+      reinterpret_cast<char *>(d_grads),
+      reinterpret_cast<char *>(d_segment_grads_ptr),
+      segments_num,
+      grad_dim,
+      grad_value_size,
+      merger_,
+      stream,
+      gpu_accessor_);
   PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
 
   PADDLE_ENFORCE_GPU_SUCCESS(cudaMemcpyAsync(d_keys,
@@ -1115,26 +1304,39 @@ template <typename KeyType,
           typename GradType,
           typename GPUAccessor>
 void HeterComm<KeyType, ValType, GradType, GPUAccessor>::split_input_to_shard(
-    KeyType* d_keys,
-    int* d_idx_ptr,
+    KeyType *d_keys,
+    int *d_idx_ptr,
     size_t len,
-    int* left,
-    int* right,
+    int *left,
+    int *right,
     int dev_num) {
+  auto stream = resource_->local_stream(dev_num, 0);
+  split_idx_to_shard(d_keys, d_idx_ptr, len, left, right, dev_num, stream);
+}
+template <typename KeyType,
+          typename ValType,
+          typename GradType,
+          typename GPUAccessor>
+template <typename T, typename StreamType>
+void HeterComm<KeyType, ValType, GradType, GPUAccessor>::split_idx_to_shard(
+    KeyType *d_keys,
+    T *d_idx_ptr,
+    size_t len,
+    T *left,
+    T *right,
+    int gpu_num,
+    StreamType stream) {
   int total_device = resource_->total_device();
-  int dev_id = resource_->dev_id(dev_num);
+  int dev_id = resource_->dev_id(gpu_num);
   DevPlace place = DevPlace(dev_id);
   AnyDeviceGuard guard(dev_id);
-  auto stream = resource_->local_stream(dev_num, 0);
-
-  auto d_idx_tmp = memory::Alloc(place, len * sizeof(int));
-  int* d_idx_tmp_ptr = reinterpret_cast<int*>(d_idx_tmp->ptr());
-
-  auto d_shard_index = memory::Alloc(place, len * sizeof(int));
-  int* d_shard_index_ptr = reinterpret_cast<int*>(d_shard_index->ptr());
-
-  auto d_shard_index_tmp = memory::Alloc(place, len * sizeof(int));
-  int* d_shard_index_tmp_ptr = reinterpret_cast<int*>(d_shard_index_tmp->ptr());
+  auto d_idx_tmp =
+      memory::Alloc(place,
+                    3 * len * sizeof(T),
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+  T *d_idx_tmp_ptr = reinterpret_cast<T *>(d_idx_tmp->ptr());
+  T *d_shard_index_ptr = reinterpret_cast<T *>(&d_idx_tmp_ptr[len]);
+  T *d_shard_index_tmp_ptr = reinterpret_cast<T *>(&d_shard_index_ptr[len]);
 
   heter_comm_kernel_->fill_idx(d_idx_tmp_ptr, len, stream);
   heter_comm_kernel_->calc_shard_index(
@@ -1153,7 +1355,10 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::split_input_to_shard(
                                  num_bits,
                                  stream);
 
-  auto d_temp_storage = memory::Alloc(place, temp_storage_bytes);
+  auto d_temp_storage =
+      memory::Alloc(place,
+                    temp_storage_bytes,
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
   heter_comm_kernel_->sort_pairs(d_temp_storage->ptr(),
                                  temp_storage_bytes,
                                  d_shard_index_tmp_ptr,
@@ -1169,124 +1374,45 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::split_input_to_shard(
       d_shard_index_ptr, left, right, len, total_device, stream);
   sync_stream(stream);
 }
-
 template <typename KeyType,
           typename ValType,
           typename GradType,
           typename GPUAccessor>
-void HeterComm<KeyType, ValType, GradType, GPUAccessor>::merge_keys(
-    int gpu_num,
-    const KeyType* d_keys,
-    size_t len,               // input
-    KeyType* d_sorted_keys,   // output
-    KeyType* d_merged_keys,   // output
-    uint32_t* d_restore_idx,  // output
-    size_t& uniq_len) {       // output
+template <typename StreamType>
+size_t HeterComm<KeyType, ValType, GradType, GPUAccessor>::merge_keys(
+    const int gpu_num,
+    const KeyType *d_keys,
+    const size_t &len,       // input
+    KeyType *d_sorted_keys,  // output
+    KeyType *d_merged_keys,  // output
+    uint32_t *d_restore_idx,
+    StreamType stream) {
+#if defined(PADDLE_WITH_CUDA)
   int dev_id = resource_->dev_id(gpu_num);
   platform::CUDAPlace place = platform::CUDAPlace(dev_id);
-  platform::CUDADeviceGuard guard(dev_id);
-  auto stream = resource_->local_stream(gpu_num, 0);
-
-  size_t grad_dim = max_mf_dim_;
-  auto accessor_wrapper_ptr =
-      GlobalAccessorFactory::GetInstance().GetAccessorWrapper();
-  size_t grad_value_size = accessor_wrapper_ptr->GetPushValueSize(max_mf_dim_);
-
-  auto d_fea_num_info = memory::Alloc(place, sizeof(uint32_t) * (len * 4 + 1));
-  uint32_t* d_fea_num_info_ptr =
-      reinterpret_cast<uint32_t*>(d_fea_num_info->ptr());
-  uint32_t* d_idx = (uint32_t*)&d_fea_num_info_ptr[len];
-  uint32_t* d_index = (uint32_t*)&d_idx[len];
-  uint32_t* d_offset = (uint32_t*)&d_index[len];
-  uint32_t* d_merged_size = (uint32_t*)&d_offset[len];
-  heter_comm_kernel_->fill_idx(d_idx, len, stream);
-
-  size_t temp_storage_bytes;
-  PADDLE_ENFORCE_GPU_SUCCESS(
-      cub::DeviceRadixSort::SortPairs(NULL,
-                                      temp_storage_bytes,
-                                      d_keys,
-                                      d_sorted_keys,
-                                      d_idx,
-                                      d_index,
-                                      len,
-                                      0,
-                                      8 * sizeof(KeyType),
-                                      stream));
-  auto d_temp_storage = memory::Alloc(place, temp_storage_bytes);
-  PADDLE_ENFORCE_GPU_SUCCESS(
-      cub::DeviceRadixSort::SortPairs(d_temp_storage->ptr(),
-                                      temp_storage_bytes,
-                                      d_keys,
-                                      d_sorted_keys,
-                                      d_idx,
-                                      d_index,
-                                      len,
-                                      0,
-                                      8 * sizeof(KeyType),
-                                      stream));
-  PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
-
-  temp_storage_bytes = 0;
-  PADDLE_ENFORCE_GPU_SUCCESS(
-      cub::DeviceRunLengthEncode::Encode(NULL,
-                                         temp_storage_bytes,
-                                         d_sorted_keys,
-                                         d_merged_keys,
-                                         d_fea_num_info_ptr,
-                                         d_merged_size,
-                                         len,
-                                         stream));
-  if (d_temp_storage->size() < temp_storage_bytes) {
-    d_temp_storage = NULL;
-    d_temp_storage = memory::Alloc(place, temp_storage_bytes);
-  }
-  PADDLE_ENFORCE_GPU_SUCCESS(
-      cub::DeviceRunLengthEncode::Encode(d_temp_storage->ptr(),
-                                         temp_storage_bytes,
-                                         d_sorted_keys,
-                                         d_merged_keys,
-                                         d_fea_num_info_ptr,
-                                         d_merged_size,
-                                         len,
-                                         stream));
-  cudaMemcpyAsync((void*)&uniq_len,
-                  d_merged_size,
-                  sizeof(int),
-                  cudaMemcpyDeviceToHost,
-                  stream);
-  PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
-
-  temp_storage_bytes = 0;
-  PADDLE_ENFORCE_GPU_SUCCESS(cub::DeviceScan::ExclusiveSum(NULL,
-                                                           temp_storage_bytes,
-                                                           d_fea_num_info_ptr,
-                                                           d_offset,
-                                                           uniq_len,
-                                                           stream));
-  if (d_temp_storage->size() < temp_storage_bytes) {
-    d_temp_storage = NULL;
-    d_temp_storage = memory::Alloc(place, temp_storage_bytes);
-  }
-  PADDLE_ENFORCE_GPU_SUCCESS(
-      cub::DeviceScan::ExclusiveSum(d_temp_storage->ptr(),
-                                    temp_storage_bytes,
-                                    d_fea_num_info_ptr,
-                                    d_offset,
-                                    uniq_len,
-                                    stream));
-  PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
 
-  heter_comm_kernel_->fill_restore_idx(true,
-                                       len,
-                                       uniq_len,
-                                       d_merged_keys,
-                                       d_index,
-                                       d_offset,
-                                       d_fea_num_info_ptr,
-                                       d_restore_idx,
-                                       stream);
-  PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
+  auto d_fea_num_info =
+      memory::Alloc(place,
+                    sizeof(uint32_t) * (len * 3),
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+  uint32_t *d_offset = reinterpret_cast<uint32_t *>(d_fea_num_info->ptr());
+  uint32_t *d_merged_cnts = reinterpret_cast<uint32_t *>(&d_offset[len]);
+  uint32_t *d_sorted_idx = reinterpret_cast<uint32_t *>(&d_merged_cnts[len]);
+
+  return dedup_keys_and_fillidx(gpu_num,
+                                len,
+                                d_keys,         // input
+                                d_merged_keys,  // output
+                                d_sorted_keys,
+                                d_restore_idx,
+                                d_sorted_idx,
+                                d_offset,
+                                d_merged_cnts,
+                                true,
+                                stream);
+#else
+  return 0;
+#endif
 }
 
 template <typename KeyType,
@@ -1294,7 +1420,7 @@ template <typename KeyType,
           typename GradType,
           typename GPUAccessor>
 void HeterComm<KeyType, ValType, GradType, GPUAccessor>::pull_merge_sparse(
-    int num, KeyType* d_keys, float* d_vals, size_t len) {
+    const int num, KeyType *d_keys, float *d_vals, size_t len) {
   int total_device = resource_->total_device();
   int dev_id = resource_->dev_id(num);
   DevPlace place = DevPlace(dev_id);
@@ -1306,8 +1432,8 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::pull_merge_sparse(
 
   auto d_left = memory::Alloc(place, total_device * sizeof(int));
   auto d_right = memory::Alloc(place, total_device * sizeof(int));
-  int* d_left_ptr = reinterpret_cast<int*>(d_left->ptr());
-  int* d_right_ptr = reinterpret_cast<int*>(d_right->ptr());
+  int *d_left_ptr = reinterpret_cast<int *>(d_left->ptr());
+  int *d_right_ptr = reinterpret_cast<int *>(d_right->ptr());
 
 #if defined(PADDLE_WITH_CUDA)
   cudaMemsetAsync(d_left_ptr, -1, total_device * sizeof(int), stream);
@@ -1339,30 +1465,34 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::pull_merge_sparse(
   size_t val_type_size = accessor_wrapper_ptr->GetPullValueSize(max_mf_dim_);
   VLOG(3) << "pull_sparse len:" << len << "  val_type_size: " << val_type_size;
   auto d_sorted_keys = memory::Alloc(place, len * sizeof(KeyType));
-  auto d_sorted_keys_ptr = reinterpret_cast<KeyType*>(d_sorted_keys->ptr());
+  auto d_sorted_keys_ptr = reinterpret_cast<KeyType *>(d_sorted_keys->ptr());
   auto d_merged_keys = memory::Alloc(place, len * sizeof(KeyType));
-  auto d_merged_keys_ptr = reinterpret_cast<KeyType*>(d_merged_keys->ptr());
+  auto d_merged_keys_ptr = reinterpret_cast<KeyType *>(d_merged_keys->ptr());
   auto d_restore_idx = memory::Alloc(place, len * sizeof(uint32_t));
-  auto d_restore_idx_ptr = reinterpret_cast<uint32_t*>(d_restore_idx->ptr());
+  auto d_restore_idx_ptr = reinterpret_cast<uint32_t *>(d_restore_idx->ptr());
   auto d_shard_keys = memory::Alloc(place, len * sizeof(KeyType));
-  auto d_shard_keys_ptr = reinterpret_cast<KeyType*>(d_shard_keys->ptr());
+  auto d_shard_keys_ptr = reinterpret_cast<KeyType *>(d_shard_keys->ptr());
   auto d_shard_vals = memory::Alloc(place, len * val_type_size);
-  auto d_shard_vals_ptr = reinterpret_cast<float*>(d_shard_vals->ptr());
-
-  size_t uniq_len = 0;
-  merge_keys(num,
-             d_keys,
-             len,
-             d_sorted_keys_ptr,
-             d_merged_keys_ptr,
-             d_restore_idx_ptr,
-             uniq_len);
+  auto d_shard_vals_ptr = reinterpret_cast<float *>(d_shard_vals->ptr());
+
+  size_t uniq_len = merge_keys(num,
+                               d_keys,
+                               len,
+                               d_sorted_keys_ptr,
+                               d_merged_keys_ptr,
+                               d_restore_idx_ptr,
+                               stream);
   sync_stream(stream);
 
   auto d_idx = memory::Alloc(place, uniq_len * sizeof(int));
-  auto d_idx_ptr = reinterpret_cast<int*>(d_idx->ptr());
-  split_input_to_shard(
-      d_merged_keys_ptr, d_idx_ptr, uniq_len, d_left_ptr, d_right_ptr, num);
+  auto d_idx_ptr = reinterpret_cast<int *>(d_idx->ptr());
+  split_idx_to_shard(d_merged_keys_ptr,
+                     d_idx_ptr,
+                     uniq_len,
+                     d_left_ptr,
+                     d_right_ptr,
+                     num,
+                     stream);
   heter_comm_kernel_->fill_shard_key(
       d_shard_keys_ptr, d_merged_keys_ptr, d_idx_ptr, uniq_len, stream);
   sync_stream(stream);
@@ -1383,7 +1513,7 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::pull_merge_sparse(
               total_device * sizeof(int),
               stream);
 
-  if (!FLAGS_gpugraph_enable_gpu_direct_access) {
+  if (!enable_gpu_direct_access_) {
     for (int i = 0; i < total_device; ++i) {
       int shard_len = h_right[i] - h_left[i] + 1;
       if (h_left[i] == -1 || h_right[i] == -1) {
@@ -1399,25 +1529,25 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::pull_merge_sparse(
     if (h_left[i] == -1) {
       continue;
     }
-    auto& node = path_[num][i].nodes_.back();
-    if (!FLAGS_gpugraph_enable_gpu_direct_access) {
+    auto &node = path_[num][i].nodes_.back();
+    if (!enable_gpu_direct_access_) {
       sync_stream(node.in_stream);
     }
     AnyDeviceGuard guard(resource_->dev_id(i));
     ptr_tables_[i]->rwlock_->RDLock();
-    if (!FLAGS_gpugraph_enable_gpu_direct_access) {
-      ptr_tables_[i]->get(reinterpret_cast<KeyType*>(node.key_storage),
+    if (!enable_gpu_direct_access_) {
+      ptr_tables_[i]->get(reinterpret_cast<KeyType *>(node.key_storage),
                           node.val_storage,
                           h_right[i] - h_left[i] + 1,
                           resource_->remote_stream(i, num),
                           gpu_accessor_);
     } else {
-      ptr_tables_[i]->get(
-          d_shard_keys_ptr + h_left[i],
-          reinterpret_cast<char*>(d_shard_vals_ptr) + h_left[i] * val_type_size,
-          h_right[i] - h_left[i] + 1,
-          resource_->remote_stream(i, num),
-          gpu_accessor_);
+      ptr_tables_[i]->get(d_shard_keys_ptr + h_left[i],
+                          reinterpret_cast<char *>(d_shard_vals_ptr) +
+                              h_left[i] * val_type_size,
+                          h_right[i] - h_left[i] + 1,
+                          resource_->remote_stream(i, num),
+                          gpu_accessor_);
     }
   }
 
@@ -1429,21 +1559,21 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::pull_merge_sparse(
     ptr_tables_[i]->rwlock_->UNLock();
   }
 
-  if (!FLAGS_gpugraph_enable_gpu_direct_access) {
+  if (!enable_gpu_direct_access_) {
     walk_to_src(num,
                 total_device,
                 h_left,
                 h_right,
-                reinterpret_cast<char*>(d_shard_vals_ptr),
+                reinterpret_cast<char *>(d_shard_vals_ptr),
                 val_type_size);
     for (int i = 0; i < total_device; ++i) {
-      auto& node = path_[num][i].nodes_.front();
+      auto &node = path_[num][i].nodes_.front();
       sync_stream(node.out_stream);
     }
   }
 
   auto d_merged_vals = memory::Alloc(place, uniq_len * val_type_size);
-  auto d_merged_vals_ptr = reinterpret_cast<float*>(d_merged_vals->ptr());
+  auto d_merged_vals_ptr = reinterpret_cast<float *>(d_merged_vals->ptr());
   heter_comm_kernel_->dy_mf_fill_dvals(d_shard_vals_ptr,
                                        d_merged_vals_ptr,
                                        d_idx_ptr,
@@ -1461,7 +1591,7 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::pull_merge_sparse(
                                          stream);
   sync_stream(stream);
 
-  if (!FLAGS_gpugraph_enable_gpu_direct_access) {
+  if (!enable_gpu_direct_access_) {
     for (int i = 0; i < total_device; ++i) {
       if (h_left[i] == -1 || h_right[i] == -1) {
         continue;
@@ -1475,7 +1605,7 @@ template <typename KeyType,
           typename GradType,
           typename GPUAccessor>
 void HeterComm<KeyType, ValType, GradType, GPUAccessor>::pull_normal_sparse(
-    int num, KeyType* d_keys, float* d_vals, size_t len) {
+    const int num, KeyType *d_keys, float *d_vals, size_t len) {
   int total_device = resource_->total_device();
   int dev_id = resource_->dev_id(num);
   DevPlace place = DevPlace(dev_id);
@@ -1487,8 +1617,8 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::pull_normal_sparse(
 
   auto d_left = memory::Alloc(place, total_device * sizeof(int));
   auto d_right = memory::Alloc(place, total_device * sizeof(int));
-  int* d_left_ptr = reinterpret_cast<int*>(d_left->ptr());
-  int* d_right_ptr = reinterpret_cast<int*>(d_right->ptr());
+  int *d_left_ptr = reinterpret_cast<int *>(d_left->ptr());
+  int *d_right_ptr = reinterpret_cast<int *>(d_right->ptr());
 
 #if defined(PADDLE_WITH_CUDA)
   cudaMemsetAsync(d_left_ptr, -1, total_device * sizeof(int), stream);
@@ -1516,18 +1646,19 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::pull_normal_sparse(
 #endif
 
   auto d_idx = memory::Alloc(place, len * sizeof(int));
-  int* d_idx_ptr = reinterpret_cast<int*>(d_idx->ptr());
+  int *d_idx_ptr = reinterpret_cast<int *>(d_idx->ptr());
 
   auto accessor_wrapper_ptr =
       GlobalAccessorFactory::GetInstance().GetAccessorWrapper();
   size_t val_type_size = accessor_wrapper_ptr->GetPullValueSize(max_mf_dim_);
   VLOG(3) << "pull_sparse len:" << len << "  val_type_size: " << val_type_size;
   auto d_shard_keys = memory::Alloc(place, len * sizeof(KeyType));
-  KeyType* d_shard_keys_ptr = reinterpret_cast<KeyType*>(d_shard_keys->ptr());
+  KeyType *d_shard_keys_ptr = reinterpret_cast<KeyType *>(d_shard_keys->ptr());
   auto d_shard_vals = memory::Alloc(place, len * val_type_size);
-  float* d_shard_vals_ptr = reinterpret_cast<float*>(d_shard_vals->ptr());
+  float *d_shard_vals_ptr = reinterpret_cast<float *>(d_shard_vals->ptr());
 
-  split_input_to_shard(d_keys, d_idx_ptr, len, d_left_ptr, d_right_ptr, num);
+  split_idx_to_shard(
+      d_keys, d_idx_ptr, len, d_left_ptr, d_right_ptr, num, stream);
 
   heter_comm_kernel_->fill_shard_key(
       d_shard_keys_ptr, d_keys, d_idx_ptr, len, stream);
@@ -1550,7 +1681,7 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::pull_normal_sparse(
               total_device * sizeof(int),
               stream);
 
-  if (!FLAGS_gpugraph_enable_gpu_direct_access) {
+  if (!enable_gpu_direct_access_) {
     for (int i = 0; i < total_device; ++i) {
       int shard_len = h_right[i] - h_left[i] + 1;
       if (h_left[i] == -1 || h_right[i] == -1) {
@@ -1565,25 +1696,25 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::pull_normal_sparse(
     if (h_left[i] == -1) {
       continue;
     }
-    auto& node = path_[num][i].nodes_.back();
-    if (!FLAGS_gpugraph_enable_gpu_direct_access) {
+    auto &node = path_[num][i].nodes_.back();
+    if (!enable_gpu_direct_access_) {
       sync_stream(node.in_stream);
     }
     AnyDeviceGuard guard(resource_->dev_id(i));
     ptr_tables_[i]->rwlock_->RDLock();
-    if (!FLAGS_gpugraph_enable_gpu_direct_access) {
-      ptr_tables_[i]->get(reinterpret_cast<KeyType*>(node.key_storage),
+    if (!enable_gpu_direct_access_) {
+      ptr_tables_[i]->get(reinterpret_cast<KeyType *>(node.key_storage),
                           node.val_storage,
                           h_right[i] - h_left[i] + 1,
                           resource_->remote_stream(i, num),
                           gpu_accessor_);
     } else {
-      ptr_tables_[i]->get(
-          d_shard_keys_ptr + h_left[i],
-          reinterpret_cast<char*>(d_shard_vals_ptr) + h_left[i] * val_type_size,
-          h_right[i] - h_left[i] + 1,
-          resource_->remote_stream(i, num),
-          gpu_accessor_);
+      ptr_tables_[i]->get(d_shard_keys_ptr + h_left[i],
+                          reinterpret_cast<char *>(d_shard_vals_ptr) +
+                              h_left[i] * val_type_size,
+                          h_right[i] - h_left[i] + 1,
+                          resource_->remote_stream(i, num),
+                          gpu_accessor_);
     }
   }
 
@@ -1594,15 +1725,15 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::pull_normal_sparse(
     }
     ptr_tables_[i]->rwlock_->UNLock();
   }
-  if (!FLAGS_gpugraph_enable_gpu_direct_access) {
+  if (!enable_gpu_direct_access_) {
     walk_to_src(num,
                 total_device,
                 h_left,
                 h_right,
-                reinterpret_cast<char*>(d_shard_vals_ptr),
+                reinterpret_cast<char *>(d_shard_vals_ptr),
                 val_type_size);
     for (int i = 0; i < total_device; ++i) {
-      auto& node = path_[num][i].nodes_.front();
+      auto &node = path_[num][i].nodes_.front();
       sync_stream(node.out_stream);
     }
   }
@@ -1611,7 +1742,7 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::pull_normal_sparse(
 
   sync_stream(stream);
 
-  if (!FLAGS_gpugraph_enable_gpu_direct_access) {
+  if (!enable_gpu_direct_access_) {
     for (int i = 0; i < total_device; ++i) {
       if (h_left[i] == -1 || h_right[i] == -1) {
         continue;
@@ -1626,14 +1757,18 @@ template <typename KeyType,
           typename GradType,
           typename GPUAccessor>
 void HeterComm<KeyType, ValType, GradType, GPUAccessor>::pull_sparse(
-    int num, KeyType* d_keys, float* d_vals, size_t len) {
+    int num, KeyType *d_keys, float *d_vals, size_t len) {
   if (len == 0) {
     return;
   }
-  if (!FLAGS_gpugraph_dedup_pull_push_mode) {
-    pull_merge_sparse(num, d_keys, d_vals, len);
+  if (multi_node_) {
+    pull_sparse_all2all(num, d_keys, d_vals, len);
   } else {
-    pull_normal_sparse(num, d_keys, d_vals, len);
+    if (!FLAGS_gpugraph_dedup_pull_push_mode) {
+      pull_merge_sparse(num, d_keys, d_vals, len);
+    } else {
+      pull_normal_sparse(num, d_keys, d_vals, len);
+    }
   }
 }
 
@@ -1645,10 +1780,27 @@ template <typename KeyType,
 template <typename Sgd>
 void HeterComm<KeyType, ValType, GradType, GPUAccessor>::push_sparse(
     int dev_num,
-    KeyType* d_keys,
-    float* d_grads,
+    KeyType *d_keys,
+    float *d_grads,
     size_t len,
-    Sgd& sgd) {  // NOLINT
+    Sgd &sgd) {  // NOLINT
+  if (multi_node_) {
+    push_sparse_all2all(dev_num, d_keys, d_grads, len, sgd);
+  } else {
+    push_normal_sparse(dev_num, d_keys, d_grads, len, sgd);
+  }
+}
+template <typename KeyType,
+          typename ValType,
+          typename GradType,
+          typename GPUAccessor>
+template <typename Sgd>
+void HeterComm<KeyType, ValType, GradType, GPUAccessor>::push_normal_sparse(
+    int dev_num,
+    KeyType *d_keys,
+    float *d_grads,
+    size_t len,
+    Sgd &sgd) {  // NOLINT
   if (len == 0) {
     return;
   }
@@ -1668,8 +1820,8 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::push_sparse(
 
   auto d_left = memory::Alloc(place, total_device * sizeof(int));
   auto d_right = memory::Alloc(place, total_device * sizeof(int));
-  int* d_left_ptr = reinterpret_cast<int*>(d_left->ptr());
-  int* d_right_ptr = reinterpret_cast<int*>(d_right->ptr());
+  int *d_left_ptr = reinterpret_cast<int *>(d_left->ptr());
+  int *d_right_ptr = reinterpret_cast<int *>(d_right->ptr());
 
 #if defined(PADDLE_WITH_CUDA)
   cudaMemsetAsync(d_left_ptr, -1, total_device * sizeof(int), stream);
@@ -1697,14 +1849,14 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::push_sparse(
 #endif
 
   auto d_idx = memory::Alloc(place, len * sizeof(int));
-  int* d_idx_ptr = reinterpret_cast<int*>(d_idx->ptr());
+  int *d_idx_ptr = reinterpret_cast<int *>(d_idx->ptr());
 
   auto d_shard_keys = memory::Alloc(place, len * sizeof(KeyType));
-  KeyType* d_shard_keys_ptr = reinterpret_cast<KeyType*>(d_shard_keys->ptr());
+  KeyType *d_shard_keys_ptr = reinterpret_cast<KeyType *>(d_shard_keys->ptr());
 
-  float* d_shard_grads_ptr;
+  float *d_shard_grads_ptr;
   auto d_shard_grads = memory::Alloc(place, len * grad_value_size);
-  d_shard_grads_ptr = reinterpret_cast<float*>(d_shard_grads->ptr());
+  d_shard_grads_ptr = reinterpret_cast<float *>(d_shard_grads->ptr());
 
   int uniq_len = len;
   if (!FLAGS_gpugraph_dedup_pull_push_mode) {
@@ -1727,8 +1879,8 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::push_sparse(
     }
   }
 
-  split_input_to_shard(
-      d_keys, d_idx_ptr, uniq_len, d_left_ptr, d_right_ptr, dev_num);
+  split_idx_to_shard(
+      d_keys, d_idx_ptr, uniq_len, d_left_ptr, d_right_ptr, dev_num, stream);
 
   heter_comm_kernel_->dy_mf_fill_shard_grads(d_shard_keys_ptr,
                                              d_keys,
@@ -1757,7 +1909,7 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::push_sparse(
               total_device * sizeof(int),
               stream);
 
-  if (!FLAGS_gpugraph_enable_gpu_direct_access) {
+  if (!enable_gpu_direct_access_) {
     for (int i = 0; i < total_device; ++i) {
       int shard_len = h_right[i] - h_left[i] + 1;
       if (h_left[i] == -1 || h_right[i] == -1) {
@@ -1772,7 +1924,7 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::push_sparse(
                  h_left,
                  h_right,
                  d_shard_keys_ptr,
-                 reinterpret_cast<char*>(d_shard_grads_ptr),
+                 reinterpret_cast<char *>(d_shard_grads_ptr),
                  grad_value_size);
   }
 
@@ -1780,22 +1932,22 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::push_sparse(
     if (h_left[i] == -1 || h_right[i] == -1) {
       continue;
     }
-    auto& node = path_[dev_num][i].nodes_.back();
-    if (!FLAGS_gpugraph_enable_gpu_direct_access) {
+    auto &node = path_[dev_num][i].nodes_.back();
+    if (!enable_gpu_direct_access_) {
       sync_stream(node.in_stream);
     }
 
     AnyDeviceGuard guard(resource_->dev_id(i));
     ptr_tables_[i]->rwlock_->WRLock();
-    if (!FLAGS_gpugraph_enable_gpu_direct_access) {
-      ptr_tables_[i]->update(reinterpret_cast<KeyType*>(node.key_storage),
+    if (!enable_gpu_direct_access_) {
+      ptr_tables_[i]->update(reinterpret_cast<KeyType *>(node.key_storage),
                              node.val_storage,
                              h_right[i] - h_left[i] + 1,
                              sgd,
                              resource_->remote_stream(i, dev_num));
     } else {
       ptr_tables_[i]->update(d_shard_keys_ptr + h_left[i],
-                             reinterpret_cast<char*>(d_shard_grads_ptr) +
+                             reinterpret_cast<char *>(d_shard_grads_ptr) +
                                  grad_value_size * h_left[i],
                              h_right[i] - h_left[i] + 1,
                              sgd,
@@ -1806,15 +1958,11 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::push_sparse(
   for (int i = 0; i < total_device; ++i) {
     sync_stream(resource_->remote_stream(i, dev_num));
     if (h_left[i] != -1) {
-      if (!multi_mf_dim_) {
-        tables_[i]->rwlock_->UNLock();
-      } else {
-        ptr_tables_[i]->rwlock_->UNLock();
-      }
+      ptr_tables_[i]->rwlock_->UNLock();
     }
   }
 
-  if (!FLAGS_gpugraph_enable_gpu_direct_access) {
+  if (!enable_gpu_direct_access_) {
     for (int i = 0; i < total_device; ++i) {
       if (h_left[i] == -1 || h_right[i] == -1) {
         continue;
@@ -1830,7 +1978,7 @@ template <typename KeyType,
           typename GradType,
           typename GPUAccessor>
 void HeterComm<KeyType, ValType, GradType, GPUAccessor>::push_sparse(
-    int dev_num, KeyType* d_keys, GradType* d_grads, size_t len) {
+    int dev_num, KeyType *d_keys, GradType *d_grads, size_t len) {
   if (len == 0) {
     return;
   }
@@ -1847,8 +1995,8 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::push_sparse(
 
   auto d_left = memory::Alloc(place, total_device * sizeof(int));
   auto d_right = memory::Alloc(place, total_device * sizeof(int));
-  int* d_left_ptr = reinterpret_cast<int*>(d_left->ptr());
-  int* d_right_ptr = reinterpret_cast<int*>(d_right->ptr());
+  int *d_left_ptr = reinterpret_cast<int *>(d_left->ptr());
+  int *d_right_ptr = reinterpret_cast<int *>(d_right->ptr());
 
 #if defined(PADDLE_WITH_CUDA)
   cudaMemsetAsync(d_left_ptr, -1, total_device * sizeof(int), stream);
@@ -1876,26 +2024,26 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::push_sparse(
 #endif
 
   auto d_idx = memory::Alloc(place, len * sizeof(int));
-  int* d_idx_ptr = reinterpret_cast<int*>(d_idx->ptr());
+  int *d_idx_ptr = reinterpret_cast<int *>(d_idx->ptr());
 
   auto d_shard_keys = memory::Alloc(place, len * sizeof(KeyType));
-  KeyType* d_shard_keys_ptr = reinterpret_cast<KeyType*>(d_shard_keys->ptr());
+  KeyType *d_shard_keys_ptr = reinterpret_cast<KeyType *>(d_shard_keys->ptr());
   auto d_shard_grads = memory::Alloc(place, len * sizeof(GradType));
-  GradType* d_shard_grads_ptr =
-      reinterpret_cast<GradType*>(d_shard_grads->ptr());
+  GradType *d_shard_grads_ptr =
+      reinterpret_cast<GradType *>(d_shard_grads->ptr());
 
   int uniq_len = len;
   merge_grad(dev_num, d_keys, d_grads, len, uniq_len);
 
-  split_input_to_shard(
-      d_keys, d_idx_ptr, uniq_len, d_left_ptr, d_right_ptr, dev_num);
+  split_idx_to_shard(
+      d_keys, d_idx_ptr, uniq_len, d_left_ptr, d_right_ptr, dev_num, stream);
 
   heter_comm_kernel_->fill_shard_grads(d_shard_keys_ptr,
                                        d_keys,
                                        d_shard_grads_ptr,
                                        d_grads,
                                        d_idx_ptr,
-                                       (long long)uniq_len,
+                                       (int64_t)uniq_len,
                                        stream);
 
   sync_stream(stream);
@@ -1935,13 +2083,13 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::push_sparse(
     if (h_left[i] == -1 || h_right[i] == -1) {
       continue;
     }
-    auto& node = path_[dev_num][i].nodes_.back();
+    auto &node = path_[dev_num][i].nodes_.back();
     sync_stream(node.in_stream);
 
     AnyDeviceGuard guard(resource_->dev_id(i));
     tables_[i]->rwlock_->WRLock();
-    tables_[i]->update(reinterpret_cast<KeyType*>(node.key_storage),
-                       reinterpret_cast<GradType*>(node.val_storage),
+    tables_[i]->update(reinterpret_cast<KeyType *>(node.key_storage),
+                       reinterpret_cast<GradType *>(node.val_storage),
                        h_right[i] - h_left[i] + 1,
                        resource_->remote_stream(i, dev_num));
   }
@@ -1970,22 +2118,31 @@ template <typename KeyType,
           typename GPUAccessor>
 template <typename Sgd>
 void HeterComm<KeyType, ValType, GradType, GPUAccessor>::update_one_table(
-    int gpu_num,
-    KeyType* d_keys,
-    GradType* d_grads,
+    int gpu_id,
+    KeyType *d_keys,
+    GradType *d_grads,
     size_t len,
-    Sgd& sgd) {  // NOLINT
+    Sgd &sgd) {  // NOLINT
   if (len == 0) {
     return;
   }
 
-  int dev_id = resource_->dev_id(gpu_num);
+  int dev_id = resource_->dev_id(gpu_id);
   platform::CUDADeviceGuard guard(dev_id);
-  tables_[gpu_num]->rwlock_->WRLock();
-  tables_[gpu_num]->update(
-      d_keys, d_grads, len, sgd, resource_->remote_stream(gpu_num, gpu_num));
-  tables_[gpu_num]->rwlock_->UNLock();
-  cudaStreamSynchronize(resource_->remote_stream(gpu_num, gpu_num));
+  auto stream = resource_->local_stream(gpu_id, 0);
+  // no mf dim
+  if (!multi_mf_dim_) {
+    auto &table = tables_[gpu_id];
+    table->rwlock_->WRLock();
+    table->update(d_keys, (const char *)d_grads, len, sgd, stream);
+    table->rwlock_->UNLock();
+  } else {
+    auto &table = ptr_tables_[gpu_id];
+    table->rwlock_->WRLock();
+    table->update(d_keys, (const char *)d_grads, len, sgd, stream);
+    table->rwlock_->UNLock();
+  }
+  cudaStreamSynchronize(stream);
 }
 
 template <typename KeyType,
@@ -1995,10 +2152,10 @@ template <typename KeyType,
 template <typename Sgd>
 void HeterComm<KeyType, ValType, GradType, GPUAccessor>::push_sparse_multi_node(
     int gpu_num,
-    KeyType* d_keys,
-    GradType* d_grads,
+    KeyType *d_keys,
+    GradType *d_grads,
     size_t len,
-    Sgd& sgd) {  // NOLINT
+    Sgd &sgd) {  // NOLINT
   if (len == 0) {
     return;
   }
@@ -2008,14 +2165,15 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::push_sparse_multi_node(
 
   uniq_len = gather_one_node_grad(gpu_num, d_keys, d_grads, uniq_len);
 
-  uniq_len = gather_multi_node_grad(gpu_num,
-                                    storage_[gpu_num].local_keys,
-                                    storage_[gpu_num].local_grads,
-                                    uniq_len);
+  uniq_len = gather_multi_node_grad(
+      gpu_num,
+      storage_[gpu_num].local_keys,
+      reinterpret_cast<GradType *>(storage_[gpu_num].local_grads),
+      uniq_len);
 
   update_one_table(gpu_num,
                    storage_[gpu_num].local_keys,
-                   storage_[gpu_num].local_grads,
+                   reinterpret_cast<GradType *>(storage_[gpu_num].local_grads),
                    uniq_len,
                    sgd);
 }
@@ -2025,10 +2183,10 @@ template <typename KeyType,
           typename GradType,
           typename GPUAccessor>
 int HeterComm<KeyType, ValType, GradType, GPUAccessor>::gather_one_node_grad(
-    int gpu_num, KeyType* d_keys, GradType* d_grads, int len) {
+    int gpu_num, KeyType *d_keys, GradType *d_grads, int len) {
   int total_gpu = resource_->total_device();
   int dev_id = resource_->dev_id(gpu_num);
-  auto& storage = storage_[gpu_num];
+  auto &storage = storage_[gpu_num];
   platform::CUDAPlace place = platform::CUDAPlace(dev_id);
   platform::CUDADeviceGuard guard(dev_id);
   auto stream = resource_->local_stream(gpu_num, 0);
@@ -2038,7 +2196,7 @@ int HeterComm<KeyType, ValType, GradType, GPUAccessor>::gather_one_node_grad(
   // alloc for size
   int h_node_len[total_gpu];  // NOLINT
   auto d_node_len_mem = memory::Alloc(place, total_gpu * sizeof(int));
-  int* d_node_len = reinterpret_cast<int*>(d_node_len_mem->ptr());
+  int *d_node_len = reinterpret_cast<int *>(d_node_len_mem->ptr());
   h_node_len[gpu_num] = len;
 
   cudaMemcpy(d_node_len + gpu_num,
@@ -2049,8 +2207,8 @@ int HeterComm<KeyType, ValType, GradType, GPUAccessor>::gather_one_node_grad(
   // allgather grad len
   PADDLE_ENFORCE_GPU_SUCCESS(platform::dynload::ncclGroupStart());
   PADDLE_ENFORCE_GPU_SUCCESS(
-      platform::dynload::ncclAllGather((const void*)(d_node_len + gpu_num),
-                                       (void*)d_node_len,
+      platform::dynload::ncclAllGather(d_node_len + gpu_num,
+                                       d_node_len,
                                        1,        // NOLINT
                                        ncclInt,  // NOLINT
                                        nccl_inner_comm,
@@ -2072,13 +2230,13 @@ int HeterComm<KeyType, ValType, GradType, GPUAccessor>::gather_one_node_grad(
   PADDLE_ENFORCE_GPU_SUCCESS(platform::dynload::ncclAllGather(
       d_keys, storage.all_keys, max_size, ncclUint64, nccl_inner_comm, stream));
 
-  PADDLE_ENFORCE_GPU_SUCCESS(
-      platform::dynload::ncclAllGather(d_grads,
-                                       storage.all_grads,
-                                       max_size * sizeof(GradType),
-                                       ncclUint8,
-                                       nccl_inner_comm,
-                                       stream));
+  PADDLE_ENFORCE_GPU_SUCCESS(platform::dynload::ncclAllGather(
+      d_grads,
+      reinterpret_cast<GradType *>(storage.all_grads),
+      max_size * sizeof(GradType),
+      ncclUint8,
+      nccl_inner_comm,
+      stream));
   PADDLE_ENFORCE_GPU_SUCCESS(platform::dynload::ncclGroupEnd());
   PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
 
@@ -2086,41 +2244,47 @@ int HeterComm<KeyType, ValType, GradType, GPUAccessor>::gather_one_node_grad(
   int h_right[total_gpu];  // NOLINT
   auto d_left = memory::Alloc(place, total_gpu * sizeof(int));
   auto d_right = memory::Alloc(place, total_gpu * sizeof(int));
-  int* d_left_ptr = reinterpret_cast<int*>(d_left->ptr());
-  int* d_right_ptr = reinterpret_cast<int*>(d_right->ptr());
+  int *d_left_ptr = reinterpret_cast<int *>(d_left->ptr());
+  int *d_right_ptr = reinterpret_cast<int *>(d_right->ptr());
 
   int merge_num = 0;
   for (int i = 0; i < total_gpu; ++i) {
     int index = i * max_size;
     auto d_idx = memory::Alloc(place, h_node_len[i] * sizeof(int));
-    int* d_idx_ptr = reinterpret_cast<int*>(d_idx->ptr());
+    int *d_idx_ptr = reinterpret_cast<int *>(d_idx->ptr());
 
     cudaMemset(d_left_ptr, -1, total_gpu * sizeof(int));
     cudaMemset(d_right_ptr, -1, total_gpu * sizeof(int));
 
-    split_input_to_shard(storage.all_keys + index,
-                         d_idx_ptr,
-                         h_node_len[i],
-                         d_left_ptr,
-                         d_right_ptr,
-                         gpu_num);
+    split_idx_to_shard(storage.all_keys + index,
+                       d_idx_ptr,
+                       h_node_len[i],
+                       d_left_ptr,
+                       d_right_ptr,
+                       gpu_num,
+                       stream);
     cudaMemcpy(
         h_left, d_left_ptr, total_gpu * sizeof(int), cudaMemcpyDeviceToHost);
     cudaMemcpy(
         h_right, d_right_ptr, total_gpu * sizeof(int), cudaMemcpyDeviceToHost);
 
-    heter_comm_kernel_->fill_shard_grads(storage.local_keys + merge_num,
-                                         storage.all_keys + index,
-                                         storage.local_grads + merge_num,
-                                         storage.all_grads + index,
-                                         d_idx_ptr + h_left[gpu_num],
-                                         h_right[gpu_num] - h_left[gpu_num] + 1,
-                                         stream);
+    heter_comm_kernel_->fill_shard_grads(
+        storage.local_keys + merge_num,
+        storage.all_keys + index,
+        reinterpret_cast<GradType *>(storage.local_grads) + merge_num,
+        reinterpret_cast<GradType *>(storage.all_grads) + index,
+        d_idx_ptr + h_left[gpu_num],
+        h_right[gpu_num] - h_left[gpu_num] + 1,
+        stream);
     merge_num = merge_num + h_right[gpu_num] - h_left[gpu_num] + 1;
   }
 
   int ret = merge_num;
-  merge_grad(gpu_num, storage.local_keys, storage.local_grads, merge_num, ret);
+  merge_grad(gpu_num,
+             storage.local_keys,
+             reinterpret_cast<GradType *>(storage.local_grads),
+             merge_num,
+             ret);
   return ret;
 }
 
@@ -2129,9 +2293,9 @@ template <typename KeyType,
           typename GradType,
           typename GPUAccessor>
 int HeterComm<KeyType, ValType, GradType, GPUAccessor>::gather_multi_node_grad(
-    int gpu_num, KeyType* d_keys, GradType* d_grads, int len) {
+    int gpu_num, KeyType *d_keys, GradType *d_grads, int len) {
   int dev_id = resource_->dev_id(gpu_num);
-  auto& storage = storage_[gpu_num];
+  auto &storage = storage_[gpu_num];
   platform::CUDAPlace place = platform::CUDAPlace(dev_id);
   platform::CUDADeviceGuard guard(dev_id);
   auto stream = resource_->local_stream(gpu_num, 0);
@@ -2140,7 +2304,7 @@ int HeterComm<KeyType, ValType, GradType, GPUAccessor>::gather_multi_node_grad(
   // alloc for size
   int h_node_len[node_size_];  // NOLINT
   auto d_node_len_mem = memory::Alloc(place, node_size_ * sizeof(int));
-  int* d_node_len = reinterpret_cast<int*>(d_node_len_mem->ptr());
+  int *d_node_len = reinterpret_cast<int *>(d_node_len_mem->ptr());
   h_node_len[0] = len;
 
   cudaMemcpy(d_node_len, h_node_len, sizeof(int), cudaMemcpyHostToDevice);
@@ -2166,13 +2330,13 @@ int HeterComm<KeyType, ValType, GradType, GPUAccessor>::gather_multi_node_grad(
   PADDLE_ENFORCE_GPU_SUCCESS(platform::dynload::ncclAllGather(
       d_keys, storage.all_keys, max_size, ncclUint64, nccl_inter_comm, stream));
 
-  PADDLE_ENFORCE_GPU_SUCCESS(
-      platform::dynload::ncclAllGather(d_grads,
-                                       storage.all_grads,
-                                       max_size * sizeof(GradType),
-                                       ncclUint8,
-                                       nccl_inter_comm,
-                                       stream));
+  PADDLE_ENFORCE_GPU_SUCCESS(platform::dynload::ncclAllGather(
+      d_grads,
+      reinterpret_cast<GradType *>(storage.all_grads),
+      max_size * sizeof(GradType),
+      ncclUint8,
+      nccl_inter_comm,
+      stream));
   PADDLE_ENFORCE_GPU_SUCCESS(platform::dynload::ncclGroupEnd());
   PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
 
@@ -2184,16 +2348,21 @@ int HeterComm<KeyType, ValType, GradType, GPUAccessor>::gather_multi_node_grad(
                     h_node_len[i],
                     cudaMemcpyDefault,
                     stream);
-    cudaMemcpyAsync(storage.local_grads + merge_num,
-                    storage.all_grads + index,
-                    h_node_len[i],
-                    cudaMemcpyDefault,
-                    stream);
+    cudaMemcpyAsync(
+        reinterpret_cast<GradType *>(storage.local_grads) + merge_num,
+        reinterpret_cast<GradType *>(storage.all_grads) + index,
+        h_node_len[i],
+        cudaMemcpyDefault,
+        stream);
     merge_num += h_node_len[i];
   }
 
   int ret = merge_num;
-  merge_grad(gpu_num, storage.local_keys, storage.local_grads, merge_num, ret);
+  merge_grad(gpu_num,
+             storage.local_keys,
+             reinterpret_cast<GradType *>(storage.local_grads),
+             merge_num,
+             ret);
   return ret;
 }
 #endif
@@ -2217,13 +2386,12 @@ void HeterComm<KeyType, ValType, GradType, GPUAccessor>::end_pass() {
     for (int i = 0; i < total_device; ++i) {
       threads.push_back(std::thread(dump_to_cpu_func, i));
     }
-    for (auto& t : threads) {
+    for (auto &t : threads) {
       t.join();
     }
   }
 }
 
-#if defined(PADDLE_WITH_CUDA)
 template <typename KeyType,
           typename ValType,
           typename GradType,
@@ -2231,30 +2399,33 @@ template <typename KeyType,
 int HeterComm<KeyType, ValType, GradType, GPUAccessor>::dedup_keys_and_fillidx(
     const int gpu_id,
     const int total_fea_num,
-    const KeyType* d_keys,   // input
-    KeyType* d_merged_keys,  // output
-    KeyType* d_sorted_keys,
-    uint32_t* d_restore_idx,
-    uint32_t* d_sorted_idx,
-    uint32_t* d_offset,
-    uint32_t* d_merged_cnts,
-    bool filter_zero) {
-  int dev_id = resource_->dev_id(gpu_id);
-  platform::CUDAPlace place = platform::CUDAPlace(dev_id);
-  platform::CUDADeviceGuard guard(dev_id);
-  auto stream = resource_->local_stream(gpu_id, 0);
+    const KeyType *d_keys,   // input
+    KeyType *d_merged_keys,  // output
+    KeyType *d_sorted_keys,
+    uint32_t *d_restore_idx,
+    uint32_t *d_sorted_idx,
+    uint32_t *d_offset,
+    uint32_t *d_merged_cnts,
+    bool filter_zero,
+    cudaStream_t stream) {
+  platform::CUDAPlace place = platform::CUDAPlace(gpu_id);
+  platform::CUDADeviceGuard guard(gpu_id);
+  if (stream == 0) {
+    stream = resource_->local_stream(gpu_id, 0);
+  }
 
   assert(total_fea_num > 0);
   int merged_size = 0;
   size_t byte_size = sizeof(uint32_t) * (total_fea_num + 1);
 
-  auto d_index_ptr = memory::Alloc(place, byte_size);
-  uint32_t* d_index_in = reinterpret_cast<uint32_t*>(d_index_ptr->ptr());
-  int* d_merged_size = reinterpret_cast<int*>(&d_index_in[total_fea_num]);
+  auto d_index_ptr = memory::Alloc(
+      place, byte_size, phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+  uint32_t *d_index_in = reinterpret_cast<uint32_t *>(d_index_ptr->ptr());
+  int *d_merged_size = reinterpret_cast<int *>(&d_index_in[total_fea_num]);
 
   heter_comm_kernel_->fill_idx(d_index_in, total_fea_num, stream);
 
-  void* d_buf = NULL;
+  void *d_buf = NULL;
   size_t temp_storage_bytes = 0;
   PADDLE_ENFORCE_GPU_SUCCESS(
       cub::DeviceRadixSort::SortPairs(NULL,
@@ -2268,8 +2439,11 @@ int HeterComm<KeyType, ValType, GradType, GPUAccessor>::dedup_keys_and_fillidx(
                                       8 * sizeof(KeyType),
                                       stream,
                                       false));
-  auto d_cache_ptr = memory::Alloc(place, temp_storage_bytes);
-  d_buf = reinterpret_cast<int*>(d_cache_ptr->ptr());
+  auto d_cache_ptr =
+      memory::Alloc(place,
+                    temp_storage_bytes,
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+  d_buf = reinterpret_cast<int *>(d_cache_ptr->ptr());
   PADDLE_ENFORCE_GPU_SUCCESS(
       cub::DeviceRadixSort::SortPairs(d_buf,
                                       temp_storage_bytes,
@@ -2294,9 +2468,12 @@ int HeterComm<KeyType, ValType, GradType, GPUAccessor>::dedup_keys_and_fillidx(
                                          stream));
   if (d_cache_ptr->size() < temp_storage_bytes) {
     d_cache_ptr = NULL;
-    d_cache_ptr = memory::Alloc(place, temp_storage_bytes);
+    d_cache_ptr =
+        memory::Alloc(place,
+                      temp_storage_bytes,
+                      phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
   }
-  d_buf = reinterpret_cast<int*>(d_cache_ptr->ptr());
+  d_buf = reinterpret_cast<int *>(d_cache_ptr->ptr());
   PADDLE_ENFORCE_GPU_SUCCESS(
       cub::DeviceRunLengthEncode::Encode(d_buf,
                                          temp_storage_bytes,
@@ -2306,8 +2483,8 @@ int HeterComm<KeyType, ValType, GradType, GPUAccessor>::dedup_keys_and_fillidx(
                                          d_merged_size,
                                          total_fea_num,
                                          stream));
-  PADDLE_ENFORCE_GPU_SUCCESS(cudaMemcpyAsync((void*)&merged_size,
-                                             (void*)d_merged_size,
+  PADDLE_ENFORCE_GPU_SUCCESS(cudaMemcpyAsync(&merged_size,
+                                             d_merged_size,
                                              sizeof(int),
                                              cudaMemcpyDeviceToHost,
                                              stream));
@@ -2317,9 +2494,12 @@ int HeterComm<KeyType, ValType, GradType, GPUAccessor>::dedup_keys_and_fillidx(
       NULL, temp_storage_bytes, d_merged_cnts, d_offset, merged_size, stream));
   if (d_cache_ptr->size() < temp_storage_bytes) {
     d_cache_ptr = NULL;
-    d_cache_ptr = memory::Alloc(place, temp_storage_bytes);
+    d_cache_ptr =
+        memory::Alloc(place,
+                      temp_storage_bytes,
+                      phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
   }
-  d_buf = reinterpret_cast<int*>(d_cache_ptr->ptr());
+  d_buf = reinterpret_cast<int *>(d_cache_ptr->ptr());
   PADDLE_ENFORCE_GPU_SUCCESS(cub::DeviceScan::ExclusiveSum(
       d_buf, temp_storage_bytes, d_merged_cnts, d_offset, merged_size, stream));
 
@@ -2341,14 +2521,1716 @@ int HeterComm<KeyType, ValType, GradType, GPUAccessor>::dedup_keys_and_fillidx(
 
   return merged_size;
 }
-#endif
-// template <typename KeyType, typename ValType, typename GradType>
-// void HeterComm<KeyType, ValType, GradType>::dump_to_cpu(int index) {
-//  auto stream = resource_->local_stream(index, 0);
-//  int dev_id = resource_->dev_id(index);
-//  platform::CUDADeviceGuard guard(dev_id);
-//  tables_[index]->dump_to_cpu(dev_id, stream);
-//}
+template <typename KeyType,
+          typename ValType,
+          typename GradType,
+          typename GPUAccessor>
+void HeterComm<KeyType, ValType, GradType, GPUAccessor>::pull_one_table(
+    const int gpu_id,
+    KeyType *d_keys,
+    float *d_vals,
+    const size_t &len,
+    const cudaStream_t &stream) {
+  // tracker need zero
+  if (FLAGS_enable_tracker_all2all) {
+    cudaMemsetAsync(d_vals, 0, len * pull_type_size_, stream);
+  }
+
+  ptr_tables_[gpu_id]->rwlock_->RDLock();
+  ptr_tables_[gpu_id]->get(
+      d_keys, reinterpret_cast<char *>(d_vals), len, stream, gpu_accessor_);
+  ptr_tables_[gpu_id]->rwlock_->UNLock();
+
+  // tracker
+  if (FLAGS_enable_tracker_all2all) {
+    // check pull values
+    heter_comm_kernel_->check_valid_values(0,
+                                           len,
+                                           d_keys,
+                                           (const char *)d_vals,
+                                           pull_type_size_,
+                                           stream,
+                                           (gpu_id == 0));
+  }
+}
+template <typename KeyType,
+          typename ValType,
+          typename GradType,
+          typename GPUAccessor>
+void HeterComm<KeyType, ValType, GradType, GPUAccessor>::pull_sparse_all2all(
+    const int &gpu_id, KeyType *d_keys, float *d_vals, const size_t &fea_num) {
+  AnyDeviceGuard guard(gpu_id);
+  auto &loc = storage_[gpu_id];
+  // get from local table
+  auto stream = resource_->comm_stream(gpu_id, 0);
+
+  size_t gather_inner_size = 0;
+  size_t pull_size = 0;
+  size_t value_bytes = pull_type_size_;
+  loc.all2all_span_.Resume();
+  // enable inner gather
+  if (FLAGS_enable_sparse_inner_gather) {
+    loc.inner_span_.Resume();
+    // gather keys of all gpu and select shard key
+    gather_inner_size =
+        gather_inter_keys_by_copy(gpu_id, fea_num, d_keys, stream);
+    loc.inner_span_.Pause();
+
+    loc.node_span_.Resume();
+    // all2all mode begins. init resource, partition keys, pull vals by all2all
+    pull_size = gather_sparse_keys_by_all2all(gpu_id,
+                                              gather_inner_size,
+                                              loc.d_merged_keys,
+                                              loc.d_merged_keys,
+                                              loc.d_merged_push_keys,
+                                              stream);
+    loc.node_span_.Pause();
+
+    // pull one table
+    pull_one_table(gpu_id,
+                   loc.d_merged_keys,
+                   reinterpret_cast<float *>(loc.d_merged_vals),
+                   pull_size,
+                   stream);
+
+    // all2all
+    loc.node_span_.Resume();
+    // fp16
+    if (FLAGS_enable_all2all_use_fp16) {
+      value_bytes = heter_comm_kernel_->compress_values(
+          pull_size,
+          (const char *)loc.d_merged_vals,
+          reinterpret_cast<char *>(loc.d_merged_push_vals),
+          pull_type_size_,
+          max_mf_dim_,
+          max_value_bound_,
+          stream);
+
+      scatter_sparse_vals_by_all2all(gpu_id,
+                                     gather_inner_size,
+                                     loc.d_merged_push_vals,
+                                     loc.d_merged_push_vals,
+                                     value_bytes,
+                                     loc.d_merged_vals,
+                                     stream);
+      // unzip fp16
+      heter_comm_kernel_->uncompress_values(
+          gather_inner_size,
+          (const char *)loc.d_merged_push_vals,
+          reinterpret_cast<char *>(loc.d_merged_vals),
+          pull_type_size_,
+          max_mf_dim_,
+          max_value_bound_,
+          stream);
+
+      // pull
+      if (FLAGS_enable_tracker_all2all) {
+        heter_comm_kernel_->check_valid_values(
+            4,
+            gather_inner_size,
+            loc.d_merged_push_keys,
+            (const char *)(loc.d_merged_vals),
+            pull_type_size_,
+            stream,
+            (gpu_id == 0));
+      }
+
+      PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
+    } else {
+      scatter_sparse_vals_by_all2all(gpu_id,
+                                     gather_inner_size,
+                                     loc.d_merged_vals,
+                                     loc.d_merged_vals,
+                                     pull_type_size_,
+                                     loc.d_merged_push_vals,
+                                     stream);
+    }
+    loc.node_span_.Pause();
+
+    // innter scatter
+    loc.inner_span_.Resume();
+    scatter_inter_vals_by_copy(
+        gpu_id, fea_num, loc.d_merged_vals, d_vals, pull_type_size_, stream);
+    loc.inner_span_.Pause();
+  } else {
+    loc.alloc(fea_num, max_type_size_);
+    loc.node_span_.Resume();
+    // all2all mode begins. init resource, partition keys, pull vals by all2all
+    pull_size = gather_sparse_keys_by_all2all(gpu_id,
+                                              fea_num,
+                                              d_keys,
+                                              loc.d_merged_keys,
+                                              loc.d_merged_push_keys,
+                                              stream);
+    loc.node_span_.Pause();
+    // get all tables
+    pull_normal_sparse(gpu_id,
+                       loc.d_merged_keys,
+                       reinterpret_cast<float *>(loc.d_merged_vals),
+                       pull_size);
+    // all2all
+    loc.node_span_.Resume();
+    // fp16
+    if (FLAGS_enable_all2all_use_fp16) {
+      value_bytes = heter_comm_kernel_->compress_values(
+          pull_size,
+          (const char *)loc.d_merged_vals,
+          reinterpret_cast<char *>(loc.d_merged_push_vals),
+          pull_type_size_,
+          max_mf_dim_,
+          max_value_bound_,
+          stream);
+      scatter_sparse_vals_by_all2all(gpu_id,
+                                     pull_size,
+                                     loc.d_merged_push_vals,
+                                     loc.d_merged_push_vals,
+                                     value_bytes,
+                                     loc.d_merged_vals,
+                                     stream);
+      heter_comm_kernel_->uncompress_values(
+          gather_inner_size,
+          (const char *)loc.d_merged_push_vals,
+          reinterpret_cast<char *>(loc.d_merged_vals),
+          pull_type_size_,
+          max_mf_dim_,
+          max_value_bound_,
+          stream);
+      PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
+    } else {
+      scatter_sparse_vals_by_all2all(gpu_id,
+                                     pull_size,
+                                     loc.d_merged_vals,
+                                     d_vals,
+                                     pull_type_size_,
+                                     loc.d_merged_push_vals,
+                                     stream);
+    }
+    loc.node_span_.Pause();
+  }
+  loc.all2all_span_.Pause();
+
+  // pull
+  if (FLAGS_enable_tracker_all2all) {
+    heter_comm_kernel_->check_valid_values(1,
+                                           fea_num,
+                                           d_keys,
+                                           (const char *)(d_vals),
+                                           pull_type_size_,
+                                           stream,
+                                           (gpu_id == 0));
+    VLOG(0) << "pull gpu id=" << gpu_id << ", fea num=" << fea_num
+            << ", inner=" << gather_inner_size << ", node=" << pull_size
+            << ", fp16=" << FLAGS_enable_all2all_use_fp16
+            << ", compress=" << value_bytes
+            << ", pull bytes=" << pull_type_size_;
+  }
+}
+template <typename KeyType,
+          typename ValType,
+          typename GradType,
+          typename GPUAccessor>
+void HeterComm<KeyType, ValType, GradType, GPUAccessor>::shard_inner_keys(
+    const size_t &total_fea_num,
+    const KeyType *d_keys,
+    const int &gpu_id,
+    const int &gpu_num,
+    HeterCommType::InnerResource *res,
+    const cudaStream_t &stream) {
+  std::vector<uint32_t> h_offsets(gpu_num * 2);  // NOLINT
+  uint32_t *d_left_ptr = res->d_offset_ptr;
+  cudaMemsetAsync(d_left_ptr, -1, gpu_num * 2 * sizeof(int), stream);
+
+  uint32_t *d_right_ptr = &d_left_ptr[gpu_num];
+  split_idx_to_shard<uint32_t, cudaStream_t>(const_cast<KeyType *>(d_keys),
+                                             res->d_idx,
+                                             total_fea_num,
+                                             d_left_ptr,
+                                             d_right_ptr,
+                                             gpu_id,
+                                             stream);
+
+  cudaMemcpyAsync(&h_offsets[0],
+                  d_left_ptr,
+                  gpu_num * 2 * sizeof(uint32_t),
+                  cudaMemcpyDeviceToHost,
+                  stream);
+  PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
+
+  for (int i = 0; i < gpu_num; ++i) {
+    uint32_t &h_right = h_offsets[gpu_num + i];
+    uint32_t &h_left = h_offsets[i];
+    if (static_cast<int>(h_right) == -1 || static_cast<int>(h_left) == -1) {
+      res->h_part_sizes[i] = 0;
+    } else {
+      res->h_part_sizes[i] = h_right - h_left + 1;
+    }
+  }
+}
+template <typename KeyType,
+          typename ValType,
+          typename GradType,
+          typename GPUAccessor>
+void HeterComm<KeyType, ValType, GradType, GPUAccessor>::gather_inner_keys_p2p(
+    const size_t &total_fea_num,
+    const KeyType *d_keys,
+    HeterCommType::InnerResource &res,  // NOLINT
+    const int &gpu_id,
+    const int &gpu_num,
+    const int &trans_id,
+    const cudaStream_t &stream) {
+  // gather all datas
+  heter_comm_kernel_->gather_keys(
+      res.d_keys_parted, d_keys, res.d_idx, total_fea_num, stream);
+  if (trans_id < 0) {
+    // not need transfer
+    for (int i = 0; i < gpu_num; ++i) {
+      size_t &data_len = res.h_part_sizes[i];
+      if (data_len == 0) {
+        continue;
+      }
+      size_t &offset = res.h_offsets[i];
+      PADDLE_ENFORCE_GPU_SUCCESS(cudaMemcpyPeerAsync(res.d_remote_keys[i],
+                                                     i,
+                                                     &res.d_keys_parted[offset],
+                                                     gpu_id,
+                                                     data_len * sizeof(KeyType),
+                                                     stream));
+    }
+    PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
+    return;
+  }
+  // need transfer
+  for (int i = 0; i < gpu_num; ++i) {
+    size_t data_len = res.h_part_sizes[i];
+    if (data_len == 0) {
+      continue;
+    }
+    size_t &offset = res.h_offsets[i];
+    // printf("[%d->%d->%d]send keys offset: %ld, len: %ld\n", gpu_id,
+    // trans_id, i, offset, data_len);
+    if (!need_transfer(gpu_id, i)) {
+      PADDLE_ENFORCE_GPU_SUCCESS(cudaMemcpyPeerAsync(res.d_remote_keys[i],
+                                                     i,
+                                                     &res.d_keys_parted[offset],
+                                                     gpu_id,
+                                                     data_len * sizeof(KeyType),
+                                                     stream));
+      continue;
+    }
+    PADDLE_ENFORCE_GPU_SUCCESS(cudaMemcpyPeerAsync(res.d_trans_keys,
+                                                   trans_id,
+                                                   &res.d_keys_parted[offset],
+                                                   gpu_id,
+                                                   data_len * sizeof(KeyType),
+                                                   stream));
+    PADDLE_ENFORCE_GPU_SUCCESS(cudaMemcpyPeerAsync(res.d_remote_keys[i],
+                                                   i,
+                                                   res.d_trans_keys,
+                                                   trans_id,
+                                                   data_len * sizeof(KeyType),
+                                                   stream));
+  }
+  PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
+}
+template <typename KeyType,
+          typename ValType,
+          typename GradType,
+          typename GPUAccessor>
+size_t
+HeterComm<KeyType, ValType, GradType, GPUAccessor>::gather_inter_keys_by_copy(
+    const int &gpu_id,
+    const size_t &fea_size,
+    const KeyType *d_keys,
+    const cudaStream_t &stream) {
+  auto &my_cache = storage_[gpu_id];
+  auto &res = my_cache.inner_res;
+  my_cache.init_inner(fea_size, device_num_);
+  res.h_part_sizes = &my_cache.h_fea_sizes[0];
+  // shard inner keys
+  shard_inner_keys(fea_size, d_keys, gpu_id, device_num_, &res, stream);
+
+  my_cache.inner_barrier_.Resume();
+  // barrier wait all gpu done
+  barrier_.wait();
+  my_cache.inner_barrier_.Pause();
+
+  size_t max_part_size = 0;
+  size_t shard_recv_offset = 0;
+  size_t shard_send_offset = 0;
+  for (int i = 0; i < device_num_; ++i) {
+    auto &cache = storage_[i];
+    my_cache.h_recv_offsets[i] = shard_recv_offset;
+    shard_recv_offset += cache.h_fea_sizes[gpu_id];
+    res.h_offsets[i] = shard_send_offset;
+    shard_send_offset += res.h_part_sizes[i];
+    if (max_part_size < res.h_part_sizes[i]) {
+      max_part_size = res.h_part_sizes[i];
+    }
+  }
+  CHECK(shard_send_offset == static_cast<size_t>(fea_size));
+
+  size_t trans_need_size =
+      std::max(shard_recv_offset, static_cast<size_t>(fea_size));
+  int trans_id = -1;
+  if (topo_aware_ && device_num_ > 4) {
+    trans_id = get_transfer_devid(gpu_id);
+    storage_[trans_id].h_trans_size = max_part_size;
+    // barrier wait all set trans length [0-4, 1-5, 3-7, 2-6]
+    barrier_.wait();
+    my_cache.h_trans_offset = trans_need_size;
+    trans_need_size += my_cache.h_trans_size;
+  }
+  my_cache.alloc(trans_need_size, max_type_size_);
+
+  my_cache.inner_barrier_.Resume();
+  // barrier wait all hbm malloc size
+  barrier_.wait();
+  my_cache.inner_barrier_.Pause();
+
+  for (int i = 0; i < device_num_; ++i) {
+    auto &cache = storage_[i];
+    size_t &recv_offset = cache.h_recv_offsets[gpu_id];
+    res.d_remote_keys[i] = &cache.d_merged_keys[recv_offset];
+    if (trans_id >= 0) {
+      // set transfer buffer
+      auto &trans_cache = storage_[trans_id];
+      res.d_trans_keys = &trans_cache.d_merged_keys[trans_cache.h_trans_offset];
+    }
+  }
+  res.d_keys_parted = my_cache.d_merged_push_keys;
+  my_cache.inner_barrier_.Resume();
+  // barrier wait set buffer ptr
+  barrier_.wait();
+  my_cache.inner_barrier_.Pause();
+  gather_inner_keys_p2p(
+      fea_size, d_keys, res, gpu_id, device_num_, trans_id, stream);
+  // barrier wait all gpu aync memcpy data
+  my_cache.inner_barrier_.Resume();
+  barrier_.wait();
+  my_cache.inner_barrier_.Pause();
+
+  my_cache.init_pull(shard_recv_offset);
+
+  size_t uniq_len = merge_keys(gpu_id,
+                               my_cache.d_merged_keys,  // in keys
+                               shard_recv_offset,
+                               my_cache.d_merged_push_keys,  // sort keys
+                               my_cache.d_merged_keys,       // out merge keys
+                               my_cache.pull_res.d_restore_keys_idx,
+                               stream);
+
+  return uniq_len;
+}
+template <typename KeyType,
+          typename ValType,
+          typename GradType,
+          typename GPUAccessor>
+void HeterComm<KeyType, ValType, GradType, GPUAccessor>::partition_shard_keys(
+    const int &gpu_id,
+    const size_t &len,
+    const KeyType *d_keys,
+    uint32_t *d_idx_parted,
+    KeyType *d_keys_parted,
+    size_t *h_part_sizes,
+    const int &shard_num,
+    const cudaStream_t &stream) {
+  DevPlace place = DevPlace(gpu_id);
+  AnyDeviceGuard guard(gpu_id);
+
+  std::vector<uint32_t> h_offsets(shard_num * 2);
+  auto d_offset_tmp =
+      memory::Alloc(place,
+                    (len * 3 + shard_num * 2) * sizeof(int),
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+  uint32_t *d_left = reinterpret_cast<uint32_t *>(d_offset_tmp->ptr());
+  uint32_t *d_right = &d_left[shard_num];
+  // init
+  cudaMemsetAsync(d_left, -1, shard_num * 2 * sizeof(int), stream);
+
+  uint32_t *d_idx_tmp_ptr = reinterpret_cast<uint32_t *>(&d_right[shard_num]);
+  uint32_t *d_shard_index_ptr = &d_idx_tmp_ptr[len];
+  uint32_t *d_shard_index_tmp_ptr = &d_shard_index_ptr[len];
+
+  heter_comm_kernel_->fill_idx(d_idx_tmp_ptr, len, stream);
+  heter_comm_kernel_->calc_node_shard_index(
+      d_keys, len, d_shard_index_tmp_ptr, device_num_, shard_num, stream);
+
+  size_t temp_storage_bytes;
+  const int num_bits = 1 + log2i(shard_num);
+  heter_comm_kernel_->sort_pairs(NULL,
+                                 temp_storage_bytes,
+                                 d_shard_index_tmp_ptr,
+                                 d_shard_index_ptr,
+                                 d_idx_tmp_ptr,
+                                 d_idx_parted,
+                                 len,
+                                 0,
+                                 num_bits,
+                                 stream);
+
+  auto d_temp_storage =
+      memory::Alloc(place,
+                    temp_storage_bytes,
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+  heter_comm_kernel_->sort_pairs(d_temp_storage->ptr(),
+                                 temp_storage_bytes,
+                                 d_shard_index_tmp_ptr,
+                                 d_shard_index_ptr,
+                                 d_idx_tmp_ptr,
+                                 d_idx_parted,
+                                 len,
+                                 0,
+                                 num_bits,
+                                 stream);
+
+  heter_comm_kernel_->calc_shard_offset(
+      d_shard_index_ptr, d_left, d_right, len, shard_num, stream);
+  heter_comm_kernel_->gather_keys(
+      d_keys_parted, d_keys, d_idx_parted, len, stream);
+
+  cudaMemcpyAsync(&h_offsets[0],
+                  d_left,
+                  shard_num * 2 * sizeof(int),
+                  cudaMemcpyDeviceToHost,
+                  stream);
+  PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
+
+  for (int i = 0; i < shard_num; ++i) {
+    uint32_t &h_right = h_offsets[shard_num + i];
+    uint32_t &h_left = h_offsets[i];
+    if (static_cast<int>(h_right) == -1 || static_cast<int>(h_left) == -1) {
+      h_part_sizes[i] = 0;
+    } else {
+      h_part_sizes[i] = h_right - h_left + 1;
+    }
+  }
+}
+template <typename KeyType,
+          typename ValType,
+          typename GradType,
+          typename GPUAccessor>
+size_t HeterComm<KeyType, ValType, GradType, GPUAccessor>::send_data_by_all2all(
+    const int &gpu_id,
+    const int &nccl_node_size,
+    const int &nccl_rank_id,
+    const int &value_bytes,
+    const size_t *h_send_part_sizes,
+    const size_t *h_send_part_offsets,
+    const size_t *h_recv_part_sizes,
+    const size_t *h_recv_part_offsets,
+    const char *d_send_buff,
+    char *d_rev_buff,
+    const cudaStream_t &stream) {
+  auto &comm = nccl_inter_comms_[gpu_id];
+  const size_t &send_size = h_send_part_sizes[nccl_rank_id];
+  size_t send_offset = h_send_part_offsets[nccl_rank_id] * value_bytes;
+  size_t recv_offset = h_recv_part_offsets[nccl_rank_id] * value_bytes;
+  PADDLE_ENFORCE_GPU_SUCCESS(
+      cudaMemcpyAsync(&d_rev_buff[recv_offset],  // output
+                      &d_send_buff[send_offset],
+                      send_size * value_bytes,
+                      cudaMemcpyDeviceToDevice,
+                      stream));
+  CHECK(send_size == h_recv_part_sizes[nccl_rank_id])
+      << "gpu id=" << gpu_id << ", rank_id=" << nccl_rank_id
+      << ", node_size=" << nccl_node_size << ", send_size=" << send_size
+      << ", recv_size=" << h_recv_part_sizes[nccl_rank_id];
+
+  size_t total_fea_num = 0;
+  PADDLE_ENFORCE_GPU_SUCCESS(platform::dynload::ncclGroupStart());
+  for (int i = 0; i < nccl_node_size; i++) {
+    if (i == nccl_rank_id) {
+      continue;
+    }
+    const size_t &send_size = h_send_part_sizes[i];
+    if (send_size > 0) {
+      send_offset = h_send_part_offsets[i] * value_bytes;
+      PADDLE_ENFORCE_GPU_SUCCESS(
+          platform::dynload::ncclSend(&d_send_buff[send_offset],
+                                      send_size * value_bytes,
+                                      ncclInt8,
+                                      i,
+                                      comm,
+                                      stream));
+      total_fea_num += send_size;
+    }
+    const size_t &recv_size = h_recv_part_sizes[i];
+    if (recv_size > 0) {
+      recv_offset = h_recv_part_offsets[i] * value_bytes;
+      PADDLE_ENFORCE_GPU_SUCCESS(platform::dynload::ncclRecv(
+          reinterpret_cast<void *>(&d_rev_buff[recv_offset]),
+          recv_size * value_bytes,
+          ncclInt8,
+          i,
+          comm,
+          stream));
+      total_fea_num += recv_size;
+    }
+  }
+  PADDLE_ENFORCE_GPU_SUCCESS(platform::dynload::ncclGroupEnd());
+
+  return total_fea_num;
+}
+template <typename KeyType,
+          typename ValType,
+          typename GradType,
+          typename GPUAccessor>
+size_t HeterComm<KeyType, ValType, GradType, GPUAccessor>::
+    gather_sparse_keys_by_all2all(const int &gpu_id,
+                                  const size_t &fea_size,
+                                  const KeyType *d_in_keys,
+                                  KeyType *d_out_keys,
+                                  KeyType *d_tmp_keys,
+                                  const cudaStream_t &stream) {
+  auto &cache = storage_[gpu_id];
+  cache.init_shard(fea_size, node_size_);
+  auto &res = cache.shard_res;
+
+  size_t *h_local_part_sizes = res.h_local_part_sizes.data();
+  size_t *h_local_part_offsets = res.h_local_part_offsets.data();
+  uint32_t *h_push_fea_sizes = res.h_push_fea_sizes.data();
+  // partition keys
+  partition_shard_keys(gpu_id,
+                       fea_size,
+                       d_in_keys,
+                       res.d_local_idx_parted,
+                       d_tmp_keys,
+                       h_local_part_sizes,
+                       node_size_,
+                       stream);
+
+  int all_shard_part_size = node_size_ * node_size_;
+  int rank_offset = rank_id_ * node_size_;
+  h_local_part_offsets[0] = 0;
+  for (int i = 0; i < node_size_; ++i) {
+    h_push_fea_sizes[rank_offset + i] = h_local_part_sizes[i];
+    h_local_part_offsets[i + 1] =
+        h_local_part_offsets[i] + h_local_part_sizes[i];
+  }
+  CHECK(fea_size == h_local_part_offsets[node_size_])
+      << "gpu id=" << gpu_id << ", fea_size=" << fea_size
+      << ", offset size=" << h_local_part_offsets[node_size_];
+  PADDLE_ENFORCE_GPU_SUCCESS(cudaMemcpyAsync(&res.d_node_size_ptr[rank_offset],
+                                             &h_push_fea_sizes[rank_offset],
+                                             node_size_ * sizeof(int),
+                                             cudaMemcpyHostToDevice,
+                                             stream));
+  cache.node_barrier_.Resume();
+  auto &comm = nccl_inter_comms_[gpu_id];
+  PADDLE_ENFORCE_GPU_SUCCESS(platform::dynload::ncclAllGather(
+      &res.d_node_size_ptr[rank_offset],
+      reinterpret_cast<void *>(res.d_node_size_ptr),
+      node_size_,
+      ncclInt,
+      comm,
+      stream));
+  PADDLE_ENFORCE_GPU_SUCCESS(cudaMemcpyAsync(&h_push_fea_sizes[0],
+                                             res.d_node_size_ptr,
+                                             all_shard_part_size * sizeof(int),
+                                             cudaMemcpyDeviceToHost,
+                                             stream));
+  PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
+  cache.node_barrier_.Pause();
+
+  size_t *h_remote_part_sizes = res.h_remote_part_sizes.data();
+  size_t *h_remote_part_offsets = res.h_remote_part_offsets.data();
+  h_remote_part_offsets[0] = 0;
+  for (int i = 0; i < node_size_; i++) {
+    int offset = node_size_ * i + rank_id_;
+    h_remote_part_sizes[i] = h_push_fea_sizes[offset];
+    h_remote_part_offsets[i + 1] =
+        h_remote_part_offsets[i] + h_remote_part_sizes[i];
+  }
+  size_t &remote_size = h_remote_part_offsets[node_size_];
+  cache.alloc(remote_size, max_type_size_, HeterCommType::COPY_KEY);
+
+  size_t total_fea_num = 0;
+  if (rdma_checker_->need_rdma_trans()) {
+    total_fea_num = send_keys_by_all2all_trans(
+        gpu_id, rank_id_, node_size_, fea_size, d_tmp_keys, d_out_keys, stream);
+  } else {
+    total_fea_num = send_data_by_all2all(gpu_id,
+                                         node_size_,
+                                         rank_id_,
+                                         sizeof(KeyType),
+                                         h_local_part_sizes,
+                                         h_local_part_offsets,
+                                         h_remote_part_sizes,
+                                         h_remote_part_offsets,
+                                         (const char *)(d_tmp_keys),
+                                         reinterpret_cast<char *>(d_out_keys),
+                                         stream);
+  }
+  PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
+
+  return remote_size;
+}
+template <typename KeyType,
+          typename ValType,
+          typename GradType,
+          typename GPUAccessor>
+void HeterComm<KeyType, ValType, GradType, GPUAccessor>::
+    scatter_sparse_vals_by_all2all(const int &gpu_id,
+                                   const size_t &fea_size,
+                                   const char *d_in_vals,
+                                   void *d_out_vals,
+                                   const size_t &value_bytes,
+                                   void *d_tmp_vals,
+                                   const cudaStream_t &stream) {
+  auto &cache = storage_[gpu_id];
+  auto &res = cache.shard_res;
+  auto h_local_part_sizes = res.h_local_part_sizes.data();
+  auto h_local_part_offsets = res.h_local_part_offsets.data();
+  auto h_remote_part_sizes = res.h_remote_part_sizes.data();
+  auto h_remote_part_offsets = res.h_remote_part_offsets.data();
+
+  size_t total_fea_num = 0;
+  if (rdma_checker_->need_rdma_trans()) {
+    total_fea_num =
+        send_vals_by_all2all_trans(gpu_id,
+                                   rank_id_,
+                                   node_size_,
+                                   d_in_vals,
+                                   reinterpret_cast<char *>(d_tmp_vals),
+                                   value_bytes,
+                                   stream);
+  } else {
+    // send local device
+    total_fea_num = send_data_by_all2all(gpu_id,
+                                         node_size_,
+                                         rank_id_,
+                                         value_bytes,
+                                         h_remote_part_sizes,
+                                         h_remote_part_offsets,
+                                         h_local_part_sizes,
+                                         h_local_part_offsets,
+                                         d_in_vals,
+                                         reinterpret_cast<char *>(d_tmp_vals),
+                                         stream);
+  }
+  PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
+  // fill vals
+  heter_comm_kernel_->scatter_vals(
+      (const float *)(d_tmp_vals),            // in
+      reinterpret_cast<float *>(d_out_vals),  // out
+      res.d_local_idx_parted,
+      fea_size,
+      value_bytes,
+      stream);
+  CUDA_CHECK(cudaStreamSynchronize(stream));
+}
+template <typename KeyType,
+          typename ValType,
+          typename GradType,
+          typename GPUAccessor>
+void HeterComm<KeyType, ValType, GradType, GPUAccessor>::scatter_inner_vals_p2p(
+    const size_t &total_fea_num,
+    void *d_out_vals,
+    HeterCommType::InnerResource &res,  // NOLINT
+    const int &gpu_id,
+    const int &gpu_num,
+    const int &trans_id,
+    const size_t &value_bytes,
+    const cudaStream_t &stream) {
+  if (trans_id < 0) {
+    // not need transfer
+    for (int i = 0; i < gpu_num; ++i) {
+      size_t &data_len = res.h_part_sizes[i];
+      if (data_len == 0) {
+        continue;
+      }
+      size_t &offset = res.h_offsets[i];
+      PADDLE_ENFORCE_GPU_SUCCESS(
+          cudaMemcpyPeerAsync(&res.d_vals_parted[offset * value_bytes],
+                              gpu_id,
+                              res.d_remote_vals[i],
+                              i,
+                              data_len * value_bytes,
+                              stream));
+    }
+  } else {
+    // need transfer
+    for (int i = 0; i < gpu_num; ++i) {
+      size_t data_len = res.h_part_sizes[i];
+      if (data_len == 0) {
+        continue;
+      }
+      size_t &offset = res.h_offsets[i];
+      // printf("[%d<-%d<-%d]recv vals offset: %ld, len: %ld\n", gpu_id,
+      // trans_id, i, offset, data_len);
+      if (!need_transfer(gpu_id, i)) {
+        PADDLE_ENFORCE_GPU_SUCCESS(
+            cudaMemcpyPeerAsync(&res.d_vals_parted[offset * value_bytes],
+                                gpu_id,
+                                res.d_remote_vals[i],
+                                i,
+                                data_len * value_bytes,
+                                stream));
+        continue;
+      }
+      PADDLE_ENFORCE_GPU_SUCCESS(cudaMemcpyPeerAsync(res.d_trans_vals,
+                                                     trans_id,
+                                                     res.d_remote_vals[i],
+                                                     i,
+                                                     data_len * value_bytes,
+                                                     stream));
+      PADDLE_ENFORCE_GPU_SUCCESS(
+          cudaMemcpyPeerAsync(&res.d_vals_parted[offset * value_bytes],
+                              gpu_id,
+                              res.d_trans_vals,
+                              trans_id,
+                              data_len * value_bytes,
+                              stream));
+    }
+  }
+  // restore vals
+  heter_comm_kernel_->scatter_vals(
+      (const float *)(res.d_vals_parted),     // in
+      reinterpret_cast<float *>(d_out_vals),  // out
+      res.d_idx,
+      total_fea_num,
+      value_bytes,
+      stream);
+  PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
+}
+template <typename KeyType,
+          typename ValType,
+          typename GradType,
+          typename GPUAccessor>
+void HeterComm<KeyType, ValType, GradType, GPUAccessor>::
+    scatter_inter_vals_by_copy(const int &gpu_id,
+                               const size_t &fea_size,
+                               const char *d_in_vals,
+                               void *d_out_vals,
+                               const size_t &value_bytes,
+                               const cudaStream_t &stream) {
+  auto &my_cache = storage_[gpu_id];
+  // restore vals
+  heter_comm_kernel_->scatter_vals(
+      (const float *)(d_in_vals),                              // in
+      reinterpret_cast<float *>(my_cache.d_merged_push_vals),  // out
+      my_cache.pull_res.d_restore_keys_idx,
+      my_cache.pull_res.h_recv_fea_num,
+      value_bytes,
+      stream);
+  PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
+
+  auto &res = my_cache.inner_res;
+  int trans_id = -1;
+  if (topo_aware_ && device_num_ > 4) {
+    trans_id = get_transfer_devid(gpu_id);
+  }
+  my_cache.inner_barrier_.Resume();
+  // barrier wait set buffer ptr
+  barrier_.wait();
+  my_cache.inner_barrier_.Pause();
+
+  for (int i = 0; i < device_num_; ++i) {
+    auto &cache = storage_[i];
+    size_t &recv_offset = cache.h_recv_offsets[gpu_id];
+    res.d_remote_vals[i] = &cache.d_merged_push_vals[recv_offset * value_bytes];
+    if (trans_id >= 0) {
+      // set transfer buffer
+      auto &trans_cache = storage_[trans_id];
+      res.d_trans_vals =
+          &trans_cache
+               .d_merged_push_vals[trans_cache.h_trans_offset * value_bytes];
+    }
+  }
+  res.d_vals_parted = my_cache.d_merged_vals;
+  my_cache.inner_barrier_.Resume();
+  // barrier wait set buffer ptr
+  barrier_.wait();
+  my_cache.inner_barrier_.Pause();
+  // recv all pull sparse vals
+  scatter_inner_vals_p2p(fea_size,
+                         d_out_vals,  // out
+                         res,
+                         gpu_id,
+                         device_num_,
+                         trans_id,
+                         value_bytes,
+                         stream);
+}
+template <typename KeyType,
+          typename ValType,
+          typename GradType,
+          typename GPUAccessor>
+void HeterComm<KeyType, ValType, GradType, GPUAccessor>::gather_inner_data_p2p(
+    const size_t &total_fea_num,
+    const KeyType *d_keys,
+    const void *d_vals,
+    HeterCommType::InnerResource &res,  // NOLINT
+    const int &gpu_id,
+    const int &gpu_num,
+    const int &trans_id,
+    const size_t &value_bytes,
+    const cudaStream_t &stream) {
+  // gather all datas
+  heter_comm_kernel_->gather_keys(
+      res.d_keys_parted, d_keys, res.d_idx, total_fea_num, stream);
+  heter_comm_kernel_->gather_vals(reinterpret_cast<float *>(res.d_vals_parted),
+                                  (const float *)(d_vals),
+                                  res.d_idx,
+                                  total_fea_num,
+                                  value_bytes,
+                                  stream);
+  // p2p copy key and values
+  if (trans_id < 0) {
+    // not need transfer
+    for (int i = 0; i < gpu_num; ++i) {
+      size_t &data_len = res.h_part_sizes[i];
+      if (data_len == 0) {
+        continue;
+      }
+      size_t &offset = res.h_offsets[i];
+      PADDLE_ENFORCE_GPU_SUCCESS(cudaMemcpyPeerAsync(res.d_remote_keys[i],
+                                                     i,
+                                                     &res.d_keys_parted[offset],
+                                                     gpu_id,
+                                                     data_len * sizeof(KeyType),
+                                                     stream));
+      PADDLE_ENFORCE_GPU_SUCCESS(
+          cudaMemcpyPeerAsync(res.d_remote_vals[i],
+                              i,
+                              &res.d_vals_parted[offset * value_bytes],
+                              gpu_id,
+                              data_len * value_bytes,
+                              stream));
+    }
+    PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
+    return;
+  }
+  // need transfer
+  for (int i = 0; i < gpu_num; ++i) {
+    size_t data_len = res.h_part_sizes[i];
+    if (data_len == 0) {
+      continue;
+    }
+    size_t &offset = res.h_offsets[i];
+    if (!need_transfer(gpu_id, i)) {
+      PADDLE_ENFORCE_GPU_SUCCESS(cudaMemcpyPeerAsync(res.d_remote_keys[i],
+                                                     i,
+                                                     &res.d_keys_parted[offset],
+                                                     gpu_id,
+                                                     data_len * sizeof(KeyType),
+                                                     stream));
+      PADDLE_ENFORCE_GPU_SUCCESS(
+          cudaMemcpyPeerAsync(res.d_remote_vals[i],
+                              i,
+                              &res.d_vals_parted[offset * value_bytes],
+                              gpu_id,
+                              data_len * value_bytes,
+                              stream));
+      continue;
+    }
+    PADDLE_ENFORCE_GPU_SUCCESS(cudaMemcpyPeerAsync(res.d_trans_keys,
+                                                   trans_id,
+                                                   &res.d_keys_parted[offset],
+                                                   gpu_id,
+                                                   data_len * sizeof(KeyType),
+                                                   stream));
+    PADDLE_ENFORCE_GPU_SUCCESS(cudaMemcpyPeerAsync(res.d_remote_keys[i],
+                                                   i,
+                                                   res.d_trans_keys,
+                                                   trans_id,
+                                                   data_len * sizeof(KeyType),
+                                                   stream));
+    PADDLE_ENFORCE_GPU_SUCCESS(
+        cudaMemcpyPeerAsync(res.d_trans_vals,
+                            trans_id,
+                            &res.d_vals_parted[offset * value_bytes],
+                            gpu_id,
+                            data_len * value_bytes,
+                            stream));
+    PADDLE_ENFORCE_GPU_SUCCESS(cudaMemcpyPeerAsync(res.d_remote_vals[i],
+                                                   i,
+                                                   res.d_trans_vals,
+                                                   trans_id,
+                                                   data_len * value_bytes,
+                                                   stream));
+  }
+  PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
+}
+template <typename KeyType,
+          typename ValType,
+          typename GradType,
+          typename GPUAccessor>
+template <typename Sgd>
+void HeterComm<KeyType, ValType, GradType, GPUAccessor>::push_sparse_all2all(
+    const int &gpu_id,
+    KeyType *d_keys,
+    float *d_grads,
+    const size_t &len,
+    Sgd &sgd) {  // NOLINT
+  if (len == 0) {
+    return;
+  }
+  auto &my_cache = storage_[gpu_id];
+  my_cache.all2all_span_.Resume();
+  auto stream = resource_->comm_stream(gpu_id, 0);
+  // tracker
+  if (FLAGS_enable_tracker_all2all) {
+    // check push grads
+    heter_comm_kernel_->check_valid_values(10,
+                                           len,
+                                           d_keys,
+                                           (const char *)(d_grads),
+                                           grad_type_size_,
+                                           stream,
+                                           (gpu_id == 0));
+  }
+  // scale grad
+  heter_comm_kernel_->scale_grad(len,
+                                 reinterpret_cast<char *>(d_grads),
+                                 grad_type_size_,
+                                 max_mf_dim_,
+                                 stream,
+                                 gpu_accessor_);
+
+  size_t inter_push_len = 0;
+  size_t node_push_len = 0;
+  size_t value_bytes = grad_type_size_;
+  // enable inner gather
+  if (FLAGS_enable_sparse_inner_gather) {
+    my_cache.inner_span_.Resume();
+    inter_push_len =
+        gather_inter_gradient_by_copy(gpu_id,
+                                      len,
+                                      d_keys,
+                                      reinterpret_cast<void *>(d_grads),
+                                      grad_type_size_,
+                                      stream);
+    my_cache.inner_span_.Pause();
+
+    my_cache.node_span_.Resume();
+
+    if (FLAGS_enable_all2all_use_fp16) {  // use fp16
+      value_bytes = heter_comm_kernel_->compress_values(
+          inter_push_len,
+          (const char *)my_cache.d_merged_push_vals,
+          reinterpret_cast<char *>(my_cache.d_merged_vals),
+          grad_type_size_,
+          max_mf_dim_,
+          max_grad_bound_,
+          stream);
+      node_push_len =
+          gather_sparse_gradient_by_all2all(gpu_id,
+                                            inter_push_len,
+                                            my_cache.d_merged_push_keys,
+                                            my_cache.d_merged_vals,
+                                            value_bytes,
+                                            my_cache.d_merged_push_keys,
+                                            my_cache.d_merged_keys,
+                                            my_cache.d_merged_vals,
+                                            my_cache.d_merged_push_vals,
+                                            stream);
+      heter_comm_kernel_->uncompress_values(
+          node_push_len,
+          (const char *)my_cache.d_merged_vals,
+          reinterpret_cast<char *>(my_cache.d_merged_push_vals),
+          grad_type_size_,
+          max_mf_dim_,
+          max_grad_bound_,
+          stream);
+      PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
+    } else {
+      node_push_len =
+          gather_sparse_gradient_by_all2all(gpu_id,
+                                            inter_push_len,
+                                            my_cache.d_merged_push_keys,
+                                            my_cache.d_merged_push_vals,
+                                            value_bytes,
+                                            my_cache.d_merged_push_keys,
+                                            my_cache.d_merged_keys,
+                                            my_cache.d_merged_push_vals,
+                                            my_cache.d_merged_vals,
+                                            stream);
+    }
+    my_cache.node_span_.Pause();
+  } else {  // only node all2all
+    my_cache.node_span_.Resume();
+    if (FLAGS_enable_all2all_use_fp16) {  // use fp16
+      value_bytes = heter_comm_kernel_->compress_values(
+          len,
+          (const char *)d_grads,
+          reinterpret_cast<char *>(my_cache.d_merged_vals),
+          grad_type_size_,
+          max_mf_dim_,
+          max_grad_bound_,
+          stream);
+      node_push_len =
+          gather_sparse_gradient_by_all2all(gpu_id,
+                                            len,
+                                            d_keys,
+                                            my_cache.d_merged_vals,
+                                            value_bytes,
+                                            my_cache.d_merged_push_keys,
+                                            my_cache.d_merged_keys,
+                                            my_cache.d_merged_vals,
+                                            my_cache.d_merged_push_vals,
+                                            stream);
+      heter_comm_kernel_->uncompress_values(
+          node_push_len,
+          (const char *)my_cache.d_merged_vals,
+          reinterpret_cast<char *>(my_cache.d_merged_push_vals),
+          grad_type_size_,
+          max_mf_dim_,
+          max_grad_bound_,
+          stream);
+      PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
+    } else {
+      node_push_len =
+          gather_sparse_gradient_by_all2all(gpu_id,
+                                            len,
+                                            d_keys,                 // in
+                                            (const char *)d_grads,  // in
+                                            value_bytes,
+                                            my_cache.d_merged_push_keys,  // out
+                                            my_cache.d_merged_keys,       // tmp
+                                            my_cache.d_merged_push_vals,  // out
+                                            my_cache.d_merged_vals,       // tmp
+                                            stream);
+    }
+    my_cache.node_span_.Pause();
+  }
+  if (FLAGS_enable_tracker_all2all) {
+    VLOG(0) << "push gpu id=" << gpu_id
+            << ", gather_sparse_gradient_by_all2all len=" << node_push_len;
+  }
+  // all embedx merge
+  size_t uniq_len = merge_grad(gpu_id,
+                               node_push_len,
+                               my_cache.d_merged_push_keys,  // in
+                               my_cache.d_merged_keys,       // out
+                               my_cache.d_merged_push_vals,  // in
+                               my_cache.d_merged_vals,
+                               stream);  // out
+  if (FLAGS_enable_tracker_all2all) {
+    // check all2ll merge grads
+    heter_comm_kernel_->check_valid_values(
+        11,
+        uniq_len,
+        my_cache.d_merged_keys,
+        (const char *)(my_cache.d_merged_vals),
+        grad_type_size_,
+        stream,
+        (gpu_id == 0));
+  }
+  if (FLAGS_enable_sparse_inner_gather) {
+    // update all grad
+    update_one_table(gpu_id,
+                     my_cache.d_merged_keys,
+                     reinterpret_cast<GradType *>(my_cache.d_merged_vals),
+                     uniq_len,
+                     sgd);
+  } else {
+    // update all tables
+    push_normal_sparse(gpu_id,
+                       my_cache.d_merged_keys,
+                       reinterpret_cast<float *>(my_cache.d_merged_vals),
+                       uniq_len,
+                       sgd);
+  }
+  my_cache.all2all_span_.Pause();
+  // push
+  if (FLAGS_enable_tracker_all2all) {
+    VLOG(0) << "push gpu id=" << gpu_id << ", push len=" << len
+            << ", inner=" << inter_push_len << ", node=" << node_push_len
+            << ", update=" << uniq_len << ", compress bytes=" << value_bytes
+            << ", grad_type_size=" << grad_type_size_;
+  }
+  print_debug_time(gpu_id);
+}
+template <typename KeyType,
+          typename ValType,
+          typename GradType,
+          typename GPUAccessor>
+size_t HeterComm<KeyType, ValType, GradType, GPUAccessor>::merge_grad(
+    const int &gpu_id,
+    const size_t &len,
+    const KeyType *d_in_keys,
+    KeyType *d_out_keys,
+    const void *d_in_grads,
+    void *d_out_grads,
+    const cudaStream_t &stream) {
+  platform::CUDADeviceGuard guard(gpu_id);
+  auto place = platform::CUDAPlace(gpu_id);
+  auto d_fea_num_info =
+      memory::Alloc(place,
+                    sizeof(uint32_t) * len * 4,
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+  uint32_t *d_offset = reinterpret_cast<uint32_t *>(d_fea_num_info->ptr());
+  uint32_t *d_sorted_idx = &d_offset[len];
+  uint32_t *d_restore_idx = &d_sorted_idx[len];
+  uint32_t *d_merged_cnts = &d_restore_idx[len];
+
+  auto d_sort_keys_ptr =
+      memory::Alloc(place,
+                    sizeof(KeyType) * len,
+                    phi::Stream(reinterpret_cast<phi::StreamId>(stream)));
+  KeyType *d_sorted_keys = reinterpret_cast<KeyType *>(d_sort_keys_ptr->ptr());
+
+  size_t merge_size = dedup_keys_and_fillidx(gpu_id,
+                                             len,
+                                             d_in_keys,   // input
+                                             d_out_keys,  // output
+                                             d_sorted_keys,
+                                             d_restore_idx,
+                                             d_sorted_idx,
+                                             d_offset,
+                                             d_merged_cnts,
+                                             false,
+                                             stream);
+
+  heter_comm_kernel_->merge_gradient(d_out_keys,
+                                     d_offset,
+                                     d_merged_cnts,
+                                     d_sorted_idx,
+                                     reinterpret_cast<const char *>(d_in_grads),
+                                     reinterpret_cast<char *>(d_out_grads),
+                                     static_cast<int>(merge_size),
+                                     max_mf_dim_,
+                                     grad_type_size_,
+                                     merger_,
+                                     stream,
+                                     gpu_accessor_);
+  PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
+
+  return merge_size;
+}
+template <typename KeyType,
+          typename ValType,
+          typename GradType,
+          typename GPUAccessor>
+size_t HeterComm<KeyType, ValType, GradType, GPUAccessor>::
+    gather_inter_gradient_by_copy(const int &gpu_id,
+                                  const size_t &push_size,
+                                  KeyType *d_keys,
+                                  void *d_push_vals,
+                                  const size_t &value_bytes,
+                                  const cudaStream_t &stream) {
+  auto &my_cache = storage_[gpu_id];
+
+  my_cache.init_inner(push_size, device_num_);
+  auto &res = my_cache.inner_res;
+  res.h_part_sizes = &my_cache.h_fea_sizes[0];
+  // shard data
+  shard_inner_keys(push_size, d_keys, gpu_id, device_num_, &res, stream);
+  my_cache.inner_barrier_.Resume();
+  // barrier wait all gpu done
+  barrier_.wait();
+  my_cache.inner_barrier_.Pause();
+
+  size_t max_part_size = 0;
+  size_t shard_recv_offset = 0;
+  size_t shard_send_offset = 0;
+  for (int i = 0; i < device_num_; ++i) {
+    auto &cache = storage_[i];
+    my_cache.h_recv_offsets[i] = shard_recv_offset;
+    shard_recv_offset += cache.h_fea_sizes[gpu_id];
+    res.h_offsets[i] = shard_send_offset;
+    shard_send_offset += res.h_part_sizes[i];
+    if (res.h_part_sizes[i] > max_part_size) {
+      max_part_size = res.h_part_sizes[i];
+    }
+  }
+
+  size_t trans_need_size = std::max(shard_recv_offset, push_size);
+  int trans_id = -1;
+  if (topo_aware_ && device_num_ > 4) {
+    trans_id = get_transfer_devid(gpu_id);
+    storage_[trans_id].h_trans_size = max_part_size;
+    // barrier wait all set trans length [0-4, 1-5, 3-7, 2-6]
+    barrier_.wait();
+    my_cache.h_trans_offset = trans_need_size;
+    trans_need_size += my_cache.h_trans_size;
+  }
+  my_cache.alloc(trans_need_size, max_type_size_);
+  my_cache.inner_barrier_.Resume();
+  // barrier wait all hbm malloc size
+  barrier_.wait();
+  my_cache.inner_barrier_.Pause();
+
+  for (int i = 0; i < device_num_; ++i) {
+    auto &cache = storage_[i];
+    size_t &recv_offset = cache.h_recv_offsets[gpu_id];
+    res.d_remote_keys[i] = &cache.d_merged_keys[recv_offset];
+    res.d_remote_vals[i] = &cache.d_merged_vals[recv_offset * value_bytes];
+    if (trans_id >= 0) {
+      // set transfer buffer
+      auto &trans_cache = storage_[trans_id];
+      res.d_trans_keys = &trans_cache.d_merged_keys[trans_cache.h_trans_offset];
+      res.d_trans_vals =
+          &trans_cache.d_merged_vals[trans_cache.h_trans_offset * value_bytes];
+    }
+  }
+  res.d_keys_parted = my_cache.d_merged_push_keys;
+  res.d_vals_parted = my_cache.d_merged_push_vals;
+  my_cache.inner_barrier_.Resume();
+  // barrier wait set buffer ptr
+  barrier_.wait();
+  my_cache.inner_barrier_.Pause();
+  gather_inner_data_p2p(push_size,
+                        d_keys,
+                        d_push_vals,
+                        res,
+                        gpu_id,
+                        device_num_,
+                        trans_id,
+                        value_bytes,
+                        stream);
+  // barrier wait all gpu aync memcpy data
+  my_cache.inner_barrier_.Resume();
+  barrier_.wait();
+  my_cache.inner_barrier_.Pause();
+  // all embedx merge
+  size_t total_push_size = merge_grad(gpu_id,
+                                      shard_recv_offset,
+                                      my_cache.d_merged_keys,
+                                      my_cache.d_merged_push_keys,
+                                      my_cache.d_merged_vals,
+                                      my_cache.d_merged_push_vals,
+                                      stream);
+
+  return total_push_size;
+}
+template <typename KeyType,
+          typename ValType,
+          typename GradType,
+          typename GPUAccessor>
+size_t HeterComm<KeyType, ValType, GradType, GPUAccessor>::
+    gather_sparse_gradient_by_all2all(const int &gpu_id,
+                                      const size_t &fea_size,
+                                      const KeyType *d_keys,
+                                      const char *d_push_vals,
+                                      const size_t &value_bytes,
+                                      KeyType *d_out_keys,
+                                      KeyType *d_tmp_keys,
+                                      char *d_out_vals,
+                                      char *d_tmp_vals,
+                                      const cudaStream_t &stream) {
+  auto &my_cache = storage_[gpu_id];
+  my_cache.init_shard(fea_size, node_size_);
+  auto &res = my_cache.shard_res;
+
+  size_t *h_local_part_sizes = res.h_local_part_sizes.data();
+  size_t *h_local_part_offsets = res.h_local_part_offsets.data();
+  uint32_t *h_push_fea_sizes = res.h_push_fea_sizes.data();
+
+  partition_shard_keys(gpu_id,
+                       fea_size,
+                       d_keys,
+                       res.d_local_idx_parted,
+                       d_tmp_keys,
+                       h_local_part_sizes,
+                       node_size_,
+                       stream);
+
+  int all_shard_part_size = node_size_ * node_size_;
+  h_local_part_offsets[0] = 0;
+  for (int i = 0; i < node_size_; i++) {
+    int offset = rank_id_ * node_size_ + i;
+    h_push_fea_sizes[offset] = h_local_part_sizes[i];
+    h_local_part_offsets[i + 1] =
+        h_local_part_offsets[i] + h_local_part_sizes[i];
+  }
+  PADDLE_ENFORCE_GPU_SUCCESS(cudaMemcpyAsync(res.d_node_size_ptr,
+                                             &h_push_fea_sizes[0],
+                                             all_shard_part_size * sizeof(int),
+                                             cudaMemcpyHostToDevice,
+                                             stream));
+
+  my_cache.node_barrier_.Resume();
+  auto &comm = nccl_inter_comms_[gpu_id];
+  PADDLE_ENFORCE_GPU_SUCCESS(platform::dynload::ncclAllGather(
+      (&res.d_node_size_ptr[rank_id_ * node_size_]),
+      reinterpret_cast<void *>(res.d_node_size_ptr),
+      node_size_,
+      ncclInt,
+      comm,
+      stream));
+  PADDLE_ENFORCE_GPU_SUCCESS(cudaMemcpyAsync(&h_push_fea_sizes[0],
+                                             res.d_node_size_ptr,
+                                             all_shard_part_size * sizeof(int),
+                                             cudaMemcpyDeviceToHost,
+                                             stream));
+  PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
+  my_cache.node_barrier_.Pause();
+
+  size_t *h_remote_part_sizes = res.h_remote_part_sizes.data();
+  size_t *h_remote_part_offsets = res.h_remote_part_offsets.data();
+
+  h_remote_part_offsets[0] = 0;
+  for (int i = 0; i < node_size_; i++) {
+    int offset = node_size_ * i + rank_id_;
+    int recv_num = h_push_fea_sizes[offset];
+    h_remote_part_sizes[i] = recv_num;
+    h_remote_part_offsets[i + 1] = h_remote_part_offsets[i] + recv_num;
+  }
+  size_t total_recv_fea_num = h_remote_part_offsets[node_size_];
+  my_cache.alloc(total_recv_fea_num, max_type_size_, HeterCommType::COPY_ALL);
+  // fill shard vals
+  heter_comm_kernel_->gather_vals(
+      reinterpret_cast<float *>(d_tmp_vals),         // out
+      reinterpret_cast<const float *>(d_push_vals),  // in
+      res.d_local_idx_parted,
+      fea_size,
+      value_bytes,
+      stream);
+
+  size_t total_send_recv = 0;
+  if (rdma_checker_->need_rdma_trans()) {
+    total_send_recv = send_gradient_by_all2all_trans(gpu_id,
+                                                     rank_id_,
+                                                     node_size_,
+                                                     fea_size,
+                                                     d_tmp_keys,
+                                                     d_tmp_vals,
+                                                     value_bytes,
+                                                     d_out_keys,
+                                                     d_out_vals,
+                                                     stream);
+  } else {
+    // send local device
+    total_send_recv = send_data_by_all2all(gpu_id,
+                                           node_size_,
+                                           rank_id_,
+                                           sizeof(KeyType),
+                                           h_local_part_sizes,
+                                           h_local_part_offsets,
+                                           h_remote_part_sizes,
+                                           h_remote_part_offsets,
+                                           (const char *)(d_tmp_keys),
+                                           reinterpret_cast<char *>(d_out_keys),
+                                           stream);
+    send_data_by_all2all(gpu_id,
+                         node_size_,
+                         rank_id_,
+                         value_bytes,
+                         h_local_part_sizes,
+                         h_local_part_offsets,
+                         h_remote_part_sizes,
+                         h_remote_part_offsets,
+                         (const char *)(d_tmp_vals),
+                         reinterpret_cast<char *>(d_out_vals),
+                         stream);
+  }
+  PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
+
+  return total_recv_fea_num;
+}
+template <typename KeyType,
+          typename ValType,
+          typename GradType,
+          typename GPUAccessor>
+size_t
+HeterComm<KeyType, ValType, GradType, GPUAccessor>::send_keys_by_all2all_trans(
+    const int &gpu_id,
+    const int &nccl_rank_id,
+    const int &nccl_node_size,
+    const size_t &fea_size,
+    const KeyType *d_in_keys,
+    KeyType *d_out_keys,
+    const cudaStream_t &stream) {
+  size_t total_fea_num = 0;
+  auto &my_cache = storage_[gpu_id];
+  if (!rdma_checker_->is_device_support_rdma(gpu_id)) {
+    int trans_id = get_transfer_devid(gpu_id);
+    auto &trans = storage_[trans_id];
+    // wait node alloc hbm
+    trans.sem_wait->post();
+    my_cache.sem_wait->wait();
+
+    const size_t &recv_size =
+        my_cache.shard_res.h_remote_part_offsets[nccl_node_size];
+    size_t need_len = std::max(fea_size, recv_size);
+    CHECK(trans.trans_keys_buff->size() >= need_len * sizeof(KeyType) * 2);
+
+    // p2p copy
+    PADDLE_ENFORCE_GPU_SUCCESS(cudaMemcpyPeerAsync(trans.d_merged_trans_keys,
+                                                   trans_id,
+                                                   d_in_keys,
+                                                   gpu_id,
+                                                   fea_size * sizeof(KeyType),
+                                                   stream));
+    PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
+
+    // wait node data ok
+    trans.sem_wait->post();
+    my_cache.sem_wait->wait();
+
+    // p2p copy
+    PADDLE_ENFORCE_GPU_SUCCESS(
+        cudaMemcpyPeerAsync(d_out_keys,
+                            gpu_id,
+                            trans.d_merged_push_trans_keys,
+                            trans_id,
+                            recv_size * sizeof(KeyType),
+                            stream));
+    PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
+  } else {
+    my_cache.sem_wait->wait();
+    int trans_id = get_transfer_devid(gpu_id);
+    auto &trans = storage_[trans_id];
+
+    // alloc trans mem
+    size_t trans_len =
+        std::max(trans.shard_res.h_local_part_offsets[nccl_node_size],
+                 trans.shard_res.h_remote_part_offsets[nccl_node_size]);
+    my_cache.init_trans(trans_len, max_type_size_);
+
+    trans.sem_wait->post();
+    my_cache.sem_wait->wait();
+
+    // send local device
+    total_fea_num =
+        send_data_by_all2all(gpu_id,
+                             nccl_node_size,
+                             nccl_rank_id,
+                             sizeof(KeyType),
+                             my_cache.shard_res.h_local_part_sizes.data(),
+                             my_cache.shard_res.h_local_part_offsets.data(),
+                             my_cache.shard_res.h_remote_part_sizes.data(),
+                             my_cache.shard_res.h_remote_part_offsets.data(),
+                             (const char *)d_in_keys,
+                             reinterpret_cast<char *>(d_out_keys),
+                             stream);
+    // send trans device
+    total_fea_num += send_data_by_all2all(
+        gpu_id,
+        nccl_node_size,
+        nccl_rank_id,
+        sizeof(KeyType),
+        trans.shard_res.h_local_part_sizes.data(),
+        trans.shard_res.h_local_part_offsets.data(),
+        trans.shard_res.h_remote_part_sizes.data(),
+        trans.shard_res.h_remote_part_offsets.data(),
+        (const char *)my_cache.d_merged_trans_keys,
+        reinterpret_cast<char *>(my_cache.d_merged_push_trans_keys),
+        stream);
+    PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
+
+    trans.sem_wait->post();
+  }
+  return total_fea_num;
+}
+template <typename KeyType,
+          typename ValType,
+          typename GradType,
+          typename GPUAccessor>
+size_t
+HeterComm<KeyType, ValType, GradType, GPUAccessor>::send_vals_by_all2all_trans(
+    const int &gpu_id,
+    const int &nccl_rank_id,
+    const int &nccl_node_size,
+    const char *d_in_vals,
+    char *d_out_vals,
+    const size_t &value_bytes,
+    const cudaStream_t &stream) {
+  auto &my_cache = storage_[gpu_id];
+  auto h_local_part_sizes = my_cache.shard_res.h_local_part_sizes.data();
+  auto h_local_part_offsets = my_cache.shard_res.h_local_part_offsets.data();
+  auto h_remote_part_sizes = my_cache.shard_res.h_remote_part_sizes.data();
+  auto h_remote_part_offsets = my_cache.shard_res.h_remote_part_offsets.data();
+
+  size_t total_fea_num = 0;
+  if (!rdma_checker_->is_device_support_rdma(gpu_id)) {
+    int trans_id = get_transfer_devid(gpu_id);
+    auto &trans = storage_[trans_id];
+
+    const size_t &send_size = h_remote_part_offsets[nccl_node_size];
+    // p2p copy
+    PADDLE_ENFORCE_GPU_SUCCESS(cudaMemcpyPeerAsync(trans.d_merged_trans_vals,
+                                                   trans_id,
+                                                   d_in_vals,
+                                                   gpu_id,
+                                                   send_size * value_bytes,
+                                                   stream));
+    PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
+
+    // wait node data ok
+    trans.sem_wait->post();
+    my_cache.sem_wait->wait();
+
+    const size_t &recv_size = h_local_part_offsets[nccl_node_size];
+    // p2p copy
+    PADDLE_ENFORCE_GPU_SUCCESS(
+        cudaMemcpyPeerAsync(d_out_vals,
+                            gpu_id,
+                            trans.d_merged_push_trans_vals,
+                            trans_id,
+                            recv_size * value_bytes,
+                            stream));
+    PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
+  } else {
+    my_cache.sem_wait->wait();
+    int trans_id = get_transfer_devid(gpu_id);
+    auto &trans = storage_[trans_id];
+
+    // send local device
+    total_fea_num = send_data_by_all2all(gpu_id,
+                                         nccl_node_size,
+                                         nccl_rank_id,
+                                         value_bytes,
+                                         h_remote_part_sizes,
+                                         h_remote_part_offsets,
+                                         h_local_part_sizes,
+                                         h_local_part_offsets,
+                                         (const char *)d_in_vals,
+                                         reinterpret_cast<char *>(d_out_vals),
+                                         stream);
+    // send trans device
+    total_fea_num += send_data_by_all2all(
+        gpu_id,
+        nccl_node_size,
+        nccl_rank_id,
+        value_bytes,
+        trans.shard_res.h_remote_part_sizes.data(),
+        trans.shard_res.h_remote_part_offsets.data(),
+        trans.shard_res.h_local_part_sizes.data(),
+        trans.shard_res.h_local_part_offsets.data(),
+        (const char *)my_cache.d_merged_trans_vals,
+        reinterpret_cast<char *>(my_cache.d_merged_push_trans_vals),
+        stream);
+    PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
+    trans.sem_wait->post();
+  }
+  return total_fea_num;
+}
+template <typename KeyType,
+          typename ValType,
+          typename GradType,
+          typename GPUAccessor>
+size_t HeterComm<KeyType, ValType, GradType, GPUAccessor>::
+    send_gradient_by_all2all_trans(const int &gpu_id,
+                                   const int &nccl_rank_id,
+                                   const int &nccl_node_size,
+                                   const size_t &fea_size,
+                                   const KeyType *d_in_keys,
+                                   const char *d_in_vals,
+                                   const size_t &value_bytes,
+                                   KeyType *d_out_keys,
+                                   char *d_out_vals,
+                                   const cudaStream_t &stream) {
+  auto &my_cache = storage_[gpu_id];
+  size_t total_send_recv = 0;
+  if (!rdma_checker_->is_device_support_rdma(gpu_id)) {
+    int trans_id = get_transfer_devid(gpu_id);
+    auto &trans = storage_[trans_id];
+
+    // wait node alloc hbm
+    //    trans.sem_wait->post();
+    //    my_cache.sem_wait->wait();
+    const size_t &recv_total_size =
+        my_cache.shard_res.h_remote_part_offsets[nccl_node_size];
+    size_t need_len = std::max(fea_size, recv_total_size);
+    CHECK(trans.trans_keys_buff->size() >= need_len * sizeof(KeyType) * 2);
+
+    // p2p copy
+    PADDLE_ENFORCE_GPU_SUCCESS(cudaMemcpyPeerAsync(trans.d_merged_trans_keys,
+                                                   trans_id,
+                                                   d_in_keys,
+                                                   gpu_id,
+                                                   fea_size * sizeof(KeyType),
+                                                   stream));
+    PADDLE_ENFORCE_GPU_SUCCESS(cudaMemcpyPeerAsync(trans.d_merged_trans_vals,
+                                                   trans_id,
+                                                   d_in_vals,
+                                                   gpu_id,
+                                                   fea_size * value_bytes,
+                                                   stream));
+    PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
+
+    // wait node data ok
+    trans.sem_wait->post();
+    my_cache.sem_wait->wait();
+
+    // p2p copy
+    PADDLE_ENFORCE_GPU_SUCCESS(
+        cudaMemcpyPeerAsync(d_out_keys,
+                            gpu_id,
+                            trans.d_merged_push_trans_keys,
+                            trans_id,
+                            recv_total_size * sizeof(KeyType),
+                            stream));
+    PADDLE_ENFORCE_GPU_SUCCESS(
+        cudaMemcpyPeerAsync(d_out_vals,
+                            gpu_id,
+                            trans.d_merged_push_trans_vals,
+                            trans_id,
+                            recv_total_size * value_bytes,
+                            stream));
+    PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
+  } else {
+    // copy local rank id data
+    my_cache.sem_wait->wait();
+    int trans_id = get_transfer_devid(gpu_id);
+    auto &trans = storage_[trans_id];
+
+    //    size_t trans_len =
+    //    std::max(trans.shard_res.h_local_part_offsets[nccl_node_size],
+    //            trans.shard_res.h_remote_part_offsets[nccl_node_size]);
+    //    // alloc mem
+    //    my_cache.init_trans(trans_len, value_bytes);
+    //
+    //    trans.sem_wait->post();
+    //    my_cache.sem_wait->wait();
+
+    // send local device
+    total_send_recv =
+        send_data_by_all2all(gpu_id,
+                             nccl_node_size,
+                             nccl_rank_id,
+                             sizeof(KeyType),
+                             my_cache.shard_res.h_local_part_sizes.data(),
+                             my_cache.shard_res.h_local_part_offsets.data(),
+                             my_cache.shard_res.h_remote_part_sizes.data(),
+                             my_cache.shard_res.h_remote_part_offsets.data(),
+                             (const char *)d_in_keys,
+                             reinterpret_cast<char *>(d_out_keys),
+                             stream);
+    send_data_by_all2all(gpu_id,
+                         nccl_node_size,
+                         nccl_rank_id,
+                         value_bytes,
+                         my_cache.shard_res.h_local_part_sizes.data(),
+                         my_cache.shard_res.h_local_part_offsets.data(),
+                         my_cache.shard_res.h_remote_part_sizes.data(),
+                         my_cache.shard_res.h_remote_part_offsets.data(),
+                         (const char *)d_in_vals,
+                         reinterpret_cast<char *>(d_out_vals),
+                         stream);
+    // send trans device
+    total_send_recv += send_data_by_all2all(
+        gpu_id,
+        nccl_node_size,
+        nccl_rank_id,
+        sizeof(KeyType),
+        trans.shard_res.h_local_part_sizes.data(),
+        trans.shard_res.h_local_part_offsets.data(),
+        trans.shard_res.h_remote_part_sizes.data(),
+        trans.shard_res.h_remote_part_offsets.data(),
+        (const char *)my_cache.d_merged_trans_keys,
+        reinterpret_cast<char *>(my_cache.d_merged_push_trans_keys),
+        stream);
+    send_data_by_all2all(
+        gpu_id,
+        nccl_node_size,
+        nccl_rank_id,
+        value_bytes,
+        trans.shard_res.h_local_part_sizes.data(),
+        trans.shard_res.h_local_part_offsets.data(),
+        trans.shard_res.h_remote_part_sizes.data(),
+        trans.shard_res.h_remote_part_offsets.data(),
+        (const char *)my_cache.d_merged_trans_vals,
+        reinterpret_cast<char *>(my_cache.d_merged_push_trans_vals),
+        stream);
+    PADDLE_ENFORCE_GPU_SUCCESS(cudaStreamSynchronize(stream));
+    trans.sem_wait->post();
+  }
+  return total_send_recv;
+}
 }  // end namespace framework
 }  // end namespace paddle
 #endif
diff --git a/paddle/fluid/framework/fleet/heter_ps/heter_comm_kernel.cu b/paddle/fluid/framework/fleet/heter_ps/heter_comm_kernel.cu
index c885b77d2e1da..c884628da72f7 100644
--- a/paddle/fluid/framework/fleet/heter_ps/heter_comm_kernel.cu
+++ b/paddle/fluid/framework/fleet/heter_ps/heter_comm_kernel.cu
@@ -16,6 +16,7 @@ limitations under the License. */
 
 #ifdef PADDLE_WITH_HETERPS
 #include "paddle/fluid/framework/fleet/heter_ps/heter_comm_kernel.h"
+#include "paddle/fluid/platform/float16.h"
 
 namespace paddle {
 namespace framework {
@@ -92,6 +93,18 @@ __global__ void calc_shard_index_kernel(KeyType* d_keys,
   }
 }
 
+template <typename KeyType, typename T>
+__global__ void calc_node_shard_index_kernel(KeyType* d_keys,
+                                             const size_t len,
+                                             T* shard_index,
+                                             const int total_gpu,
+                                             const int node_num) {
+  const size_t i = blockIdx.x * blockDim.x + threadIdx.x;
+  if (i < len) {
+    shard_index[i] = (d_keys[i] / total_gpu) % node_num;
+  }
+}
+
 template <typename KeyType, typename T>
 __global__ void fill_shard_key_kernel(KeyType* d_shard_keys,
                                       KeyType* d_keys,
@@ -137,22 +150,23 @@ __global__ void merge_gradients_basic_kernel(const KeyType* d_keys,
                                              char* output,
                                              int n,
                                              size_t grad_value_size,
-                                             DynamicGradMerger& merger,
-                                             GPUAccessor& gpu_accessor) {
+                                             const DynamicGradMerger& merger,
+                                             const GPUAccessor& gpu_accessor) {
   const size_t i = blockIdx.x * blockDim.x + threadIdx.x;
 
   if (i < n) {
     uint32_t start = offset[i];
     uint32_t num = fea_num[i];
     int ori_index = index[start];
-    float* out = (float*)(output + i * grad_value_size);
-    float* in = (float*)(input + size_t(ori_index) * grad_value_size);
+    float* out = reinterpret_cast<float*>(output + i * grad_value_size);
+    const float* in = reinterpret_cast<const float*>(
+        input + size_t(ori_index) * grad_value_size);
     merger.update_basic(out, in, gpu_accessor);
     KeyType key = d_keys[i];
     if (key != 0) {
       for (int j = 1; j < num; ++j) {
         ori_index = index[start + j];
-        in = (float*)(input + size_t(ori_index) * grad_value_size);
+        in = (float*)(input + size_t(ori_index) * grad_value_size);  // NOLINT
         merger.merge_basic(out, in, gpu_accessor);
       }
     }
@@ -169,27 +183,23 @@ __global__ void merge_gradients_embedx_kernel(const KeyType* d_keys,
                                               int n,
                                               size_t grad_dim,
                                               size_t grad_value_size,
-                                              DynamicGradMerger& merger,
-                                              GPUAccessor& gpu_accessor) {
+                                              const DynamicGradMerger& merger,
+                                              const GPUAccessor& gpu_accessor) {
   const size_t i = blockIdx.x * blockDim.x + threadIdx.x;
 
   if (i < n) {
     size_t value_idx = i / grad_dim;
-    size_t field_idx = i % grad_dim;
-    uint32_t start = offset[value_idx];
-    uint32_t num = fea_num[value_idx];
-    int ori_index = index[start];
-    float* in = (float*)(input + size_t(ori_index) * grad_value_size);
-    float* out = (float*)(output + value_idx * grad_value_size);
-    merger.update_embedx(out, in, field_idx, gpu_accessor);
-    KeyType key = d_keys[value_idx];
-    if (key != 0) {
-      for (int j = 1; j < num; ++j) {
-        int ori_index = index[start + j];
-        float* in = (float*)(input + size_t(ori_index) * grad_value_size);
-        merger.merge_embedx(out, in, field_idx, gpu_accessor);
-      }
+    const uint32_t& start = offset[value_idx];
+    const uint32_t& num = fea_num[value_idx];
+
+    double val = 0;
+    uint32_t off =
+        gpu_accessor.common_push_value.EmbedxGIndex() + (i % grad_dim);
+    for (uint32_t j = 0; j < num; ++j) {
+      val += ((float*)(&input[size_t(index[start + j]) *  // NOLINT
+                              grad_value_size]))[off];
     }
+    (reinterpret_cast<float*>(&output[value_idx * grad_value_size]))[off] = val;
   }
 }
 
@@ -266,9 +276,11 @@ __global__ void unpack_merged_vals_kernel(const KeyType* d_keys,
   }
 
   uint64_t dst_offset = uint64_t(tx) * val_size;
-  float* dst = (float*)((char*)d_out + dst_offset);
-  float* src_val =
-      (float*)((char*)d_merged_vals + uint64_t(src_val_idx) * val_size);
+  float* dst =
+      reinterpret_cast<float*>(reinterpret_cast<char*>(d_out) + dst_offset);
+  const float* src_val = reinterpret_cast<const float*>(
+      reinterpret_cast<const char*>(d_merged_vals) +
+      uint64_t(src_val_idx) * val_size);
 
   size_t n_float = val_size / sizeof(float);
   for (size_t k = 0; k < n_float; ++k) {
@@ -277,15 +289,23 @@ __global__ void unpack_merged_vals_kernel(const KeyType* d_keys,
 }
 
 template <typename TUnit, typename T>
-__global__ void scatter_dvals_by_unit_kernel(TUnit* d_dest_vals,
-                                             const TUnit* d_src_vals,
-                                             T* idx,
-                                             size_t len,
-                                             size_t val_size_unit) {
+__global__ void gather_keys_kernel(TUnit* d_dest_vals,
+                                   const TUnit* d_src_vals,
+                                   T* idx,
+                                   size_t len) {
   const size_t i = blockIdx.x * blockDim.x + threadIdx.x;
   if (i < len) {
-    size_t pos = idx[i / val_size_unit] * val_size_unit + (i % val_size_unit);
-    d_dest_vals[i] = d_src_vals[pos];
+    d_dest_vals[i] = d_src_vals[idx[i]];
+  }
+}
+template <typename TUnit, typename T>
+__global__ void scatter_keys_kernel(TUnit* d_dest_vals,
+                                    const TUnit* d_src_vals,
+                                    T* idx,
+                                    size_t len) {
+  const size_t i = blockIdx.x * blockDim.x + threadIdx.x;
+  if (i < len) {
+    d_dest_vals[idx[i]] = d_src_vals[i];
   }
 }
 
@@ -296,6 +316,19 @@ __global__ void gather_dvals_by_unit_kernel(TUnit* d_dest_vals,
                                             size_t len,
                                             const size_t val_size_unit) {
   const size_t i = blockIdx.x * blockDim.x + threadIdx.x;
+  if (i < len) {
+    size_t pos = idx[i / val_size_unit] * val_size_unit + (i % val_size_unit);
+    d_dest_vals[i] = d_src_vals[pos];
+  }
+}
+
+template <typename TUnit, typename T>
+__global__ void scatter_dvals_by_unit_kernel(TUnit* d_dest_vals,
+                                             const TUnit* d_src_vals,
+                                             T* idx,
+                                             size_t len,
+                                             const size_t val_size_unit) {
+  const size_t i = blockIdx.x * blockDim.x + threadIdx.x;
   if (i < len) {
     size_t pos = idx[i / val_size_unit] * val_size_unit + (i % val_size_unit);
     d_dest_vals[pos] = d_src_vals[i];
@@ -304,11 +337,9 @@ __global__ void gather_dvals_by_unit_kernel(TUnit* d_dest_vals,
 
 // cuda implemention of  heter_comm_kernel.h
 template <typename T, typename StreamType>
-void HeterCommKernel::fill_idx(T* idx,
-                               long long len,
-                               const StreamType& stream) {
+void HeterCommKernel::fill_idx(T* idx, int64_t len, const StreamType& stream) {
   int grid_size = (len - 1) / block_size_ + 1;
-  size_t c_len = (size_t)len;
+  size_t c_len = static_cast<size_t>(len);
   fill_idx_kernel<<<grid_size, block_size_, 0, stream>>>(idx, c_len);
 }
 
@@ -316,35 +347,47 @@ template <typename T, typename StreamType>
 void HeterCommKernel::calc_shard_offset(T* idx,
                                         T* left,
                                         T* right,
-                                        long long len,
+                                        int64_t len,
                                         int total_devs,
                                         const StreamType& stream) {
   int grid_size = (len - 1) / block_size_ + 1;
-  size_t c_len = (size_t)len;
+  size_t c_len = static_cast<size_t>(len);
   calc_shard_offset_kernel<<<grid_size, block_size_, 0, stream>>>(
       idx, left, right, c_len);
 }
 
 template <typename KeyType, typename T, typename StreamType>
 void HeterCommKernel::calc_shard_index(KeyType* d_keys,
-                                       long long len,
+                                       int64_t len,
                                        T* shard_index,
                                        int total_gpu,
                                        const StreamType& stream) {
   int grid_size = (len - 1) / block_size_ + 1;
-  size_t c_len = (size_t)len;
+  size_t c_len = static_cast<size_t>(len);
   calc_shard_index_kernel<<<grid_size, block_size_, 0, stream>>>(
       d_keys, c_len, shard_index, total_gpu);
 }
+template <typename KeyType, typename T, typename StreamType>
+void HeterCommKernel::calc_node_shard_index(const KeyType* d_keys,
+                                            int64_t len,
+                                            T* shard_index,
+                                            const int& total_devs,
+                                            const int& node_num,
+                                            const StreamType& stream) {
+  int grid_size = (len - 1) / block_size_ + 1;
+  size_t c_len = static_cast<size_t>(len);
+  calc_node_shard_index_kernel<<<grid_size, block_size_, 0, stream>>>(
+      d_keys, c_len, shard_index, total_devs, node_num);
+}
 
 template <typename KeyType, typename T, typename StreamType>
 void HeterCommKernel::fill_shard_key(KeyType* d_shard_keys,
                                      KeyType* d_keys,
                                      T* idx,
-                                     long long len,
+                                     int64_t len,
                                      const StreamType& stream) {
   int grid_size = (len - 1) / block_size_ + 1;
-  size_t c_len = (size_t)len;
+  size_t c_len = static_cast<size_t>(len);
   fill_shard_key_kernel<<<grid_size, block_size_, 0, stream>>>(
       d_shard_keys, d_keys, idx, c_len);
 }
@@ -355,10 +398,10 @@ void HeterCommKernel::fill_shard_grads(KeyType* d_shard_keys,
                                        GradType* d_shard_grads,
                                        GradType* d_grads,
                                        T* idx,
-                                       long long len,
+                                       int64_t len,
                                        const StreamType& stream) {
   int grid_size = (len - 1) / block_size_ + 1;
-  size_t c_len = (size_t)len;
+  size_t c_len = static_cast<size_t>(len);
   fill_shard_grads_kernel<<<grid_size, block_size_, 0, stream>>>(
       d_shard_keys, d_keys, d_shard_grads, d_grads, idx, c_len);
 }
@@ -367,10 +410,10 @@ template <typename ValType, typename T, typename StreamType>
 void HeterCommKernel::fill_dvals(ValType* d_shard_vals,
                                  ValType* d_vals,
                                  T* idx,
-                                 long long len,
+                                 int64_t len,
                                  const StreamType& stream) {
   int grid_size = (len - 1) / block_size_ + 1;
-  size_t c_len = (size_t)len;
+  size_t c_len = static_cast<size_t>(len);
   fill_dvals_kernel<<<grid_size, block_size_, 0, stream>>>(
       d_shard_vals, d_vals, idx, c_len);
 }
@@ -429,7 +472,6 @@ void HeterCommKernel::reduce_by_key(void* d_temp_storage,
                                                             stream,
                                                             debug_synchronous));
 }
-
 template <typename KeyType,
           typename T,
           typename StreamType,
@@ -439,22 +481,22 @@ void HeterCommKernel::dy_mf_fill_shard_grads(KeyType* d_shard_keys,
                                              float* d_shard_grads,
                                              float* d_grads,
                                              T* idx,
-                                             long long len,
+                                             int64_t len,
                                              size_t grad_value_size,
                                              const StreamType& stream,
-                                             GPUAccessor& gpu_accessor) {
+                                             const GPUAccessor& gpu_accessor) {
   int grid_size = (len - 1) / block_size_ + 1;
-  size_t c_len = (size_t)len;
+  size_t c_len = static_cast<size_t>(len);
 
-  const size_t grad_value_size_float = grad_value_size / sizeof(float);
+  const size_t grad_value_size_float = size_t(grad_value_size / sizeof(float));
   // d_keys to d_shard_keys
   fill_shard_key_kernel<<<grid_size, block_size_, 0, stream>>>(
       d_shard_keys, d_keys, idx, c_len);
 
-  CHECK((grad_value_size % sizeof(float)) == 0);
+  CHECK_EQ(grad_value_size % sizeof(float), 0);
   size_t N = len * grad_value_size_float;
   grid_size = (N - 1) / block_size_ + 1;
-  scatter_dvals_by_unit_kernel<<<grid_size, block_size_, 0, stream>>>(
+  gather_dvals_by_unit_kernel<<<grid_size, block_size_, 0, stream>>>(
       d_shard_grads, d_grads, idx, N, grad_value_size_float);
 }
 
@@ -468,9 +510,9 @@ void HeterCommKernel::merge_gradient(const KeyType* d_keys,
                                      int n,
                                      size_t grad_dim,
                                      size_t grad_value_size,
-                                     DynamicGradMerger& merger,
+                                     const DynamicGradMerger& merger,
                                      const StreamType& stream,
-                                     GPUAccessor& gpu_accessor) {
+                                     const GPUAccessor& gpu_accessor) {
   int grid_size1 = (n - 1) / block_size_ + 1;
   merge_gradients_basic_kernel<<<grid_size1, block_size_, 0, stream>>>(
       d_keys,
@@ -504,15 +546,15 @@ template <typename T, typename StreamType>
 void HeterCommKernel::dy_mf_fill_dvals(float* d_shard_vals,
                                        float* d_vals,
                                        T* idx,
-                                       long long len,
+                                       int64_t len,
                                        size_t val_size,
                                        const StreamType& stream) {
   const size_t val_size_float = val_size / sizeof(float);
-  CHECK((val_size % sizeof(float)) == 0);
+  CHECK_EQ(val_size % sizeof(float), 0);
   size_t N = len * val_size_float;
   const int grid_size = (N - 1) / block_size_ + 1;
   // fill by float, d_shard_vals to d_vals
-  gather_dvals_by_unit_kernel<<<grid_size, block_size_, 0, stream>>>(
+  scatter_dvals_by_unit_kernel<<<grid_size, block_size_, 0, stream>>>(
       d_vals, d_shard_vals, idx, N, val_size_float);
 }
 
@@ -659,62 +701,458 @@ void HeterCommKernel::unpack_merged_vals(size_t n,
   int grid_size = (n - 1) / block_size_ + 1;
   unpack_merged_vals_kernel<<<grid_size, block_size_, 0, stream>>>(
       d_keys,
-      (const float*)d_merged_vals,
+      reinterpret_cast<const float*>(d_merged_vals),
       d_restore_idx,
-      (float*)d_vals,
+      reinterpret_cast<float*>(d_vals),
       val_size,
       n);
 }
+template <typename KeyType, typename T, typename StreamType>
+void HeterCommKernel::gather_keys(KeyType* d_shard_keys,
+                                  const KeyType* d_keys,
+                                  T* idx,
+                                  int64_t len,
+                                  const StreamType& stream) {
+  size_t N = len;
+  int grid_size = (N - 1) / block_size_ + 1;
+  // d_keys -> d_shard_keys
+  gather_keys_kernel<<<grid_size, block_size_, 0, stream>>>(
+      d_shard_keys, d_keys, idx, N);
+}
+template <typename KeyType, typename T, typename StreamType>
+void HeterCommKernel::scatter_keys(const KeyType* d_shard_keys,
+                                   KeyType* d_keys,
+                                   T* idx,
+                                   int64_t len,
+                                   const StreamType& stream) {
+  size_t N = len;
+  int grid_size = (N - 1) / block_size_ + 1;
+  // d_shard_keys -> d_keys
+  scatter_keys_kernel<<<grid_size, block_size_, 0, stream>>>(
+      d_keys, d_shard_keys, idx, N);
+}
+template <typename T, typename StreamType>
+void HeterCommKernel::gather_vals(float* d_shard_vals,
+                                  const float* d_vals,
+                                  T* idx,
+                                  int64_t len,
+                                  size_t value_bytes,
+                                  const StreamType& stream) {
+  const size_t value_size_float = size_t(value_bytes / sizeof(float));
+  size_t N = len * value_size_float;
+  int grid_size = (N - 1) / block_size_ + 1;
+  // d_vals -> d_shard_vals
+  gather_dvals_by_unit_kernel<<<grid_size, block_size_, 0, stream>>>(
+      d_shard_vals, d_vals, idx, N, value_size_float);
+}
+template <typename T, typename StreamType>
+void HeterCommKernel::scatter_vals(const float* d_shard_vals,
+                                   float* d_vals,
+                                   T* idx,
+                                   int64_t len,
+                                   size_t value_bytes,
+                                   const StreamType& stream) {
+  const size_t val_size_float = size_t(value_bytes / sizeof(float));
+  CHECK_EQ(value_bytes % sizeof(float), 0);
+  size_t N = len * val_size_float;
+  const int grid_size = (N - 1) / block_size_ + 1;
+  // fill by float, d_shard_vals to d_vals
+  scatter_dvals_by_unit_kernel<<<grid_size, block_size_, 0, stream>>>(
+      d_vals, d_shard_vals, idx, N, val_size_float);
+}
+template <typename KeyType>
+__global__ void check_valid_values_kernel(const int type,
+                                          const size_t N,
+                                          const KeyType* keys,
+                                          const char* input,
+                                          const size_t value_bytes,
+                                          const int num,
+                                          bool debug) {
+  const size_t i = blockIdx.x * blockDim.x + threadIdx.x;
+  if (i < N) {
+    const float* val = (const float*)(input + i * value_bytes);
+    if (debug && (i == 0 || i == (N - 1))) {
+      if (keys != nullptr) {
+        printf(
+            "type=%d, id=%lu, bytes=%lu, key=%lu, "
+            "values=[%f,%f,%f,%f,%f,%f,%f,%f]\n",
+            type,
+            i,
+            value_bytes,
+            uint64_t(keys[i]),
+            val[0],
+            val[1],
+            val[2],
+            val[3],
+            val[4],
+            val[5],
+            val[6],
+            val[7]);
+      } else {
+        printf("type=%d, id=%lu, bytes=%lu, values=[%f,%f,%f,%f,%f,%f,%f,%f]\n",
+               type,
+               i,
+               value_bytes,
+               val[0],
+               val[1],
+               val[2],
+               val[3],
+               val[4],
+               val[5],
+               val[6],
+               val[7]);
+      }
+    }
+    for (int k = 0; k < num; ++k) {
+      auto& c = val[k];
+      if (isnan(c)) {
+        if (keys != nullptr) {
+          printf(
+              "nan type %d, id=%lu, offset=%d, float=%f, key=%lu, "
+              "values=[%f,%f,%f,%f,%f,%f,%f,%f]\n",
+              type,
+              i,
+              k,
+              c,
+              uint64_t(keys[i]),
+              val[0],
+              val[1],
+              val[2],
+              val[3],
+              val[4],
+              val[5],
+              val[6],
+              val[7]);
+        } else {
+          printf("nan type %d, id=%lu, offset=%d, float=%f\n", type, i, k, c);
+        }
+      } else if (isinf(c)) {
+        if (keys != nullptr) {
+          printf(
+              "inf type %d, id=%lu, offset=%d, float=%f, key=%lu, "
+              "values=[%f,%f,%f,%f,%f,%f,%f,%f]\n",
+              type,
+              i,
+              k,
+              c,
+              uint64_t(keys[i]),
+              val[0],
+              val[1],
+              val[2],
+              val[3],
+              val[4],
+              val[5],
+              val[6],
+              val[7]);
+        } else {
+          printf("inf type %d, id=%lu, offset=%d, float=%f\n", type, i, k, c);
+        }
+      } else if (static_cast<int>(c) > 1e+30 ||
+                 static_cast<int>(c) < -(1e+30)) {
+        if (keys != nullptr) {
+          printf(
+              "err type %d, id=%lu, offset=%d, float=%f, key=%lu, "
+              "values=[%f,%f,%f,%f,%f,%f,%f,%f]\n",
+              type,
+              i,
+              k,
+              c,
+              uint64_t(keys[i]),
+              val[0],
+              val[1],
+              val[2],
+              val[3],
+              val[4],
+              val[5],
+              val[6],
+              val[7]);
+        } else {
+          printf("err type %d, id=%lu, offset=%d, float=%f, int=%d\n",
+                 type,
+                 i,
+                 k,
+                 c,
+                 static_cast<int>(c));
+        }
+      }
+    }
+  }
+}
+template <typename KeyType, typename StreamType>
+void HeterCommKernel::check_valid_values(const int& type,
+                                         const size_t& N,
+                                         const KeyType* keys,
+                                         const char* input,
+                                         const size_t& value_bytes,
+                                         const StreamType& stream,
+                                         bool debug) {
+  CHECK_EQ(value_bytes % sizeof(float), 0);
+  const int grid_size = (N - 1) / block_size_ + 1;
+  const int num = static_cast<int>(value_bytes / sizeof(float));
+  check_valid_values_kernel<<<grid_size, block_size_, 0, stream>>>(
+      type, N, keys, input, value_bytes, num, debug);
+}
+
+template <typename GPUAccessor>
+__global__ void scale_grad_kernel(const size_t N,
+                                  char* grads,
+                                  const size_t value_bytes,
+                                  const size_t grad_dim,
+                                  const GPUAccessor& accessor) {
+  const size_t i = blockIdx.x * blockDim.x + threadIdx.x;
+  if (i < N) {
+    size_t idx = i / grad_dim;
+    size_t field_id = i % grad_dim;
+
+    float* vals = reinterpret_cast<float*>(&grads[idx * value_bytes]);
+    float& show = vals[accessor.common_push_value.ShowIndex()];
+    if (show > 0.0) {
+      vals[accessor.common_push_value.EmbedGIndex() + field_id] /= show;
+    }
+  }
+}
+
+template <typename StreamType, typename GPUAccessor>
+void HeterCommKernel::scale_grad(const size_t& len,
+                                 char* grads,
+                                 const size_t& value_bytes,
+                                 const size_t& max_mif_dim,
+                                 const StreamType& stream,
+                                 const GPUAccessor& gpu_accessor) {
+  const size_t grad_dim = (max_mif_dim + 1);
+  const size_t N = len * grad_dim;
+  const int grid_size = (N - 1) / block_size_ + 1;
+  scale_grad_kernel<<<grid_size, block_size_, 0, stream>>>(
+      N, grads, value_bytes, grad_dim, gpu_accessor);
+}
+__device__ __forceinline__ int16_t float_int16(const float& val,
+                                               const float& bound) {
+  if (val >= bound) {
+    return 32767;
+  } else if (val <= -bound) {
+    return -32767;
+  }
+  if (val > 0.0) {
+    return int16_t((val * 32767.0 / bound) + 0.5);
+  }
+  return int16_t((val * 32767.0 / bound) - 0.5);
+}
+__global__ void compress_kernel(const size_t N,
+                                const float* in,
+                                const size_t float_num,
+                                const size_t head_off,
+                                char* out,
+                                const size_t new_bytes,
+                                const float bound) {
+  const size_t i = blockIdx.x * blockDim.x + threadIdx.x;
+  if (i < N) {
+    size_t idx = i / float_num;
+    size_t off = i % float_num;
+
+    if (off < head_off) {
+      *(reinterpret_cast<float*>(&out[idx * new_bytes + off * sizeof(float)])) =
+          in[i];
+    } else {
+      int16_t* dest = reinterpret_cast<int16_t*>(
+          &out[idx * new_bytes + head_off * sizeof(float)]);
+      dest[off - head_off] = float_int16(in[i], bound);
+    }
+  }
+}
+// compress
+template <typename StreamType>
+size_t HeterCommKernel::compress_values(const size_t& len,
+                                        const char* in_vals,
+                                        char* out_vals,
+                                        const size_t& value_bytes,
+                                        const size_t& embedx_dim,
+                                        const float& max_bound,
+                                        const StreamType& stream) {
+  const size_t new_bytes = value_bytes - sizeof(int16_t) * embedx_dim;
+  const size_t float_num = size_t(value_bytes / sizeof(float));
+  const size_t head_off = float_num - embedx_dim;
+  const size_t N = len * float_num;
+  const int grid_size = (N - 1) / block_size_ + 1;
+  compress_kernel<<<grid_size, block_size_, 0, stream>>>(N,
+                                                         (const float*)in_vals,
+                                                         float_num,
+                                                         head_off,
+                                                         out_vals,
+                                                         new_bytes,
+                                                         max_bound);
+  return new_bytes;
+}
+__device__ __forceinline__ float int16_float(const int16_t& val,
+                                             const float& bound) {
+  return static_cast<float>(val * bound / 32767.0);
+}
+__global__ void uncompress_kernel(const size_t N,
+                                  const char* in,
+                                  const size_t new_bytes,
+                                  float* out,
+                                  const size_t float_num,
+                                  const size_t head_off,
+                                  const float bound) {
+  const size_t i = blockIdx.x * blockDim.x + threadIdx.x;
+  if (i < N) {
+    size_t idx = i / float_num;
+    size_t off = i % float_num;
+
+    if (off < head_off) {
+      out[i] = *((const float*)&in[idx * new_bytes + off * sizeof(float)]);
+    } else {
+      const int16_t* src =
+          (const int16_t*)(&in[idx * new_bytes + head_off * sizeof(float)]);
+      out[i] = int16_float(src[off - head_off], bound);
+    }
+  }
+}
+// uncompress
+template <typename StreamType>
+void HeterCommKernel::uncompress_values(const size_t& len,
+                                        const char* in_vals,
+                                        char* out_vals,
+                                        const size_t& value_bytes,
+                                        const size_t& embedx_dim,
+                                        const float& max_bound,
+                                        const StreamType& stream) {
+  const size_t new_bytes = value_bytes - sizeof(int16_t) * embedx_dim;
+  const size_t float_num = size_t(value_bytes / sizeof(float));
+  const size_t head_off = float_num - embedx_dim;
+  const size_t N = len * float_num;
+  const int grid_size = (N - 1) / block_size_ + 1;
+  uncompress_kernel<<<grid_size, block_size_, 0, stream>>>(
+      N,
+      in_vals,
+      new_bytes,
+      reinterpret_cast<float*>(out_vals),
+      float_num,
+      head_off,
+      max_bound);
+}
 
 template void HeterCommKernel::fill_idx<int, cudaStream_t>(
-    int* idx, long long len, const cudaStream_t& stream);
+    int* idx, int64_t len, const cudaStream_t& stream);
 template void HeterCommKernel::fill_idx<uint32_t, cudaStream_t>(
-    uint32_t* idx, long long len, const cudaStream_t& stream);
+    uint32_t* idx, int64_t len, const cudaStream_t& stream);
 
 template void HeterCommKernel::calc_shard_offset<int, cudaStream_t>(
     int* idx,
     int* left,
     int* right,
-    long long len,
+    int64_t len,
     int total_devs,
     const cudaStream_t& stream);
-template void
-HeterCommKernel::calc_shard_index<unsigned long, int, cudaStream_t>(
-    unsigned long* d_keys,
-    long long len,
+template void HeterCommKernel::calc_shard_offset<uint32_t, cudaStream_t>(
+    uint32_t* idx,
+    uint32_t* left,
+    uint32_t* right,
+    int64_t len,
+    int total_devs,
+    const cudaStream_t& stream);
+
+template void HeterCommKernel::calc_shard_index<uint64_t, int, cudaStream_t>(
+    uint64_t* d_keys,
+    int64_t len,
     int* shard_index,
     int total_devs,
     const cudaStream_t& stream);
 
-template void HeterCommKernel::calc_shard_index<long, int, cudaStream_t>(
-    long* d_keys,
-    long long len,
+template void HeterCommKernel::calc_shard_index<int64_t, int, cudaStream_t>(
+    int64_t* d_keys,
+    int64_t len,
     int* shard_index,
     int total_devs,
     const cudaStream_t& stream);
 
-template void HeterCommKernel::fill_shard_key<long, int, cudaStream_t>(
-    long* d_shard_keys,
-    long* d_keys,
+template void
+HeterCommKernel::calc_shard_index<uint64_t, uint32_t, cudaStream_t>(
+    uint64_t* d_keys,
+    int64_t len,
+    uint32_t* shard_index,
+    int total_devs,
+    const cudaStream_t& stream);
+
+template void
+HeterCommKernel::calc_shard_index<int64_t, uint32_t, cudaStream_t>(
+    int64_t* d_keys,
+    int64_t len,
+    uint32_t* shard_index,
+    int total_devs,
+    const cudaStream_t& stream);
+
+template void
+HeterCommKernel::calc_node_shard_index<uint64_t, int, cudaStream_t>(
+    const uint64_t* d_keys,
+    int64_t len,
+    int* shard_index,
+    const int& total_devs,
+    const int& node_num,
+    const cudaStream_t& stream);
+
+template void
+HeterCommKernel::calc_node_shard_index<int64_t, int, cudaStream_t>(
+    const int64_t* d_keys,
+    int64_t len,
+    int* shard_index,
+    const int& total_devs,
+    const int& node_num,
+    const cudaStream_t& stream);
+
+template void
+HeterCommKernel::calc_node_shard_index<uint64_t, uint32_t, cudaStream_t>(
+    const uint64_t* d_keys,
+    int64_t len,
+    uint32_t* shard_index,
+    const int& total_devs,
+    const int& node_num,
+    const cudaStream_t& stream);
+
+template void
+HeterCommKernel::calc_node_shard_index<int64_t, uint32_t, cudaStream_t>(
+    const int64_t* d_keys,
+    int64_t len,
+    uint32_t* shard_index,
+    const int& total_devs,
+    const int& node_num,
+    const cudaStream_t& stream);
+
+template void HeterCommKernel::fill_shard_key<int64_t, int, cudaStream_t>(
+    int64_t* d_shard_keys,
+    int64_t* d_keys,
     int* idx,
-    long long len,
+    int64_t len,
     const cudaStream_t& stream);
 
-template void HeterCommKernel::fill_shard_key<unsigned long, int, cudaStream_t>(
-    unsigned long* d_shard_keys,
-    unsigned long* d_keys,
+template void HeterCommKernel::fill_shard_key<uint64_t, int, cudaStream_t>(
+    uint64_t* d_shard_keys,
+    uint64_t* d_keys,
     int* idx,
-    long long len,
+    int64_t len,
+    const cudaStream_t& stream);
+
+template void HeterCommKernel::fill_shard_key<int64_t, uint32_t, cudaStream_t>(
+    int64_t* d_shard_keys,
+    int64_t* d_keys,
+    uint32_t* idx,
+    int64_t len,
+    const cudaStream_t& stream);
+
+template void HeterCommKernel::fill_shard_key<uint64_t, uint32_t, cudaStream_t>(
+    uint64_t* d_shard_keys,
+    uint64_t* d_keys,
+    uint32_t* idx,
+    int64_t len,
     const cudaStream_t& stream);
 
 template void
-HeterCommKernel::fill_shard_grads<unsigned long, float, int, cudaStream_t>(
-    unsigned long* d_shard_keys,
-    unsigned long* d_keys,
+HeterCommKernel::fill_shard_grads<uint32_t, float, int, cudaStream_t>(
+    uint32_t* d_shard_keys,
+    uint32_t* d_keys,
     float* d_shard_grads,
     float* d_grads,
     int* idx,
-    long long len,
+    int64_t len,
     const cudaStream_t& stream);
 
 template void
@@ -722,31 +1160,43 @@ HeterCommKernel::fill_dvals<paddle::framework::FeatureValue, int, cudaStream_t>(
     paddle::framework::FeatureValue* d_shard_vals,
     paddle::framework::FeatureValue* d_vals,
     int* idx,
-    long long len,
+    int64_t len,
     const cudaStream_t& stream);
 
-template void HeterCommKernel::sort_pairs<unsigned long,
-                                          paddle::framework::FeaturePushValue,
-                                          cudaStream_t>(
+template void HeterCommKernel::
+    sort_pairs<uint32_t, paddle::framework::FeaturePushValue, cudaStream_t>(
+        void* d_temp_storage,
+        size_t& temp_storage_bytes,  // NOLINT
+        const uint32_t* d_keys_in,   // NOLINT
+        uint32_t* d_keys_out,
+        const paddle::framework::FeaturePushValue* d_values_in,
+        paddle::framework::FeaturePushValue* d_values_out,
+        int num_items,
+        int begin_bit,
+        int end_bit,
+        cudaStream_t stream,
+        bool debug_synchronous);
+
+template void HeterCommKernel::sort_pairs<int, int, cudaStream_t>(
     void* d_temp_storage,
-    size_t& temp_storage_bytes,      // NOLINT
-    const unsigned long* d_keys_in,  // NOLINT
-    unsigned long* d_keys_out,
-    const paddle::framework::FeaturePushValue* d_values_in,
-    paddle::framework::FeaturePushValue* d_values_out,
+    size_t& temp_storage_bytes,  // NOLINT
+    const int* d_keys_in,        // NOLINT
+    int* d_keys_out,
+    const int* d_values_in,
+    int* d_values_out,
     int num_items,
     int begin_bit,
     int end_bit,
     cudaStream_t stream,
     bool debug_synchronous);
 
-template void HeterCommKernel::sort_pairs<int, int, cudaStream_t>(
+template void HeterCommKernel::sort_pairs<uint32_t, uint32_t, cudaStream_t>(
     void* d_temp_storage,
     size_t& temp_storage_bytes,  // NOLINT
-    const int* d_keys_in,        // NOLINT
-    int* d_keys_out,
-    const int* d_values_in,
-    int* d_values_out,
+    const uint32_t* d_keys_in,   // NOLINT
+    uint32_t* d_keys_out,
+    const uint32_t* d_values_in,
+    uint32_t* d_values_out,
     int num_items,
     int begin_bit,
     int end_bit,
@@ -754,15 +1204,15 @@ template void HeterCommKernel::sort_pairs<int, int, cudaStream_t>(
     bool debug_synchronous);
 
 template void HeterCommKernel::reduce_by_key<
-    unsigned long*,
-    unsigned long*,
+    uint32_t*,
+    uint32_t*,
     paddle::framework::FeaturePushValue*,
     paddle::framework::FeaturePushValue*,
     int*,
     cudaStream_t>(void* d_temp_storage,
                   size_t& temp_storage_bytes,  // NOLINT
-                  unsigned long* d_keys_in,
-                  unsigned long* d_unique_out,
+                  uint32_t* d_keys_in,
+                  uint32_t* d_unique_out,
                   paddle::framework::FeaturePushValue* d_values_in,
                   paddle::framework::FeaturePushValue* d_aggregates_out,
                   int* d_num_runs_out,
@@ -771,18 +1221,18 @@ template void HeterCommKernel::reduce_by_key<
                   bool debug_synchronous);
 
 template void HeterCommKernel::dy_mf_fill_shard_grads<
-    unsigned long,
+    uint64_t,
     int,
     cudaStream_t,
-    CommonFeatureValueAccessor>(unsigned long* d_shard_keys,
-                                unsigned long* d_keys,
+    CommonFeatureValueAccessor>(uint64_t* d_shard_keys,
+                                uint64_t* d_keys,
                                 float* d_shard_grads,
                                 float* d_grads,
                                 int* idx,
-                                long long len,
+                                int64_t len,
                                 size_t grad_value_size,
                                 const cudaStream_t& stream,
-                                CommonFeatureValueAccessor& gpu_accessor);
+                                const CommonFeatureValueAccessor& gpu_accessor);
 
 template void HeterCommKernel::
     merge_gradient<uint32_t, cudaStream_t, CommonFeatureValueAccessor>(
@@ -795,9 +1245,9 @@ template void HeterCommKernel::
         int n,
         size_t grad_dim,
         size_t grad_value_size,
-        DynamicGradMerger& merger_,
+        const DynamicGradMerger& merger_,
         const cudaStream_t& stream,
-        CommonFeatureValueAccessor& gpu_accessor);
+        const CommonFeatureValueAccessor& gpu_accessor);
 
 template void HeterCommKernel::
     merge_gradient<uint64_t, cudaStream_t, CommonFeatureValueAccessor>(
@@ -810,15 +1260,23 @@ template void HeterCommKernel::
         int n,
         size_t grad_dim,
         size_t grad_value_size,
-        DynamicGradMerger& merger_,
+        const DynamicGradMerger& merger_,
         const cudaStream_t& stream,
-        CommonFeatureValueAccessor& gpu_accessor);
+        const CommonFeatureValueAccessor& gpu_accessor);
 
 template void HeterCommKernel::dy_mf_fill_dvals<int, cudaStream_t>(
     float* d_shard_vals,
     float* d_vals,
     int* idx,
-    long long len,
+    int64_t len,
+    size_t val_size,
+    const cudaStream_t& stream);
+
+template void HeterCommKernel::dy_mf_fill_dvals<uint32_t, cudaStream_t>(
+    float* d_shard_vals,
+    float* d_vals,
+    uint32_t* idx,
+    int64_t len,
     size_t val_size,
     const cudaStream_t& stream);
 
@@ -891,6 +1349,126 @@ template void HeterCommKernel::unpack_merged_vals<uint32_t, cudaStream_t>(
     void* d_vals,
     size_t val_size,
     const cudaStream_t& stream);
+
+template void HeterCommKernel::gather_keys<uint64_t, int, cudaStream_t>(
+    uint64_t* d_shard_keys,
+    const uint64_t* d_keys,
+    int* idx,
+    int64_t len,
+    const cudaStream_t& stream);
+template void HeterCommKernel::gather_keys<int32_t, int, cudaStream_t>(
+    int32_t* d_shard_keys,
+    const int32_t* d_keys,
+    int* idx,
+    int64_t len,
+    const cudaStream_t& stream);
+template void HeterCommKernel::gather_keys<uint64_t, uint32_t, cudaStream_t>(
+    uint64_t* d_shard_keys,
+    const uint64_t* d_keys,
+    uint32_t* idx,
+    int64_t len,
+    const cudaStream_t& stream);
+template void HeterCommKernel::gather_keys<int32_t, uint32_t, cudaStream_t>(
+    int32_t* d_shard_keys,
+    const int32_t* d_keys,
+    uint32_t* idx,
+    int64_t len,
+    const cudaStream_t& stream);
+template void HeterCommKernel::scatter_keys<uint64_t, int, cudaStream_t>(
+    const uint64_t* d_shard_keys,
+    uint64_t* d_keys,
+    int* idx,
+    int64_t len,
+    const cudaStream_t& stream);
+
+template void HeterCommKernel::scatter_keys<int32_t, int, cudaStream_t>(
+    const int32_t* d_shard_keys,
+    int32_t* d_keys,
+    int* idx,
+    int64_t len,
+    const cudaStream_t& stream);
+template void HeterCommKernel::scatter_keys<uint64_t, uint32_t, cudaStream_t>(
+    const uint64_t* d_shard_keys,
+    uint64_t* d_keys,
+    uint32_t* idx,
+    int64_t len,
+    const cudaStream_t& stream);
+template void HeterCommKernel::scatter_keys<int32_t, uint32_t, cudaStream_t>(
+    const int32_t* d_shard_keys,
+    int32_t* d_keys,
+    uint32_t* idx,
+    int64_t len,
+    const cudaStream_t& stream);
+template void HeterCommKernel::gather_vals<int, cudaStream_t>(
+    float* d_shard_vals,
+    const float* d_vals,
+    int* idx,
+    int64_t len,
+    size_t value_bytes,
+    const cudaStream_t& stream);
+template void HeterCommKernel::gather_vals<uint32_t, cudaStream_t>(
+    float* d_shard_vals,
+    const float* d_vals,
+    uint32_t* idx,
+    int64_t len,
+    size_t value_bytes,
+    const cudaStream_t& stream);
+template void HeterCommKernel::scatter_vals<int, cudaStream_t>(
+    const float* d_shard_vals,
+    float* d_vals,
+    int* idx,
+    int64_t len,
+    size_t value_bytes,
+    const cudaStream_t& stream);
+template void HeterCommKernel::scatter_vals<uint32_t, cudaStream_t>(
+    const float* d_shard_vals,
+    float* d_vals,
+    uint32_t* idx,
+    int64_t len,
+    size_t value_bytes,
+    const cudaStream_t& stream);
+template void HeterCommKernel::check_valid_values<int32_t, cudaStream_t>(
+    const int& type,
+    const size_t& N,
+    const int32_t* keys,
+    const char* input,
+    const size_t& value_bytes,
+    const cudaStream_t& stream,
+    bool debug);
+template void HeterCommKernel::check_valid_values<uint64_t, cudaStream_t>(
+    const int& type,
+    const size_t& N,
+    const uint64_t* keys,
+    const char* input,
+    const size_t& value_bytes,
+    const cudaStream_t& stream,
+    bool debug);
+template void
+HeterCommKernel::scale_grad<cudaStream_t, CommonFeatureValueAccessor>(
+    const size_t& len,
+    char* grads,
+    const size_t& value_bytes,
+    const size_t& grad_dim,
+    const cudaStream_t& stream,
+    const CommonFeatureValueAccessor& gpu_accessor);
+// compress
+template size_t HeterCommKernel::compress_values<cudaStream_t>(
+    const size_t& len,
+    const char* in_vals,
+    char* out_vals,
+    const size_t& value_bytes,
+    const size_t& embedx_dim,
+    const float& max_bound,
+    const cudaStream_t& stream);
+// uncompress
+template void HeterCommKernel::uncompress_values<cudaStream_t>(
+    const size_t& len,
+    const char* in_vals,
+    char* out_vals,
+    const size_t& value_bytes,
+    const size_t& embedx_dim,
+    const float& max_bound,
+    const cudaStream_t& stream);
 #endif
 
 }  // namespace framework
diff --git a/paddle/fluid/framework/fleet/heter_ps/heter_comm_kernel.h b/paddle/fluid/framework/fleet/heter_ps/heter_comm_kernel.h
index affde16713c62..f9e1a98e5cc2d 100644
--- a/paddle/fluid/framework/fleet/heter_ps/heter_comm_kernel.h
+++ b/paddle/fluid/framework/fleet/heter_ps/heter_comm_kernel.h
@@ -42,50 +42,54 @@ struct DynamicGradMerger {
   }
 
   template <typename GPUAccessor>
-  __device__ __forceinline__ void update_one(float* output,
-                                             const float* input,
-                                             GPUAccessor& gpu_accessor) {
+  __device__ __forceinline__ void update_one(
+      float* output,
+      const float* input,
+      const GPUAccessor& gpu_accessor) const {
     gpu_accessor.PushValueFill(output, input);
   }
 
   template <typename GPUAccessor>
-  __device__ __forceinline__ void merge_one(float* output,
-                                            const float* input,
-                                            GPUAccessor& gpu_accessor) {
+  __device__ __forceinline__ void merge_one(
+      float* output,
+      const float* input,
+      const GPUAccessor& gpu_accessor) const {
     gpu_accessor.MergePushValue(output, input);
   }
 
   template <typename GPUAccessor>
-  __device__ __forceinline__ void update_basic(float* output,
-                                               const float* input,
-                                               GPUAccessor& fv_accessor) {
+  __device__ __forceinline__ void update_basic(
+      float* output, const float* input, const GPUAccessor& fv_accessor) const {
     fv_accessor.PushValueFillBasic(output, input);
   }
 
   template <typename GPUAccessor>
-  __device__ __forceinline__ void merge_basic(float* output,
-                                              const float* input,
-                                              GPUAccessor& fv_accessor) {
+  __device__ __forceinline__ void merge_basic(
+      float* output, const float* input, const GPUAccessor& fv_accessor) const {
     fv_accessor.MergePushValueBasic(output, input);
   }
 
   template <typename GPUAccessor>
-  __device__ __forceinline__ void update_embedx(float* output,
-                                                const float* input,
-                                                size_t embedx_idx,
-                                                GPUAccessor& fv_accessor) {
-    if (embedx_idx < output[fv_accessor.common_push_value.MfDimIndex()]) {
+  __device__ __forceinline__ void update_embedx(
+      float* output,
+      const float* input,
+      const int embedx_idx,
+      const GPUAccessor& fv_accessor) const {
+    if (embedx_idx <
+        static_cast<int>(output[fv_accessor.common_push_value.MfDimIndex()])) {
       output[fv_accessor.common_push_value.EmbedxGIndex() + embedx_idx] =
           input[fv_accessor.common_push_value.EmbedxGIndex() + embedx_idx];
     }
   }
 
   template <typename GPUAccessor>
-  __device__ __forceinline__ void merge_embedx(float* output,
-                                               const float* input,
-                                               size_t embedx_idx,
-                                               GPUAccessor& fv_accessor) {
-    if (embedx_idx < output[fv_accessor.common_push_value.MfDimIndex()]) {
+  __device__ __forceinline__ void merge_embedx(
+      float* output,
+      const float* input,
+      const int embedx_idx,
+      const GPUAccessor& fv_accessor) const {
+    if (embedx_idx <
+        static_cast<int>(output[fv_accessor.common_push_value.MfDimIndex()])) {
       output[fv_accessor.common_push_value.EmbedxGIndex() + embedx_idx] +=
           input[fv_accessor.common_push_value.EmbedxGIndex() + embedx_idx];
     }
@@ -98,19 +102,19 @@ class HeterCommKernel {
   explicit HeterCommKernel(const int block_size) : block_size_(block_size) {}
 
   template <typename T, typename StreamType>
-  void fill_idx(T* idx, long long len, const StreamType& stream);
+  void fill_idx(T* idx, int64_t len, const StreamType& stream);
 
   template <typename T, typename StreamType>
   void calc_shard_offset(T* idx,
                          T* left,
                          T* right,
-                         long long len,
+                         int64_t len,
                          int total_devs,
                          const StreamType& stream);
 
   template <typename KeyType, typename T, typename StreamType>
   void calc_shard_index(KeyType* d_keys,
-                        long long len,
+                        int64_t len,
                         T* shard_index,
 
                         int total_devs,
@@ -120,7 +124,7 @@ class HeterCommKernel {
   void fill_shard_key(KeyType* d_shard_keys,
                       KeyType* d_keys,
                       T* idx,
-                      long long len,
+                      int64_t len,
                       const StreamType& stream);
 
   template <typename KeyType,
@@ -132,14 +136,14 @@ class HeterCommKernel {
                         GradType* d_shard_grads,
                         GradType* d_grads,
                         T* idx,
-                        long long len,
+                        int64_t len,
                         const StreamType& stream);
 
   template <typename ValType, typename T, typename StreamType>
   void fill_dvals(ValType* d_shard_vals,
                   ValType* d_vals,
                   T* idx,
-                  long long len,
+                  int64_t len,
                   const StreamType& stream);
 
   template <typename KeyT, typename ValueT, typename StreamType>
@@ -183,10 +187,10 @@ class HeterCommKernel {
                               float* d_shard_grads,
                               float* d_grads,
                               T* idx,
-                              long long len,
+                              int64_t len,
                               size_t grad_value_size,
                               const StreamType& stream,
-                              GPUAccessor& gpu_accessor);
+                              const GPUAccessor& gpu_accessor);
 
   template <typename KeyType, typename StreamType, typename GPUAccessor>
   void merge_gradient(const KeyType* d_shard_keys,
@@ -198,15 +202,15 @@ class HeterCommKernel {
                       int n,
                       size_t grad_dim,
                       size_t grad_value_size,
-                      DynamicGradMerger& merger,
+                      const DynamicGradMerger& merger,
                       const StreamType& stream,
-                      GPUAccessor& gpu_accessor);
+                      const GPUAccessor& gpu_accessor);
 
   template <typename T, typename StreamType>
   void dy_mf_fill_dvals(float* d_shard_vals,
                         float* d_vals,
                         T* idx,
-                        long long len,
+                        int64_t len,
                         size_t val_size,
                         const StreamType& stream);
 
@@ -253,6 +257,76 @@ class HeterCommKernel {
                           size_t val_size,
                           const StreamType& stream);
 
+  template <typename KeyType, typename T, typename StreamType>
+  void calc_node_shard_index(const KeyType* d_keys,
+                             int64_t len,
+                             T* shard_index,
+                             const int& total_devs,
+                             const int& node_num,
+                             const StreamType& stream);
+
+  template <typename KeyType, typename T, typename StreamType>
+  void gather_keys(KeyType* d_shard_keys,
+                   const KeyType* d_keys,
+                   T* idx,
+                   int64_t len,
+                   const StreamType& stream);
+  template <typename KeyType, typename T, typename StreamType>
+  void scatter_keys(const KeyType* d_shard_keys,
+                    KeyType* d_keys,
+                    T* idx,
+                    int64_t len,
+                    const StreamType& stream);
+  template <typename T, typename StreamType>
+  void gather_vals(float* d_shard_vals,
+                   const float* d_vals,
+                   T* idx,
+                   int64_t len,
+                   size_t value_bytes,
+                   const StreamType& stream);
+  template <typename T, typename StreamType>
+  void scatter_vals(const float* d_shard_vals,
+                    float* d_vals,
+                    T* idx,
+                    int64_t len,
+                    size_t value_bytes,
+                    const StreamType& stream);
+  // scale grad values
+  template <typename StreamType, typename GPUAccessor>
+  void scale_grad(const size_t& len,
+                  char* grads,
+                  const size_t& value_bytes,
+                  const size_t& grad_dim,
+                  const StreamType& stream,
+                  const GPUAccessor& gpu_accessor);
+
+  template <typename KeyType, typename StreamType>
+  void check_valid_values(const int& type,
+                          const size_t& N,
+                          const KeyType* keys,
+                          const char* input,
+                          const size_t& value_bytes,
+                          const StreamType& stream,
+                          bool debug = false);
+  // compress
+  template <typename StreamType>
+  size_t compress_values(const size_t& len,
+                         const char* in_vals,
+                         char* out_vals,
+                         const size_t& value_bytes,
+                         const size_t& embedx_dim,
+                         const float& max_bound,
+                         const StreamType& stream);
+  // uncompress
+  template <typename StreamType>
+  void uncompress_values(const size_t& len,
+                         const char* in_vals,
+                         char* out_vals,
+                         const size_t& value_bytes,
+                         const size_t& embedx_dim,
+                         const float& max_bound,
+                         const StreamType& stream);
+
  private:
   int block_size_{256};
 };
diff --git a/paddle/fluid/framework/fleet/heter_ps/heter_comm_kernel.kps b/paddle/fluid/framework/fleet/heter_ps/heter_comm_kernel.kps
index b44ea1807fd65..7849816ce5fc9 100644
--- a/paddle/fluid/framework/fleet/heter_ps/heter_comm_kernel.kps
+++ b/paddle/fluid/framework/fleet/heter_ps/heter_comm_kernel.kps
@@ -67,8 +67,8 @@ __global__ void fill_idx_kernel(T* idx, long long len) {
 }
 
 template <typename T>
-__global__ void calc_shard_offset_kernel(T* idx, T* left, T* right,
-                                         long long len, const int total_xpu) {
+__global__ void calc_shard_offset_kernel(
+    T* idx, T* left, T* right, long long len, const int total_xpu) {
   int cid = core_id();
   int ncores = core_num();
   if (cid >= ncores) {
@@ -115,8 +115,10 @@ __global__ void calc_shard_offset_kernel(T* idx, T* left, T* right,
 }
 
 template <typename KeyType, typename T>
-__global__ void calc_shard_index_kernel(KeyType* d_keys, long long len,
-                                        T* shard_index, int total_xpu) {
+__global__ void calc_shard_index_kernel(KeyType* d_keys,
+                                        long long len,
+                                        T* shard_index,
+                                        int total_xpu) {
   int cid = core_id();
   int ncores = core_num();
   if (cid >= ncores) {
@@ -141,8 +143,10 @@ __global__ void calc_shard_index_kernel(KeyType* d_keys, long long len,
 }
 
 template <typename KeyType, typename T>
-__global__ void fill_shard_key_kernel(KeyType* d_shard_keys, KeyType* d_keys,
-                                      T* idx, long long len) {
+__global__ void fill_shard_key_kernel(KeyType* d_shard_keys,
+                                      KeyType* d_keys,
+                                      T* idx,
+                                      long long len) {
   int cid = core_id();
   int ncores = core_num();
   if (cid >= ncores) {
@@ -171,9 +175,11 @@ __global__ void fill_shard_key_kernel(KeyType* d_shard_keys, KeyType* d_keys,
 
 // local mem too large, cause compile error
 template <typename KeyType, typename GradType, typename T>
-__global__ void fill_shard_grads_kernel(KeyType* d_shard_keys, KeyType* d_keys,
+__global__ void fill_shard_grads_kernel(KeyType* d_shard_keys,
+                                        KeyType* d_keys,
                                         GradType* d_shard_grads,
-                                        GradType* d_grads, T* idx,
+                                        GradType* d_grads,
+                                        T* idx,
                                         long long len) {
   int cid = core_id();
   int ncores = core_num();
@@ -200,8 +206,8 @@ __global__ void fill_shard_grads_kernel(KeyType* d_shard_keys, KeyType* d_keys,
     GM2LM(idx + i, local_idx, read_len * sizeof(T));
     for (int k = 0; k < read_len; k++) {
       GM2LM(d_keys + local_idx[k], &local_shard_keys[k], 1 * sizeof(KeyType));
-      GM2LM(d_grads + local_idx[k], &local_shard_grads[k],
-            1 * sizeof(GradType));
+      GM2LM(
+          d_grads + local_idx[k], &local_shard_grads[k], 1 * sizeof(GradType));
       // local_shard_keys[k] = local_keys[local_idx[k]];
       // local_shard_grads[k] = local_grads[local_idx[k]];
     }
@@ -211,8 +217,10 @@ __global__ void fill_shard_grads_kernel(KeyType* d_shard_keys, KeyType* d_keys,
 }
 
 template <typename ValType, typename T>
-__global__ void fill_dvals_kernel(ValType* d_shard_vals, ValType* d_vals,
-                                  T* idx, long long len) {
+__global__ void fill_dvals_kernel(ValType* d_shard_vals,
+                                  ValType* d_vals,
+                                  T* idx,
+                                  long long len) {
   int cid = core_id();
   int ncores = core_num();
   if (cid >= ncores) {
@@ -240,63 +248,84 @@ __global__ void fill_dvals_kernel(ValType* d_shard_vals, ValType* d_vals,
 }
 
 template <typename T, typename StreamType>
-void HeterCommKernel::fill_idx(T* idx, long long len,
+void HeterCommKernel::fill_idx(T* idx,
+                               long long len,
                                const StreamType& stream) {
   fill_idx_kernel<T><<<4, 64, stream>>>(idx, len);
 }
 
 template <typename T, typename StreamType>
-void HeterCommKernel::calc_shard_offset(T* idx, T* left, T* right,
-                                        long long len, int total_devs,
+void HeterCommKernel::calc_shard_offset(T* idx,
+                                        T* left,
+                                        T* right,
+                                        long long len,
+                                        int total_devs,
                                         const StreamType& stream) {
-  calc_shard_offset_kernel<T><<<4, 64, stream>>>(idx, left, right, len,
-                                                 total_devs);
+  calc_shard_offset_kernel<T>
+      <<<4, 64, stream>>>(idx, left, right, len, total_devs);
 }
 
 template <typename KeyType, typename T, typename StreamType>
-void HeterCommKernel::calc_shard_index(KeyType* d_keys, long long len,
-                                       T* shard_index, int total_devs,
+void HeterCommKernel::calc_shard_index(KeyType* d_keys,
+                                       long long len,
+                                       T* shard_index,
+                                       int total_devs,
                                        const StreamType& stream) {
-  calc_shard_index_kernel<KeyType, T><<<4, 64, stream>>>(
-      d_keys, len, shard_index, total_devs);
+  calc_shard_index_kernel<KeyType, T>
+      <<<4, 64, stream>>>(d_keys, len, shard_index, total_devs);
 }
 
 template <typename KeyType, typename T, typename StreamType>
-void HeterCommKernel::fill_shard_key(KeyType* d_shard_keys, KeyType* d_keys,
-                                     T* idx, long long len,
+void HeterCommKernel::fill_shard_key(KeyType* d_shard_keys,
+                                     KeyType* d_keys,
+                                     T* idx,
+                                     long long len,
                                      const StreamType& stream) {
-  fill_shard_key_kernel<KeyType, T><<<4, 64, stream>>>(d_shard_keys, d_keys,
-                                                       idx, len);
+  fill_shard_key_kernel<KeyType, T>
+      <<<4, 64, stream>>>(d_shard_keys, d_keys, idx, len);
 }
 
 template <typename KeyType, typename GradType, typename T, typename StreamType>
-void HeterCommKernel::fill_shard_grads(KeyType* d_shard_keys, KeyType* d_keys,
+void HeterCommKernel::fill_shard_grads(KeyType* d_shard_keys,
+                                       KeyType* d_keys,
                                        GradType* d_shard_grads,
-                                       GradType* d_grads, T* idx, long long len,
+                                       GradType* d_grads,
+                                       T* idx,
+                                       long long len,
                                        const StreamType& stream) {
   fill_shard_grads_kernel<KeyType, GradType, T><<<4, 64, stream>>>(
       d_shard_keys, d_keys, d_shard_grads, d_grads, idx, len);
 }
 
 template <typename ValType, typename T, typename StreamType>
-void HeterCommKernel::fill_dvals(ValType* d_shard_vals, ValType* d_vals, T* idx,
-                                 long long len, const StreamType& stream) {
-  fill_dvals_kernel<ValType, T><<<4, 64, stream>>>(d_shard_vals, d_vals, idx,
-                                                   len);
+void HeterCommKernel::fill_dvals(ValType* d_shard_vals,
+                                 ValType* d_vals,
+                                 T* idx,
+                                 long long len,
+                                 const StreamType& stream) {
+  fill_dvals_kernel<ValType, T>
+      <<<4, 64, stream>>>(d_shard_vals, d_vals, idx, len);
 }
 
 template <typename KeyT, typename ValueT, typename StreamType>
 void HeterCommKernel::sort_pairs(void* d_temp_storage,
                                  size_t& temp_storage_bytes,  // NOLINT
                                  const KeyT* d_keys_in,       // NOLINT
-                                 KeyT* d_keys_out, const ValueT* d_values_in,
-                                 ValueT* d_values_out, int num_items,
-                                 int begin_bit, int end_bit, StreamType stream,
+                                 KeyT* d_keys_out,
+                                 const ValueT* d_values_in,
+                                 ValueT* d_values_out,
+                                 int num_items,
+                                 int begin_bit,
+                                 int end_bit,
+                                 StreamType stream,
                                  bool debug_synchronous) {}
 
-template <typename KeysInputIteratorT, typename UniqueOutputIteratorT,
-          typename ValuesInputIteratorT, typename AggregatesOutputIteratorT,
-          typename NumRunsOutputIteratorT, typename StreamType>
+template <typename KeysInputIteratorT,
+          typename UniqueOutputIteratorT,
+          typename ValuesInputIteratorT,
+          typename AggregatesOutputIteratorT,
+          typename NumRunsOutputIteratorT,
+          typename StreamType>
 void HeterCommKernel::reduce_by_key(void* d_temp_storage,
                                     size_t& temp_storage_bytes,  // NOLINT
                                     KeysInputIteratorT d_keys_in,
@@ -304,62 +333,97 @@ void HeterCommKernel::reduce_by_key(void* d_temp_storage,
                                     ValuesInputIteratorT d_values_in,
                                     AggregatesOutputIteratorT d_aggregates_out,
                                     NumRunsOutputIteratorT d_num_runs_out,
-                                    int num_items, StreamType stream,
+                                    int num_items,
+                                    StreamType stream,
                                     bool debug_synchronous) {}
 
 template void HeterCommKernel::fill_idx<int, XPUStream>(
     int* idx, long long len, const XPUStream& stream);
 
 template void HeterCommKernel::calc_shard_offset<int, XPUStream>(
-    int* idx, int* left, int* right, long long len, int total_devs,
+    int* idx,
+    int* left,
+    int* right,
+    long long len,
+    int total_devs,
     const XPUStream& stream);
 template void HeterCommKernel::calc_shard_index<unsigned long, int, XPUStream>(
-    unsigned long* d_keys, long long len, int* shard_index, int total_devs,
+    unsigned long* d_keys,
+    long long len,
+    int* shard_index,
+    int total_devs,
     const XPUStream& stream);
 
 template void HeterCommKernel::fill_shard_key<unsigned long, int, XPUStream>(
-    unsigned long* d_shard_keys, unsigned long* d_keys, int* idx, long long len,
+    unsigned long* d_shard_keys,
+    unsigned long* d_keys,
+    int* idx,
+    long long len,
     const XPUStream& stream);
 
 template void HeterCommKernel::fill_shard_grads<
-    unsigned long, paddle::framework::FeaturePushValue, int, XPUStream>(
-    unsigned long* d_shard_keys, unsigned long* d_keys,
-    paddle::framework::FeaturePushValue* d_shard_grads,
-    paddle::framework::FeaturePushValue* d_grads, int* idx, long long len,
-    const XPUStream& stream);
+    unsigned long,
+    paddle::framework::FeaturePushValue,
+    int,
+    XPUStream>(unsigned long* d_shard_keys,
+               unsigned long* d_keys,
+               paddle::framework::FeaturePushValue* d_shard_grads,
+               paddle::framework::FeaturePushValue* d_grads,
+               int* idx,
+               long long len,
+               const XPUStream& stream);
 
 template void
 HeterCommKernel::fill_dvals<paddle::framework::FeatureValue, int, XPUStream>(
     paddle::framework::FeatureValue* d_shard_vals,
-    paddle::framework::FeatureValue* d_vals, int* idx, long long len,
+    paddle::framework::FeatureValue* d_vals,
+    int* idx,
+    long long len,
     const XPUStream& stream);
 
-template void HeterCommKernel::sort_pairs<
-    unsigned long, paddle::framework::FeaturePushValue, XPUStream>(
-    void* d_temp_storage,
-    size_t& temp_storage_bytes,      // NOLINT
-    const unsigned long* d_keys_in,  // NOLINT
-    unsigned long* d_keys_out,
-    const paddle::framework::FeaturePushValue* d_values_in,
-    paddle::framework::FeaturePushValue* d_values_out, int num_items,
-    int begin_bit, int end_bit, XPUStream stream, bool debug_synchronous);
+template void HeterCommKernel::
+    sort_pairs<unsigned long, paddle::framework::FeaturePushValue, XPUStream>(
+        void* d_temp_storage,
+        size_t& temp_storage_bytes,      // NOLINT
+        const unsigned long* d_keys_in,  // NOLINT
+        unsigned long* d_keys_out,
+        const paddle::framework::FeaturePushValue* d_values_in,
+        paddle::framework::FeaturePushValue* d_values_out,
+        int num_items,
+        int begin_bit,
+        int end_bit,
+        XPUStream stream,
+        bool debug_synchronous);
 
 template void HeterCommKernel::sort_pairs<int, int, XPUStream>(
     void* d_temp_storage,
     size_t& temp_storage_bytes,  // NOLINT
     const int* d_keys_in,        // NOLINT
-    int* d_keys_out, const int* d_values_in, int* d_values_out, int num_items,
-    int begin_bit, int end_bit, XPUStream stream, bool debug_synchronous);
+    int* d_keys_out,
+    const int* d_values_in,
+    int* d_values_out,
+    int num_items,
+    int begin_bit,
+    int end_bit,
+    XPUStream stream,
+    bool debug_synchronous);
 
 template void HeterCommKernel::reduce_by_key<
-    unsigned long*, unsigned long*, paddle::framework::FeaturePushValue*,
-    paddle::framework::FeaturePushValue*, int*, XPUStream>(
-    void* d_temp_storage,
-    size_t& temp_storage_bytes,  // NOLINT
-    unsigned long* d_keys_in, unsigned long* d_unique_out,
-    paddle::framework::FeaturePushValue* d_values_in,
-    paddle::framework::FeaturePushValue* d_aggregates_out, int* d_num_runs_out,
-    int num_items, XPUStream stream, bool debug_synchronous);
+    unsigned long*,
+    unsigned long*,
+    paddle::framework::FeaturePushValue*,
+    paddle::framework::FeaturePushValue*,
+    int*,
+    XPUStream>(void* d_temp_storage,
+               size_t& temp_storage_bytes,  // NOLINT
+               unsigned long* d_keys_in,
+               unsigned long* d_unique_out,
+               paddle::framework::FeaturePushValue* d_values_in,
+               paddle::framework::FeaturePushValue* d_aggregates_out,
+               int* d_num_runs_out,
+               int num_items,
+               XPUStream stream,
+               bool debug_synchronous);
 
 #endif
 
diff --git a/paddle/fluid/framework/fleet/heter_ps/heter_ps.cc b/paddle/fluid/framework/fleet/heter_ps/heter_ps.cc
index 59c31c5bc2735..dc886256360e5 100644
--- a/paddle/fluid/framework/fleet/heter_ps/heter_ps.cc
+++ b/paddle/fluid/framework/fleet/heter_ps/heter_ps.cc
@@ -55,7 +55,7 @@ template <typename GPUAccessor, template <typename T> class GPUOptimizer>
 HeterPs<GPUAccessor, GPUOptimizer>::HeterPs(
     size_t capacity,
     std::shared_ptr<HeterPsResource> resource,
-    GPUAccessor& gpu_accessor) {
+    const GPUAccessor& gpu_accessor) {
   comm_ = std::make_shared<HeterComm<FeatureKey, float*, float*, GPUAccessor>>(
       capacity, resource);
   opt_ = GPUOptimizer<GPUAccessor>(gpu_accessor);
diff --git a/paddle/fluid/framework/fleet/heter_ps/heter_ps.cu b/paddle/fluid/framework/fleet/heter_ps/heter_ps.cu
index 92934e961f149..000ded733899b 100644
--- a/paddle/fluid/framework/fleet/heter_ps/heter_ps.cu
+++ b/paddle/fluid/framework/fleet/heter_ps/heter_ps.cu
@@ -54,7 +54,7 @@ template <typename GPUAccessor, template <typename T> class GPUOptimizer>
 HeterPs<GPUAccessor, GPUOptimizer>::HeterPs(
     size_t capacity,
     std::shared_ptr<HeterPsResource> resource,
-    GPUAccessor& gpu_accessor) {
+    const GPUAccessor& gpu_accessor) {
   comm_ = std::make_shared<HeterComm<FeatureKey, float*, float*, GPUAccessor>>(
       capacity, resource, gpu_accessor);
   opt_ = GPUOptimizer<GPUAccessor>(gpu_accessor);
@@ -122,8 +122,9 @@ template <typename GPUAccessor, template <typename T> class GPUOptimizer>
 void HeterPs<GPUAccessor, GPUOptimizer>::set_nccl_comm_and_size(
     const std::vector<ncclComm_t>& inner_comms,
     const std::vector<ncclComm_t>& inter_comms,
-    int comm_size) {
-  comm_->set_nccl_comm_and_size(inner_comms, inter_comms, comm_size);
+    int comm_size,
+    int rank_id) {
+  comm_->set_nccl_comm_and_size(inner_comms, inter_comms, comm_size, rank_id);
 }
 
 template <typename GPUAccessor, template <typename T> class GPUOptimizer>
diff --git a/paddle/fluid/framework/fleet/heter_ps/heter_ps.h b/paddle/fluid/framework/fleet/heter_ps/heter_ps.h
index 20292a4df3633..dcafde10b9782 100644
--- a/paddle/fluid/framework/fleet/heter_ps/heter_ps.h
+++ b/paddle/fluid/framework/fleet/heter_ps/heter_ps.h
@@ -32,7 +32,7 @@ class HeterPs : public HeterPsBase {
   HeterPs() {}
   HeterPs(size_t capacity,
           std::shared_ptr<HeterPsResource> resource,
-          GPUAccessor& gpu_accessor);
+          const GPUAccessor& gpu_accessor);
   virtual ~HeterPs();
   HeterPs(const HeterPs&) = delete;
   HeterPs& operator=(const HeterPs&) = delete;
@@ -41,8 +41,6 @@ class HeterPs : public HeterPsBase {
                    FeatureKey* d_keys,
                    float* d_vals,
                    size_t len) override;
-  // void build_ps(int num, FeatureKey* h_keys, float* h_vals, size_t len,
-  //               size_t chunk_size, int stream_num) override;
   void build_ps(int num,
                 FeatureKey* h_keys,
                 char* pool,
@@ -53,7 +51,8 @@ class HeterPs : public HeterPsBase {
 #if defined(PADDLE_WITH_CUDA)
   void set_nccl_comm_and_size(const std::vector<ncclComm_t>& inner_comms,
                               const std::vector<ncclComm_t>& inter_comms,
-                              int comm_size) override;
+                              int comm_size,
+                              int rank_id) override;
   void set_multi_mf_dim(int multi_mf_dim, int max_mf_dim) override;
 
 #endif
@@ -79,6 +78,16 @@ class HeterPs : public HeterPsBase {
                              uint32_t* d_merged_cnts,
                              bool filter_zero);
 #endif
+  // reset table
+  void reset_table(const int dev_id,
+                   size_t capacity,
+                   const OptimizerConfig& sgd_config,
+                   const OptimizerConfig& embedx_config,
+                   bool infer_mode) {
+    comm_->reset_table(dev_id, capacity, sgd_config, embedx_config, infer_mode);
+  }
+  void set_mode(bool infer_mode) { comm_->set_mode(infer_mode); }
+
  private:
   std::shared_ptr<HeterComm<FeatureKey, float*, float*, GPUAccessor>> comm_;
 #if defined(PADDLE_WITH_CUDA)
diff --git a/paddle/fluid/framework/fleet/heter_ps/heter_ps_base.h b/paddle/fluid/framework/fleet/heter_ps/heter_ps_base.h
index af1a1261d7341..8624425d8bfbd 100644
--- a/paddle/fluid/framework/fleet/heter_ps/heter_ps_base.h
+++ b/paddle/fluid/framework/fleet/heter_ps/heter_ps_base.h
@@ -48,7 +48,8 @@ class HeterPsBase {
   virtual void set_nccl_comm_and_size(
       const std::vector<ncclComm_t>& inner_comms,
       const std::vector<ncclComm_t>& inter_comms,
-      int comm_size) = 0;
+      int comm_size,
+      int rank_id) = 0;
   virtual void set_multi_mf_dim(int multi_mf_dim, int max_mf_dim) = 0;
 
 #endif
@@ -82,6 +83,12 @@ class HeterPsBase {
                                      uint32_t* d_merged_cnts,
                                      bool filter_zero) = 0;
 #endif
+  virtual void reset_table(const int dev_id,
+                           size_t capacity,
+                           const OptimizerConfig& sgd_config,
+                           const OptimizerConfig& embedx_config,
+                           bool infer_mode) = 0;
+  virtual void set_mode(bool infer_mode) = 0;
 };
 
 }  // end namespace framework
diff --git a/paddle/fluid/framework/fleet/heter_ps/heter_resource.cc b/paddle/fluid/framework/fleet/heter_ps/heter_resource.cc
index b330c9bb9f5ef..fa72b5c99d599 100644
--- a/paddle/fluid/framework/fleet/heter_ps/heter_resource.cc
+++ b/paddle/fluid/framework/fleet/heter_ps/heter_resource.cc
@@ -23,12 +23,16 @@ limitations under the License. */
 #include "paddle/fluid/platform/device/xpu/enforce_xpu.h"
 #include "paddle/fluid/platform/device/xpu/xpu_info.h"
 #endif
+#include "paddle/utils/string/string_helper.h"
+
+DECLARE_bool(enable_auto_detect_gpu_topo);
+DECLARE_bool(enable_auto_rdma_trans);
 
 namespace paddle {
 namespace framework {
 
 #if defined(PADDLE_WITH_CUDA)
-GPUResource::GPUResource(std::vector<int>& dev_ids, int index) {
+GPUResource::GPUResource(std::vector<int> &dev_ids, int index) {
   index_ = index;
   dev_ids_ = dev_ids;
   dev_id_ = dev_ids_[index];
@@ -62,7 +66,7 @@ GPUResource::~GPUResource() {
 }
 
 #elif defined(PADDLE_WITH_XPU_KP)
-XPUResource::XPUResource(std::vector<int>& dev_ids, int index) {
+XPUResource::XPUResource(std::vector<int> &dev_ids, int index) {
   index_ = index;
   dev_ids_ = dev_ids;
   dev_id_ = dev_ids_[index];
@@ -119,8 +123,124 @@ void HeterPsResource::enable_p2p() {
   }
 #endif
 }
+static std::string excute_cmd_result(const std::string &cmd) {
+  FILE *fp = popen(cmd.c_str(), "r");
+  if (fp == NULL) {
+    fprintf(stderr, "cmd %s open failed\n", cmd.c_str());
+    return "";
+  }
+
+  std::string out;
+  size_t ret = 0;
+  char szline[1024] = {0};
+  while ((ret = fread(szline, sizeof(char), sizeof(szline), fp)) > 0) {
+    out.append(szline, ret);
+  }
+  pclose(fp);
+  fprintf(stderr, "cmd: %s, ret:\n%s\n", cmd.c_str(), out.c_str());
+  return paddle::string::trim_spaces(out);
+}
+#if defined(PADDLE_WITH_CUDA)
+static std::shared_ptr<GpuRDMAChecker> g_checker = nullptr;
+GpuRDMAChecker *GpuRDMAChecker::get(int device_num) {
+  if (g_checker == nullptr) {
+    g_checker = std::make_shared<GpuRDMAChecker>(device_num);
+  }
+  // check gpu num
+  CHECK(device_num == g_checker->device_num());
+  return g_checker.get();
+}
+GpuRDMAChecker::GpuRDMAChecker(int device_num) {
+  device_num_ = device_num;
+  rdma_trans_ = check_device_status(device_num, &rdma_status_);
+}
+bool GpuRDMAChecker::need_rdma_trans(void) {
+  return (FLAGS_enable_auto_rdma_trans && rdma_trans_);
+}
+bool GpuRDMAChecker::is_device_support_rdma(int devid) {
+  if (rdma_status_.empty()) {
+    return true;
+  }
+  return rdma_status_[devid];
+}
+bool GpuRDMAChecker::check_device_status(const int &device_count,
+                                         std::vector<int> *gpu_status) {
+  // not need auto detect gpu topo aware
+  if (!FLAGS_enable_auto_detect_gpu_topo) {
+    return false;
+  }
+  // a100
+  std::string str = excute_cmd_result("source ~/.bashrc && nvidia-smi topo -m");
+  if (str.empty()) {  // a100 auto gpu card rdma status
+    return false;
+  }
+  // mlx5_0  PXB PXB SYS SYS SYS SYS SYS SYS  X  SYS SYS
+  // mlx5_2  SYS SYS PXB PXB SYS SYS SYS SYS SYS NODE     X
+  std::vector<std::string> lines = paddle::string::split_string(str, "\n");
+  if (lines.empty()) {
+    fprintf(stdout, "%s\n", str.c_str());
+    return false;
+  }
+  std::vector<std::string> gpu_mlxs;
+  gpu_status->resize(device_count, 0);
+  gpu_mlxs.resize(device_count);
+  for (auto line : lines) {
+    std::vector<std::string> tags = paddle::string::split_string(line);
+    if (tags.size() < static_cast<size_t>(device_count + 1)) {
+      continue;
+    }
+    std::string &card_name = tags[0];
+    if (strncmp(card_name.c_str(), "GPU0", 4) == 0) {
+      // check topo_aware
+      topo_aware_ = false;
+      for (int j = 1; j < device_count; ++j) {
+        std::string &tag = tags[j + 1];
+        if (strncmp(tag.c_str(), "NV", 2) == 0) {
+          continue;
+        }
+        topo_aware_ = true;
+      }
+      continue;
+    }
+    if (strncmp(card_name.c_str(), "mlx5", 4) != 0) {
+      continue;
+    }
+    for (int j = 0; j < device_count; ++j) {
+      std::string &tag = tags[j + 1];
+      if (strcmp(tag.c_str(), "PXB") != 0 && strcmp(tag.c_str(), "PIX") != 0) {
+        continue;
+      }
+      (*gpu_status)[j] = 1;
+      if (!gpu_mlxs[j].empty()) {
+        gpu_mlxs[j].append(",");
+      }
+      gpu_mlxs[j].append(card_name);
+    }
+  }
+  int not_trans_cnt = 0;
+  int need_trans_cnt = 0;
+  // check all rdma
+  for (int j = 0; j < device_count; ++j) {
+    if ((*gpu_status)[j] > 0) {
+      fprintf(
+          stdout, "GPU%d: rdma check ok, used %s\n", j, gpu_mlxs[j].c_str());
+      continue;
+    }
+    int trans_id = (j + device_count / 2) % device_count;
+    if ((*gpu_status)[trans_id] > 0) {
+      fprintf(
+          stdout, "GPU%d: rdma check pcie, used trans id %d\n", j, trans_id);
+      ++need_trans_cnt;
+    } else {
+      ++not_trans_cnt;
+    }
+  }
+  // need trans device all connect to other device
+  return (need_trans_cnt > 0 && not_trans_cnt == 0);
+}
+#endif
 
-HeterPsResource::HeterPsResource(const std::vector<int>& dev_ids) {
+HeterPsResource::HeterPsResource(const std::vector<int> &dev_ids) {
   dev_ids_ = dev_ids;
   for (size_t i = 0; i < dev_ids_.size(); ++i) {
     std::shared_ptr<DevResource> resource =
diff --git a/paddle/fluid/framework/fleet/heter_ps/heter_resource.h b/paddle/fluid/framework/fleet/heter_ps/heter_resource.h
index 087877818f5fb..1a624dc5224a4 100644
--- a/paddle/fluid/framework/fleet/heter_ps/heter_resource.h
+++ b/paddle/fluid/framework/fleet/heter_ps/heter_resource.h
@@ -97,6 +97,34 @@ using DevPlace = platform::XPUPlace;
 using AnyDeviceGuard = platform::XPUDeviceGuard;
 #endif
 
+#if defined(PADDLE_WITH_CUDA)
+class GpuRDMAChecker {
+ public:
+  static GpuRDMAChecker* get(int device_num);
+
+ public:
+  explicit GpuRDMAChecker(int device_num);
+  // rdma
+  bool need_rdma_trans(void);
+  bool is_device_support_rdma(int devid);
+  // device num
+  int device_num(void) { return device_num_; }
+  // topo_aware
+  bool topo_aware(void) { return topo_aware_; }
+
+ private:
+  bool check_device_status(const int& device_count,
+                           std::vector<int>* gpu_status);
+
+ private:
+  int device_num_ = 0;
+  bool topo_aware_ = false;
+  // rdma
+  bool rdma_trans_ = false;
+  std::vector<int> rdma_status_;
+};
+#endif
+
 class HeterPsResource {
  public:
   explicit HeterPsResource(const std::vector<int>& dev_ids);
@@ -114,12 +142,17 @@ class HeterPsResource {
   ppStream local_stream(int dev_num, int stream_num);
   ppStream remote_stream(int dev_num, int stream_num);
   ppStream comm_stream(int dev_num, int stream_num);
+  // node
+  bool multi_node(void) { return multi_node_; }
+  void set_multi_node(bool multi_node) { multi_node_ = multi_node; }
 
   std::vector<std::shared_ptr<DevResource>> resources_;
   std::vector<int> dev_ids_;
   std::map<int, int> devid_2_index_;
   int multi_mf_dim_{0};
   int max_mf_dim_{0};
+  // multi node
+  bool multi_node_ = false;
 };
 
 }  // end namespace framework
diff --git a/paddle/fluid/framework/fleet/heter_ps/mem_pool.h b/paddle/fluid/framework/fleet/heter_ps/mem_pool.h
index 4696a7cc91b5a..1574a2f98ebd1 100644
--- a/paddle/fluid/framework/fleet/heter_ps/mem_pool.h
+++ b/paddle/fluid/framework/fleet/heter_ps/mem_pool.h
@@ -15,8 +15,6 @@ limitations under the License. */
 #pragma once
 
 #ifdef PADDLE_WITH_HETERPS
-// #include
-// "paddle/fluid/framework/fleet/heter_ps/cudf/concurrent_unordered_map.cuh.h"
 #include <iostream>
 #ifdef PADDLE_WITH_CUDA
 #include "paddle/fluid/framework/fleet/heter_ps/cudf/managed.cuh"
@@ -31,7 +29,7 @@ class MemoryPool {
       : capacity_(capacity), block_size_(block_size) {
     VLOG(3) << "mem_pool init with block_size: " << block_size
             << " capacity: " << capacity;
-    mem_ = (char*)malloc(block_size * capacity_);
+    mem_ = reinterpret_cast<char*>(malloc(block_size * capacity_));
   }
   ~MemoryPool() {
     VLOG(3) << "mem pool delete";
@@ -42,9 +40,7 @@ class MemoryPool {
 
   size_t capacity() { return capacity_; }
   size_t byte_size() { return capacity_ * block_size_; }
-  void* mem_address(const uint32_t& idx) {
-    return (void*)&mem_[(idx)*block_size_];
-  }
+  void* mem_address(const uint32_t& idx) { return &mem_[(idx)*block_size_]; }
 
  private:
   char* mem_ = NULL;
@@ -52,11 +48,12 @@ class MemoryPool {
   size_t block_size_;
 };
 
+// Derived from managed, alloced managed hbm
 class HBMMemoryPool : public managed {
  public:
   HBMMemoryPool(size_t capacity, size_t block_size)
       : capacity_(capacity), block_size_(block_size) {}
-  HBMMemoryPool(MemoryPool* mem_pool) {
+  explicit HBMMemoryPool(MemoryPool* mem_pool) {
     capacity_ = mem_pool->capacity();
     block_size_ = mem_pool->block_size();
     VLOG(3) << "hbm memory pool with capacity" << capacity_
@@ -87,13 +84,60 @@ class HBMMemoryPool : public managed {
 
   size_t capacity() { return capacity_; }
   __forceinline__ __device__ void* mem_address(const uint32_t& idx) {
-    return (void*)&mem_[(idx)*block_size_];
+    return &mem_[(idx)*block_size_];
+  }
+
+ private:
+  char* mem_ = NULL;
+  size_t capacity_;
+  size_t block_size_;
+};
+
+class HBMMemoryPoolFix : public managed {
+ public:
+  HBMMemoryPoolFix() {
+    capacity_ = 0;
+    size_ = 0;
+    block_size_ = 0;
+    max_byte_capacity_ = 0;
+  }
+
+  ~HBMMemoryPoolFix() {
+    VLOG(3) << "delete hbm memory pool";
+    cudaFree(mem_);
+  }
+
+  size_t block_size() { return block_size_; }
+
+  void clear(void) { cudaMemset(mem_, 0, block_size_ * capacity_); }
+
+  void reset(size_t capacity, size_t block_size) {
+    if (max_byte_capacity_ < capacity * block_size) {
+      if (mem_ != NULL) {
+        cudaFree(mem_);
+      }
+      max_byte_capacity_ = (block_size * capacity / 8 + 1) * 8;
+      CUDA_CHECK(cudaMalloc(&mem_, max_byte_capacity_));
+    }
+    size_ = capacity;
+    block_size_ = block_size;
+    capacity_ = max_byte_capacity_ / block_size;
+  }
+
+  char* mem() { return mem_; }
+
+  size_t capacity() { return capacity_; }
+  size_t size() { return size_; }
+  __forceinline__ __device__ void* mem_address(const uint32_t& idx) {
+    return &mem_[(idx)*block_size_];
   }
 
  private:
   char* mem_ = NULL;
   size_t capacity_;
+  size_t size_;
   size_t block_size_;
+  size_t max_byte_capacity_;
 };
 
 }  // end namespace framework
diff --git a/paddle/fluid/framework/fleet/heter_ps/optimizer.cuh.h b/paddle/fluid/framework/fleet/heter_ps/optimizer.cuh.h
index 1e95284869856..ef633528af176 100644
--- a/paddle/fluid/framework/fleet/heter_ps/optimizer.cuh.h
+++ b/paddle/fluid/framework/fleet/heter_ps/optimizer.cuh.h
@@ -91,14 +91,15 @@ class SparseAdagradOptimizer {
         optimizer_config.nonclk_coeff * (g_show - g_click) +
         optimizer_config.clk_coeff * g_click;
     float slot = ptr[gpu_accessor_.common_feature_value.SlotIndex()];
-
+    // dual box
+    float scale = (optimizer_config.multi_node) ? 1.0 : g_show;
     update_value_work(
         optimizer_config,
         1,
         ptr + gpu_accessor_.common_feature_value.EmbedWIndex(),
         ptr + gpu_accessor_.common_feature_value.EmbedG2SumIndex(),
         grad + gpu_accessor_.common_push_value.EmbedGIndex(),
-        g_show,
+        scale,
         slot);
 
     int mf_dim = int(ptr[gpu_accessor_.common_feature_value.MfDimIndex()]);
@@ -127,7 +128,7 @@ class SparseAdagradOptimizer {
           ptr + gpu_accessor_.common_feature_value.EmbedxWIndex(),
           ptr + gpu_accessor_.common_feature_value.EmbedxG2SumIndex(),
           grad + gpu_accessor_.common_push_value.EmbedxGIndex(),
-          g_show,
+          scale,
           slot);
     }
   }
diff --git a/paddle/fluid/framework/fleet/heter_ps/optimizer_conf.h b/paddle/fluid/framework/fleet/heter_ps/optimizer_conf.h
index ba76f2ff914b2..a50fe55d7d696 100644
--- a/paddle/fluid/framework/fleet/heter_ps/optimizer_conf.h
+++ b/paddle/fluid/framework/fleet/heter_ps/optimizer_conf.h
@@ -43,6 +43,8 @@ class OptimizerConfig {
 
   float nodeid_slot = 9008;
   float feature_learning_rate = 0.05;
+  // multi node
+  bool multi_node = false;
 
   void set_sparse_sgd(float nonclk_coeff,
                       float clk_coeff,
@@ -115,8 +117,8 @@ class OptimizerConfig {
     this->mf_beta2_decay_rate = optimizer_config.mf_beta2_decay_rate;
     this->mf_ada_epsilon = optimizer_config.mf_ada_epsilon;
 
-    this->nodeid_slot = nodeid_slot;
-    this->feature_learning_rate = feature_learning_rate;
+    this->nodeid_slot = optimizer_config.nodeid_slot;
+    this->feature_learning_rate = optimizer_config.feature_learning_rate;
   }
 };
 
diff --git a/paddle/fluid/framework/fleet/heter_ps/test_cpu_query.cu b/paddle/fluid/framework/fleet/heter_ps/test_cpu_query.cu
index c4e77d65203be..78297ce292c41 100644
--- a/paddle/fluid/framework/fleet/heter_ps/test_cpu_query.cu
+++ b/paddle/fluid/framework/fleet/heter_ps/test_cpu_query.cu
@@ -25,10 +25,11 @@
 #include "paddle/fluid/framework/fleet/heter_ps/optimizer.cuh.h"
 #include "paddle/fluid/platform/cuda_device_guard.h"
 
-using namespace paddle::framework;
+using namespace paddle::framework;  // NOLINT
 namespace platform = paddle::platform;
 
 std::string edges[] = {
+    // NOLINT
     std::string("0\t1"),
     std::string("0\t9"),
     std::string("1\t2"),
@@ -48,7 +49,7 @@ std::string edges[] = {
 };
 char edge_file_name[] = "edges1.txt";
 
-std::string nodes[] = {
+std::string nodes[] = {  // NOLINT
     std::string("user\t37\ta 0.34\tb 13 14\tc hello\td abc"),
     std::string("user\t96\ta 0.31\tb 15 10\tc 96hello\td abcd"),
     std::string("user\t59\ta 0.11\tb 11 14"),
@@ -118,7 +119,7 @@ TEST(TEST_FLEET, test_cpu_cache) {
       std::make_shared<HeterPsResource>(device_id_mapping);
   resource->enable_p2p();
   int use_nv = 1;
-  GpuPsGraphTable g(resource, 1, 2);
+  GpuPsGraphTable g(resource, 2);
   g.init_cpu_table(table_proto);
   g.cpu_graph_table_->Load(node_file_name, "nuser");
   g.cpu_graph_table_->Load(node_file_name, "nitem");
@@ -160,9 +161,9 @@ TEST(TEST_FLEET, test_cpu_cache) {
   for (int i = 0; i < 2; i++) {
     // platform::CUDADeviceGuard guard(i);
     LOG(0) << "query on card " << i;
-    //{1,9} or {9,1} is expected for key 0
-    //{0,2} or {2,0} is expected for key 1
-    //{1,3} or {3,1} is expected for key 2
+    // {1,9} or {9,1} is expected for key 0
+    // {0,2} or {2,0} is expected for key 1
+    // {1,3} or {3,1} is expected for key 2
     int step = 2;
     int cur = 0;
     while (true) {
@@ -177,7 +178,7 @@ TEST(TEST_FLEET, test_cpu_cache) {
       query.initialize(
           i, 0, node_query_res.get_val(), 1, node_query_res.get_len());
       query.display();
-      auto c = g.graph_neighbor_sample_v3(query, false);
+      auto c = g.graph_neighbor_sample_v3(query, false, true);
       c.display();
     }
   }
@@ -219,7 +220,7 @@ TEST(TEST_FLEET, test_cpu_cache) {
         query.initialize(i, 0, node_query_res.get_val(), 4,
                          node_query_res.get_len());
         query.display();
-        auto c = g.graph_neighbor_sample_v3(query, true);
+        auto c = g.graph_neighbor_sample_v3(query, true, true);
         c.display();
         platform::CUDADeviceGuard guard(i);
         uint64_t *key;
@@ -229,7 +230,7 @@ TEST(TEST_FLEET, test_cpu_cache) {
         uint64_t t_key = 1;
         cudaMemcpy(key, &t_key, sizeof(uint64_t), cudaMemcpyHostToDevice);
         q1.initialize(i, 0, (uint64_t)key, 2, 1);
-        auto d = g.graph_neighbor_sample_v3(q1, true);
+        auto d = g.graph_neighbor_sample_v3(q1, true, true);
         d.display();
         cudaFree(key);
         g.cpu_graph_table_->set_search_level(1);
diff --git a/paddle/fluid/framework/fleet/ps_gpu_wrapper.cc b/paddle/fluid/framework/fleet/ps_gpu_wrapper.cc
index 2cf1714aaa063..87704f077904b 100644
--- a/paddle/fluid/framework/fleet/ps_gpu_wrapper.cc
+++ b/paddle/fluid/framework/fleet/ps_gpu_wrapper.cc
@@ -31,15 +31,18 @@ limitations under the License. */
 
 #include <algorithm>
 #include <deque>
+#include <unordered_set>
 
 #include "paddle/fluid/framework/data_set.h"
 #include "paddle/fluid/framework/fleet/heter_ps/gpu_graph_utils.h"
+#include "paddle/fluid/framework/fleet/heter_ps/graph_gpu_wrapper.h"
 #include "paddle/fluid/platform/timer.h"
 #if defined(PADDLE_WITH_PSCORE)
 #include "paddle/fluid/distributed/ps/table/depends/feature_value.h"
 #endif
 
 DECLARE_int32(gpugraph_dedup_pull_push_mode);
+DECLARE_int32(gpugraph_storage_mode);
 
 namespace paddle {
 namespace framework {
@@ -111,6 +114,110 @@ void PSGPUWrapper::InitAfsApi(const std::string& fs_name,
   use_afs_api_ = 1;
 }
 #endif
+
+void PSGPUWrapper::add_key_to_local(const std::vector<uint64_t>& vec_data) {
+  size_t total_len = vec_data.size();
+  size_t len_per_thread = total_len / thread_keys_thread_num_;
+  size_t begin = 0;
+  std::vector<std::thread> threads;
+
+  int remain = total_len % thread_keys_thread_num_;
+  auto gen_graph_data_func = [this](const std::vector<uint64_t>& total_data,
+                                    int begin_index,
+                                    int end_index,
+                                    int i) {
+    for (auto iter = total_data.begin() + begin_index;
+         iter != total_data.begin() + end_index;
+         iter++) {
+      uint64_t cur_key = *iter;
+      int shard_id = cur_key % thread_keys_shard_num_;
+      this->thread_keys_[i][shard_id].insert(cur_key);
+    }
+  };
+  auto gen_graph_dynamic_mf_func = [this](
+                                       const std::vector<uint64_t>& total_data,
+                                       int begin_index,
+                                       int end_index,
+                                       int i) {
+    for (auto iter = total_data.begin() + begin_index;
+         iter != total_data.begin() + end_index;
+         iter++) {
+      uint64_t cur_key = *iter;
+      int shard_id = cur_key % thread_keys_shard_num_;
+      // TODO: feasign <-> slot <-> multi_dim
+      this->thread_dim_keys_[i][shard_id][0].insert(cur_key);
+    }
+  };
+  for (int i = 0; i < thread_keys_thread_num_; i++) {
+    if (!multi_mf_dim_) {
+      threads.push_back(
+          std::thread(gen_graph_data_func,
+                      std::ref(vec_data),
+                      begin,
+                      begin + len_per_thread + (i < remain ? 1 : 0),
+                      i));
+    } else {
+      threads.push_back(
+          std::thread(gen_graph_dynamic_mf_func,
+                      std::ref(vec_data),
+                      begin,
+                      begin + len_per_thread + (i < remain ? 1 : 0),
+                      i));
+    }
+    begin += len_per_thread + (i < remain ? 1 : 0);
+  }
+  for (std::thread& t : threads) {
+    t.join();
+  }
+}
+
+void PSGPUWrapper::add_key_to_gputask(std::shared_ptr<HeterContext> gpu_task) {
+  std::vector<std::thread> threads;
+  platform::Timer timeline;
+  timeline.Start();
+  // merge thread_keys to shard_keys
+  auto merge_ins_dynamic_mf_func = [this, gpu_task](int shard_num, int dim_id) {
+    for (int i = 0; i < thread_keys_thread_num_; ++i) {
+      gpu_task->batch_add_keys(
+          shard_num, dim_id, thread_dim_keys_[i][shard_num][dim_id]);
+      thread_dim_keys_[i][shard_num][dim_id].clear();
+    }
+  };
+  for (int i = 0; i < thread_keys_shard_num_; ++i) {
+    for (int j = 0; j < multi_mf_dim_; j++) {
+      threads.push_back(std::thread(merge_ins_dynamic_mf_func, i, j));
+    }
+  }
+  for (auto& t : threads) {
+    t.join();
+  }
+  timeline.Pause();
+
+  VLOG(0) << "GpuPs task add keys cost " << timeline.ElapsedSec()
+          << " seconds.";
+  timeline.Start();
+  size_t slot_num = slot_vector_.size() - 1;
+  // no slot_fea mode and whole_hbm mode, only keep one unique_sort action
+  if (slot_num > 0 && FLAGS_gpugraph_storage_mode !=
+                          paddle::framework::GpuGraphStorageMode::WHOLE_HBM) {
+    gpu_task->UniqueKeys();
+  }
+  timeline.Pause();
+  VLOG(0) << "GpuPs task unique cost " << timeline.ElapsedSec() << " seconds.";
+}
+
+void PSGPUWrapper::resize_gputask(std::shared_ptr<HeterContext> gpu_task) {
+  for (int i = 0; i < thread_keys_shard_num_; i++) {
+    for (int j = 0; j < multi_mf_dim_; j++) {
+      if (i == 0 && j == multi_mf_dim_ - 1) {
+        gpu_task->feature_dim_keys_[i][j].push_back(0);
+      }
+      gpu_task->value_dim_ptr_[i][j].resize(
+          gpu_task->feature_dim_keys_[i][j].size());
+    }
+  }
+}
+
 void PSGPUWrapper::PreBuildTask(std::shared_ptr<HeterContext> gpu_task) {
   VLOG(3) << "PSGPUWrapper::BuildGPUPSTask begin";
   platform::Timer timeline;
@@ -136,7 +243,7 @@ void PSGPUWrapper::PreBuildTask(std::shared_ptr<HeterContext> gpu_task) {
 
   std::string data_set_name = std::string(typeid(*dataset_).name());
 
-  VLOG(0) << "gpu_graph_mode_:" << gpu_graph_mode_;
+  VLOG(1) << "gpu_graph_mode_:" << gpu_graph_mode_;
   if (!gpu_graph_mode_) {
     if (data_set_name.find("SlotRecordDataset") != std::string::npos) {
       VLOG(0) << "ps_gpu_wrapper use SlotRecordDataset";
@@ -234,121 +341,329 @@ void PSGPUWrapper::PreBuildTask(std::shared_ptr<HeterContext> gpu_task) {
               << " seconds.";
     }
   } else {
-    VLOG(0) << "PreBuild in GpuGraph mode";
-    SlotRecordDataset* dataset = (SlotRecordDataset*)(dataset_);  // NOLINT
+    SlotRecordDataset* dataset = reinterpret_cast<SlotRecordDataset*>(dataset_);
     const std::vector<uint64_t>& vec_data = dataset->GetGpuGraphTotalKeys();
+    timeline.Start();
+    add_key_to_local(vec_data);
+    timeline.Pause();
+    VLOG(0) << "GpuGraphTotalKeys: " << vec_data.size()
+            << ", add_key_to_local cost " << timeline.ElapsedSec()
+            << " seconds.";
+  }
 
-    total_len = vec_data.size();
-    len_per_thread = total_len / thread_keys_thread_num_;
-    VLOG(0) << "GpuGraphTotalKeys: " << total_len;
-    remain = total_len % thread_keys_thread_num_;
-    auto gen_graph_data_func = [this](const std::vector<uint64_t>& total_data,
-                                      int begin_index,
-                                      int end_index,
-                                      int i) {
-      for (auto iter = total_data.begin() + begin_index;
-           iter != total_data.begin() + end_index;
-           iter++) {
-        uint64_t cur_key = *iter;
-        int shard_id = cur_key % thread_keys_shard_num_;
-        this->thread_keys_[i][shard_id].insert(cur_key);
-      }
-    };
-    auto gen_graph_dynamic_mf_func =
-        [this](const std::vector<uint64_t>& total_data,
-               int begin_index,
-               int end_index,
-               int i) {
-          for (auto iter = total_data.begin() + begin_index;
-               iter != total_data.begin() + end_index;
-               iter++) {
-            uint64_t cur_key = *iter;
-            int shard_id = cur_key % thread_keys_shard_num_;
-            // TODO(fengdanlei): feasign <-> slot <-> multi_dim
-            this->thread_dim_keys_[i][shard_id][0].insert(cur_key);
+  add_key_to_gputask(gpu_task);
+}
+
+void PSGPUWrapper::add_slot_feature(std::shared_ptr<HeterContext> gpu_task) {
+  platform::Timer timeline;
+  platform::Timer time_stage;
+  timeline.Start();
+  // 8卡数据分片
+  size_t device_num = heter_devices_.size();
+  std::vector<std::thread> threads;
+  size_t slot_num = slot_vector_.size() - 1;  // node slot 9008 in slot_vector
+  auto& local_dim_keys = gpu_task->feature_dim_keys_;  // [shard_num, 0, keys]]
+  double divide_nodeid_cost = 0;
+  double get_feature_id_cost = 0;
+  double add_feature_to_set_cost = 0;
+  double add_feature_to_key_cost = 0;
+
+  std::vector<std::vector<uint64_t>> node_ids(device_num);
+  size_t node_num = 0;
+  for (int i = 0; i < thread_keys_shard_num_; i++) {
+    for (int j = 0; j < multi_mf_dim_; j++) {
+      node_num += local_dim_keys[i][j].size();
+    }
+  }
+  for (auto& node_id_vector : node_ids) {
+    node_id_vector.reserve(node_num * 1.2 / device_num);
+  }
+
+  auto& device_dim_mutex = gpu_task->dim_mutex_;
+
+  auto divide_nodeid_to_device =
+      [this, device_num, &local_dim_keys, &node_ids, &device_dim_mutex](int i,
+                                                                        int j) {
+        std::vector<std::vector<uint64_t>> task_keys(device_num);
+        size_t batch = 10000;
+        for (size_t k = 0; k < device_num; k++) {
+          task_keys[k].reserve(batch * 1.2 / device_num);
+        }
+        std::vector<int> shuffle_device = shuffle_int_vector(device_num);
+        size_t start = 0;
+        while (start < local_dim_keys[i][j].size()) {
+          if (batch + start > local_dim_keys[i][j].size()) {
+            batch = local_dim_keys[i][j].size() - start;
           }
-        };
-    for (int i = 0; i < thread_keys_thread_num_; i++) {
-      if (!multi_mf_dim_) {
-        VLOG(1) << "psgpu graph wrapper genfunc";
-        threads.push_back(
-            std::thread(gen_graph_data_func,
-                        std::ref(vec_data),
-                        begin,
-                        begin + len_per_thread + (i < remain ? 1 : 0),
-                        i));
-      } else {
-        VLOG(1) << "psgpu graph wrapper genfunc with dynamic mf";
-        threads.push_back(
-            std::thread(gen_graph_dynamic_mf_func,
-                        std::ref(vec_data),
-                        begin,
-                        begin + len_per_thread + (i < remain ? 1 : 0),
-                        i));
+          for (size_t k = start; k < (start + batch); k++) {
+            int shard = local_dim_keys[i][j][k] % device_num;
+            task_keys[shard].push_back(local_dim_keys[i][j][k]);
+          }
+          // allocate local keys to devices
+          for (auto dev : shuffle_device) {
+            device_dim_mutex[dev][0]->lock();
+            int len = task_keys[dev].size();
+            for (int k = 0; k < len; ++k) {
+              node_ids[dev].push_back(task_keys[dev][k]);
+            }
+            device_dim_mutex[dev][0]->unlock();
+            task_keys[dev].clear();
+          }
+          start += batch;
+        }
+      };
+  threads.resize(thread_keys_shard_num_ * multi_mf_dim_);
+  time_stage.Start();
+
+  for (int i = 0; i < thread_keys_shard_num_; i++) {
+    for (int j = 0; j < multi_mf_dim_; j++) {
+      threads[i * multi_mf_dim_ + j] =
+          std::thread(divide_nodeid_to_device, i, j);
+    }
+  }
+  for (std::thread& t : threads) {
+    t.join();
+  }
+  threads.clear();
+  time_stage.Pause();
+  divide_nodeid_cost = time_stage.ElapsedSec();
+  gpu_task->sub_graph_feas = new std::vector<GpuPsCommGraphFea>;
+  std::vector<GpuPsCommGraphFea>& sub_graph_feas =
+      *((std::vector<GpuPsCommGraphFea>*)gpu_task->sub_graph_feas);
+  std::vector<std::vector<uint64_t>> feature_ids(device_num);
+  std::vector<uint64_t*> feature_list(device_num);
+  std::vector<size_t> feature_list_size(device_num);
+  size_t batch = 40000;
+
+  time_stage.Start();
+  if (FLAGS_gpugraph_storage_mode ==
+      paddle::framework::GpuGraphStorageMode::MEM_EMB_AND_GPU_GRAPH) {
+    auto gpu_graph_ptr = GraphGpuWrapper::GetInstance();
+    auto h_slot_feature_num_map = gpu_graph_ptr->slot_feature_num_map();
+    int fea_num_per_node = 0;
+    for (size_t i = 0; i < slot_num; ++i) {
+      fea_num_per_node += h_slot_feature_num_map[i];
+    }
+
+    auto get_feature_id = [this,
+                           slot_num,
+                           batch,
+                           fea_num_per_node,
+                           &h_slot_feature_num_map,
+                           &node_ids,
+                           &feature_ids](int i) {
+      platform::CUDADeviceGuard guard(resource_->dev_id(i));
+      int* d_slot_feature_num_map;
+      uint64_t* d_node_list_ptr;
+      uint64_t* d_feature_list_ptr;
+      CUDA_CHECK(cudaMalloc(&d_slot_feature_num_map, slot_num * sizeof(int)));
+      CUDA_CHECK(cudaMemcpy(d_slot_feature_num_map,
+                            h_slot_feature_num_map.data(),
+                            sizeof(int) * slot_num,
+                            cudaMemcpyHostToDevice));
+      CUDA_CHECK(cudaMalloc(&d_node_list_ptr, batch * sizeof(uint64_t)));
+      CUDA_CHECK(cudaMalloc(&d_feature_list_ptr,
+                            batch * fea_num_per_node * sizeof(uint64_t)));
+      auto gpu_graph_ptr = GraphGpuWrapper::GetInstance();
+      uint64_t pos = 0;
+      size_t real_batch = 0;
+      feature_ids[i].resize(node_ids[i].size() * fea_num_per_node);
+      while (pos < node_ids[i].size()) {
+        real_batch = (pos + batch) <= node_ids[i].size()
+                         ? batch
+                         : node_ids[i].size() - pos;
+        CUDA_CHECK(cudaMemcpy(d_node_list_ptr,
+                              node_ids[i].data() + pos,
+                              real_batch * sizeof(uint64_t),
+                              cudaMemcpyHostToDevice));
+        int ret = gpu_graph_ptr->get_feature_of_nodes(i,
+                                                      d_node_list_ptr,
+                                                      d_feature_list_ptr,
+                                                      real_batch,
+                                                      slot_num,
+                                                      d_slot_feature_num_map,
+                                                      fea_num_per_node);
+        PADDLE_ENFORCE_EQ(
+            ret,
+            0,
+            platform::errors::PreconditionNotMet("get_feature_of_nodes error"));
+
+        CUDA_CHECK(cudaMemcpy(feature_ids[i].data() + pos * fea_num_per_node,
+                              d_feature_list_ptr,
+                              real_batch * fea_num_per_node * sizeof(uint64_t),
+                              cudaMemcpyDeviceToHost));
+        pos += real_batch;
       }
-      begin += len_per_thread + (i < remain ? 1 : 0);
+      cudaFree(d_slot_feature_num_map);
+      cudaFree(d_node_list_ptr);
+      cudaFree(d_feature_list_ptr);
+    };
+
+    threads.resize(device_num);
+    for (size_t i = 0; i < device_num; i++) {
+      threads[i] = std::thread(get_feature_id, i);
     }
     for (std::thread& t : threads) {
       t.join();
     }
+    threads.clear();
+    for (size_t i = 0; i < device_num; i++) {
+      feature_list[i] = feature_ids[i].data();
+      feature_list_size[i] = feature_ids[i].size();
+    }
+  } else if (FLAGS_gpugraph_storage_mode ==
+                 paddle::framework::GpuGraphStorageMode::
+                     MEM_EMB_FEATURE_AND_GPU_GRAPH ||
+             FLAGS_gpugraph_storage_mode ==
+                 paddle::framework::GpuGraphStorageMode::
+                     SSD_EMB_AND_MEM_FEATURE_GPU_GRAPH) {
+    auto gpu_graph_ptr = GraphGpuWrapper::GetInstance();
+    sub_graph_feas = gpu_graph_ptr->get_sub_graph_fea(node_ids, slot_num);
+    for (size_t i = 0; i < device_num; i++) {
+      feature_list[i] = sub_graph_feas[i].feature_list;
+      feature_list_size[i] = sub_graph_feas[i].feature_size;
+    }
+  } else {
+    VLOG(0) << "FLAGS_gpugraph_storage_mode is not adaptived";
+  }
+  time_stage.Pause();
+  get_feature_id_cost = time_stage.ElapsedSec();
+  size_t feature_num = 0;
+  for (size_t i = 0; i < device_num; i++) {
+    feature_num += feature_list_size[i];
+  }
+  VLOG(0) << "feature_num is " << feature_num << " node_num num is "
+          << node_num;
+
+  size_t set_num = thread_keys_shard_num_;
+  std::vector<std::unordered_set<uint64_t>> feature_id_set(set_num);
+  std::vector<std::mutex> set_mutex(set_num);
+
+  auto add_feature_to_set =
+      [this, set_num, &feature_list, &feature_id_set, &set_mutex](
+          int dev, size_t start, size_t end) {
+        size_t batch = 10000 * set_num;
+        std::vector<std::vector<uint64_t>> feature_list_tmp(set_num);
+        for (size_t i = 0; i < set_num; i++) {
+          feature_list_tmp[i].reserve((batch * 1.2) / set_num);
+        }
+        std::vector<int> shuffle_set_index = shuffle_int_vector(set_num);
+        size_t pos = start;
+        size_t real_batch = 0;
+        while (pos < end) {
+          real_batch = (pos + batch <= end) ? batch : end - pos;
+          for (size_t i = pos; i < pos + real_batch; i++) {
+            if (feature_list[dev][i] == 0) {
+              continue;
+            }
+            int shard_num = feature_list[dev][i] % set_num;
+            feature_list_tmp[shard_num].push_back(feature_list[dev][i]);
+          }
+          // uniq in local
+          for (size_t i = 0; i < set_num; i++) {
+            std::sort(feature_list_tmp[i].begin(), feature_list_tmp[i].end());
+            size_t idx = 0;
+            size_t total = feature_list_tmp[i].size();
+            for (size_t j = 0; j < total; j++) {
+              auto& k = feature_list_tmp[i][j];
+              if (idx > 0 && feature_list_tmp[i][idx - 1] == k) {
+                continue;
+              }
+              feature_list_tmp[i][idx] = k;
+              ++idx;
+            }
+            feature_list_tmp[i].resize(idx);
+          }
+          // uniq in global
+          for (auto set_index : shuffle_set_index) {
+            set_mutex[set_index].lock();
+            for (auto feature_id : feature_list_tmp[set_index]) {
+              feature_id_set[set_index].insert(feature_id);
+            }
+            set_mutex[set_index].unlock();
+            feature_list_tmp[set_index].clear();
+          }
+          pos += real_batch;
+        }
+      };
+  size_t device_thread_num = 8;
+  threads.resize(device_num * device_thread_num);
+  time_stage.Start();
+  for (size_t i = 0; i < device_num; i++) {
+    size_t start = 0;
+    for (size_t j = 0; j < device_thread_num; j++) {
+      size_t batch = feature_list_size[i] / device_thread_num;
+      if (j < feature_list_size[i] % device_thread_num) {
+        batch += 1;
+      }
+      threads[i * device_thread_num + j] =
+          std::thread(add_feature_to_set, i, start, start + batch);
+      start += batch;
+    }
+  }
+  for (std::thread& t : threads) {
+    t.join();
   }
-
-  timeline.Start();
-
   threads.clear();
-  // merge thread_keys to shard_keys
-  auto merge_ins_dynamic_mf_func = [this, gpu_task](int shard_num, int dim_id) {
-    for (int i = 0; i < thread_keys_thread_num_; ++i) {
-      gpu_task->batch_add_keys(
-          shard_num, dim_id, thread_dim_keys_[i][shard_num][dim_id]);
-      thread_dim_keys_[i][shard_num][dim_id].clear();
+  time_stage.Pause();
+  add_feature_to_set_cost = time_stage.ElapsedSec();
+  auto add_feature_to_key = [this,
+                             device_num,
+                             &feature_id_set,
+                             &local_dim_keys,
+                             set_num](int shard_num, int j) {
+    local_dim_keys[shard_num][j].reserve(local_dim_keys[shard_num][j].size() +
+                                         feature_id_set[shard_num].size());
+    for (auto it = feature_id_set[shard_num].begin();
+         it != feature_id_set[shard_num].end();
+         it++) {
+      local_dim_keys[shard_num][j].push_back(*it);
     }
+    feature_id_set[shard_num].clear();
   };
-  for (int i = 0; i < thread_keys_shard_num_; ++i) {
+  time_stage.Start();
+  threads.resize(thread_keys_shard_num_ * multi_mf_dim_);
+  for (int i = 0; i < thread_keys_shard_num_; i++) {
     for (int j = 0; j < multi_mf_dim_; j++) {
-      threads.push_back(std::thread(merge_ins_dynamic_mf_func, i, j));
+      threads[i * multi_mf_dim_ + j] = std::thread(add_feature_to_key, i, j);
     }
   }
-  for (auto& t : threads) {
+  for (std::thread& t : threads) {
     t.join();
   }
+  time_stage.Pause();
+  add_feature_to_key_cost = time_stage.ElapsedSec();
+  threads.clear();
   timeline.Pause();
-
-  VLOG(0) << "GpuPs task add keys cost " << timeline.ElapsedSec()
-          << " seconds.";
-  timeline.Start();
-  gpu_task->UniqueKeys();
-  timeline.Pause();
-
-  VLOG(0) << "GpuPs task unique cost " << timeline.ElapsedSec() << " seconds.";
-  for (int i = 0; i < thread_keys_shard_num_; i++) {
-    for (int j = 0; j < multi_mf_dim_; j++) {
-      if (i == 0 && j == multi_mf_dim_ - 1) {
-        gpu_task->feature_dim_keys_[i][j].push_back(0);
-      }
-      VLOG(0) << "GpuPs shard: " << i << "mf dim: " << index_dim_vec_[j]
-              << " key len: " << gpu_task->feature_dim_keys_[i][j].size();
-      gpu_task->value_dim_ptr_[i][j].resize(
-          gpu_task->feature_dim_keys_[i][j].size());
-    }
-  }
+  VLOG(0) << " add_slot_feature costs: " << timeline.ElapsedSec() << " s."
+          << " divide_nodeid_cost " << divide_nodeid_cost
+          << " get_feature_id_cost " << get_feature_id_cost
+          << " add_feature_to_set_cost " << add_feature_to_set_cost
+          << " add_feature_to_key_cost " << add_feature_to_key_cost;
 }
 
 void PSGPUWrapper::BuildPull(std::shared_ptr<HeterContext> gpu_task) {
   platform::Timer timeline;
-  std::vector<std::future<void>> task_futures;
-  int device_num = heter_devices_.size();
-  auto& local_keys = gpu_task->feature_keys_;
-  auto& local_ptr = gpu_task->value_ptr_;
+  size_t slot_num = slot_vector_.size() - 1;  // node slot 9008 in slot_vector
+  if (slot_num > 0 && FLAGS_gpugraph_storage_mode !=
+                          paddle::framework::GpuGraphStorageMode::WHOLE_HBM) {
+    add_slot_feature(gpu_task);
+  }
+
+  resize_gputask(gpu_task);
+
+  platform::Timer time_stage;
+  time_stage.Start();
+  gpu_task->UniqueKeys();
+  time_stage.Pause();
+  VLOG(0) << "BuildPull slot feature uniq and sort cost time: "
+          << time_stage.ElapsedSec();
 
   auto& local_dim_keys = gpu_task->feature_dim_keys_;
   auto& local_dim_ptr = gpu_task->value_dim_ptr_;
 
-  auto& device_keys = gpu_task->device_keys_;
-  auto& device_vals = gpu_task->device_values_;
   auto& device_dim_keys = gpu_task->device_dim_keys_;
   auto& device_dim_ptr = gpu_task->device_dim_ptr_;
-  auto& device_dim_mutex = gpu_task->dim_mutex_;
 
   for (size_t dev = 0; dev < device_dim_keys.size(); dev++) {
     device_dim_keys[dev].resize(multi_mf_dim_);
@@ -380,7 +695,8 @@ void PSGPUWrapper::BuildPull(std::shared_ptr<HeterContext> gpu_task) {
   timeline.Start();
 
   auto ptl_dynamic_mf_func =
-      [this, &local_dim_keys, &local_dim_ptr, &fleet_ptr](int i, int j) {
+      [this, &local_dim_keys, &local_dim_ptr, &fleet_ptr, &gpu_task](int i,
+                                                                     int j) {
         size_t key_size = local_dim_keys[i][j].size();
         int32_t status = -1;
         int32_t cnt = 0;
@@ -421,10 +737,12 @@ void PSGPUWrapper::BuildPull(std::shared_ptr<HeterContext> gpu_task) {
 #ifdef PADDLE_WITH_PSCORE
         while (true) {
           auto tt = fleet_ptr->worker_ptr_->PullSparsePtr(
+              i,
               reinterpret_cast<char**>(local_dim_ptr[i][j].data()),
               this->table_id_,
               local_dim_keys[i][j].data(),
-              key_size);
+              key_size,
+              gpu_task->pass_id_);
           bool flag = true;
 
           tt.wait();
@@ -456,16 +774,20 @@ void PSGPUWrapper::BuildPull(std::shared_ptr<HeterContext> gpu_task) {
           sleep(300);
           exit(-1);
         } else {
-          VLOG(0) << "FleetWrapper Pull sparse to local done with table size: "
+          VLOG(1) << "FleetWrapper Pull sparse to local done with table size: "
                   << local_dim_keys[i][j].size();
         }
       };
 
   threads.resize(thread_keys_shard_num_ * multi_mf_dim_);
+
+  uint64_t total_key = 0;
+  std::vector<std::future<void>> task_futures;
   for (int i = 0; i < thread_keys_shard_num_; i++) {
     for (int j = 0; j < multi_mf_dim_; j++) {
       task_futures.emplace_back(
           pull_thread_pool_[i]->enqueue(ptl_dynamic_mf_func, i, j));
+      total_key += local_dim_keys[i][j].size();
     }
   }
   for (auto& f : task_futures) {
@@ -473,8 +795,8 @@ void PSGPUWrapper::BuildPull(std::shared_ptr<HeterContext> gpu_task) {
   }
   task_futures.clear();
   timeline.Pause();
-  VLOG(0) << "pull sparse from CpuPS into GpuPS cost " << timeline.ElapsedSec()
-          << " seconds.";
+  VLOG(0) << "pull sparse from CpuPS into GpuPS total keys " << total_key
+          << ", cost " << timeline.ElapsedSec() << " seconds.";
   if (multi_node_) {
     auto gloo_wrapper = paddle::framework::GlooWrapper::GetInstance();
     if (!gloo_wrapper->IsInitialized()) {
@@ -483,13 +805,29 @@ void PSGPUWrapper::BuildPull(std::shared_ptr<HeterContext> gpu_task) {
     }
     gloo_wrapper->Barrier();
   }
+}
 
-  timeline.Start();
-  std::vector<std::vector<std::pair<uint64_t, char*>>> pass_values;
+void PSGPUWrapper::divide_to_device(std::shared_ptr<HeterContext> gpu_task) {
+  platform::Timer timeline;
+  int device_num = heter_devices_.size();
+  std::vector<std::thread> threads;
+  std::vector<std::future<void>> task_futures;
+  auto& local_dim_keys = gpu_task->feature_dim_keys_;
+  auto& local_dim_ptr = gpu_task->value_dim_ptr_;
 
-  bool record_status = false;
-  auto& device_task_keys = gpu_task->device_task_keys_;
-  auto& device_task_ptrs = gpu_task->device_task_ptr_;
+  auto& device_dim_keys = gpu_task->device_dim_keys_;
+  auto& device_dim_ptr = gpu_task->device_dim_ptr_;
+  auto& device_dim_mutex = gpu_task->dim_mutex_;
+  // auto& device_mutex = gpu_task->mutex_;
+
+  if (multi_mf_dim_) {
+    for (size_t dev = 0; dev < device_dim_keys.size(); dev++) {
+      device_dim_keys[dev].resize(multi_mf_dim_);
+      device_dim_ptr[dev].resize(multi_mf_dim_);
+    }
+  }
+
+  timeline.Start();
   auto build_pull_dynamic_mf_func = [this,
                                      device_num,
                                      &local_dim_keys,
@@ -513,7 +851,8 @@ void PSGPUWrapper::BuildPull(std::shared_ptr<HeterContext> gpu_task) {
       task_ptrs[shard].push_back(local_dim_ptr[i][j][k]);
     }
     // allocate local keys to devices
-    for (int dev = 0; dev < device_num; dev++) {
+    std::vector<int> shuffle_device = shuffle_int_vector(device_num);
+    for (auto dev : shuffle_device) {
       device_dim_mutex[dev][j]->lock();
       int len = task_keys[dev].size();
       int cur = device_dim_keys[dev][j].size();
@@ -526,6 +865,43 @@ void PSGPUWrapper::BuildPull(std::shared_ptr<HeterContext> gpu_task) {
       device_dim_mutex[dev][j]->unlock();
     }
   };
+
+  if (multi_mf_dim_) {
+    threads.resize(thread_keys_shard_num_ * multi_mf_dim_);
+    for (int i = 0; i < thread_keys_shard_num_; i++) {
+      for (int j = 0; j < multi_mf_dim_; j++) {
+        threads[i * multi_mf_dim_ + j] =
+            std::thread(build_pull_dynamic_mf_func, i, j);
+      }
+    }
+    for (std::thread& t : threads) {
+      t.join();
+    }
+  }
+  timeline.Pause();
+  VLOG(0) << "GpuPs prepare for build hbm cost " << timeline.ElapsedSec()
+          << " seconds.";
+}
+
+void PSGPUWrapper::PrepareGPUTask(std::shared_ptr<HeterContext> gpu_task) {
+  platform::Timer timeline;
+  int device_num = heter_devices_.size();
+  std::vector<std::thread> threads;
+  std::vector<std::future<void>> task_futures;
+  auto& local_keys = gpu_task->feature_keys_;
+  auto& local_ptr = gpu_task->value_ptr_;
+
+  auto& device_keys = gpu_task->device_keys_;
+  auto& device_vals = gpu_task->device_values_;
+  // auto& device_mutex = gpu_task->mutex_;
+
+  timeline.Start();
+  std::vector<std::vector<std::pair<uint64_t, char*>>> pass_values;
+
+  bool record_status = false;
+  auto& device_task_keys = gpu_task->device_task_keys_;
+  auto& device_task_ptrs = gpu_task->device_task_ptr_;
+
   auto build_func = [device_num,
                      record_status,
                      &pass_values,
@@ -612,18 +988,9 @@ void PSGPUWrapper::BuildPull(std::shared_ptr<HeterContext> gpu_task) {
                                  &device_vals,
                                  &device_task_keys,
                                  &device_task_ptrs](int dev, int shard_id) {
-  // auto& task_keys = device_task_keys[shard_id];
 #ifdef PADDLE_WITH_PSLIB
     auto& task_ptrs = device_task_ptrs[shard_id];
-#endif
-
-    // #ifdef PADDLE_WITH_PSCORE
-    //     auto& task_ptrs = device_task_ptrs[shard_id];
-    // #endif
 
-    // int len = prefix_sum[dev][shard_id + 1] - prefix_sum[dev][shard_id];
-    // int cur = prefix_sum[dev][shard_id];
-#ifdef PADDLE_WITH_PSLIB
     for (int j = 0; j < len; ++j) {
       device_keys[dev][cur + j] = task_keys[dev][j];
       float* ptr_val = task_ptrs[dev][j]->data();
@@ -653,18 +1020,7 @@ void PSGPUWrapper::BuildPull(std::shared_ptr<HeterContext> gpu_task) {
 #endif
     VLOG(3) << "GpuPs build hbmps done";
   };
-
-  if (multi_mf_dim_) {
-    for (int i = 0; i < thread_keys_shard_num_; i++) {
-      for (int j = 0; j < multi_mf_dim_; j++) {
-        threads[i * multi_mf_dim_ + j] =
-            std::thread(build_pull_dynamic_mf_func, i, j);
-      }
-    }
-    for (std::thread& t : threads) {
-      t.join();
-    }
-  } else {
+  if (!multi_mf_dim_) {
     for (int i = 0; i < thread_keys_shard_num_; i++) {
       for (int j = 0; j < device_num; j++) {
         task_futures.emplace_back(
@@ -683,8 +1039,8 @@ void PSGPUWrapper::BuildPull(std::shared_ptr<HeterContext> gpu_task) {
 
 void PSGPUWrapper::BuildGPUTask(std::shared_ptr<HeterContext> gpu_task) {
   int device_num = heter_devices_.size();
-  platform::Timer timeline;
-  timeline.Start();
+  platform::Timer stagetime;
+  stagetime.Start();
 
   std::vector<size_t> feature_keys_count(device_num);
   size_t size_max = 0;
@@ -696,15 +1052,10 @@ void PSGPUWrapper::BuildGPUTask(std::shared_ptr<HeterContext> gpu_task) {
               << " dim index: " << j << " contains feasign nums: "
               << gpu_task->device_dim_ptr_[i][j].size();
     }
-    VLOG(1) << i << " card with dynamic mf contains feasign nums total: "
+    VLOG(0) << i << " card with dynamic mf contains feasign nums total: "
             << feature_keys_count[i];
     size_max = std::max(size_max, feature_keys_count[i]);
   }
-
-  if (HeterPs_) {
-    delete HeterPs_;
-    HeterPs_ = nullptr;
-  }
   if (size_max <= 0) {
     VLOG(0) << "Skip build gpu ps cause feasign nums = " << size_max;
     return;
@@ -712,94 +1063,37 @@ void PSGPUWrapper::BuildGPUTask(std::shared_ptr<HeterContext> gpu_task) {
   std::vector<std::thread> threads(device_num);
   auto accessor_wrapper_ptr =
       GlobalAccessorFactory::GetInstance().GetAccessorWrapper();
-  HeterPs_ = HeterPsBase::get_instance(
-      size_max, resource_, fleet_config_, accessor_class_, optimizer_type_);
+  if (HeterPs_ == NULL) {
+    HeterPs_ = HeterPsBase::get_instance(
+        size_max, resource_, fleet_config_, accessor_class_, optimizer_type_);
 #ifdef PADDLE_WITH_CUDA
-  HeterPs_->set_nccl_comm_and_size(inner_comms_, inter_comms_, node_size_);
-  HeterPs_->set_sparse_sgd(optimizer_config_);
-  HeterPs_->set_embedx_sgd(optimizer_config_);
+    HeterPs_->set_nccl_comm_and_size(
+        inner_comms_, inter_comms_, node_size_, rank_id_);
+    HeterPs_->set_sparse_sgd(optimizer_config_);
+    HeterPs_->set_embedx_sgd(optimizer_config_);
 #endif
+  }
+  stagetime.Pause();
+  VLOG(0) << "card: "
+          << " BuildGPUTask create HeterPs_ costs: " << stagetime.ElapsedSec()
+          << " s.";
+  stagetime.Start();
 
-  auto build_dymf_mem_pool = [this, &gpu_task, &accessor_wrapper_ptr](int i,
-                                                                      int j) {
-    this->HeterPs_->set_multi_mf_dim(multi_mf_dim_, max_mf_dim_);
-    int mf_dim = this->index_dim_vec_[j];
-    VLOG(0) << "building table: " << i << "with mf dim: " << mf_dim
-            << " feature_value_size:"
-            << accessor_wrapper_ptr->GetFeatureValueSize(mf_dim);
-    size_t feature_value_size =
-        accessor_wrapper_ptr->GetFeatureValueSize(mf_dim);
-    auto& device_dim_keys = gpu_task->device_dim_keys_[i][j];
+  auto build_dynamic_mf_func = [this, &gpu_task, &accessor_wrapper_ptr](
+                                   int i, int j, size_t start, size_t end) {
+    // this->HeterPs_->set_multi_mf_dim(multi_mf_dim_, max_mf_dim_);
     auto& device_dim_ptrs = gpu_task->device_dim_ptr_[i][j];
-    size_t len = device_dim_keys.size();
-    CHECK(len == device_dim_ptrs.size());
-    this->mem_pools_[i * this->multi_mf_dim_ + j] =
-        new MemoryPool(len, feature_value_size);
-  };
-  auto build_dymf_hbm_pool = [this, &gpu_task, &accessor_wrapper_ptr](int i,
-                                                                      int j) {
-    auto& device_dim_keys = gpu_task->device_dim_keys_[i][j];
-    size_t len = device_dim_keys.size();
     int mf_dim = this->index_dim_vec_[j];
     size_t feature_value_size =
         accessor_wrapper_ptr->GetFeatureValueSize(mf_dim);
-
-    auto& mem_pool = this->mem_pools_[i * this->multi_mf_dim_ + j];
-    platform::CUDADeviceGuard guard(resource_->dev_id(i));
-    this->hbm_pools_[i * this->multi_mf_dim_ + j] = new HBMMemoryPool(mem_pool);
-    auto& cur_pool = this->hbm_pools_[i * this->multi_mf_dim_ + j];
-
-    this->HeterPs_->build_ps(i,
-                             device_dim_keys.data(),
-                             cur_pool->mem(),
-                             len,
-                             feature_value_size,
-                             500000,
-                             2);
-    if (device_dim_keys.size() > 0) {
-      VLOG(3) << "show table: " << i
-              << " table kv size: " << device_dim_keys.size()
-              << "dim: " << mf_dim << " len: " << len;
-      HeterPs_->show_one_table(i);
-    }
-    delete mem_pool;
-  };
-  int thread_num = 16;
-  auto build_dynamic_mf_func = [this,
-                                &gpu_task,
-                                thread_num,
-                                &accessor_wrapper_ptr](int i, int j, int z) {
-    // this->HeterPs_->set_multi_mf_dim(multi_mf_dim_, max_mf_dim_);
-    int mf_dim = this->index_dim_vec_[j];
-    VLOG(0) << "building table: " << i << "with mf dim: " << mf_dim;
-    auto& device_dim_keys = gpu_task->device_dim_keys_[i][j];
-    auto& device_dim_ptrs = gpu_task->device_dim_ptr_[i][j];
-    size_t len = device_dim_keys.size();
-    CHECK(len == device_dim_ptrs.size());
-    // this->mem_pools_[i * this->multi_mf_dim_ + j] =
-    //    new MemoryPool(len, feature_value_size);
-    auto& mem_pool = this->mem_pools_[i * this->multi_mf_dim_ + j];
-
-    // ============ add for multi-thread ================
-    size_t len_per_thread = len / thread_num;
-    size_t remain = len % thread_num;
-    size_t left = 0, right = 0;
-
-    size_t real_len = len_per_thread;
-    if ((size_t)z < remain) real_len++;  // NOLINT
-
-    if ((size_t)z < remain) {  // NOLINT
-      left = z * (len_per_thread + 1);
-      right = left + real_len;
-    } else {
-      left = remain * (len_per_thread + 1) + (z - remain) * len_per_thread;
-      right = left + real_len;
-    }
-    // ============ add for multi-thread ================
-
-    for (size_t k = left; k < right; k++) {
+    size_t real_len = end - start;
+    std::shared_ptr<char> build_values(new char[feature_value_size * real_len],
+                                       [](char* p) { delete[] p; });
+    char* test_build_values = build_values.get();
+    for (size_t k = start; k < end; k++) {
 #ifdef PADDLE_WITH_PSLIB
-      float* val = (float*)(mem_pool->mem_address(k));  // NOLINT
+      float* val = reinterpret_cast<float*>(test_build_values +
+                                            (k - start) * feature_value_size);
       float* ptr_val = device_dim_ptrs[k]->data();
       size_t dim = device_dim_ptrs[k]->size();
       val->delta_score =
@@ -831,55 +1125,165 @@ void PSGPUWrapper::BuildGPUTask(std::shared_ptr<HeterContext> gpu_task) {
           val->mf[x] = 0;
         }
       }
+      VLOG(5) << "build " << k << " : "
+              << feature_value_accessor_.ParseToString(
+                     val,
+                     feature_value_accessor_.common_feature_value.Dim(mf_dim));
 #endif
 #ifdef PADDLE_WITH_PSCORE
-      void* val = mem_pool->mem_address(k);
+      void* val = reinterpret_cast<float*>(test_build_values +
+                                           (k - start) * feature_value_size);
       accessor_wrapper_ptr->BuildFill(
           val, device_dim_ptrs[k], cpu_table_accessor_, mf_dim);
 #endif
     }
+    task_info task;
+    task.build_values = build_values;
+    task.offset = start;
+    task.device_id = i;
+    task.multi_mf_dim = j;
+    task.start = 0;
+    task.end = real_len;
+    cpu_reday_channels_[i]->Put(task);
   };
 
-  threads.resize(device_num * multi_mf_dim_);
-  for (int i = 0; i < device_num; i++) {
+  auto build_dymf_hbm_pool = [this,
+                              &gpu_task,
+                              &accessor_wrapper_ptr,
+                              &feature_keys_count](int i) {
+    platform::CUDADeviceGuard guard(resource_->dev_id(i));
+    // reset table
+    this->HeterPs_->reset_table(i,
+                                feature_keys_count[i],
+                                optimizer_config_,
+                                optimizer_config_,
+                                infer_mode_);
+    // insert hbm table
+    std::vector<std::thread> threads(multi_mf_dim_);
     for (int j = 0; j < multi_mf_dim_; j++) {
-      threads[i + j * device_num] = std::thread(build_dymf_mem_pool, i, j);
+      auto& device_dim_keys = gpu_task->device_dim_keys_[i][j];
+      size_t len = device_dim_keys.size();
+      int mf_dim = this->index_dim_vec_[j];
+      size_t feature_value_size =
+          accessor_wrapper_ptr->GetFeatureValueSize(mf_dim);
+      this->hbm_pools_[i * this->multi_mf_dim_ + j]->reset(len,
+                                                           feature_value_size);
+
+      auto build_ps_thread =
+          [this, &gpu_task](
+              int i, int j, size_t len, size_t feature_value_size) {
+            auto& device_dim_keys = gpu_task->device_dim_keys_[i][j];
+            this->HeterPs_->build_ps(
+                i,
+                device_dim_keys.data(),
+                this->hbm_pools_[i * this->multi_mf_dim_ + j]->mem(),
+                len,
+                feature_value_size,
+                500000,
+                2);
+            if (device_dim_keys.size() > 0) {
+              VLOG(3) << "show table: " << i
+                      << " table kv size: " << device_dim_keys.size()
+                      << "dim: " << this->index_dim_vec_[j] << " len: " << len;
+              HeterPs_->show_one_table(i);
+            }
+          };
+      threads[j] = std::thread(build_ps_thread, i, j, len, feature_value_size);
+    }
+    // build feature table
+    size_t slot_num = slot_vector_.size() - 1;  // node slot 9008 in slot_vector
+    if (slot_num > 0 &&
+        (FLAGS_gpugraph_storage_mode == paddle::framework::GpuGraphStorageMode::
+                                            MEM_EMB_FEATURE_AND_GPU_GRAPH ||
+         FLAGS_gpugraph_storage_mode ==
+             paddle::framework::GpuGraphStorageMode::
+                 SSD_EMB_AND_MEM_FEATURE_GPU_GRAPH)) {
+      auto build_feature_table = [this, &gpu_task](int i) {
+        auto gpu_graph_ptr = GraphGpuWrapper::GetInstance();
+        std::vector<GpuPsCommGraphFea>* tmp =
+            (std::vector<GpuPsCommGraphFea>*)gpu_task->sub_graph_feas;
+        gpu_graph_ptr->build_gpu_graph_fea((*tmp)[i], i);
+      };
+      threads.push_back(std::thread(build_feature_table, i));
     }
-  }
 
-  for (std::thread& t : threads) {
-    t.join();
-  }
-  threads.clear();
+    struct task_info task;
+    while (cpu_reday_channels_[i]->Get(task)) {
+      auto hbm = this->hbm_pools_[task.device_id * this->multi_mf_dim_ +
+                                  task.multi_mf_dim]
+                     ->mem();
+      int mf_dim = this->index_dim_vec_[task.multi_mf_dim];
+      size_t feature_value_size =
+          accessor_wrapper_ptr->GetFeatureValueSize(mf_dim);
+      auto hbm_start = hbm + task.offset * feature_value_size;
+      CUDA_CHECK(
+          cudaMemcpy(hbm_start,
+                     task.build_values.get() + task.start * feature_value_size,
+                     (task.end - task.start) * feature_value_size,
+                     cudaMemcpyHostToDevice));
+    }
+    platform::Timer stagetime;
+    stagetime.Start();
+    for (std::thread& t : threads) {
+      t.join();
+    }
+    stagetime.Pause();
+    VLOG(0) << "card: " << i
+            << " BuildGPUTask build_ps async costs: " << stagetime.ElapsedSec()
+            << " s.";
+  };
 
-  // multi-thread process
-  threads.resize(device_num * multi_mf_dim_ * thread_num);
+  std::vector<std::future<void>> cpu_task_futures;
+  std::vector<std::future<void>> gpu_task_futures;
+
+  int once_gpu_copy = 64 * 1024;
+  threads.resize(device_num * multi_mf_dim_);
   for (int i = 0; i < device_num; i++) {
+    cpu_reday_channels_[i]->Open();
+    gpu_task_futures.emplace_back(
+        hbm_thread_pool_[i]->enqueue(build_dymf_hbm_pool, i));
     for (int j = 0; j < multi_mf_dim_; j++) {
-      for (int k = 0; k < thread_num; k++) {
-        threads[(i + j * device_num) * thread_num + k] =
-            std::thread(build_dynamic_mf_func, i, j, k);
+      auto& device_dim_keys = gpu_task->device_dim_keys_[i][j];
+      size_t len = device_dim_keys.size();
+      size_t start = 0;
+      size_t end = 0;
+      while (end < len) {
+        start = end;
+        end = end + once_gpu_copy < len ? (end + once_gpu_copy) : len;
+        cpu_task_futures.emplace_back(cpu_work_pool_[i]->enqueue(
+            build_dynamic_mf_func, i, j, start, end));
       }
     }
   }
-  for (std::thread& t : threads) {
-    t.join();
+
+  stagetime.Start();
+  for (auto& f : cpu_task_futures) {
+    f.wait();
   }
-  threads.clear();
-  threads.resize(device_num * multi_mf_dim_);
+  cpu_task_futures.clear();
+  stagetime.Pause();
+  VLOG(0) << " BuildGPUTask build_dynamic_mf_func "
+          << " cost " << stagetime.ElapsedSec() << " s.";
   for (int i = 0; i < device_num; i++) {
-    for (int j = 0; j < multi_mf_dim_; j++) {
-      threads[i + j * device_num] = std::thread(build_dymf_hbm_pool, i, j);
-    }
+    cpu_reday_channels_[i]->Close();
   }
-  for (std::thread& t : threads) {
-    t.join();
+  stagetime.Start();
+  for (auto& f : gpu_task_futures) {
+    f.wait();
   }
-  threads.clear();
-
-  timeline.Pause();
-  VLOG(0) << "GpuPs build table total costs: " << timeline.ElapsedSec()
-          << " s.";
+  gpu_task_futures.clear();
+  if (FLAGS_gpugraph_storage_mode == paddle::framework::GpuGraphStorageMode::
+                                         MEM_EMB_FEATURE_AND_GPU_GRAPH ||
+      FLAGS_gpugraph_storage_mode == paddle::framework::GpuGraphStorageMode::
+                                         SSD_EMB_AND_MEM_FEATURE_GPU_GRAPH) {
+    std::vector<GpuPsCommGraphFea>* tmp =
+        (std::vector<GpuPsCommGraphFea>*)gpu_task->sub_graph_feas;
+    delete tmp;
+    gpu_task->sub_graph_feas = NULL;
+  }
+  stagetime.Pause();
+  VLOG(0) << "  build_dymf_hbm_pool "
+          << " cost " << stagetime.ElapsedSec() << " s.";
 }
 
 void PSGPUWrapper::LoadIntoMemory(bool is_shuffle) {
@@ -889,17 +1293,25 @@ void PSGPUWrapper::LoadIntoMemory(bool is_shuffle) {
   dataset_->LoadIntoMemory();
   timer.Pause();
   VLOG(0) << "LoadIntoMemory cost: " << timer.ElapsedSec() << "s";
-
+  gpu_graph_mode_ = dataset_->GetGpuGraphMode();
+  if (dataset_->GetMemoryDataSize() == 0) {
+    VLOG(0) << "GetMemoryDataSize == 0";
+    return;
+  }
   // local shuffle
   if (is_shuffle) {
     dataset_->LocalShuffle();
   }
-  InitSlotInfo();
-  gpu_graph_mode_ = dataset_->GetGpuGraphMode();
-  std::shared_ptr<HeterContext> gpu_task = gpu_task_pool_.Get();
-  gpu_task->Reset();
 
-  data_ready_channel_->Put(gpu_task);
+  InitSlotInfo();
+  if (FLAGS_gpugraph_storage_mode != GpuGraphStorageMode::WHOLE_HBM) {
+    std::shared_ptr<HeterContext> gpu_task = gpu_task_pool_.Get();
+    gpu_task->Reset();
+    gpu_task->pass_id_ = (uint16_t)(dataset_->GetPassID());
+    data_ready_channel_->Put(gpu_task);
+  } else if (hbm_sparse_table_initialized_ == false) {
+    SparseTableToHbm();
+  }
 
   VLOG(3) << "End LoadIntoMemory(), dataset[" << dataset_ << "]";
 }
@@ -908,6 +1320,7 @@ void PSGPUWrapper::start_build_thread() {
   running_ = true;
   VLOG(3) << "start build CPU ps thread.";
   pre_build_threads_ = std::thread([this] { pre_build_thread(); });
+  buildpull_threads_ = std::thread([this] { build_pull_thread(); });
 }
 
 void PSGPUWrapper::pre_build_thread() {
@@ -930,6 +1343,27 @@ void PSGPUWrapper::pre_build_thread() {
   VLOG(3) << "build cpu thread end";
 }
 
+void PSGPUWrapper::build_pull_thread() {
+  while (running_) {
+    std::shared_ptr<HeterContext> gpu_task = nullptr;
+    if (!buildcpu_ready_channel_->Get(gpu_task)) {
+      continue;
+    }
+    VLOG(3) << "thread build pull start.";
+    platform::Timer timer;
+    timer.Start();
+    // build cpu ps data process
+    BuildPull(gpu_task);
+    if (multi_mf_dim_) {
+      divide_to_device(gpu_task);
+    }
+    timer.Pause();
+    VLOG(1) << "thread BuildPull end, cost time: " << timer.ElapsedSec() << "s";
+    buildpull_ready_channel_->Put(gpu_task);
+  }
+  VLOG(3) << "build cpu thread end";
+}
+
 void PSGPUWrapper::build_task() {
   // build_task: build_pull + build_gputask
   std::shared_ptr<HeterContext> gpu_task = nullptr;
@@ -938,24 +1372,29 @@ void PSGPUWrapper::build_task() {
     return;
   }
   // ins and pre_build end
-  if (!buildcpu_ready_channel_->Get(gpu_task)) {
+  if (!buildpull_ready_channel_->Get(gpu_task)) {
     return;
   }
 
-  VLOG(0) << "BuildPull start.";
+  VLOG(0) << "PrepareGPUTask start.";
   platform::Timer timer;
   timer.Start();
-  BuildPull(gpu_task);
+  if (!multi_mf_dim_) {
+    PrepareGPUTask(gpu_task);
+  }
   BuildGPUTask(gpu_task);
   timer.Pause();
-  VLOG(0) << "BuildPull + BuildGPUTask end, cost time: " << timer.ElapsedSec()
-          << "s";
+  VLOG(0) << "PrepareGPUTask + BuildGPUTask end, cost time: "
+          << timer.ElapsedSec() << "s";
 
   current_task_ = gpu_task;
 }
 
 void PSGPUWrapper::BeginPass() {
   platform::Timer timer;
+  if (FLAGS_gpugraph_storage_mode == GpuGraphStorageMode::WHOLE_HBM) {
+    return;
+  }
   timer.Start();
   if (current_task_) {
     PADDLE_THROW(
@@ -981,12 +1420,65 @@ void PSGPUWrapper::BeginPass() {
 }
 
 void PSGPUWrapper::EndPass() {
+  if (FLAGS_gpugraph_storage_mode == GpuGraphStorageMode::WHOLE_HBM) {
+    return;
+  }
+  platform::Timer stagetime;
+  stagetime.Start();
+  HbmToSparseTable();
+  stagetime.Pause();
+  VLOG(0) << "EndPass HbmToSparseTable cost time: " << stagetime.ElapsedSec()
+          << "s";
+
+  gpu_task_pool_.Push(current_task_);
+  current_task_ = nullptr;
+  gpu_free_channel_->Put(current_task_);
+  // fleet_ptr->pslib_ptr_->_worker_ptr->release_table_mutex(this->table_id_);
+}
+
+void PSGPUWrapper::SparseTableToHbm() {
+  std::shared_ptr<HeterContext> gpu_task = gpu_task_pool_.Get();
+  gpu_task->Reset();
+  size_t device_num = heter_devices_.size();
+  gpu_task->init(thread_keys_shard_num_, device_num, multi_mf_dim_);
+  gpu_task->pass_id_ = (uint16_t)(dataset_->GetPassID());
+  auto gpu_graph_ptr = GraphGpuWrapper::GetInstance();
+  auto node_to_id = gpu_graph_ptr->feature_to_id;
+  auto edge_to_id = gpu_graph_ptr->edge_to_id;
+  std::vector<uint64_t> vec_data = gpu_graph_ptr->get_graph_total_keys();
+
+  thread_dim_keys_.resize(thread_keys_thread_num_);
+  for (int i = 0; i < thread_keys_thread_num_; i++) {
+    thread_dim_keys_[i].resize(thread_keys_shard_num_);
+    for (int j = 0; j < thread_keys_shard_num_; j++) {
+      thread_dim_keys_[i][j].resize(multi_mf_dim_);
+    }
+  }
+
+  add_key_to_local(vec_data);
+  add_key_to_gputask(gpu_task);
+  BuildPull(gpu_task);
+  if (!multi_mf_dim_) {
+    PrepareGPUTask(gpu_task);
+  } else {
+    divide_to_device(gpu_task);
+  }
+  BuildGPUTask(gpu_task);
+  current_task_ = gpu_task;
+  hbm_sparse_table_initialized_ = true;
+}
+
+void PSGPUWrapper::HbmToSparseTable() {
+  // hbm no update not need dump
+  if (grad_push_count_ == 0) {
+    return;
+  }
+  grad_push_count_ = 0;
+
   if (!current_task_) {
     PADDLE_THROW(
         platform::errors::Fatal("[EndPass] current task has been ended."));
   }
-  platform::Timer timer;
-  timer.Start();
   size_t keysize_max = 0;
   // in case of feasign_num = 0, skip dump_to_cpu
 
@@ -996,87 +1488,124 @@ void PSGPUWrapper::EndPass() {
           std::max(keysize_max, current_task_->device_dim_keys_[i][j].size());
     }
   }
-  int thread_num = 8;
   auto accessor_wrapper_ptr =
       GlobalAccessorFactory::GetInstance().GetAccessorWrapper();
-  auto dump_pool_to_cpu_func = [this, thread_num, &accessor_wrapper_ptr](
-                                   int i, int j, int z) {
+
+  int once_cpu_num = 16 * 1024;
+  int once_gpu_copy = 8 * once_cpu_num;
+
+  auto dump_pool_to_cpu_func = [this, &accessor_wrapper_ptr, once_cpu_num](
+                                   int i, int j, size_t start, size_t end) {
     PADDLE_ENFORCE_GPU_SUCCESS(cudaSetDevice(this->resource_->dev_id(i)));
     auto& hbm_pool = this->hbm_pools_[i * this->multi_mf_dim_ + j];
-    auto& device_keys = this->current_task_->device_dim_keys_[i][j];
-    size_t len = device_keys.size();
-    // ====== multi-thread process feasign================
-    int len_per_thread = len / thread_num;
-    int remain = len % thread_num;
-    int left = -1, right = -1;
-    int real_len = len_per_thread;
-    if (z < remain) real_len++;
-    if (z < remain) {
-      left = z * (len_per_thread + 1);
-      right = left + real_len;
-    } else {
-      left = remain * (len_per_thread + 1) + (z - remain) * len_per_thread;
-      right = left + real_len;
-    }
+    size_t real_len = end - start;
     // ============ multi-thread process feasign============
     int mf_dim = this->index_dim_vec_[j];
     size_t feature_value_size =
         accessor_wrapper_ptr->GetFeatureValueSize(mf_dim);
-    VLOG(0) << "dump pool to cpu table: " << i << "with mf dim: " << mf_dim
-            << " key_len :" << len
-            << " feature_value_size:" << feature_value_size;
-    char* test_build_values =
-        (char*)malloc(feature_value_size * real_len);  // NOLINT
-    uint64_t offset = left * feature_value_size;
+
+    std::shared_ptr<char> build_values(new char[feature_value_size * real_len],
+                                       [](char* p) { delete[] p; });
+    uint64_t offset = start * feature_value_size;
+    char* test_build_values = build_values.get();
+
     cudaMemcpy(test_build_values,
                hbm_pool->mem() + offset,
                feature_value_size * real_len,
                cudaMemcpyDeviceToHost);
-    CHECK(len == hbm_pool->capacity());
-    uint64_t unuse_key = std::numeric_limits<uint64_t>::max();
-    for (int i = left; i < right; ++i) {
-      if (device_keys[i] == unuse_key) {
-        continue;
-      }
-      size_t local_offset = (i - left) * feature_value_size;
-      float* gpu_val = (float*)(test_build_values + local_offset);  // NOLINT
+    for (size_t k = 0; k * once_cpu_num < real_len; k++) {
+      struct task_info task;
+      task.build_values = build_values;
+      task.offset = start;
+      task.device_id = i;
+      task.multi_mf_dim = j;
+      task.start = k * once_cpu_num;
+      task.end = (k + 1) * once_cpu_num < real_len ? ((k + 1) * once_cpu_num)
+                                                   : (real_len);
+      cpu_reday_channels_[i]->Put(task);
+    }
+  };
+  auto cpu_func = [this, &accessor_wrapper_ptr](int j) {
+    struct task_info task;
+    while (cpu_reday_channels_[j]->Get(task)) {
+      auto& device_keys =
+          this->current_task_
+              ->device_dim_keys_[task.device_id][task.multi_mf_dim];
+      char* test_build_values = task.build_values.get();
+      int mf_dim = this->index_dim_vec_[task.multi_mf_dim];
+      size_t feature_value_size =
+          accessor_wrapper_ptr->GetFeatureValueSize(mf_dim);
+      uint64_t unuse_key = std::numeric_limits<uint64_t>::max();
+      for (int i = task.start; i < task.end; ++i) {
+        if (device_keys[i + task.offset] == unuse_key) {
+          continue;
+        }
+        size_t local_offset = i * feature_value_size;
+        float* gpu_val =
+            reinterpret_cast<float*>(test_build_values + local_offset);
 #ifdef PADDLE_WITH_PSLIB
-      // TODO(fengdanlei): PSLIB DumpFill
+        // TODO: PSLIB DumpFill
 #endif
 #ifdef PADDLE_WITH_PSCORE
-      accessor_wrapper_ptr->DumpFill(gpu_val, cpu_table_accessor_, mf_dim);
+        accessor_wrapper_ptr->DumpFill(gpu_val, cpu_table_accessor_, mf_dim);
 #endif
+      }
     }
-    free(test_build_values);
   };
+  platform::Timer timer;
+  timer.Start();
+  std::vector<std::future<void>> cpu_task_futures;
+  std::vector<std::future<void>> gpu_task_futures;
+  size_t thread_num = 16;
+  size_t device_num = heter_devices_.size();
   if (multi_mf_dim_) {
     VLOG(0) << "psgpu wrapper dump pool: multi_mf_dim_: " << multi_mf_dim_;
-    size_t device_num = heter_devices_.size();
-    std::vector<std::thread> threads(device_num * multi_mf_dim_ * thread_num);
     for (size_t i = 0; i < device_num; i++) {
+      cpu_reday_channels_[i]->Open();
       for (int j = 0; j < multi_mf_dim_; j++) {
-        for (int k = 0; k < thread_num; k++) {
-          threads[(i + j * device_num) * thread_num + k] =
-              std::thread(dump_pool_to_cpu_func, i, j, k);
+        auto& device_keys = this->current_task_->device_dim_keys_[i][j];
+        size_t len = device_keys.size();
+        size_t start = 0;
+        size_t end = 0;
+        while (end < len) {
+          start = end;
+          end = end + once_gpu_copy < len ? (end + once_gpu_copy) : len;
+          gpu_task_futures.emplace_back(hbm_thread_pool_[i]->enqueue(
+              dump_pool_to_cpu_func, i, j, start, end));
         }
       }
+      for (size_t j = 0; j < thread_num; j++) {
+        cpu_task_futures.emplace_back(cpu_work_pool_[i]->enqueue(cpu_func, i));
+      }
     }
-    for (std::thread& t : threads) {
-      t.join();
-    }
   }
+  for (auto& f : gpu_task_futures) {
+    f.wait();
+  }
+  timer.Pause();
+  VLOG(0) << " EndPass  dump_pool_to_cpu_func "
+          << " cost " << timer.ElapsedSec() << " s.";
+  for (size_t i = 0; i < device_num; i++) {
+    cpu_reday_channels_[i]->Close();
+  }
+  gpu_task_futures.clear();
+  timer.Start();
+  for (auto& f : cpu_task_futures) {
+    f.wait();
+  }
+  cpu_task_futures.clear();
+  timer.Pause();
+  VLOG(0) << " EndPass  cpu_func "
+          << " cost " << timer.ElapsedSec() << " s.";
   if (keysize_max != 0) {
     HeterPs_->end_pass();
   }
+}
 
-  for (size_t i = 0; i < hbm_pools_.size(); i++) {
-    delete hbm_pools_[i];
+void PSGPUWrapper::DumpToMem() {
+  if (FLAGS_gpugraph_storage_mode == GpuGraphStorageMode::WHOLE_HBM) {
+    this->HbmToSparseTable();
   }
-  gpu_task_pool_.Push(current_task_);
-  current_task_ = nullptr;
-  gpu_free_channel_->Put(current_task_);
-  timer.Pause();
-  VLOG(1) << "EndPass end, cost time: " << timer.ElapsedSec() << "s";
 }
 
 void PSGPUWrapper::PullSparse(const paddle::platform::Place& place,
@@ -1380,6 +1909,7 @@ void PSGPUWrapper::PushSparseGrad(const paddle::platform::Place& place,
                                   const std::vector<int64_t>& slot_lengths,
                                   const int hidden_size,
                                   const int batch_size) {
+  ++grad_push_count_;
   platform::Timer all_timer;
   platform::Timer push_gpups_timer;
   all_timer.Start();
diff --git a/paddle/fluid/framework/fleet/ps_gpu_wrapper.h b/paddle/fluid/framework/fleet/ps_gpu_wrapper.h
index 96dc86cea3c58..4a452f4d36cea 100644
--- a/paddle/fluid/framework/fleet/ps_gpu_wrapper.h
+++ b/paddle/fluid/framework/fleet/ps_gpu_wrapper.h
@@ -34,6 +34,7 @@ limitations under the License. */
 #include "paddle/fluid/distributed/ps/thirdparty/round_robin.h"
 #include "paddle/fluid/framework/channel.h"
 #include "paddle/fluid/framework/fleet/heter_context.h"
+#include "paddle/fluid/framework/fleet/heter_ps/graph_gpu_wrapper.h"
 #include "paddle/fluid/framework/fleet/heter_ps/heter_ps_base.h"
 #include "paddle/fluid/framework/fleet/heter_ps/heter_resource.h"
 #include "paddle/fluid/framework/heter_util.h"
@@ -63,6 +64,7 @@ limitations under the License. */
 #include "downpour_accessor.h"  // NOLINT
 #endif
 #include "paddle/fluid/framework/fleet/heter_ps/log_patch.h"
+DECLARE_int32(gpugraph_storage_mode);
 
 namespace paddle {
 namespace framework {
@@ -96,6 +98,15 @@ class AfsWrapper {
 };
 #endif
 
+struct task_info {
+  std::shared_ptr<char> build_values;
+  size_t offset;
+  int device_id;
+  int multi_mf_dim;
+  int start;
+  int end;
+};
+
 class PSGPUWrapper {
   class DCacheBuffer {
    public:
@@ -188,30 +199,61 @@ class PSGPUWrapper {
                 int total_len,
                 int* key2slot);
 
+  void divide_to_device(std::shared_ptr<HeterContext> gpu_task);
+  void add_slot_feature(std::shared_ptr<HeterContext> gpu_task);
   void BuildGPUTask(std::shared_ptr<HeterContext> gpu_task);
   void PreBuildTask(std::shared_ptr<HeterContext> gpu_task);
   void BuildPull(std::shared_ptr<HeterContext> gpu_task);
+  void PrepareGPUTask(std::shared_ptr<HeterContext> gpu_task);
   void LoadIntoMemory(bool is_shuffle);
   void BeginPass();
   void EndPass();
+  void add_key_to_local(const std::vector<uint64_t>& keys);
+  void add_key_to_gputask(std::shared_ptr<HeterContext> gpu_task);
+  void resize_gputask(std::shared_ptr<HeterContext> gpu_task);
+  void SparseTableToHbm();
+  void HbmToSparseTable();
   void start_build_thread();
   void pre_build_thread();
+  void build_pull_thread();
   void build_task();
+  void DumpToMem();
+  // set mode
+  void SetMode(bool infer_mode) {
+    infer_mode_ = infer_mode;
+    if (HeterPs_ != NULL) {
+      HeterPs_->set_mode(infer_mode);
+    }
+    VLOG(0) << "set infer mode=" << infer_mode;
+  }
 
   void Finalize() {
     VLOG(3) << "PSGPUWrapper Begin Finalize.";
     if (s_instance_ == nullptr) {
       return;
     }
+    if (FLAGS_gpugraph_storage_mode == GpuGraphStorageMode::WHOLE_HBM) {
+      this->EndPass();
+    }
+    for (size_t i = 0; i < hbm_pools_.size(); i++) {
+      delete hbm_pools_[i];
+    }
     data_ready_channel_->Close();
     buildcpu_ready_channel_->Close();
+    buildpull_ready_channel_->Close();
     gpu_free_channel_->Close();
     running_ = false;
     VLOG(3) << "begin stop pre_build_threads_";
     pre_build_threads_.join();
+    VLOG(3) << "begin stop buildpull_threads_";
+    buildpull_threads_.join();
     s_instance_ = nullptr;
     VLOG(3) << "PSGPUWrapper Finalize Finished.";
     HeterPs_->show_table_collisions();
+    if (HeterPs_ != NULL) {
+      delete HeterPs_;
+      HeterPs_ = NULL;
+    }
     if (device_caches_ != nullptr) {
       delete[] device_caches_;
       device_caches_ = nullptr;
@@ -230,6 +272,12 @@ class PSGPUWrapper {
       auto gloo = paddle::framework::GlooWrapper::GetInstance();
       if (gloo->Size() > 1) {
         multi_node_ = 1;
+        resource_->set_multi_node(multi_node_);
+        optimizer_config_.multi_node = true;
+        VLOG(0) << "init multi node gpu server";
+      } else {
+        optimizer_config_.multi_node = false;
+        VLOG(0) << "init single node gpu server";
       }
 #else
       PADDLE_THROW(
@@ -262,10 +310,15 @@ class PSGPUWrapper {
         opts.setRoot(0);
         gloo::broadcast(opts);
 
+        PADDLE_ENFORCE_GPU_SUCCESS(platform::dynload::ncclGroupStart());
         for (int i = 0; i < dev_size; ++i) {
+          platform::CUDADeviceGuard guard(dev_ids[i]);
           platform::dynload::ncclCommInitRank(
               &inter_comms_[i], gloo->Size(), inter_ncclids_[i], gloo->Rank());
         }
+        PADDLE_ENFORCE_GPU_SUCCESS(platform::dynload::ncclGroupEnd());
+
+        rank_id_ = gloo->Rank();
         node_size_ = gloo->Size();
 #else
         PADDLE_THROW(
@@ -278,9 +331,17 @@ class PSGPUWrapper {
       data_ready_channel_->SetCapacity(3);
       buildcpu_ready_channel_->Open();
       buildcpu_ready_channel_->SetCapacity(3);
+      buildpull_ready_channel_->Open();
+      buildpull_ready_channel_->SetCapacity(1);
       gpu_free_channel_->Open();
       gpu_free_channel_->SetCapacity(1);
 
+      cpu_reday_channels_.resize(dev_ids.size());
+      for (size_t i = 0; i < dev_ids.size(); i++) {
+        cpu_reday_channels_[i] = paddle::framework::MakeChannel<task_info>();
+        cpu_reday_channels_[i]->SetCapacity(16);
+      }
+
       current_task_ = nullptr;
       gpu_free_channel_->Put(current_task_);
 
@@ -378,6 +439,11 @@ class PSGPUWrapper {
       hbm_thread_pool_[i].reset(new ::ThreadPool(1));
     }
 
+    cpu_work_pool_.resize(thread_keys_shard_num_);
+    for (size_t i = 0; i < hbm_thread_pool_.size(); i++) {
+      cpu_work_pool_[i].reset(new ::ThreadPool(16));
+    }
+
     auto sparse_table_accessor = sparse_table.accessor();
     auto sparse_table_accessor_parameter =
         sparse_table_accessor.ctr_accessor_param();
@@ -535,6 +601,7 @@ class PSGPUWrapper {
   }
   void SetSlotVector(const std::vector<int>& slot_vector) {
     slot_vector_ = slot_vector;
+    VLOG(0) << "slot_vector size is " << slot_vector_.size();
   }
 
   void SetSlotOffsetVector(const std::vector<int>& slot_offset_vector) {
@@ -589,6 +656,10 @@ class PSGPUWrapper {
       dim_index_map[index_dim_vec_[i]] = i;
     }
     hbm_pools_.resize(resource_->total_device() * num_of_dim);
+    for (size_t i = 0; i < hbm_pools_.size(); i++) {
+      hbm_pools_[i] = new HBMMemoryPoolFix();
+    }
+
     mem_pools_.resize(resource_->total_device() * num_of_dim);
     max_mf_dim_ = index_dim_vec_.back();
     multi_mf_dim_ = (dim_index_map.size() >= 1) ? dim_index_map.size() : 0;
@@ -648,7 +719,8 @@ class PSGPUWrapper {
       uint64_t,
       std::vector<std::unordered_map<uint64_t, std::vector<float>>>>
       local_tables_;
-  HeterPsBase* HeterPs_;
+  HeterPsBase* HeterPs_ = NULL;
+  // std::vector<LoDTensor> keys_tensor;  // Cache for pull_sparse
   std::vector<phi::DenseTensor> keys_tensor;  // Cache for pull_sparse
   std::shared_ptr<HeterPsResource> resource_;
   int32_t sleep_seconds_before_fail_exit_;
@@ -670,6 +742,7 @@ class PSGPUWrapper {
   double time_4 = 0.0;
 
   int multi_node_{0};
+  int rank_id_;
   int node_size_;
   uint64_t table_id_;
   int gpu_graph_mode_ = 0;
@@ -691,6 +764,7 @@ class PSGPUWrapper {
   int month_;
   int day_;
   bool slot_info_initialized_ = false;
+  bool hbm_sparse_table_initialized_ = false;
   int use_afs_api_ = 0;
   int optimizer_type_ = 1;
   std::string accessor_class_;
@@ -701,8 +775,8 @@ class PSGPUWrapper {
 
 #ifdef PADDLE_WITH_CUDA
   std::vector<MemoryPool*> mem_pools_;
-  std::vector<HBMMemoryPool*> hbm_pools_;  // in multi mfdim, one table need hbm
-                                           // pools of totol dims number
+  std::vector<HBMMemoryPoolFix*> hbm_pools_;  // in multi mfdim, one table need
+                                              // hbm pools of totol dims number
 #endif
 
   std::shared_ptr<
@@ -717,12 +791,24 @@ class PSGPUWrapper {
       paddle::framework::ChannelObject<std::shared_ptr<HeterContext>>>
       gpu_free_channel_ =
           paddle::framework::MakeChannel<std::shared_ptr<HeterContext>>();
+  std::shared_ptr<
+      paddle::framework::ChannelObject<std::shared_ptr<HeterContext>>>
+      buildpull_ready_channel_ =
+          paddle::framework::MakeChannel<std::shared_ptr<HeterContext>>();
+  std::vector<std::shared_ptr<paddle::framework::ChannelObject<task_info>>>
+      cpu_reday_channels_;
   std::shared_ptr<HeterContext> current_task_ = nullptr;
   std::thread pre_build_threads_;
+  std::thread buildpull_threads_;
   bool running_ = false;
   std::vector<std::shared_ptr<ThreadPool>> pull_thread_pool_;
   std::vector<std::shared_ptr<ThreadPool>> hbm_thread_pool_;
+  std::vector<std::shared_ptr<ThreadPool>> cpu_work_pool_;
   OptimizerConfig optimizer_config_;
+  // gradient push count
+  uint64_t grad_push_count_ = 0;
+  // infer mode
+  bool infer_mode_ = false;
 
  protected:
   static bool is_initialized_;
diff --git a/paddle/fluid/framework/hogwild_worker.cc b/paddle/fluid/framework/hogwild_worker.cc
index 57da94da86e55..ef2ef8f8596c9 100644
--- a/paddle/fluid/framework/hogwild_worker.cc
+++ b/paddle/fluid/framework/hogwild_worker.cc
@@ -14,6 +14,7 @@ limitations under the License. */
 
 #include <ctime>
 
+#include "paddle/fluid/framework/barrier.h"
 #include "paddle/fluid/framework/convert_utils.h"
 #include "paddle/fluid/framework/data_type.h"
 #include "paddle/fluid/framework/device_worker.h"
@@ -25,9 +26,14 @@ limitations under the License. */
 #include "paddle/fluid/distributed/ps/service/communicator/communicator.h"
 #endif
 
+DECLARE_bool(enable_exit_when_partial_worker);
+
 namespace paddle {
 namespace framework {
 
+std::atomic<uint64_t> HogwildWorker::worker_num_stat_(0);
+Barrier g_barrier;
+
 void HogwildWorker::Initialize(const TrainerDesc &desc) {
   fetch_config_ = desc.fetch_config();
   param_ = desc.hogwild_param();
@@ -141,9 +147,30 @@ void HogwildWorker::TrainFilesWithProfiler() {
   double read_time = 0.0;
   int cur_batch;
   int batch_cnt = 0;
+  if (thread_id_ == 0) {
+    worker_num_stat_.store(0);
+  }
+  g_barrier.wait();
+  bool train_mode = device_reader_->IsTrainMode();
   timeline.Start();
   uint64_t total_inst = 0;
-  while ((cur_batch = device_reader_->Next()) > 0) {
+#if defined(PADDLE_WITH_GPU_GRAPH) && defined(PADDLE_WITH_HETERPS)
+  device_reader_->InitGraphTrainResource();
+#endif
+  while (1) {
+    cur_batch = device_reader_->Next();
+    if (FLAGS_enable_exit_when_partial_worker && train_mode) {
+      if (cur_batch > 0) {
+        worker_num_stat_.fetch_add(1, std::memory_order_relaxed);
+      }
+      g_barrier.wait();
+      if (worker_num_stat_.load(std::memory_order_relaxed) % thread_num_ != 0) {
+        break;
+      }
+    }
+    if (cur_batch <= 0) {
+      break;
+    }
     VLOG(3) << "read a batch in thread " << thread_id_;
     timeline.Pause();
     read_time += timeline.ElapsedSec();
@@ -237,8 +264,33 @@ void HogwildWorker::TrainFiles() {
   device_reader_->Start();
   int cur_batch;
   int batch_cnt = 0;
+  if (thread_id_ == 0) {
+    worker_num_stat_.store(0);
+  }
+  g_barrier.wait();
 
-  while ((cur_batch = device_reader_->Next()) > 0) {
+#if defined(PADDLE_WITH_HETERPS) && defined(PADDLE_WITH_CUDA)
+  platform::SetDeviceId(thread_id_);
+#endif
+  // while ((cur_batch = device_reader_->Next()) > 0) {
+  bool train_mode = device_reader_->IsTrainMode();
+#if defined(PADDLE_WITH_GPU_GRAPH) && defined(PADDLE_WITH_HETERPS)
+  device_reader_->InitGraphTrainResource();
+#endif
+  while (1) {
+    cur_batch = device_reader_->Next();
+    if (FLAGS_enable_exit_when_partial_worker && train_mode) {
+      if (cur_batch > 0) {
+        worker_num_stat_.fetch_add(1, std::memory_order_relaxed);
+      }
+      g_barrier.wait();
+      if (worker_num_stat_.load(std::memory_order_relaxed) % thread_num_ != 0) {
+        break;
+      }
+    }
+    if (cur_batch <= 0) {
+      break;
+    }
     for (auto &op : ops_) {
       bool need_skip = false;
       for (auto t = 0u; t < skip_ops_.size(); ++t) {
diff --git a/paddle/fluid/framework/multi_trainer.cc b/paddle/fluid/framework/multi_trainer.cc
index 2a101e8f00aeb..258da58ba7885 100644
--- a/paddle/fluid/framework/multi_trainer.cc
+++ b/paddle/fluid/framework/multi_trainer.cc
@@ -46,6 +46,7 @@ void MultiTrainer::Initialize(const TrainerDesc& trainer_desc,
     places_.push_back(place);
   }
 #endif
+  user_define_dump_filename_ = trainer_desc.user_define_dump_filename();
   // get filelist from trainer_desc here
   const std::vector<paddle::framework::DataFeed*> readers =
       dataset->GetReaders();
diff --git a/paddle/fluid/inference/CMakeLists.txt b/paddle/fluid/inference/CMakeLists.txt
index 33e564f097f0b..b8f60d252137f 100644
--- a/paddle/fluid/inference/CMakeLists.txt
+++ b/paddle/fluid/inference/CMakeLists.txt
@@ -117,8 +117,7 @@ if(WITH_CRYPTO)
 endif()
 
 if(WITH_PSCORE)
-  set(SHARED_INFERENCE_DEPS ${SHARED_INFERENCE_DEPS} fleet ps_service
-                            tensor_table)
+  set(SHARED_INFERENCE_DEPS ${SHARED_INFERENCE_DEPS} fleet ps_service)
 endif()
 
 if(WITH_INFERENCE_NVTX AND NOT WIN32)
diff --git a/paddle/fluid/inference/tensorrt/helper.h b/paddle/fluid/inference/tensorrt/helper.h
index 9d48024c61190..3d6bb6da3d758 100644
--- a/paddle/fluid/inference/tensorrt/helper.h
+++ b/paddle/fluid/inference/tensorrt/helper.h
@@ -22,6 +22,7 @@
 #include <utility>
 #include <vector>
 
+#include "paddle/fluid/framework/framework.pb.h"
 #include "paddle/fluid/platform/dynload/tensorrt.h"
 #include "paddle/fluid/platform/enforce.h"
 #include "paddle/phi/common/data_type.h"
diff --git a/paddle/fluid/memory/allocation/auto_growth_best_fit_allocator.cc b/paddle/fluid/memory/allocation/auto_growth_best_fit_allocator.cc
index d91e8c54fb344..26f988f43af3c 100644
--- a/paddle/fluid/memory/allocation/auto_growth_best_fit_allocator.cc
+++ b/paddle/fluid/memory/allocation/auto_growth_best_fit_allocator.cc
@@ -49,7 +49,12 @@ AutoGrowthBestFitAllocator::AutoGrowthBestFitAllocator(
     : underlying_allocator_(underlying_allocator),
       alignment_(alignment),
       chunk_size_(std::max(AlignedSize(chunk_size, alignment), alignment)),
-      allow_free_idle_chunk_(allow_free_idle_chunk) {}
+      allow_free_idle_chunk_(allow_free_idle_chunk) {
+  total_alloc_times_ = 0;
+  total_alloc_size_ = 0;
+  total_free_times_ = 0;
+  total_free_size_ = 0;
+}
 
 phi::Allocation *AutoGrowthBestFitAllocator::AllocateImpl(
     size_t unaligned_size) {
@@ -112,6 +117,8 @@ phi::Allocation *AutoGrowthBestFitAllocator::AllocateImpl(
     VLOG(2) << "Not found and reallocate " << realloc_size << "("
             << static_cast<void *>(p) << "), and remaining " << remaining_size;
   }
+  ++total_alloc_times_;
+  total_alloc_size_ += size;
   VLOG(10) << "Alloc " << block_it->size_ << " bytes, ptr = " << block_it->ptr_;
   return new BlockAllocation(block_it);
 }
@@ -126,6 +133,9 @@ void AutoGrowthBestFitAllocator::FreeImpl(phi::Allocation *allocation) {
   auto block_it = static_cast<BlockAllocation *>(allocation)->block_it_;
   auto &blocks = block_it->chunk_->blocks_;
 
+  total_free_times_ += 1;
+  total_free_size_ += block_it->size_;
+
   block_it->is_free_ = true;
 
   if (block_it != blocks.begin()) {
@@ -176,9 +186,30 @@ uint64_t AutoGrowthBestFitAllocator::FreeIdleChunks() {
       ++chunk_it;
     }
   }
+
+  Trace();
   return bytes;
 }
 
+void AutoGrowthBestFitAllocator::Trace() const {
+  size_t cur_idle_bytes = 0;
+  auto it = free_blocks_.begin();
+  for (; it != free_blocks_.end(); ++it) {
+    cur_idle_bytes += it->second->size_;
+  }
+
+  VLOG(1) << "alloc:" << total_alloc_size_ / static_cast<double>(1024 * 1024)
+          << "m free:" << total_free_size_ / static_cast<double>(1024 * 1024)
+          << "m busy:"
+          << (total_alloc_size_ - total_free_size_) /
+                 static_cast<double>(1024 * 1024)
+          << "m idle:" << cur_idle_bytes / static_cast<double>(1024 * 1024)
+          << "m alloc_times:" << total_alloc_times_
+          << " free_times:" << total_free_times_
+          << " free_blocks_num:" << free_blocks_.size()
+          << " curr_chunks_num:" << chunks_.size();
+}
+
 }  // namespace allocation
 }  // namespace memory
 }  // namespace paddle
diff --git a/paddle/fluid/memory/allocation/auto_growth_best_fit_allocator.h b/paddle/fluid/memory/allocation/auto_growth_best_fit_allocator.h
index dadf751bdfa41..138f4a98c4db5 100644
--- a/paddle/fluid/memory/allocation/auto_growth_best_fit_allocator.h
+++ b/paddle/fluid/memory/allocation/auto_growth_best_fit_allocator.h
@@ -49,6 +49,7 @@ class AutoGrowthBestFitAllocator : public Allocator {
 
  private:
   uint64_t FreeIdleChunks();
+  void Trace() const;
 
   template <typename T>
   using List = std::list<T>;
@@ -93,6 +94,12 @@ class AutoGrowthBestFitAllocator : public Allocator {
   size_t chunk_size_;
   bool allow_free_idle_chunk_;
 
+  // stat info
+  size_t total_alloc_times_;
+  size_t total_alloc_size_;
+  size_t total_free_times_;
+  size_t total_free_size_;
+
   SpinLock spinlock_;
 };
 
diff --git a/paddle/fluid/operators/shuffle_batch_op.cu b/paddle/fluid/operators/shuffle_batch_op.cu
index 4ab4868bfb5b2..863e056ac4d3e 100644
--- a/paddle/fluid/operators/shuffle_batch_op.cu
+++ b/paddle/fluid/operators/shuffle_batch_op.cu
@@ -27,6 +27,37 @@
 namespace paddle {
 namespace operators {
 
+struct CacheAllocator {
+  typedef char value_type;
+  explicit CacheAllocator(platform::Place place) {
+    VLOG(2) << "construct allocator";
+    place_ = place;
+  }
+
+  ~CacheAllocator() { VLOG(2) << "destory allocator"; }
+
+  char *allocate(std::ptrdiff_t num_bytes) {
+    VLOG(2) << "allocate " << num_bytes << " bytes";
+    auto storage = memory::AllocShared(place_, num_bytes);
+    char *ptr = reinterpret_cast<char *>(storage->ptr());
+    busy_allocation_.emplace(std::make_pair(ptr, storage));
+    return ptr;
+  }
+
+  void deallocate(char *ptr, size_t) {
+    VLOG(2) << "deallocate ";
+    allocation_map_type::iterator iter = busy_allocation_.find(ptr);
+    CHECK(iter != busy_allocation_.end());
+    busy_allocation_.erase(iter);
+  }
+
+ private:
+  typedef std::unordered_map<char *, std::shared_ptr<phi::Allocation>>
+      allocation_map_type;
+  allocation_map_type busy_allocation_;
+  platform::Place place_;
+};
+
 template <typename T, bool kIsForward>
 struct ReorderFunctor {
   ReorderFunctor(const T *x, const int64_t *shuffle_idx, T *y, int64_t stride)
@@ -90,7 +121,8 @@ class ShuffleBatchCUDAKernel : public framework::OpKernel<T> {
 
     auto &dev_ctx = ctx.template device_context<phi::GPUContext>();
 #ifdef PADDLE_WITH_CUDA
-    const auto &exec_policy = thrust::cuda::par.on(dev_ctx.stream());
+    CacheAllocator allocator(ctx.GetPlace());
+    const auto &exec_policy = thrust::cuda::par(allocator).on(dev_ctx.stream());
 #else
     const auto &exec_policy = thrust::hip::par.on(dev_ctx.stream());
 #endif
diff --git a/paddle/fluid/platform/monitor.cc b/paddle/fluid/platform/monitor.cc
index ea6240b649cad..dd38ce7956309 100644
--- a/paddle/fluid/platform/monitor.cc
+++ b/paddle/fluid/platform/monitor.cc
@@ -19,6 +19,7 @@ namespace platform {}  // namespace platform
 }  // namespace paddle
 
 DEFINE_INT_STATUS(STAT_total_feasign_num_in_mem)
+DEFINE_INT_STATUS(STAT_epoch_finish)
 DEFINE_INT_STATUS(STAT_gpu0_mem_size)
 DEFINE_INT_STATUS(STAT_gpu1_mem_size)
 DEFINE_INT_STATUS(STAT_gpu2_mem_size)
diff --git a/paddle/fluid/platform/profiler.proto b/paddle/fluid/platform/profiler.proto
index 31193534a00be..824f2ee43861f 100644
--- a/paddle/fluid/platform/profiler.proto
+++ b/paddle/fluid/platform/profiler.proto
@@ -58,4 +58,4 @@ message Profile {
   optional uint64 start_ns = 2;
   optional uint64 end_ns = 3;
   repeated MemEvent mem_events = 4;
-}
\ No newline at end of file
+}
diff --git a/paddle/fluid/pybind/data_set_py.cc b/paddle/fluid/pybind/data_set_py.cc
index f717b97fb5b44..4ed5f32ff3088 100644
--- a/paddle/fluid/pybind/data_set_py.cc
+++ b/paddle/fluid/pybind/data_set_py.cc
@@ -368,6 +368,12 @@ void BindDataset(py::module *m) {
            py::call_guard<py::gil_scoped_release>())
       .def("set_gpu_graph_mode",
            &framework::Dataset::SetGpuGraphMode,
+           py::call_guard<py::gil_scoped_release>())
+      .def("set_pass_id",
+           &framework::Dataset::SetPassId,
+           py::call_guard<py::gil_scoped_release>())
+      .def("get_pass_id",
+           &framework::Dataset::GetPassID,
            py::call_guard<py::gil_scoped_release>());
 
   py::class_<IterableDatasetWrapper>(*m, "IterableDatasetWrapper")
diff --git a/paddle/fluid/pybind/fleet_py.cc b/paddle/fluid/pybind/fleet_py.cc
index ff83c7a23bb1a..cea9a4c4bbc59 100644
--- a/paddle/fluid/pybind/fleet_py.cc
+++ b/paddle/fluid/pybind/fleet_py.cc
@@ -64,6 +64,7 @@ void BindDistFleetWrapper(py::module* m) {
       .def("save_one_model", &FleetWrapper::SaveModelOneTable)
       .def("recv_and_save_model", &FleetWrapper::RecvAndSaveTable)
       .def("sparse_table_stat", &FleetWrapper::PrintTableStat)
+      .def("save_cache_table", &FleetWrapper::SaveCacheTable)
       .def("stop_server", &FleetWrapper::StopServer)
       .def("stop_worker", &FleetWrapper::FinalizeWorker)
       .def("barrier", &FleetWrapper::BarrierWithTable)
@@ -370,11 +371,18 @@ void BindGraphGpuWrapper(py::module* m) {
                &GraphGpuWrapper::graph_neighbor_sample))
       .def("set_device", &GraphGpuWrapper::set_device)
       .def("set_feature_separator", &GraphGpuWrapper::set_feature_separator)
+      .def("set_slot_feature_separator",
+           &GraphGpuWrapper::set_slot_feature_separator)
       .def("init_service", &GraphGpuWrapper::init_service)
       .def("set_up_types", &GraphGpuWrapper::set_up_types)
       .def("query_node_list", &GraphGpuWrapper::query_node_list)
       .def("add_table_feat_conf", &GraphGpuWrapper::add_table_feat_conf)
-      .def("load_edge_file", &GraphGpuWrapper::load_edge_file)
+      .def("load_edge_file",
+           py::overload_cast<std::string, std::string, bool>(
+               &GraphGpuWrapper::load_edge_file))
+      .def("load_edge_file",
+           py::overload_cast<std::string, std::string, int, bool>(
+               &GraphGpuWrapper::load_edge_file))
       .def("load_node_and_edge", &GraphGpuWrapper::load_node_and_edge)
       .def("upload_batch",
            py::overload_cast<int, int, int, const std::string&>(
@@ -388,6 +396,10 @@ void BindGraphGpuWrapper(py::module* m) {
       .def("get_all_id",
            py::overload_cast<int, int, std::vector<std::vector<uint64_t>>*>(
                &GraphGpuWrapper::get_all_id))
+      .def("init_metapath", &GraphGpuWrapper::init_metapath)
+      .def("get_node_type_size", &GraphGpuWrapper::get_node_type_size)
+      .def("get_edge_type_size", &GraphGpuWrapper::get_edge_type_size)
+      .def("clear_metapath_state", &GraphGpuWrapper::clear_metapath_state)
       .def("load_next_partition", &GraphGpuWrapper::load_next_partition)
       .def("make_partitions", &GraphGpuWrapper::make_partitions)
       .def("make_complementary_graph",
@@ -398,7 +410,15 @@ void BindGraphGpuWrapper(py::module* m) {
       .def("get_partition", &GraphGpuWrapper::get_partition)
       .def("load_node_weight", &GraphGpuWrapper::load_node_weight)
       .def("export_partition_files", &GraphGpuWrapper::export_partition_files)
-      .def("load_node_file", &GraphGpuWrapper::load_node_file)
+      .def("load_node_file",
+           py::overload_cast<std::string, std::string>(
+               &GraphGpuWrapper::load_node_file))
+      .def("load_node_file",
+           py::overload_cast<std::string, std::string, int>(
+               &GraphGpuWrapper::load_node_file))
+      .def("release_graph", &GraphGpuWrapper::release_graph)
+      .def("release_graph_edge", &GraphGpuWrapper::release_graph_edge)
+      .def("release_graph_node", &GraphGpuWrapper::release_graph_node)
       .def("finalize", &GraphGpuWrapper::finalize);
 }
 #endif
diff --git a/paddle/fluid/pybind/gloo_wrapper_py.cc b/paddle/fluid/pybind/gloo_wrapper_py.cc
index e570333d091b4..f112e5178b18f 100644
--- a/paddle/fluid/pybind/gloo_wrapper_py.cc
+++ b/paddle/fluid/pybind/gloo_wrapper_py.cc
@@ -31,8 +31,14 @@ namespace py = pybind11;
 namespace paddle {
 namespace pybind {
 void BindGlooWrapper(py::module* m) {
+#if defined(PADDLE_WITH_GPU_GRAPH)
+  py::class_<framework::GlooWrapper, std::shared_ptr<framework::GlooWrapper>>(
+      *m, "Gloo")
+      .def(py::init([]() { return framework::GlooWrapper::GetInstance(); }))
+#else
   py::class_<framework::GlooWrapper>(*m, "Gloo")
       .def(py::init())
+#endif
       .def("init", &framework::GlooWrapper::Init)
       .def("rank", &framework::GlooWrapper::Rank)
       .def("size", &framework::GlooWrapper::Size)
diff --git a/paddle/fluid/pybind/ps_gpu_wrapper_py.cc b/paddle/fluid/pybind/ps_gpu_wrapper_py.cc
index e9c993d3ee128..7c02a02aff775 100644
--- a/paddle/fluid/pybind/ps_gpu_wrapper_py.cc
+++ b/paddle/fluid/pybind/ps_gpu_wrapper_py.cc
@@ -64,6 +64,9 @@ void BindPSGPUWrapper(py::module* m) {
       .def("begin_pass",
            &framework::PSGPUWrapper::BeginPass,
            py::call_guard<py::gil_scoped_release>())
+      .def("dump_to_mem",
+           &framework::PSGPUWrapper::DumpToMem,
+           py::call_guard<py::gil_scoped_release>())
       .def("load_into_memory",
            &framework::PSGPUWrapper::LoadIntoMemory,
            py::call_guard<py::gil_scoped_release>())
@@ -74,6 +77,9 @@ void BindPSGPUWrapper(py::module* m) {
 #endif
       .def("finalize",
            &framework::PSGPUWrapper::Finalize,
+           py::call_guard<py::gil_scoped_release>())
+      .def("set_mode",
+           &framework::PSGPUWrapper::SetMode,
            py::call_guard<py::gil_scoped_release>());
 }  // end PSGPUWrapper
 #ifdef PADDLE_WITH_PSLIB
diff --git a/paddle/phi/core/flags.cc b/paddle/phi/core/flags.cc
index cdcf67f245fb2..29c9c63e7fd22 100644
--- a/paddle/phi/core/flags.cc
+++ b/paddle/phi/core/flags.cc
@@ -843,6 +843,20 @@ PADDLE_DEFINE_EXPORTED_bool(graph_load_in_parallel,
                             "It controls whether load graph node and edge with "
                             "mutli threads parallely.");
 
+/**
+ * Distributed related FLAG
+ * Name: FLAGS_graph_metapath_split_opt
+ * Since Version: 2.2.0
+ * Value Range: bool, default=false
+ * Example:
+ * Note: Control whether load graph node and edge with multi threads parallely
+ *       If it is not set, load graph data with one thread
+ */
+PADDLE_DEFINE_EXPORTED_bool(graph_metapath_split_opt,
+                            false,
+                            "It controls whether load graph node and edge with "
+                            "mutli threads parallely.");
+
 /**
  * Distributed related FLAG
  * Name: FLAGS_graph_get_neighbor_id
@@ -857,6 +871,32 @@ PADDLE_DEFINE_EXPORTED_bool(
     false,
     "It controls get all neighbor id when running sub part graph.");
 
+/**
+ * Distributed related FLAG
+ * Name: enable_exit_when_partial_worker
+ * Since Version: 2.2.0
+ * Value Range: bool, default=false
+ * Example:
+ * Note: Control  whether exit trainer when an worker has no ins.
+ *       If it is not set, trainer will exit until all worker finish train.
+ */
+PADDLE_DEFINE_EXPORTED_bool(
+    enable_exit_when_partial_worker,
+    false,
+    "It controls whether exit trainer when an worker has no ins.");
+
+/**
+ * Distributed related FLAG
+ * Name: enable_exit_when_partial_worker
+ * Since Version: 2.2.0
+ * Value Range: bool, default=false
+ * Example:
+ * Note: represent gpugraph storage mode, 1 for full hbm, 2 for hbm + mem + ssd.
+ */
+PADDLE_DEFINE_EXPORTED_int32(gpugraph_storage_mode,
+                             1,
+                             "gpugraph storage mode, default 1");
+
 /**
  * KP kernel related FLAG
  * Name: FLAGS_run_kp_kernel
@@ -985,6 +1025,9 @@ PADDLE_DEFINE_EXPORTED_uint64(
     gpugraph_merge_grads_segment_size,
     128,
     "segment size with segment gradient merge, default 128");
+PADDLE_DEFINE_EXPORTED_uint64(gpugraph_slot_feasign_max_num,
+                              5,
+                              "max feasign number in one slot, default 5");
 PADDLE_DEFINE_EXPORTED_int32(
     gpugraph_dedup_pull_push_mode,
     0,
@@ -992,7 +1035,27 @@ PADDLE_DEFINE_EXPORTED_int32(
 PADDLE_DEFINE_EXPORTED_bool(gpugraph_load_node_list_into_hbm,
                             true,
                             "enable load_node_list_into_hbm, default true");
-
+PADDLE_DEFINE_EXPORTED_int32(gpugraph_sparse_table_storage_mode,
+                             0,
+                             "parse_table_storage_mode, default 0");
+PADDLE_DEFINE_EXPORTED_bool(enable_auto_detect_gpu_topo,
+                            true,
+                            "enable auto detect gpu topo, default true");
+PADDLE_DEFINE_EXPORTED_bool(enable_auto_rdma_trans,
+                            true,
+                            "enable auto gpu rdma trans, default true");
+PADDLE_DEFINE_EXPORTED_bool(enable_tracker_all2all,
+                            false,
+                            "enable tracker all2all log, default false");
+PADDLE_DEFINE_EXPORTED_bool(enable_all2all_use_fp16,
+                            false,
+                            "enable all2all use fp16, default false");
+PADDLE_DEFINE_EXPORTED_bool(enable_sparse_inner_gather,
+                            false,
+                            "enable sparse inner gather, default false");
+PADDLE_DEFINE_EXPORTED_bool(gpugraph_debug_gpu_memory,
+                            false,
+                            "enable debug gpu memory, default false");
 /**
  * ProcessGroupNCCL related FLAG
  * Name: nccl_blocking_wait
diff --git a/paddle/phi/kernels/gpu/graph_reindex_funcs.h b/paddle/phi/kernels/gpu/graph_reindex_funcs.h
index aee6e5c4d46ce..2a5479e076e1d 100644
--- a/paddle/phi/kernels/gpu/graph_reindex_funcs.h
+++ b/paddle/phi/kernels/gpu/graph_reindex_funcs.h
@@ -23,7 +23,7 @@ namespace phi {
 
 template <typename T>
 inline __device__ size_t Hash(T id, int64_t size) {
-  return id % size;
+  return static_cast<unsigned long long int>(id) % size;  // NOLINT
 }
 
 template <typename T>
@@ -169,7 +169,7 @@ __global__ void FillUniqueItems(const T* items,
 
 template <typename T>
 __global__ void ReindexSrcOutput(T* src_output,
-                                 int num_items,
+                                 int64_t num_items,
                                  int64_t size,
                                  const T* keys,
                                  const int* values) {
diff --git a/paddle/phi/kernels/gpu/graph_reindex_kernel.cu b/paddle/phi/kernels/gpu/graph_reindex_kernel.cu
index 10a5eec5b1ecf..9b12adb13bcaa 100644
--- a/paddle/phi/kernels/gpu/graph_reindex_kernel.cu
+++ b/paddle/phi/kernels/gpu/graph_reindex_kernel.cu
@@ -20,6 +20,15 @@
 #include <thrust/scan.h>
 #include <thrust/sequence.h>
 
+#ifdef __NVCC__
+#include <cub/cub.cuh>
+#endif
+#ifdef __HIPCC__
+#include <hipcub/hipcub.hpp>
+namespace cub = hipcub;
+#endif
+
+#include "paddle/fluid/memory/memory.h"
 #include "paddle/phi/backends/gpu/gpu_context.h"
 #include "paddle/phi/core/kernel_registry.h"
 #include "paddle/phi/kernels/gpu/graph_reindex_funcs.h"
@@ -27,16 +36,27 @@
 namespace phi {
 
 constexpr int WARP_SIZE = 32;
+const int CUDA_NUM_THREADS = 512;
+inline int GET_BLOCKS(const int N) {
+  return (N + CUDA_NUM_THREADS - 1) / CUDA_NUM_THREADS;
+}
+
+template <typename T>
+__global__ void InitializeHashTable(T* tensor, int len) {
+  CUDA_KERNEL_LOOP(idx, len) { tensor[idx] = -1; }
+}
 
 template <typename T, typename Context>
-void FillHashTable(const Context& dev_ctx,
-                   const T* input,
-                   int num_input,
-                   int64_t len_hashtable,
-                   thrust::device_vector<T>* unique_items,
-                   T* keys,
-                   int* values,
-                   int* key_index) {
+std::shared_ptr<phi::Allocation> FillHashTable(const Context& dev_ctx,
+                                               const T* input,
+                                               int num_input,
+                                               int64_t len_hashtable,
+                                               T* keys,
+                                               int* values,
+                                               int* key_index,
+                                               int* final_nodes_len) {
+  const auto place = dev_ctx.GetPlace();
+
 #ifdef PADDLE_WITH_HIP
   int block = 256;
 #else
@@ -50,30 +70,53 @@ void FillHashTable(const Context& dev_ctx,
       input, num_input, len_hashtable, keys, key_index);
 
   // Get item index count.
-  thrust::device_vector<int> item_count(num_input + 1, 0);
+  auto item_count = paddle::memory::Alloc(place, (num_input + 1) * sizeof(int));
+  int* item_count_ptr = reinterpret_cast<int*>(item_count->ptr());
+#ifdef PADDLE_WITH_HIP
+  hipMemset(item_count_ptr, 0, sizeof(int) * (num_input + 1));
+#else
+  cudaMemset(item_count_ptr, 0, sizeof(int) * (num_input + 1));
+#endif
   GetItemIndexCount<T><<<grid, block, 0, dev_ctx.stream()>>>(
-      input,
-      thrust::raw_pointer_cast(item_count.data()),
-      num_input,
-      len_hashtable,
-      keys,
-      key_index);
+      input, item_count_ptr, num_input, len_hashtable, keys, key_index);
+
+  size_t temp_storage_bytes = 0;
+  cub::DeviceScan::ExclusiveSum(
+      NULL, temp_storage_bytes, item_count_ptr, item_count_ptr, num_input + 1);
+  auto d_temp_storage = paddle::memory::Alloc(place, temp_storage_bytes);
+  cub::DeviceScan::ExclusiveSum(d_temp_storage->ptr(),
+                                temp_storage_bytes,
+                                item_count_ptr,
+                                item_count_ptr,
+                                num_input + 1);
+  int total_unique_items = 0;
+#ifdef PADDLE_WITH_HIP
+  hipMemcpy(&total_unique_items,
+            item_count_ptr + num_input,
+            sizeof(int),
+            hipMemcpyDeviceToHost);
+#else
+  cudaMemcpy(&total_unique_items,
+             item_count_ptr + num_input,
+             sizeof(int),
+             cudaMemcpyDeviceToHost);
+#endif
 
-  thrust::exclusive_scan(
-      item_count.begin(), item_count.end(), item_count.begin());
-  size_t total_unique_items = item_count[num_input];
-  unique_items->resize(total_unique_items);
+  auto unique_items =
+      paddle::memory::AllocShared(place, total_unique_items * sizeof(T));
+  T* unique_items_data = reinterpret_cast<T*>(unique_items->ptr());
+  *final_nodes_len = total_unique_items;
 
   // Get unique items
-  FillUniqueItems<T><<<grid, block, 0, dev_ctx.stream()>>>(
-      input,
-      num_input,
-      len_hashtable,
-      thrust::raw_pointer_cast(unique_items->data()),
-      thrust::raw_pointer_cast(item_count.data()),
-      keys,
-      values,
-      key_index);
+  FillUniqueItems<T><<<grid, block, 0, dev_ctx.stream()>>>(input,
+                                                           num_input,
+                                                           len_hashtable,
+                                                           unique_items_data,
+                                                           item_count_ptr,
+                                                           keys,
+                                                           values,
+                                                           key_index);
+  return unique_items;
 }
 
 template <typename T, typename Context>
@@ -137,6 +180,26 @@ void ResetBufferHashTable(const Context& dev_ctx,
       values);
 }
 
+template <typename T, typename Context>
+void ReindexSrc(const Context& dev_ctx,
+                T* edges_src,
+                T* keys,
+                int* values,
+                int64_t num_edges,
+                int64_t table_size) {
+// Fill outputs with reindex result.
+#ifdef PADDLE_WITH_HIP
+  int block = 256;
+#else
+  int block = 1024;
+#endif
+  int max_grid_dimx = dev_ctx.GetCUDAMaxGridDimSize()[0];
+  int grid_tmp = (num_edges + block - 1) / block;
+  int grid = grid_tmp < max_grid_dimx ? grid_tmp : max_grid_dimx;
+  ReindexSrcOutput<T><<<grid, block, 0, dev_ctx.stream()>>>(
+      edges_src, num_edges, table_size, keys, values);
+}
+
 template <typename T, typename Context>
 void Reindex(const Context& dev_ctx,
              const T* inputs,
@@ -148,67 +211,53 @@ void Reindex(const Context& dev_ctx,
   thrust::copy(inputs, inputs + num_inputs, out_nodes->begin());
   thrust::copy(
       src_outputs, src_outputs + num_edges, out_nodes->begin() + num_inputs);
-  thrust::device_vector<T> unique_nodes;
-  unique_nodes.clear();
 
   // Fill hash table
   int64_t num = out_nodes->size();
   int64_t log_num = 1 << static_cast<size_t>(1 + std::log2(num >> 1));
   int64_t table_size = log_num << 1;
-  T* keys;
-  int *values, *key_index;
-
-#ifdef PADDLE_WITH_HIP
-  hipMalloc(&keys, table_size * sizeof(T));
-  hipMalloc(&values, table_size * sizeof(int));
-  hipMalloc(&key_index, table_size * sizeof(int));
-  hipMemset(keys, -1, table_size * sizeof(T));
-  hipMemset(values, -1, table_size * sizeof(int));
-  hipMemset(key_index, -1, table_size * sizeof(int));
-#else
-  cudaMalloc(&keys, table_size * sizeof(T));
-  cudaMalloc(&values, table_size * sizeof(int));
-  cudaMalloc(&key_index, table_size * sizeof(int));
-  cudaMemset(keys, -1, table_size * sizeof(T));
-  cudaMemset(values, -1, table_size * sizeof(int));
-  cudaMemset(key_index, -1, table_size * sizeof(int));
-#endif
-
-  FillHashTable<T, Context>(dev_ctx,
-                            thrust::raw_pointer_cast(out_nodes->data()),
-                            out_nodes->size(),
-                            table_size,
-                            &unique_nodes,
-                            keys,
-                            values,
-                            key_index);
-  out_nodes->resize(unique_nodes.size());
-  thrust::copy(unique_nodes.begin(), unique_nodes.end(), out_nodes->begin());
 
-// Fill outputs with reindex result.
-#ifdef PADDLE_WITH_HIP
-  int block = 256;
-#else
-  int block = 1024;
-#endif
-  int max_grid_dimx = dev_ctx.GetCUDAMaxGridDimSize()[0];
-  int grid_tmp = (num_edges + block - 1) / block;
-  int grid = grid_tmp < max_grid_dimx ? grid_tmp : max_grid_dimx;
-  ReindexSrcOutput<T><<<grid, block, 0, dev_ctx.stream()>>>(
-      thrust::raw_pointer_cast(src_outputs),
-      num_edges,
-      table_size,
-      keys,
-      values);
-#ifdef PADDLE_WITH_HIP
-  hipFree(keys);
-  hipFree(values);
-  hipFree(key_index);
-#else
-  cudaFree(keys);
-  cudaFree(values);
-  cudaFree(key_index);
-#endif
+  auto keys = paddle::memory::Alloc(dev_ctx.GetPlace(), table_size * sizeof(T));
+  auto values =
+      paddle::memory::Alloc(dev_ctx.GetPlace(), table_size * sizeof(int));
+  auto key_index =
+      paddle::memory::Alloc(dev_ctx.GetPlace(), table_size * sizeof(int));
+  T* keys_ptr = reinterpret_cast<T*>(keys->ptr());
+  int* values_ptr = reinterpret_cast<int*>(values->ptr());
+  int* key_index_ptr = reinterpret_cast<int*>(key_index->ptr());
+
+  InitializeHashTable<T>
+      <<<GET_BLOCKS(table_size), CUDA_NUM_THREADS, 0, dev_ctx.stream()>>>(
+          keys_ptr, table_size);
+  InitializeHashTable<int>
+      <<<GET_BLOCKS(table_size), CUDA_NUM_THREADS, 0, dev_ctx.stream()>>>(
+          values_ptr, table_size);
+  InitializeHashTable<int>
+      <<<GET_BLOCKS(table_size), CUDA_NUM_THREADS, 0, dev_ctx.stream()>>>(
+          key_index_ptr, table_size);
+
+  int unique_len = 0;
+  std::shared_ptr<phi::Allocation> unique_items =
+      FillHashTable<T, Context>(dev_ctx,
+                                thrust::raw_pointer_cast(out_nodes->data()),
+                                out_nodes->size(),
+                                table_size,
+                                keys_ptr,
+                                values_ptr,
+                                key_index_ptr,
+                                &unique_len);
+  out_nodes->resize(unique_len);
+  T* unique_items_data = reinterpret_cast<T*>(unique_items->ptr());
+  thrust::copy(thrust::device_pointer_cast(unique_items_data),
+               thrust::device_pointer_cast(unique_items_data) + unique_len,
+               out_nodes->begin());
+
+  ReindexSrc<T, Context>(dev_ctx,
+                         thrust::raw_pointer_cast(src_outputs),
+                         keys_ptr,
+                         values_ptr,
+                         num_edges,
+                         table_size);
 }
 
 template <typename T, typename Context>
@@ -281,6 +330,47 @@ __global__ void GetDstEdgeCUDAKernel(const int64_t num_rows,
   }
 }
 
+template <typename T, typename Context>
+void ReindexDst(const Context& dev_ctx,
+                T* reindex_dst_data,
+                int* scan_dst_data,
+                const int* count_data,
+                int num_edge_types,
+                int node_len) {
+  constexpr int BLOCK_WARPS = 128 / WARP_SIZE;
+  constexpr int TILE_SIZE = BLOCK_WARPS * 16;
+  const dim3 block(WARP_SIZE, BLOCK_WARPS);
+  const dim3 grid((node_len + TILE_SIZE - 1) / TILE_SIZE);
+
+  int begin = 0, count_i = 0;
+  thrust::device_vector<int> dst_ptr(node_len + 1, 0);
+  for (int i = 0; i < num_edge_types; i++) {
+    thrust::inclusive_scan(
+        thrust::device_pointer_cast(count_data) + i * node_len,
+        thrust::device_pointer_cast(count_data) + (i + 1) * node_len,
+        dst_ptr.begin() + 1);
+    GetDstEdgeCUDAKernel<T, BLOCK_WARPS, TILE_SIZE>
+        <<<grid, block, 0, dev_ctx.stream()>>>(
+            node_len,
+            scan_dst_data,
+            count_data + i * node_len,
+            thrust::raw_pointer_cast(dst_ptr.data()),
+            reindex_dst_data + begin);
+#ifdef PADDLE_WITH_HIP
+    hipMemcpy(&count_i,
+              thrust::raw_pointer_cast(dst_ptr.data()) + node_len,
+              sizeof(int),
+              hipMemcpyDeviceToHost);
+#else
+    cudaMemcpy(&count_i,
+               thrust::raw_pointer_cast(dst_ptr.data()) + node_len,
+               sizeof(int),
+               cudaMemcpyDeviceToHost);
+#endif
+    begin += count_i;
+  }
+}
+
 template <typename T, typename Context>
 void GraphReindexKernel(const Context& dev_ctx,
                         const DenseTensor& x,
@@ -336,32 +426,15 @@ void GraphReindexKernel(const Context& dev_ctx,
   int num_edge_types = num_ac_count / bs;
   thrust::device_vector<int> unique_dst_reindex(bs);
   thrust::sequence(unique_dst_reindex.begin(), unique_dst_reindex.end());
-  constexpr int BLOCK_WARPS = 128 / WARP_SIZE;
-  constexpr int TILE_SIZE = BLOCK_WARPS * 16;
-  const dim3 block(WARP_SIZE, BLOCK_WARPS);
-  const dim3 grid((bs + TILE_SIZE - 1) / TILE_SIZE);
   reindex_dst->Resize({num_edges});
   T* reindex_dst_data = dev_ctx.template Alloc<T>(reindex_dst);
-  int begin = 0;
-  for (int i = 0; i < num_edge_types; i++) {
-    thrust::device_vector<int> dst_ptr(bs);
-    thrust::exclusive_scan(
-        count_data + i * bs, count_data + (i + 1) * bs, dst_ptr.begin());
-
-    GetDstEdgeCUDAKernel<T, BLOCK_WARPS, TILE_SIZE>
-        <<<grid, block, 0, dev_ctx.stream()>>>(
-            bs,
-            thrust::raw_pointer_cast(unique_dst_reindex.data()),
-            count_data + i * bs,
-            thrust::raw_pointer_cast(dst_ptr.data()),
-            reindex_dst_data + begin);
-
-    int count_i =
-        thrust::reduce(thrust::device_pointer_cast(count_data) + i * bs,
-                       thrust::device_pointer_cast(count_data) + (i + 1) * bs);
-    begin += count_i;
-  }
 
+  ReindexDst<T, Context>(dev_ctx,
+                         reindex_dst_data,
+                         thrust::raw_pointer_cast(unique_dst_reindex.data()),
+                         count_data,
+                         num_edge_types,
+                         bs);
   out_nodes->Resize({static_cast<int>(unique_nodes.size())});
   T* out_nodes_data = dev_ctx.template Alloc<T>(out_nodes);
   thrust::copy(unique_nodes.begin(), unique_nodes.end(), out_nodes_data);
diff --git a/paddle/phi/kernels/graph_reindex_kernel.h b/paddle/phi/kernels/graph_reindex_kernel.h
index 12a742006ee73..044227fd7a7a7 100644
--- a/paddle/phi/kernels/graph_reindex_kernel.h
+++ b/paddle/phi/kernels/graph_reindex_kernel.h
@@ -18,6 +18,32 @@
 
 namespace phi {
 
+template <typename T, typename Context>
+std::shared_ptr<phi::Allocation> FillHashTable(const Context& dev_ctx,
+                                               const T* input,
+                                               int num_input,
+                                               int64_t len_hashtable,
+                                               T* keys,
+                                               int* values,
+                                               int* key_index,
+                                               int* final_nodes_len);
+
+template <typename T, typename Context>
+void ReindexSrc(const Context& dev_ctx,
+                T* edges_src,
+                T* keys,
+                int* values,
+                int64_t num_edges,
+                int64_t table_size);
+
+template <typename T, typename Context>
+void ReindexDst(const Context& dev_ctx,
+                T* reindex_dst_data,
+                int* scan_dst_data,
+                const int* count_data,
+                int num_edge_types,
+                int node_len);
+
 template <typename T, typename Context>
 void GraphReindexKernel(const Context& dev_ctx,
                         const DenseTensor& x,
diff --git a/paddle/phi/kernels/send_u_recv_grad_kernel.h b/paddle/phi/kernels/send_u_recv_grad_kernel.h
index 1acb3bd7f14c4..7dfb19c30d2b7 100644
--- a/paddle/phi/kernels/send_u_recv_grad_kernel.h
+++ b/paddle/phi/kernels/send_u_recv_grad_kernel.h
@@ -31,4 +31,5 @@ void SendURecvGradKernel(const Context& ctx,
                          const DenseTensor& out_grad,
                          const std::string& reduce_op,
                          DenseTensor* x_grad);
+
 }  // namespace phi
diff --git a/paddle/scripts/paddle_build.sh b/paddle/scripts/paddle_build.sh
index 73fa2181076cc..e54039c85f0ac 100755
--- a/paddle/scripts/paddle_build.sh
+++ b/paddle/scripts/paddle_build.sh
@@ -3473,7 +3473,6 @@ function run_setup(){
     # Delete previous built paddle cache
     rm -rf ${PADDLE_ROOT}/build/python/paddle 2>/dev/null || true
     startTime_s=`date +%s`
-
     SYSTEM=`uname -s`
     if [ "$SYSTEM" == "Darwin" ]; then
         echo "Using python abi: $1"
diff --git a/paddle/utils/string/string_helper.h b/paddle/utils/string/string_helper.h
index b84c7fa75209d..029eb9eb59dc6 100644
--- a/paddle/utils/string/string_helper.h
+++ b/paddle/utils/string/string_helper.h
@@ -334,6 +334,42 @@ inline int split_string_ptr(const char* str,
   return num;
 }
 
+inline int split_string_ptr(const char* str,
+                            size_t len,
+                            char delim,
+                            std::vector<str_ptr>* values,
+                            int max_num) {
+  if (len <= 0) {
+    return 0;
+  }
+
+  int num = 0;
+  const char* p = str;
+  const char* end = str + len;
+  const char* last = str;
+  while (p < end) {
+    if (*p != delim) {
+      ++p;
+      continue;
+    }
+    values->emplace_back(last, (size_t)(p - last));
+    ++num;
+    ++p;
+    if (num >= max_num) {
+      return num;
+    }
+    // skip continue delim
+    while (*p == delim) {
+      ++p;
+    }
+    last = p;
+  }
+  if (p > last) {
+    values->emplace_back(last, (size_t)(p - last));
+    ++num;
+  }
+  return num;
+}
 // A helper class for reading lines from file. A line buffer is maintained. It
 // doesn't need to know the maximum possible length of a line.
 
diff --git a/paddle/utils/string/string_helper_test.cc b/paddle/utils/string/string_helper_test.cc
index e0789e9a545dd..68382e692d6cf 100644
--- a/paddle/utils/string/string_helper_test.cc
+++ b/paddle/utils/string/string_helper_test.cc
@@ -63,3 +63,23 @@ TEST(StringHelper, JoinStringsWithConversion) {
       paddle::string::join_strings(v, ",", [](int x) { return x * x; });
   EXPECT_EQ(result, "4,9");
 }
+
+TEST(StringHelper, SplitString) {
+  std::string line = "hello world my world";
+  std::vector<paddle::string::str_ptr> vals;
+  int num = 0;
+  num =
+      paddle::string::split_string_ptr(line.c_str(), line.length(), ' ', &vals);
+  EXPECT_EQ(num, 4);
+
+  num = paddle::string::split_string_ptr(
+      line.c_str(), line.length(), ' ', &vals, 3);
+  EXPECT_EQ(num, 3);
+
+  num = paddle::string::split_string_ptr(
+      line.c_str(), line.length(), ' ', &vals, 10);
+  EXPECT_EQ(num, 4);
+
+  num = paddle::string::split_string_ptr(line.c_str(), -1, ' ', &vals, 3);
+  EXPECT_EQ(num, 0);
+}
diff --git a/python/paddle/distributed/fleet/__init__.py b/python/paddle/distributed/fleet/__init__.py
index aebedefeaafcb..0cda5198ab3c9 100755
--- a/python/paddle/distributed/fleet/__init__.py
+++ b/python/paddle/distributed/fleet/__init__.py
@@ -102,4 +102,5 @@
 set_log_level = log_util.set_log_level
 get_log_level_code = log_util.get_log_level_code
 get_log_level_name = log_util.get_log_level_name
+save_cache_table = fleet.save_cache_table
 from .. import auto_parallel as auto
diff --git a/python/paddle/distributed/fleet/base/distributed_strategy.py b/python/paddle/distributed/fleet/base/distributed_strategy.py
index 68d2d7a5b0e75..efa94862b5246 100755
--- a/python/paddle/distributed/fleet/base/distributed_strategy.py
+++ b/python/paddle/distributed/fleet/base/distributed_strategy.py
@@ -608,8 +608,12 @@ def fleet_desc_configs(self, configs):
             'embedx_sparse_beta2_decay_rate',
             'feature_learning_rate',
             'nodeid_slot',
+            'sparse_load_filter_slots',
+        ]
+        support_sparse_table_class = [
+            'DownpourSparseTable',
+            'DownpourSparseSSDTable',
         ]
-        support_sparse_table_class = ['DownpourSparseTable']
         support_sparse_accessor_class = [
             'DownpourSparseValueAccessor',
             'DownpourCtrAccessor',
@@ -729,10 +733,13 @@ def set_sparse_table_config(table_data, config):
             )
             if table_class not in support_sparse_table_class:
                 raise ValueError(
-                    "support sparse_table_class: ['DownpourSparseTable'], but actual %s"
+                    "support sparse_table_class: ['DownpourSparseTable, DownpourSparseSSDTable'], but actual %s"
                     % (table_class)
                 )
-            table_data.table_class = 'MemorySparseTable'
+            if table_class == "DownpourSparseSSDTable":
+                table_data.table_class = 'SSDSparseTable'
+            else:
+                table_data.table_class = 'MemorySparseTable'
             table_data.shard_num = config.get('sparse_shard_num', 1000)
             table_data.enable_sparse_table_cache = config.get(
                 'sparse_enable_cache', True
@@ -801,6 +808,10 @@ def set_sparse_table_config(table_data, config):
             table_data.accessor.ctr_accessor_param.ssd_unseenday_threshold = (
                 config.get('sparse_ssd_unseenday_threshold', 1)
             )
+            load_filter_slots = config.get('sparse_load_filter_slots', [])
+            table_data.accessor.ctr_accessor_param.load_filter_slots.extend(
+                load_filter_slots
+            )
             converter = config.get('sparse_converter', "")
             deconverter = config.get('sparse_deconverter', "")
 
diff --git a/python/paddle/distributed/fleet/base/runtime_factory.py b/python/paddle/distributed/fleet/base/runtime_factory.py
index 1bc6eef1404fa..d4acee8474e32 100644
--- a/python/paddle/distributed/fleet/base/runtime_factory.py
+++ b/python/paddle/distributed/fleet/base/runtime_factory.py
@@ -17,18 +17,22 @@
 __all__ = []
 
 
-class RuntimeFactory:
+class RuntimeFactory(object):
     def __init__(self):
         pass
 
     def _create_runtime(self, context):
+        # add collective && pslib mode
+        if "use_fleet_ps" in context and context["use_fleet_ps"]:
+            ps_runtime = TheOnePSRuntime()
+            ps_runtime._set_basic_info(context)
+            return ps_runtime
         if context["role_maker"]._is_collective:
             collective_runtime = CollectiveRuntime()
             collective_runtime._set_basic_info(context)
             return collective_runtime
 
         k_steps = context["valid_strategy"].a_sync_configs["k_steps"]
-
         if not context["role_maker"]._is_collective and k_steps >= 0:
             ps_runtime = TheOnePSRuntime()
             ps_runtime._set_basic_info(context)
diff --git a/python/paddle/distributed/fleet/fleet.py b/python/paddle/distributed/fleet/fleet.py
index 644b9fabf7428..6efcadbb51a5d 100755
--- a/python/paddle/distributed/fleet/fleet.py
+++ b/python/paddle/distributed/fleet/fleet.py
@@ -969,6 +969,15 @@ def save_cache_model(self, dirname, **configs):
     def check_save_pre_patch_done(self):
         return self._runtime_handle._check_save_pre_patch_done()
 
+    @is_non_distributed_check
+    @inited_runtime_handler
+    def save_cache_table(
+        self, table_id, pass_id, mem_cache_key_threshold=4000000000
+    ):
+        return self._runtime_handle._save_cache_table(
+            table_id, pass_id, mem_cache_key_threshold
+        )
+
     @is_non_distributed_check
     @inited_runtime_handler
     def save_one_table(self, table_id, path, mode):
@@ -1026,6 +1035,8 @@ def save_dense_params(
             executor, dirname, scope, program, var_names
         )
 
+    @is_non_distributed_check
+    @inited_runtime_handler
     def shrink(self, threshold=None):
         self._runtime_handle._shrink(threshold)
 
@@ -1318,57 +1329,87 @@ def _minimize_impl(
 
             return optimize_ops, params_grads, dist_startup_prog, dist_main_prog
 
-        # compile time
-        distributed_optimizer_list = (
-            MetaOptimizerFactory()._get_valid_meta_optimizers(
-                self.user_defined_optimizer
-            )
-        )
-
         context["user_defined_strategy"] = copy.deepcopy(
             self._user_defined_strategy
         )
         copy_user_defined_strategy = copy.deepcopy(self._user_defined_strategy)
 
-        # trigger the auto-parallel in very strict condition
-        # strategy = DistributedStrategy()
-        # strategy.auto = True
-        # optimizer = paddle.optimizer.SGD(learning_rate=0.1)
-        # optimizer = fleet.distributed_optimizer(optimizer, strategy)
-        if copy_user_defined_strategy._is_strict_auto():
-            # turn on all the strategy for each optimizer
-            for opt in distributed_optimizer_list:
-                opt._enable_strategy(copy_user_defined_strategy, context)
-
-        valid_optimizer_list = []
-        valid_graph_optimizer_list = []
         can_not_apply_optimizer_list = []
-        # recall meta optimizers for ranking
-        for opt in distributed_optimizer_list:
-            opt._set_basic_info(
+        # fix set collective and fleet ps gpu error
+        if (
+            self._is_collective
+            and len(self._user_defined_strategy.sparse_table_configs) > 0
+        ):
+            context["use_fleet_ps"] = True
+            from .meta_optimizers import ParameterServerOptimizer
+
+            meta_optimizer = ParameterServerOptimizer(
+                self.user_defined_optimizer
+            )
+            meta_optimizer._set_basic_info(
                 loss,
                 self._role_maker,
                 self.user_defined_optimizer,
                 copy_user_defined_strategy,
             )
-            if opt._can_apply() and not opt._is_graph_out():
-                valid_optimizer_list.append(opt)
-            elif opt._can_apply() and opt._is_graph_out():
-                valid_graph_optimizer_list.append(opt)
-            else:
-                can_not_apply_optimizer_list.append(opt)
-        # combine recalled meta optimizers to be a valid meta optimizer
-        (
-            meta_optimizer,
-            graph_optimizer,
-        ) = self.strategy_compiler.generate_optimizer(
-            loss,
-            self._role_maker,
-            self.user_defined_optimizer,
-            copy_user_defined_strategy,
-            valid_optimizer_list,
-            valid_graph_optimizer_list,
-        )
+            can_not_apply_optimizer_list.append(meta_optimizer)
+            from .meta_optimizers import ParameterServerGraphOptimizer
+
+            graph_optimizer = ParameterServerGraphOptimizer(
+                self.user_defined_optimizer
+            )
+            graph_optimizer._set_basic_info(
+                loss,
+                self._role_maker,
+                self.user_defined_optimizer,
+                copy_user_defined_strategy,
+            )
+            can_not_apply_optimizer_list.append(graph_optimizer)
+        else:
+            # compile time
+            distributed_optimizer_list = (
+                MetaOptimizerFactory()._get_valid_meta_optimizers(
+                    self.user_defined_optimizer
+                )
+            )
+            # trigger the auto-parallel in very strict condition
+            # strategy = DistributedStrategy()
+            # strategy.auto = True
+            # optimizer = paddle.optimizer.SGD(learning_rate=0.1)
+            # optimizer = fleet.distributed_optimizer(optimizer, strategy)
+            if copy_user_defined_strategy._is_strict_auto():
+                # turn on all the strategy for each optimizer
+                for opt in distributed_optimizer_list:
+                    opt._enable_strategy(copy_user_defined_strategy, context)
+
+            valid_optimizer_list = []
+            valid_graph_optimizer_list = []
+            # recall meta optimizers for ranking
+            for opt in distributed_optimizer_list:
+                opt._set_basic_info(
+                    loss,
+                    self._role_maker,
+                    self.user_defined_optimizer,
+                    copy_user_defined_strategy,
+                )
+                if opt._can_apply() and not opt._is_graph_out():
+                    valid_optimizer_list.append(opt)
+                elif opt._can_apply() and opt._is_graph_out():
+                    valid_graph_optimizer_list.append(opt)
+                else:
+                    can_not_apply_optimizer_list.append(opt)
+            # combine recalled meta optimizers to be a valid meta optimizer
+            (
+                meta_optimizer,
+                graph_optimizer,
+            ) = self.strategy_compiler.generate_optimizer(
+                loss,
+                self._role_maker,
+                self.user_defined_optimizer,
+                copy_user_defined_strategy,
+                valid_optimizer_list,
+                valid_graph_optimizer_list,
+            )
 
         valid_strategy = self.strategy_compiler._get_valid_strategy(
             copy_user_defined_strategy, can_not_apply_optimizer_list
diff --git a/python/paddle/distributed/ps/the_one_ps.py b/python/paddle/distributed/ps/the_one_ps.py
index ce725e6f3717a..05d6ce9b78a0e 100755
--- a/python/paddle/distributed/ps/the_one_ps.py
+++ b/python/paddle/distributed/ps/the_one_ps.py
@@ -1631,11 +1631,17 @@ def _save_cache_model(self, dirname, **kwargs):
         fleet.util.barrier()
         return feasign_num
 
+    def _save_cache_table(self, table_id, pass_id, mem_cache_key_threshold):
+        if self.role_maker._is_first_worker():
+            self._worker.save_cache_table(
+                table_id, pass_id, mem_cache_key_threshold
+            )
+        fleet.util.barrier()
+
     def _check_save_pre_patch_done(self):
         fleet.util.barrier()
         if self.role_maker._is_first_worker():
             self._worker.check_save_pre_patch_done()
-        fleet.util.barrier()
 
     def _load_sparse_params(self, dirname, context, main_program, mode):
         distributed_varnames = get_sparse_tablenames(
diff --git a/python/paddle/fluid/dataset.py b/python/paddle/fluid/dataset.py
index b21550bcc3ad3..650cf3756dabb 100644
--- a/python/paddle/fluid/dataset.py
+++ b/python/paddle/fluid/dataset.py
@@ -389,6 +389,7 @@ def __init__(self):
         self.merge_by_lineid = False
         self.fleet_send_sleep_seconds = None
         self.trainer_num = -1
+        self.pass_id = 0
 
     @deprecated(
         since="2.0.0",
@@ -1112,8 +1113,50 @@ def set_graph_config(self, config):
         self.proto_desc.graph_config.gpu_graph_training = config.get(
             "gpu_graph_training", True
         )
+        self.proto_desc.graph_config.sage_mode = config.get("sage_mode", False)
+        self.proto_desc.graph_config.samples = config.get("samples", "")
+        self.proto_desc.graph_config.train_table_cap = config.get(
+            "train_table_cap", 800000
+        )
+        self.proto_desc.graph_config.infer_table_cap = config.get(
+            "infer_table_cap", 800000
+        )
         self.dataset.set_gpu_graph_mode(True)
 
+    def set_pass_id(self, pass_id):
+        """
+        Set pass id, user can set pass id in gpu graph mode.
+
+        Args:
+            pass_id: pass id.
+
+        Examples:
+            .. code-block:: python
+
+              import paddle.fluid as fluid
+              pass_id = 0
+              dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset")
+              dataset.set_pass_id(pass_id)
+        """
+        self.pass_id = pass_id
+        self.dataset.set_pass_id(pass_id)
+
+    def get_pass_id(self):
+        """
+        Get pass id, user can set pass id in gpu graph mode.
+
+        Returns:
+            The pass id.
+
+        Examples:
+            .. code-block:: python
+
+              import paddle.fluid as fluid
+              dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset")
+              pass_id = dataset.get_pass_id()
+        """
+        return self.pass_id
+
 
 class QueueDataset(DatasetBase):
     """
diff --git a/python/paddle/fluid/tests/unittests/CMakeLists.txt b/python/paddle/fluid/tests/unittests/CMakeLists.txt
index 7e24d51b64a00..25c545861d39e 100755
--- a/python/paddle/fluid/tests/unittests/CMakeLists.txt
+++ b/python/paddle/fluid/tests/unittests/CMakeLists.txt
@@ -625,6 +625,11 @@ if(WITH_DISTRIBUTE)
 
   list(REMOVE_ITEM DIST_TEST_OPS "test_dist_fleet_gloo")
 
+  if(NOT WITH_GPU)
+    list(REMOVE_ITEM DIST_TEST_OPS "test_dist_fleet_spmt")
+    list(REMOVE_ITEM DIST_TEST_OPS "test_dist_fleet_minimize")
+  endif()
+
   if(NOT WITH_HETERPS)
     list(REMOVE_ITEM DIST_TEST_OPS "test_communicator_ps_gpu")
     list(REMOVE_ITEM DIST_TEST_OPS "test_dist_fleet_ps11")
diff --git a/python/paddle/fluid/tests/unittests/collective/fleet/test_fleet_rolemaker_new.py b/python/paddle/fluid/tests/unittests/collective/fleet/test_fleet_rolemaker_new.py
index 96e84251011fe..d37ee5d5af13e 100644
--- a/python/paddle/fluid/tests/unittests/collective/fleet/test_fleet_rolemaker_new.py
+++ b/python/paddle/fluid/tests/unittests/collective/fleet/test_fleet_rolemaker_new.py
@@ -14,12 +14,8 @@
 """Test cloud role maker."""
 
 import os
-import platform
-import shutil
-import tempfile
 import unittest
 
-import paddle
 import paddle.distributed.fleet.base.role_maker as role_maker
 
 
@@ -165,6 +161,7 @@ def test_tr_rolemaker(self):
         self.assertEqual(ro._role_id(), 0)
 
 
+"""
 class TestGlooWithCloudRoleMaker(unittest.TestCase):
     def setUp(self):
         os.environ["PADDLE_TRAINERS_NUM"] = "1"
@@ -478,6 +475,7 @@ def net():
         self.assertEqual(1, all_reduce)
 
         self.clean(tmp)
+"""
 
 
 if __name__ == "__main__":
diff --git a/python/paddle/fluid/tests/unittests/dist_fleet_ctr.py b/python/paddle/fluid/tests/unittests/dist_fleet_ctr.py
index 59f8f67aca833..8129d6104d0c9 100644
--- a/python/paddle/fluid/tests/unittests/dist_fleet_ctr.py
+++ b/python/paddle/fluid/tests/unittests/dist_fleet_ctr.py
@@ -394,6 +394,10 @@ def do_dataset_training(self, fleet):
             fleet.save_persistables(exe, patch_dirname, None, 5)
             fleet.check_save_pre_patch_done()
 
+        # add for gpugrahp
+        fleet.save_cache_table(0, 0)
+        fleet.shrink()
+
 
 if __name__ == "__main__":
     runtime_main(TestDistCTR2x2)
diff --git a/python/paddle/fluid/tests/unittests/test_dataset.py b/python/paddle/fluid/tests/unittests/test_dataset.py
index ab126c4378114..f3c1300aac412 100644
--- a/python/paddle/fluid/tests/unittests/test_dataset.py
+++ b/python/paddle/fluid/tests/unittests/test_dataset.py
@@ -293,6 +293,27 @@ def test_in_memory_dataset_run(self):
 
         temp_dir.cleanup()
 
+    def test_in_memory_dataset_gpugraph_mode(self):
+        """
+        Testcase for InMemoryDataset in gpugraph mode.
+        """
+        dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset")
+        dataset.set_feed_type("SlotRecordInMemoryDataFeed")
+        graph_config = {
+            "walk_len": 24,
+            "walk_degree": 10,
+            "once_sample_startid_len": 80000,
+            "sample_times_one_chunk": 5,
+            "window": 3,
+            "debug_mode": 0,
+            "batch_size": 800,
+            "meta_path": "cuid2clk-clk2cuid;cuid2conv-conv2cuid;clk2cuid-cuid2clk;clk2cuid-cuid2conv",
+            "gpu_graph_training": 1,
+        }
+        dataset.set_graph_config(graph_config)
+        dataset.set_pass_id(0)
+        dataset.get_pass_id()
+
     def test_in_memory_dataset_masterpatch(self):
         """
         Testcase for InMemoryDataset from create to run.
@@ -824,6 +845,77 @@ def test_run_with_inmemory_dataset_train_debug_mode(self):
 
         temp_dir.cleanup()
 
+    def test_cuda_in_memory_dataset_run(self):
+        """
+        Testcase for cuda inmemory dataset hogwild_worker train to run(barrier).
+        """
+        temp_dir = tempfile.TemporaryDirectory()
+        filename1 = os.path.join(
+            temp_dir.name, "test_in_memory_dataset_run_a.txt"
+        )
+        filename2 = os.path.join(
+            temp_dir.name, "test_in_memory_dataset_run_b.txt"
+        )
+
+        with open(filename1, "w") as f:
+            data = "1 1 2 3 3 4 5 5 5 5 1 1\n"
+            data += "1 2 2 3 4 4 6 6 6 6 1 2\n"
+            data += "1 3 2 3 5 4 7 7 7 7 1 3\n"
+            f.write(data)
+        with open(filename2, "w") as f:
+            data = "1 4 2 3 3 4 5 5 5 5 1 4\n"
+            data += "1 5 2 3 4 4 6 6 6 6 1 5\n"
+            data += "1 6 2 3 5 4 7 7 7 7 1 6\n"
+            data += "1 7 2 3 6 4 8 8 8 8 1 7\n"
+            f.write(data)
+
+        slots = ["slot1", "slot2", "slot3", "slot4"]
+        slots_vars = []
+        for slot in slots:
+            var = fluid.layers.data(
+                name=slot, shape=[1], dtype="int64", lod_level=1
+            )
+            slots_vars.append(var)
+
+        dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset")
+        dataset.set_feed_type("SlotRecordInMemoryDataFeed")
+        dataset.set_batch_size(1)
+        dataset.set_pipe_command("cat")
+        dataset.set_use_var(slots_vars)
+        dataset.set_filelist([filename1, filename2])
+
+        graph_config = {
+            "walk_len": 24,
+            "walk_degree": 10,
+            "once_sample_startid_len": 80000,
+            "sample_times_one_chunk": 5,
+            "window": 3,
+            "debug_mode": 0,
+            "batch_size": 800,
+            "meta_path": "cuid2clk-clk2cuid;cuid2conv-conv2cuid;clk2cuid-cuid2clk;clk2cuid-cuid2conv",
+            "gpu_graph_training": 1,
+        }
+        dataset.set_graph_config(graph_config)
+        dataset.set_pass_id(2)
+        pass_id = dataset.get_pass_id()
+
+        dataset.load_into_memory()
+
+        dataset.get_memory_data_size()
+
+        exe = fluid.Executor(
+            fluid.CPUPlace()
+            if not core.is_compiled_with_cuda()
+            else fluid.CUDAPlace(0)
+        )
+        exe.run(fluid.default_startup_program())
+        for i in range(self.epoch_num):
+            try:
+                exe.train_from_dataset(fluid.default_main_program(), dataset)
+            except Exception as e:
+                self.assertTrue(False)
+        temp_dir.cleanup()
+
 
 class TestDatasetWithDataLoader(TestDataset):
     """
diff --git a/python/paddle/fluid/tests/unittests/test_dist_fleet_minimize.py b/python/paddle/fluid/tests/unittests/test_dist_fleet_minimize.py
new file mode 100644
index 0000000000000..9fe6e27c27fc4
--- /dev/null
+++ b/python/paddle/fluid/tests/unittests/test_dist_fleet_minimize.py
@@ -0,0 +1,249 @@
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unittest
+
+import paddle
+import paddle.distributed.fleet as fleet
+import paddle.distributed.fleet.base.role_maker as role_maker
+import paddle.fluid as fluid
+
+paddle.enable_static()
+
+# For Net
+base_lr = 0.2
+emb_lr = base_lr * 3
+dict_dim = 1500
+emb_dim = 128
+hid_dim = 128
+margin = 0.1
+sample_rate = 1
+batch_size = 4
+
+
+class TestPSMinimize(unittest.TestCase):
+    def net(self):
+        def get_acc(cos_q_nt, cos_q_pt, batch_size):
+            cond = paddle.less_than(cos_q_nt, cos_q_pt)
+            cond = fluid.layers.cast(cond, dtype='float64')
+            cond_3 = paddle.sum(cond)
+            acc = paddle.divide(
+                cond_3,
+                fluid.layers.fill_constant(
+                    shape=[1], value=batch_size * 1.0, dtype='float64'
+                ),
+                name="simnet_acc",
+            )
+            return acc
+
+        def get_loss(cos_q_pt, cos_q_nt):
+            loss_op1 = paddle.subtract(
+                fluid.layers.fill_constant_batch_size_like(
+                    input=cos_q_pt, shape=[-1, 1], value=margin, dtype='float32'
+                ),
+                cos_q_pt,
+            )
+            loss_op2 = paddle.add(loss_op1, cos_q_nt)
+            loss_op3 = paddle.maximum(
+                fluid.layers.fill_constant_batch_size_like(
+                    input=loss_op2, shape=[-1, 1], value=0.0, dtype='float32'
+                ),
+                loss_op2,
+            )
+            avg_cost = paddle.mean(loss_op3)
+            return avg_cost
+
+        is_distributed = False
+        is_sparse = True
+
+        # query
+        q = fluid.layers.data(name="1", shape=[1], dtype="int64", lod_level=1)
+        # embedding
+        q_emb = fluid.contrib.layers.sparse_embedding(
+            input=q,
+            size=[dict_dim, emb_dim],
+            param_attr=fluid.ParamAttr(
+                initializer=fluid.initializer.Constant(value=0.01),
+                name="__emb__",
+                learning_rate=emb_lr,
+            ),
+        )
+        q_emb = paddle.reshape(q_emb, [-1, emb_dim])
+        # vsum
+        q_sum = fluid.layers.sequence_pool(input=q_emb, pool_type='sum')
+        q_ss = paddle.nn.functional.softsign(q_sum)
+        # fc layer after conv
+        q_fc = fluid.layers.fc(
+            input=q_ss,
+            size=hid_dim,
+            param_attr=fluid.ParamAttr(
+                initializer=fluid.initializer.Constant(value=0.01),
+                name="__q_fc__",
+                learning_rate=base_lr,
+            ),
+        )
+        # label data
+        label = fluid.layers.data(name="label", shape=[1], dtype="int64")
+        # pt
+        pt = fluid.layers.data(name="2", shape=[1], dtype="int64", lod_level=1)
+        # embedding
+        pt_emb = fluid.contrib.layers.sparse_embedding(
+            input=pt,
+            size=[dict_dim, emb_dim],
+            param_attr=fluid.ParamAttr(
+                initializer=fluid.initializer.Constant(value=0.01),
+                name="__emb__",
+                learning_rate=emb_lr,
+            ),
+        )
+        pt_emb = paddle.reshape(pt_emb, [-1, emb_dim])
+        # vsum
+        pt_sum = fluid.layers.sequence_pool(input=pt_emb, pool_type='sum')
+        pt_ss = paddle.nn.functional.softsign(pt_sum)
+        # fc layer
+        pt_fc = fluid.layers.fc(
+            input=pt_ss,
+            size=hid_dim,
+            param_attr=fluid.ParamAttr(
+                initializer=fluid.initializer.Constant(value=0.01),
+                name="__fc__",
+                learning_rate=base_lr,
+            ),
+            bias_attr=fluid.ParamAttr(name="__fc_b__"),
+        )
+        # nt
+        nt = fluid.layers.data(name="3", shape=[1], dtype="int64", lod_level=1)
+        # embedding
+        nt_emb = fluid.contrib.layers.sparse_embedding(
+            input=nt,
+            size=[dict_dim, emb_dim],
+            param_attr=fluid.ParamAttr(
+                initializer=fluid.initializer.Constant(value=0.01),
+                name="__emb__",
+                learning_rate=emb_lr,
+            ),
+        )
+        nt_emb = paddle.reshape(nt_emb, [-1, emb_dim])
+        # vsum
+        nt_sum = fluid.layers.sequence_pool(input=nt_emb, pool_type='sum')
+        nt_ss = paddle.nn.functional.softsign(nt_sum)
+        # fc layer
+        nt_fc = fluid.layers.fc(
+            input=nt_ss,
+            size=hid_dim,
+            param_attr=fluid.ParamAttr(
+                initializer=fluid.initializer.Constant(value=0.01),
+                name="__fc__",
+                learning_rate=base_lr,
+            ),
+            bias_attr=fluid.ParamAttr(name="__fc_b__"),
+        )
+        cos_q_pt = paddle.nn.functional.cosine_similarity(q_fc, pt_fc)
+        cos_q_nt = paddle.nn.functional.cosine_similarity(q_fc, nt_fc)
+        # loss
+        avg_cost = get_loss(cos_q_pt, cos_q_nt)
+        # acc
+        acc = get_acc(cos_q_nt, cos_q_pt, batch_size)
+        return [avg_cost, acc, cos_q_pt]
+
+    def gen_sparse_config(self):
+        """
+        gen sparse config
+        """
+        sparse_config = dict()
+        # sparse_config['sparse_table_class'] = "DownpourSparseSSDTable"
+        sparse_config['sparse_table_class'] = "DownpourSparseTable"
+        sparse_config['sparse_compress_in_save'] = True
+        sparse_config['sparse_shard_num'] = 67
+        # sparse_config['sparse_accessor_class'] = "DownpourCtrAccessor"
+        sparse_config[
+            'sparse_accessor_class'
+        ] = "DownpourCtrDymfAccessor"  # for variable embedding
+        sparse_config['sparse_learning_rate'] = 0.05  # sparse_lr
+        sparse_config['sparse_initial_g2sum'] = 3
+        sparse_config['sparse_initial_range'] = 0.02  # init_range
+        sparse_config['sparse_weight_bounds'] = [-10.0, 10.0]
+        sparse_config['sparse_embedx_dim'] = 8  # emb_size
+        sparse_config['sparse_embedx_threshold'] = 10
+        sparse_config['sparse_nonclk_coeff'] = 0.1
+        sparse_config['sparse_click_coeff'] = 1.0
+        sparse_config['sparse_base_threshold'] = 0
+        sparse_config['sparse_delta_threshold'] = 0.25
+        sparse_config['sparse_delta_keep_days'] = 16.0
+        sparse_config['sparse_show_click_decay_rate'] = 0.98
+        sparse_config['sparse_delete_threshold'] = 0.8
+        sparse_config['sparse_delete_after_unseen_days'] = 30
+
+        sparse_config['embed_sparse_optimizer'] = "adagrad"  # op_type
+        sparse_config['embed_sparse_learning_rate'] = 0.05  # sparse_lr
+        sparse_config['embed_sparse_initial_range'] = 0
+        sparse_config[
+            'embed_sparse_beta1_decay_rate'
+        ] = 0.9  # args.beta1_decay_rate
+        sparse_config[
+            'embed_sparse_beta2_decay_rate'
+        ] = 0.999  # args.beta2_decay_rate
+        sparse_config['embed_sparse_weight_bounds'] = [-10.0, 10.0]
+
+        sparse_config['embedx_sparse_optimizer'] = "adagrad"  # op_type
+        sparse_config['embedx_sparse_learning_rate'] = 0.05  # sparse_lr
+        sparse_config['embedx_sparse_initial_range'] = 0.02  # init_range
+        sparse_config[
+            'embedx_sparse_beta1_decay_rate'
+        ] = 0.9  # args.beta1_decay_rate
+        sparse_config[
+            'embedx_sparse_beta2_decay_rate'
+        ] = 0.999  # args.beta2_decay_rate
+        sparse_config['embedx_sparse_weight_bounds'] = [-10.0, 10.0]
+        # sparse_config['nodeid_slot'] = nodeid_slot
+        # sparse_config['feature_learning_rate'] = feature_lr
+        return sparse_config
+
+    def test(self):
+        os.environ["PADDLE_PSERVER_NUMS"] = "2"
+        os.environ["PADDLE_TRAINERS_NUM"] = "2"
+        os.environ["POD_IP"] = "127.0.0.1"
+        os.environ["PADDLE_PORT"] = "36001"
+        os.environ["PADDLE_TRAINER_ID"] = "0"
+        os.environ["PADDLE_TRAINERS_NUM"] = "2"
+        os.environ[
+            "PADDLE_TRAINER_ENDPOINTS"
+        ] = "127.0.0.1:36001,127.0.0.2:36001"
+        os.environ[
+            "PADDLE_PSERVERS_IP_PORT_LIST"
+        ] = "127.0.0.1:36002,127.0.0.2:36002"
+        os.environ["TRAINING_ROLE"] = "TRAINER"
+        os.environ["FLAGS_selected_gpus"] = "0"
+
+        role = role_maker.PaddleCloudRoleMaker()
+        fleet.init(is_collective=True)
+        loss, acc, _ = self.net()
+
+        strategy = paddle.distributed.fleet.DistributedStrategy()
+        configs = {"use_ps_gpu": 0, "launch_barrier": False}
+        strategy.a_sync_configs = configs
+        strategy.a_sync = True
+
+        sparse_config = dict()
+        sparse_config['embedding'] = self.gen_sparse_config()
+        strategy.fleet_desc_configs = sparse_config
+
+        optimizer = paddle.fluid.optimizer.Adam(learning_rate=0.01)
+        optimizer = fleet.distributed_optimizer(optimizer, strategy=strategy)
+        optimizer.minimize(loss)
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_dist_fleet_spmt.py b/python/paddle/fluid/tests/unittests/test_dist_fleet_spmt.py
new file mode 100644
index 0000000000000..446c70ae87d11
--- /dev/null
+++ b/python/paddle/fluid/tests/unittests/test_dist_fleet_spmt.py
@@ -0,0 +1,251 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unittest
+
+import paddle
+import paddle.fluid as fluid
+
+paddle.enable_static()
+
+# For Net
+base_lr = 0.2
+emb_lr = base_lr * 3
+dict_dim = 1500
+emb_dim = 128
+hid_dim = 128
+margin = 0.1
+sample_rate = 1
+batch_size = 4
+
+
+class TestSPMT(unittest.TestCase):
+    def net(self):
+        def get_acc(cos_q_nt, cos_q_pt, batch_size):
+            cond = paddle.less_than(cos_q_nt, cos_q_pt)
+            cond = fluid.layers.cast(cond, dtype='float64')
+            cond_3 = paddle.sum(cond)
+            acc = paddle.divide(
+                cond_3,
+                fluid.layers.fill_constant(
+                    shape=[1], value=batch_size * 1.0, dtype='float64'
+                ),
+                name="simnet_acc",
+            )
+            return acc
+
+        def get_loss(cos_q_pt, cos_q_nt):
+            loss_op1 = paddle.subtract(
+                fluid.layers.fill_constant_batch_size_like(
+                    input=cos_q_pt, shape=[-1, 1], value=margin, dtype='float32'
+                ),
+                cos_q_pt,
+            )
+            loss_op2 = paddle.add(loss_op1, cos_q_nt)
+            loss_op3 = paddle.maximum(
+                fluid.layers.fill_constant_batch_size_like(
+                    input=loss_op2, shape=[-1, 1], value=0.0, dtype='float32'
+                ),
+                loss_op2,
+            )
+            avg_cost = paddle.mean(loss_op3)
+            return avg_cost
+
+        is_distributed = False
+        is_sparse = True
+
+        # query
+        q = fluid.layers.data(name="1", shape=[1], dtype="int64", lod_level=1)
+        # embedding
+        q_emb = fluid.contrib.layers.sparse_embedding(
+            input=q,
+            size=[dict_dim, emb_dim],
+            param_attr=fluid.ParamAttr(
+                initializer=fluid.initializer.Constant(value=0.01),
+                name="__emb__",
+                learning_rate=emb_lr,
+            ),
+        )
+        q_emb = paddle.reshape(q_emb, [-1, emb_dim])
+        # vsum
+        q_sum = fluid.layers.sequence_pool(input=q_emb, pool_type='sum')
+        q_ss = paddle.nn.functional.softsign(q_sum)
+        # fc layer after conv
+        q_fc = fluid.layers.fc(
+            input=q_ss,
+            size=hid_dim,
+            param_attr=fluid.ParamAttr(
+                initializer=fluid.initializer.Constant(value=0.01),
+                name="__q_fc__",
+                learning_rate=base_lr,
+            ),
+        )
+        # label data
+        label = fluid.layers.data(name="label", shape=[1], dtype="int64")
+        # pt
+        pt = fluid.layers.data(name="2", shape=[1], dtype="int64", lod_level=1)
+        # embedding
+        pt_emb = fluid.contrib.layers.sparse_embedding(
+            input=pt,
+            size=[dict_dim, emb_dim],
+            param_attr=fluid.ParamAttr(
+                initializer=fluid.initializer.Constant(value=0.01),
+                name="__emb__",
+                learning_rate=emb_lr,
+            ),
+        )
+        pt_emb = paddle.reshape(pt_emb, [-1, emb_dim])
+        # vsum
+        pt_sum = fluid.layers.sequence_pool(input=pt_emb, pool_type='sum')
+        pt_ss = paddle.nn.functional.softsign(pt_sum)
+        # fc layer
+        pt_fc = fluid.layers.fc(
+            input=pt_ss,
+            size=hid_dim,
+            param_attr=fluid.ParamAttr(
+                initializer=fluid.initializer.Constant(value=0.01),
+                name="__fc__",
+                learning_rate=base_lr,
+            ),
+            bias_attr=fluid.ParamAttr(name="__fc_b__"),
+        )
+        # nt
+        nt = fluid.layers.data(name="3", shape=[1], dtype="int64", lod_level=1)
+        # embedding
+        nt_emb = fluid.contrib.layers.sparse_embedding(
+            input=nt,
+            size=[dict_dim, emb_dim],
+            param_attr=fluid.ParamAttr(
+                initializer=fluid.initializer.Constant(value=0.01),
+                name="__emb__",
+                learning_rate=emb_lr,
+            ),
+        )
+        nt_emb = paddle.reshape(nt_emb, [-1, emb_dim])
+        # vsum
+        nt_sum = fluid.layers.sequence_pool(input=nt_emb, pool_type='sum')
+        nt_ss = paddle.nn.functional.softsign(nt_sum)
+        # fc layer
+        nt_fc = fluid.layers.fc(
+            input=nt_ss,
+            size=hid_dim,
+            param_attr=fluid.ParamAttr(
+                initializer=fluid.initializer.Constant(value=0.01),
+                name="__fc__",
+                learning_rate=base_lr,
+            ),
+            bias_attr=fluid.ParamAttr(name="__fc_b__"),
+        )
+        cos_q_pt = paddle.nn.functional.cosine_similarity(q_fc, pt_fc)
+        cos_q_nt = paddle.nn.functional.cosine_similarity(q_fc, nt_fc)
+        # loss
+        avg_cost = get_loss(cos_q_pt, cos_q_nt)
+        # acc
+        acc = get_acc(cos_q_nt, cos_q_pt, batch_size)
+        return [avg_cost, acc, cos_q_pt]
+
+    # def test(self):
+    #    os.environ["PADDLE_PSERVER_NUMS"] = "2"
+    #    os.environ["PADDLE_TRAINERS_NUM"] = "2"
+    #    os.environ["POD_IP"] = "127.0.0.1"
+    #    os.environ["PADDLE_PORT"] = "36001"
+    #    os.environ["PADDLE_TRAINER_ID"] = "0"
+    #    os.environ["PADDLE_TRAINERS_NUM"] = "2"
+    #    os.environ[
+    #        "PADDLE_TRAINER_ENDPOINTS"
+    #    ] = "127.0.0.1:36001,127.0.0.2:36001"
+    #    os.environ[
+    #        "PADDLE_PSERVERS_IP_PORT_LIST"
+    #    ] = "127.0.0.1:36002,127.0.0.2:36002"
+    #    os.environ["TRAINING_ROLE"] = "TRAINER"
+    #    os.environ["FLAGS_selected_gpus"] = "0"
+    #    role = role_maker.PaddleCloudRoleMaker()
+    #    fleet.init(role)
+    #    loss, acc, _ = self.net()
+    #
+    #    strategy = paddle.distributed.fleet.DistributedStrategy()
+    #    configs = {"use_ps_gpu": 1, "launch_barrier": False}
+    #    strategy.a_sync_configs = configs
+    #    strategy.a_sync = True
+    #    optimizer = paddle.fluid.optimizer.Adam(learning_rate=0.01)
+    #    optimizer = fleet.distributed_optimizer(optimizer, strategy=strategy)
+    #    optimizer.minimize(loss)
+
+    def get_dist_env(self):
+        trainer_id = int(os.getenv('PADDLE_TRAINER_ID', '0'))
+        trainer_endpoints = ''
+        current_endpoint = ''
+        num_trainers = 0
+        if os.getenv('PADDLE_TRAINER_ENDPOINTS'):
+            trainer_endpoints = os.getenv('PADDLE_TRAINER_ENDPOINTS')
+            current_endpoint = trainer_endpoints.split(',')[trainer_id]
+            num_trainers = len(trainer_endpoints.split(','))
+
+        return {
+            'trainer_id': trainer_id,
+            'num_trainers': num_trainers,
+            'current_endpoint': current_endpoint,
+            'trainer_endpoints': trainer_endpoints,
+        }
+
+    def test_SingleProcessMultiThread(self):
+        """
+        Testcase for SingleProcessMultiThread
+        """
+        os.environ["PADDLE_PSERVER_NUMS"] = "2"
+        os.environ["PADDLE_TRAINERS_NUM"] = "2"
+        os.environ["POD_IP"] = "127.0.0.1"
+        os.environ["PADDLE_PORT"] = "36001"
+        os.environ["PADDLE_TRAINER_ID"] = "0"
+        os.environ["PADDLE_TRAINERS_NUM"] = "2"
+        os.environ[
+            "PADDLE_TRAINER_ENDPOINTS"
+        ] = "127.0.0.1:36001,127.0.0.2:36001"
+        os.environ[
+            "PADDLE_PSERVERS_IP_PORT_LIST"
+        ] = "127.0.0.1:36002,127.0.0.2:36002"
+        os.environ["TRAINING_ROLE"] = "TRAINER"
+        os.environ["FLAGS_selected_gpus"] = "0"
+        os.environ["PADDLE_FUSE_ALLREDUCE"] = "1"
+        os.environ["PADDLE_LOSS_SCALE"] = "1"
+
+        startup_program = fluid.Program()
+        main_program = fluid.Program()
+        with fluid.program_guard(main_program, startup_program):
+            with fluid.unique_name.guard():
+                loss, acc, _ = self.net()
+        optimizer = paddle.fluid.optimizer.Adam(learning_rate=0.01)
+        optimizer.minimize(loss)
+        print("===main_program====")
+        print(main_program)
+        print("===main_program====")
+        from paddle.fluid.transpiler.collective import SingleProcessMultiThread
+
+        t = SingleProcessMultiThread()
+        env = self.get_dist_env()
+        t.transpile(
+            startup_program=startup_program,
+            main_program=main_program,
+            rank=env["trainer_id"],
+            endpoints=env["trainer_endpoints"],
+            current_endpoint=env['current_endpoint'],
+            wait_port=False,
+        )
+        param_cnt = t._get_update_param_count()
+        print("param_cnt:", param_cnt)
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_downpoursgd.py b/python/paddle/fluid/tests/unittests/test_downpoursgd.py
index 2e15d059db5f3..ce93813dd438d 100644
--- a/python/paddle/fluid/tests/unittests/test_downpoursgd.py
+++ b/python/paddle/fluid/tests/unittests/test_downpoursgd.py
@@ -218,6 +218,7 @@ def test_downpour_opt_work(self):
             opt_info["scale_datanorm"] = -1
             opt_info["dump_slot"] = False
             opt_info["stat_var_names"] = []
+            opt_info["user_define_dump_filename"] = "./dump_filename/dump.txt"
             worker = DownpourWorker(None)
             worker.get_desc().CopyFrom(ps_param.trainer_param[0])
             opt_info["program_id_to_worker"] = {program_id: worker}
diff --git a/python/paddle/fluid/trainer_factory.py b/python/paddle/fluid/trainer_factory.py
index 281fbd8693c2d..8fa675ffcc7ad 100644
--- a/python/paddle/fluid/trainer_factory.py
+++ b/python/paddle/fluid/trainer_factory.py
@@ -91,6 +91,13 @@ def _create_trainer(self, opt_info=None):
                     and len(opt_info.get("dump_fields_path")) != 0
                 ):
                     trainer._set_dump_fields_path(opt_info["dump_fields_path"])
+                if (
+                    opt_info.get("user_define_dump_filename") is not None
+                    and len(opt_info.get("user_define_dump_filename")) != 0
+                ):
+                    trainer._set_user_define_dump_filename(
+                        opt_info["user_define_dump_filename"]
+                    )
                 if opt_info.get("dump_file_num") is not None:
                     trainer._set_dump_file_num(opt_info["dump_file_num"])
                 if opt_info.get("dump_converter") is not None:
diff --git a/python/paddle/fluid/transpiler/collective.py b/python/paddle/fluid/transpiler/collective.py
index c22b1746966f1..870efa0968d72 100644
--- a/python/paddle/fluid/transpiler/collective.py
+++ b/python/paddle/fluid/transpiler/collective.py
@@ -473,15 +473,190 @@ def _transpile_main_program(self):
 
 
 class SingleProcessMultiThread(GradAllReduce):
-    ''' '''
+    """
+    single process multi thread mode
+    """
 
     def __init__(self):
         GradAllReduce.__init__(self, 1)
         self.mode = "single_process_multi_thread"
+        self.fuse_allreduce = int(os.getenv("PADDLE_FUSE_ALLREDUCE", "1"))
+        self.loss_scale = int(os.getenv("PADDLE_LOSS_SCALE", "1"))
+        self.gpu_nums = len(
+            os.getenv("FLAGS_selected_gpus", "0,1,2,3,4,5,6,7").split(",")
+        )
 
     def _transpile_startup_program(self):
-        block = self.startup_program.global_block()
-        block.append_op(type='c_comm_init_all', attrs={'ring_id': 0})
+        nodes_num = 0
+        if len(self.endpoints) > 1:
+            nodes_num = len(set([x.split(':')[0] for x in self.endpoints]))
+        # diffent ip num is multi node
+        if nodes_num > 1:
+            self.nranks = nodes_num
+            print("begin to _transpile_startup_program for multi-node")
+            print("current_endpoint: ", self.current_endpoint)
+            print("total endpoints: ", self.endpoints)
+            print("rank: %d, ring_id: %d" % (self.rank, self.nrings))
+            for ring_id in range(self.nrings):
+                self._init_communicator(
+                    self.startup_program,
+                    self.current_endpoint,
+                    self.endpoints,
+                    self.rank,
+                    ring_id,
+                    self.wait_port,
+                    True,
+                )
+        else:
+            self.nranks = 1
+            print("begin to _transpile_startup_program for single-node")
+            block = self.startup_program.global_block()
+            block.append_op(type='c_comm_init_all', attrs={'ring_id': 0})
+
+    def _transpile_main_program(self):
+        # not need loss scale and no dense param
+        param_cnt = self._get_update_param_count()
+        if self.loss_scale is 0 and param_cnt is 0:
+            return
+        # scale loss
+        self._insert_scale_loss_grad_ops()
+        # no param
+        if param_cnt is 0:
+            return
+        # fuse allreduce
+        if self.fuse_allreduce > 0:
+            print("begin used fuse_allreduce param count = %s" % (param_cnt))
+            # use fuse allreduce
+            self._insert_fuse_allreduce_ops()
+        else:
+            self._insert_allreduce_ops()
+
+    def _get_update_param_count(self):
+        """
+        get need update param count
+        """
+        param_count = 0
+        block = self.main_program.global_block()
+        for idx, op in reversed(list(enumerate(block.ops))):
+            if not self._is_backward_op(op):
+                continue
+            if not self.op_role_var_key in op.attr_names:
+                continue
+            op_role_var = op.all_attrs()[self.op_role_var_key]
+            if len(op_role_var) == 0:
+                continue
+
+            assert len(op_role_var) % 2 == 0
+            for i in range(0, len(op_role_var), 2):
+                param = block.vars[op_role_var[i]]
+                if param.is_distributed:
+                    continue
+                param_count = param_count + 1
+
+        return param_count
+
+    def _insert_scale_loss_grad_ops(self):
+        '''
+        In order to keep the learning rate consistent in different numbers of
+        training workers, we scale the loss grad by the number of workers
+        '''
+        scale = 1.0 / self.nranks / self.gpu_nums
+        print("begin _insert_scale_loss_grad_ops scale = %s" % (scale))
+        block = self.main_program.global_block()
+        for idx, op in reversed(list(enumerate(block.ops))):
+            if not self._is_loss_grad_op(op):
+                continue
+            loss_grad_var = block.vars[op.output_arg_names[0]]
+            block._insert_op(
+                idx + 1,
+                type='scale',
+                inputs={'X': loss_grad_var},
+                outputs={'Out': loss_grad_var},
+                attrs={'scale': scale, self.op_role_key: OpRole.Backward},
+            )
+
+    def _insert_fuse_allreduce_ops(self):
+        """
+        insert coalesce_tensor and all reduce ops
+        """
+        block = self.main_program.global_block()
+        ring_id = -1
+        grad = None
+        input_grads = []
+        global_offset = 0  # find insert offset of fuse tensor, after the max dense grad offset
+        for idx, op in reversed(list(enumerate(block.ops))):
+            if (
+                self._is_backward_op(op)
+                and self.op_role_var_key in op.attr_names
+            ):
+                op_role_var = op.all_attrs()[self.op_role_var_key]
+                if len(op_role_var) == 0:
+                    continue
+                assert len(op_role_var) % 2 == 0
+                offset = idx
+                for i in range(0, len(op_role_var), 2):
+                    param = block.vars[op_role_var[i]]
+                    grad = block.vars[op_role_var[i + 1]]
+                    if param.is_distributed:
+                        continue
+                    if offset == idx:
+                        input_grads.append(grad)
+                        global_offset = max(global_offset, offset + 1)
+        if grad is None:
+            return
+
+        # init output_grads
+        output_grads = input_grads
+        # init fused_output with temp shape, it will calculate real shape depend on inputs
+        fused_output = block.create_var(
+            name="fused_output",
+            shape=[1],
+            persistable=False,
+            dtype=core.VarDesc.VarType.FP32,
+            stop_gradient=True,
+        )
+        # fuse all grad tensors
+        coalesce_tensor_attrs = {
+            "copy_data": True,
+            "set_constant": False,
+            "dtype": core.VarDesc.VarType.FP32,
+        }
+        block._insert_op(
+            global_offset,
+            type='coalesce_tensor',
+            inputs={'Input': input_grads},
+            outputs={'Output': output_grads, 'FusedOutput': fused_output},
+            attrs=coalesce_tensor_attrs,
+        )
+        global_offset += 1
+        # grads aggregation of multi-gpus
+        block._insert_op(
+            global_offset,
+            type='c_sync_calc_stream',
+            inputs={'X': fused_output},
+            outputs={'Out': fused_output},
+            attrs={self.op_role_key: OpRole.Backward},
+        )
+        global_offset += 1
+        ring_id = (ring_id + 1) % self.nrings
+        block._insert_op(
+            global_offset,
+            type='c_allreduce_sum',
+            inputs={'X': fused_output},
+            outputs={'Out': fused_output},
+            attrs={'ring_id': ring_id, self.op_role_key: OpRole.Backward},
+        )
+        global_offset += 1
+
+        # sync before adam
+        block._insert_op(
+            global_offset,
+            type='c_sync_comm_stream',
+            inputs={'X': fused_output},
+            outputs={'Out': fused_output},
+            attrs={'ring_id': ring_id, self.op_role_key: OpRole.Backward},
+        )
+        global_offset += 1
 
 
 class MultiThread(GradAllReduce):