Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simulator end run for all clients #2514

Merged
merged 13 commits into from
Apr 19, 2024
Merged

Conversation

yhwen
Copy link
Collaborator

@yhwen yhwen commented Apr 17, 2024

Fixes #

Address the simulator run large number of clients performance issue.

Description

Simulator only runs the END_RUN events for those active running clients when the workflow finishes the run. All those swapped out clients need to be swapped in and re-created to run the END_RUN event handling. Most of the applications the END_RUN event handling are not needed when the client processes are not active. Change the simulator to provide an "end_run_for_all" option. The default is false. Only set it to run END_RUN event handling for all clients when explicitly set.

Also change to use multi-threads to run the END_RUN event handling if needed.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh.
  • In-line docstrings updated.
  • Documentation updated.

@yhwen yhwen changed the title Simulator end run all Simulator end run for all clients Apr 18, 2024
chesterxgchen
chesterxgchen previously approved these changes Apr 18, 2024
@yanchengnv
Copy link
Collaborator

Do we really need these changes?
@YuanTingHsieh what would happen if we don't do?

@YuanTingHsieh
Copy link
Collaborator

If we are not firing END_RUN events in the simulator, based on our current components that utilizes this event, we will have following issues when running in simulator:

  1. Metrics streaming results could be missing when running in simulator, but since How to avoids lost tracking messages and lost events #2477 report the current one also has this issue so I guess the current END_RUN mechanism not helps
  2. Some components will not be gracefully shutdown, for example the external file pipe will just timeout and print "PEER_GONE", this will not affect correctness of job. There will be potential warning / error messages print out that does not affect the correctness.

We could warn the users saying that simulator will not fire END_RUN / ABOUT_TO_END_RUN / CHECK_FOR_END_RUN_READINESS events.

@yhwen
Copy link
Collaborator Author

yhwen commented Apr 19, 2024

If we are not firing END_RUN events in the simulator, based on our current components that utilizes this event, we will have following issues when running in simulator:

  1. Metrics streaming results could be missing when running in simulator, but since How to avoids lost tracking messages and lost events #2477 report the current one also has this issue so I guess the current END_RUN mechanism not helps
  2. Some components will not be gracefully shutdown, for example the external file pipe will just timeout and print "PEER_GONE", this will not affect correctness of job. There will be potential warning / error messages print out that does not affect the correctness.

We could warn the users saying that simulator will not fire END_RUN / ABOUT_TO_END_RUN / CHECK_FOR_END_RUN_READINESS events.

The "--end_run_for_all" option is to allow the user to run the END_RUN event handling for all the clients. Users can choose to run in this mode if needed for the Application. By default it's False.

@yanchengnv
Copy link
Collaborator

Then this PR is not worth it.

@SYangster
Copy link
Collaborator

/build

Copy link
Collaborator

@YuanTingHsieh YuanTingHsieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yhwen
Copy link
Collaborator Author

yhwen commented Apr 19, 2024

/build

@yhwen yhwen merged commit b93fe3b into NVIDIA:main Apr 19, 2024
16 checks passed
holgerroth pushed a commit to zhijinl/NVFlare that referenced this pull request May 2, 2024
…son.

Update file headers.

Update README.

Update README.

Fix README.

Refine README.

Update README.

Added more logging for the job status changing. (NVIDIA#2480)

* Added more logging for the job status changing.

* Fixed a logging call error.

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

Fix update client status (NVIDIA#2508)

* check workflow id before updating client status

* change order of checks

Add user guide on how to deploy to EKS (NVIDIA#2510)

* Add user guide on how to deploy to EKS

* Address comments

Improve dead client handling (NVIDIA#2506)

* dev

* test dead client cmd

* added more info for dead client tracing

* remove unused imports

* fix unit test

* fix test case

* address PR comments

---------

Co-authored-by: Sean Yang <[email protected]>

Enhance WFController (NVIDIA#2505)

* set flmodel variables in basefedavg

* make round info optional, fix inproc api bug

temporarily disable preflight tests (NVIDIA#2521)

Upgrade dependencies (NVIDIA#2516)

Use full path for PSI components (NVIDIA#2437) (NVIDIA#2517)

Multiple bug fixes from 2.4 (NVIDIA#2518)

* [2.4] Support client custom code in simulator (NVIDIA#2447)

* Support client custom code in simulator

* Fix client custom code

* Remove cancel_futures args (NVIDIA#2457)

* Fix sub_worker_process shutdown (NVIDIA#2458)

* Set GRPC_ENABLE_FORK_SUPPORT to False (NVIDIA#2474)

Pythonic job creation (NVIDIA#2483)

* WIP: constructed the FedJob.

* WIP: server_app josn export.

* generate the job app config.

* fully functional pythonic job creation.

* Added simulator_run for pythonic API.

* reformat.

* Added filters support for pythonic job creation.

* handled the direct import case in fed_job.

* refactor.

* Added the resource_spec set function for FedJob.

* refactored.

* Moved the ClientApp and ServerApp into fed_app.py.

* Refactored: removed the _FilterDef class.

* refactored.

* Rename job config classes (NVIDIA#3)

* rename config related classes

* add client api example

* fix metric streaming

* add to() routine

* Enable obj in the constructor as paramenter.

* Added support for the launcher script.

* refactored.

* reformat.

* Update the comment.

* re-arrange the package location.

* Added add_ext_script() for BaseAppConfig.

* codestyle fix.

* Removed the client-api-pt example.

* removed no used import.

* fixed the in_time_accumulate_weighted_aggregator_test.py

* Added Enum parameter support.

* Added docstring.

* Added ability to handle parameters from base class.

* Move the parameter data format conversion to the START_RUN event for InProcessClientAPIExecutor.

* Added params_exchange_format for PTInProcessClientAPIExecutor.

* codestyle fix.

* Fixed a custom code folder structure issue.

* work for sub-folder custom files.

* backed to handle parameters from base classes.

* Support folder structure job config.

* Added support for flat folder from '.XXX' import.

* codestyle fix.

* refactored and add docstring.

* Address some of the PR reviews.

---------

Co-authored-by: Holger Roth <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>
Co-authored-by: Chester Chen <[email protected]>

Enhancements from 2.4 (NVIDIA#2519)

* Starts heartbeat after task is pull and before task execution (NVIDIA#2415)

* Starts pipe handler heartbeat send/check after task is pull before task execution (NVIDIA#2442)

* [2.4] Improve cell pipe timeout handling (NVIDIA#2441)

* improve cell pipe timeout handling

* improved end and abort handling

* improve timeout handling

---------

Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

* [2.4] Enhance launcher executor (NVIDIA#2433)

* Update LauncherExecutor logs and execution setup timeout

* Change name

* [2.4] Fire and forget for pipe handler control messages (NVIDIA#2413)

* Fire and forget for pipe handler control messages

* Add default timeout value

* fix wait-for-reply (NVIDIA#2478)

* Fix pipe handler timeout in task exchanger and launcher executor (NVIDIA#2495)

* Fix metric relay pipe handler timeout (NVIDIA#2496)

* Rely on launcher check_run_status to pause/resume hb (NVIDIA#2502)

Co-authored-by: Chester Chen <[email protected]>

---------

Co-authored-by: Yan Cheng <[email protected]>
Co-authored-by: Chester Chen <[email protected]>

Update ci cd from 2.4 (NVIDIA#2520)

* Update github actions (NVIDIA#2450)

* Fix premerge (NVIDIA#2467)

* Fix issues on hello-world TF2 notebook

* Fix tf integration test (NVIDIA#2504)

* Add client api integration tests

---------

Co-authored-by: Isaac Yang <[email protected]>
Co-authored-by: Sean Yang <[email protected]>

use controller name for stats (NVIDIA#2522)

Simulator workspace re-design (NVIDIA#2492)

* Redesign simulator workspace structure.

* working, needs clean.

* Changed the simulator workspacce structure to be consistent with POC.

* Moved the logfile init to start_server_app().

* optimzed.

* adjust the stats pool location.

* Addressed the PR views.

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

Simulator end run for all clients (NVIDIA#2514)

* Provide an option to run END_RUN for all clients.

* Added end_run_all option for simulator to run END_RUN event for all clients.

* Fixed a add_argument type, added help message.

* Changed to use add_argument(() compatible with python 3.8.

* reformat.

* rewrite the _end_run_clients() and add docstring for easier understanding.

* reformat.

* adjusting the locking in the _end_run_clients.

* Fixed a potential None pointer error.

* renamed the clients_finished_end_run variable.

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Sean Yang <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

Secure XGBoost Integration (NVIDIA#2512)

* Updated FOBS readme to add DatumManager, added agrpcs as secure scheme

* Refactoring

* Refactored the secure version to histogram_based_v2

* Replaced Paillier with a mock encryptor

* Added license header

* Put mock back

* Added metrics_writer back and fixed GRPC error reply

simplify job simulator_run to take only one workspace parameter. (NVIDIA#2528)

Fix README.

Fix file links in README.

Fix file links in README.

Add comparison between centralized and federated training code.

Add missing client api test jobs (NVIDIA#2535)

Fixed the simulator server workspace root dir (NVIDIA#2533)

* Fixed the simulator server root dir error.

* Added unit test for SimulatorRunner start_server_app.

---------

Co-authored-by: Chester Chen <[email protected]>

Improve InProcessClientAPIExecutor  (NVIDIA#2536)

* 1. rename ExeTaskFnWrapper class to TaskScriptRunner
2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function
3. redirect print() to logger.info()

* 1. rename ExeTaskFnWrapper class to TaskScriptRunner
2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function
3. redirect print() to logger.info()

* make result check and result pull use the same configurable variable

* rename exec_task_fn_wrapper to task_script_runner.py

* fix typo

Update README for launching python script.

Modify tensorboard logdir.

Link to environment setup instructions.

expose aggregate_fn to users for overwriting (NVIDIA#2539)

FIX MLFLow and Tensorboard Output to be consistent with new Workspace root changes (NVIDIA#2537)

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1. Remove the default code to use configuration
2. fix some broken notebook

* rollback changes

Fix decorator issue (NVIDIA#2542)

Remove line number in code link.

FLModel summary (NVIDIA#2544)

* add FLModel Summary

* format

formatting

Update KM example, add 2-stage solution without HE (NVIDIA#2541)

* add KM without HE, update everything

* fix license header

* fix license header - update year to 2024

* fix format

---------

Co-authored-by: Chester Chen <[email protected]>
holgerroth added a commit that referenced this pull request May 2, 2024
* Implement federated logistic regression with second-order newton raphson.

Update file headers.

Update README.

Update README.

Fix README.

Refine README.

Update README.

Added more logging for the job status changing. (#2480)

* Added more logging for the job status changing.

* Fixed a logging call error.

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

Fix update client status (#2508)

* check workflow id before updating client status

* change order of checks

Add user guide on how to deploy to EKS (#2510)

* Add user guide on how to deploy to EKS

* Address comments

Improve dead client handling (#2506)

* dev

* test dead client cmd

* added more info for dead client tracing

* remove unused imports

* fix unit test

* fix test case

* address PR comments

---------

Co-authored-by: Sean Yang <[email protected]>

Enhance WFController (#2505)

* set flmodel variables in basefedavg

* make round info optional, fix inproc api bug

temporarily disable preflight tests (#2521)

Upgrade dependencies (#2516)

Use full path for PSI components (#2437) (#2517)

Multiple bug fixes from 2.4 (#2518)

* [2.4] Support client custom code in simulator (#2447)

* Support client custom code in simulator

* Fix client custom code

* Remove cancel_futures args (#2457)

* Fix sub_worker_process shutdown (#2458)

* Set GRPC_ENABLE_FORK_SUPPORT to False (#2474)

Pythonic job creation (#2483)

* WIP: constructed the FedJob.

* WIP: server_app josn export.

* generate the job app config.

* fully functional pythonic job creation.

* Added simulator_run for pythonic API.

* reformat.

* Added filters support for pythonic job creation.

* handled the direct import case in fed_job.

* refactor.

* Added the resource_spec set function for FedJob.

* refactored.

* Moved the ClientApp and ServerApp into fed_app.py.

* Refactored: removed the _FilterDef class.

* refactored.

* Rename job config classes (#3)

* rename config related classes

* add client api example

* fix metric streaming

* add to() routine

* Enable obj in the constructor as paramenter.

* Added support for the launcher script.

* refactored.

* reformat.

* Update the comment.

* re-arrange the package location.

* Added add_ext_script() for BaseAppConfig.

* codestyle fix.

* Removed the client-api-pt example.

* removed no used import.

* fixed the in_time_accumulate_weighted_aggregator_test.py

* Added Enum parameter support.

* Added docstring.

* Added ability to handle parameters from base class.

* Move the parameter data format conversion to the START_RUN event for InProcessClientAPIExecutor.

* Added params_exchange_format for PTInProcessClientAPIExecutor.

* codestyle fix.

* Fixed a custom code folder structure issue.

* work for sub-folder custom files.

* backed to handle parameters from base classes.

* Support folder structure job config.

* Added support for flat folder from '.XXX' import.

* codestyle fix.

* refactored and add docstring.

* Address some of the PR reviews.

---------

Co-authored-by: Holger Roth <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>
Co-authored-by: Chester Chen <[email protected]>

Enhancements from 2.4 (#2519)

* Starts heartbeat after task is pull and before task execution (#2415)

* Starts pipe handler heartbeat send/check after task is pull before task execution (#2442)

* [2.4] Improve cell pipe timeout handling (#2441)

* improve cell pipe timeout handling

* improved end and abort handling

* improve timeout handling

---------

Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

* [2.4] Enhance launcher executor (#2433)

* Update LauncherExecutor logs and execution setup timeout

* Change name

* [2.4] Fire and forget for pipe handler control messages (#2413)

* Fire and forget for pipe handler control messages

* Add default timeout value

* fix wait-for-reply (#2478)

* Fix pipe handler timeout in task exchanger and launcher executor (#2495)

* Fix metric relay pipe handler timeout (#2496)

* Rely on launcher check_run_status to pause/resume hb (#2502)

Co-authored-by: Chester Chen <[email protected]>

---------

Co-authored-by: Yan Cheng <[email protected]>
Co-authored-by: Chester Chen <[email protected]>

Update ci cd from 2.4 (#2520)

* Update github actions (#2450)

* Fix premerge (#2467)

* Fix issues on hello-world TF2 notebook

* Fix tf integration test (#2504)

* Add client api integration tests

---------

Co-authored-by: Isaac Yang <[email protected]>
Co-authored-by: Sean Yang <[email protected]>

use controller name for stats (#2522)

Simulator workspace re-design (#2492)

* Redesign simulator workspace structure.

* working, needs clean.

* Changed the simulator workspacce structure to be consistent with POC.

* Moved the logfile init to start_server_app().

* optimzed.

* adjust the stats pool location.

* Addressed the PR views.

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

Simulator end run for all clients (#2514)

* Provide an option to run END_RUN for all clients.

* Added end_run_all option for simulator to run END_RUN event for all clients.

* Fixed a add_argument type, added help message.

* Changed to use add_argument(() compatible with python 3.8.

* reformat.

* rewrite the _end_run_clients() and add docstring for easier understanding.

* reformat.

* adjusting the locking in the _end_run_clients.

* Fixed a potential None pointer error.

* renamed the clients_finished_end_run variable.

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Sean Yang <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

Secure XGBoost Integration (#2512)

* Updated FOBS readme to add DatumManager, added agrpcs as secure scheme

* Refactoring

* Refactored the secure version to histogram_based_v2

* Replaced Paillier with a mock encryptor

* Added license header

* Put mock back

* Added metrics_writer back and fixed GRPC error reply

simplify job simulator_run to take only one workspace parameter. (#2528)

Fix README.

Fix file links in README.

Fix file links in README.

Add comparison between centralized and federated training code.

Add missing client api test jobs (#2535)

Fixed the simulator server workspace root dir (#2533)

* Fixed the simulator server root dir error.

* Added unit test for SimulatorRunner start_server_app.

---------

Co-authored-by: Chester Chen <[email protected]>

Improve InProcessClientAPIExecutor  (#2536)

* 1. rename ExeTaskFnWrapper class to TaskScriptRunner
2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function
3. redirect print() to logger.info()

* 1. rename ExeTaskFnWrapper class to TaskScriptRunner
2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function
3. redirect print() to logger.info()

* make result check and result pull use the same configurable variable

* rename exec_task_fn_wrapper to task_script_runner.py

* fix typo

Update README for launching python script.

Modify tensorboard logdir.

Link to environment setup instructions.

expose aggregate_fn to users for overwriting (#2539)

FIX MLFLow and Tensorboard Output to be consistent with new Workspace root changes (#2537)

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1. Remove the default code to use configuration
2. fix some broken notebook

* rollback changes

Fix decorator issue (#2542)

Remove line number in code link.

FLModel summary (#2544)

* add FLModel Summary

* format

formatting

Update KM example, add 2-stage solution without HE (#2541)

* add KM without HE, update everything

* fix license header

* fix license header - update year to 2024

* fix format

---------

Co-authored-by: Chester Chen <[email protected]>

* update license

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Holger Roth <[email protected]>
nvidianz pushed a commit to nvidianz/NVFlare that referenced this pull request May 6, 2024
* Implement federated logistic regression with second-order newton raphson.

Update file headers.

Update README.

Update README.

Fix README.

Refine README.

Update README.

Added more logging for the job status changing. (NVIDIA#2480)

* Added more logging for the job status changing.

* Fixed a logging call error.

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

Fix update client status (NVIDIA#2508)

* check workflow id before updating client status

* change order of checks

Add user guide on how to deploy to EKS (NVIDIA#2510)

* Add user guide on how to deploy to EKS

* Address comments

Improve dead client handling (NVIDIA#2506)

* dev

* test dead client cmd

* added more info for dead client tracing

* remove unused imports

* fix unit test

* fix test case

* address PR comments

---------

Co-authored-by: Sean Yang <[email protected]>

Enhance WFController (NVIDIA#2505)

* set flmodel variables in basefedavg

* make round info optional, fix inproc api bug

temporarily disable preflight tests (NVIDIA#2521)

Upgrade dependencies (NVIDIA#2516)

Use full path for PSI components (NVIDIA#2437) (NVIDIA#2517)

Multiple bug fixes from 2.4 (NVIDIA#2518)

* [2.4] Support client custom code in simulator (NVIDIA#2447)

* Support client custom code in simulator

* Fix client custom code

* Remove cancel_futures args (NVIDIA#2457)

* Fix sub_worker_process shutdown (NVIDIA#2458)

* Set GRPC_ENABLE_FORK_SUPPORT to False (NVIDIA#2474)

Pythonic job creation (NVIDIA#2483)

* WIP: constructed the FedJob.

* WIP: server_app josn export.

* generate the job app config.

* fully functional pythonic job creation.

* Added simulator_run for pythonic API.

* reformat.

* Added filters support for pythonic job creation.

* handled the direct import case in fed_job.

* refactor.

* Added the resource_spec set function for FedJob.

* refactored.

* Moved the ClientApp and ServerApp into fed_app.py.

* Refactored: removed the _FilterDef class.

* refactored.

* Rename job config classes (NVIDIA#3)

* rename config related classes

* add client api example

* fix metric streaming

* add to() routine

* Enable obj in the constructor as paramenter.

* Added support for the launcher script.

* refactored.

* reformat.

* Update the comment.

* re-arrange the package location.

* Added add_ext_script() for BaseAppConfig.

* codestyle fix.

* Removed the client-api-pt example.

* removed no used import.

* fixed the in_time_accumulate_weighted_aggregator_test.py

* Added Enum parameter support.

* Added docstring.

* Added ability to handle parameters from base class.

* Move the parameter data format conversion to the START_RUN event for InProcessClientAPIExecutor.

* Added params_exchange_format for PTInProcessClientAPIExecutor.

* codestyle fix.

* Fixed a custom code folder structure issue.

* work for sub-folder custom files.

* backed to handle parameters from base classes.

* Support folder structure job config.

* Added support for flat folder from '.XXX' import.

* codestyle fix.

* refactored and add docstring.

* Address some of the PR reviews.

---------

Co-authored-by: Holger Roth <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>
Co-authored-by: Chester Chen <[email protected]>

Enhancements from 2.4 (NVIDIA#2519)

* Starts heartbeat after task is pull and before task execution (NVIDIA#2415)

* Starts pipe handler heartbeat send/check after task is pull before task execution (NVIDIA#2442)

* [2.4] Improve cell pipe timeout handling (NVIDIA#2441)

* improve cell pipe timeout handling

* improved end and abort handling

* improve timeout handling

---------

Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

* [2.4] Enhance launcher executor (NVIDIA#2433)

* Update LauncherExecutor logs and execution setup timeout

* Change name

* [2.4] Fire and forget for pipe handler control messages (NVIDIA#2413)

* Fire and forget for pipe handler control messages

* Add default timeout value

* fix wait-for-reply (NVIDIA#2478)

* Fix pipe handler timeout in task exchanger and launcher executor (NVIDIA#2495)

* Fix metric relay pipe handler timeout (NVIDIA#2496)

* Rely on launcher check_run_status to pause/resume hb (NVIDIA#2502)

Co-authored-by: Chester Chen <[email protected]>

---------

Co-authored-by: Yan Cheng <[email protected]>
Co-authored-by: Chester Chen <[email protected]>

Update ci cd from 2.4 (NVIDIA#2520)

* Update github actions (NVIDIA#2450)

* Fix premerge (NVIDIA#2467)

* Fix issues on hello-world TF2 notebook

* Fix tf integration test (NVIDIA#2504)

* Add client api integration tests

---------

Co-authored-by: Isaac Yang <[email protected]>
Co-authored-by: Sean Yang <[email protected]>

use controller name for stats (NVIDIA#2522)

Simulator workspace re-design (NVIDIA#2492)

* Redesign simulator workspace structure.

* working, needs clean.

* Changed the simulator workspacce structure to be consistent with POC.

* Moved the logfile init to start_server_app().

* optimzed.

* adjust the stats pool location.

* Addressed the PR views.

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

Simulator end run for all clients (NVIDIA#2514)

* Provide an option to run END_RUN for all clients.

* Added end_run_all option for simulator to run END_RUN event for all clients.

* Fixed a add_argument type, added help message.

* Changed to use add_argument(() compatible with python 3.8.

* reformat.

* rewrite the _end_run_clients() and add docstring for easier understanding.

* reformat.

* adjusting the locking in the _end_run_clients.

* Fixed a potential None pointer error.

* renamed the clients_finished_end_run variable.

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Sean Yang <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

Secure XGBoost Integration (NVIDIA#2512)

* Updated FOBS readme to add DatumManager, added agrpcs as secure scheme

* Refactoring

* Refactored the secure version to histogram_based_v2

* Replaced Paillier with a mock encryptor

* Added license header

* Put mock back

* Added metrics_writer back and fixed GRPC error reply

simplify job simulator_run to take only one workspace parameter. (NVIDIA#2528)

Fix README.

Fix file links in README.

Fix file links in README.

Add comparison between centralized and federated training code.

Add missing client api test jobs (NVIDIA#2535)

Fixed the simulator server workspace root dir (NVIDIA#2533)

* Fixed the simulator server root dir error.

* Added unit test for SimulatorRunner start_server_app.

---------

Co-authored-by: Chester Chen <[email protected]>

Improve InProcessClientAPIExecutor  (NVIDIA#2536)

* 1. rename ExeTaskFnWrapper class to TaskScriptRunner
2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function
3. redirect print() to logger.info()

* 1. rename ExeTaskFnWrapper class to TaskScriptRunner
2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function
3. redirect print() to logger.info()

* make result check and result pull use the same configurable variable

* rename exec_task_fn_wrapper to task_script_runner.py

* fix typo

Update README for launching python script.

Modify tensorboard logdir.

Link to environment setup instructions.

expose aggregate_fn to users for overwriting (NVIDIA#2539)

FIX MLFLow and Tensorboard Output to be consistent with new Workspace root changes (NVIDIA#2537)

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1. Remove the default code to use configuration
2. fix some broken notebook

* rollback changes

Fix decorator issue (NVIDIA#2542)

Remove line number in code link.

FLModel summary (NVIDIA#2544)

* add FLModel Summary

* format

formatting

Update KM example, add 2-stage solution without HE (NVIDIA#2541)

* add KM without HE, update everything

* fix license header

* fix license header - update year to 2024

* fix format

---------

Co-authored-by: Chester Chen <[email protected]>

* update license

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Holger Roth <[email protected]>
holgerroth pushed a commit to holgerroth/NVFlare that referenced this pull request May 6, 2024
generate the job app config.

fully functional pythonic job creation.

Added simulator_run for pythonic API.

reformat.

Added filters support for pythonic job creation.

handled the direct import case in fed_job.

refactor.

Added the resource_spec set function for FedJob.

refactored.

Moved the ClientApp and ServerApp into fed_app.py.

Refactored: removed the _FilterDef class.

refactored.

Rename job config classes (#3)

* rename config related classes

* add client api example

* fix metric streaming

* add to() routine

Enable obj in the constructor as paramenter.

Added support for the launcher script.

refactored.

reformat.

Update the comment.

re-arrange the package location.

Added add_ext_script() for BaseAppConfig.

codestyle fix.

Removed the client-api-pt example.

removed no used import.

fixed the in_time_accumulate_weighted_aggregator_test.py

Added Enum parameter support.

Added docstring.

Fix typo (NVIDIA#2432)

Enable StreamCell for all application channels (NVIDIA#2407)

Add back request header (NVIDIA#2440)

Check wandb login (NVIDIA#2445)

* check wandb login

* Use default wandb offline mode

* add mode online check

Add note about delay in workspace creation for larger jobs (NVIDIA#2454)

Client API Update: Job Templates, examples to reflect different type of Client API (NVIDIA#2456)

* 1. Update README
2. fix bugs on in-proc client API
3. update examples to use in-proc client api in cases make sense

* 1. update documentation

* 1. update job template description
2. update in process API to allow user keep the existing configuration
3. update notebooks for step-by-step sag

* update README.md

* remove task_fn_args argument in the executor

* remove task_fn_args argument in the executor

add controller interface (NVIDIA#2451)

Update README.md (NVIDIA#2460)

fix typo

improve reliable msg (NVIDIA#2459)

CC block byoc jobs  (NVIDIA#2403)

* WIP: tdx_cc integration.

* fixed toke_file read.

* WIP: added info for CC add client tokens.:

* Fixed an error when client does not have CC token reported.

* Added handle for client does not have CC_INFO.

* Added CLIENT_QUIT event for CCManager to remove client token.

* Added _add_client_token client token logging info.

* Added peer_ctx for client quit.

* set_peer_context for client quit.

* Changed the AUTHORIZATION_REASON set_prop sticky to False.

* WIP: TokenPundit interface change.

* WIP: added cc_authorizer_ids config.

* Added cc_issuer_id for CCManager.

* renamed the TokenPundit to CCAutorizer.

* Added CC token adding through client heartbeat.

* Added function to stop current running job if CC verify fail.

* if CC failed to get toke, don't allow the system to start.

* Added exceptions None check.

* Address the client side CC check before job scheduled.

* fixed the PEER_FL_CONTEXT error.

* Added CCManager support to have multiple cc_issuers.

* optimized CCManager.

* updated the _verify_participants() logic.

* set up the proper fl_ctx for admin send_requests().

* Add proper fl_ctx.

* Refactor the CCManager.

* Refactor the CCManager and TDX_authorizer.

* Added TOKEN_EXPIRATION for each cc_issue in CCManager.

* Fixed CC TOKEN_EXPIRATION error.

* refactor the CCManager _prepare_cc_info()

* Refactor.

* refactor the cc tokens periodic verification.

* added critical_level for CCManager.

* codestyle fix.

* removed no used import.

* removed no use import.

* Fixed the unitest.

* Added CCManager unit tests.

* Added CCTokenGenerateError and CCTokenVerifyError. Updated CCAuthorizer interface.

* WIP: CC block byoc job.

* block BYOC job for CC.

* Addressed some PR reviews.

* Added exception catch for TDXAuthorizer.

* codestyle fix.

* renamed some events.

* renamed event names.

* renamed event names.

---------

Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>
Co-authored-by: Chester Chen <[email protected]>

Fixed the authz and site_security check for check_resource command. (NVIDIA#2462)

Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

add garbage collect at ends of round-based workflows (NVIDIA#2463)

add WFController (NVIDIA#2468)

Add warning when the same admin in project.yml has different role

Add custom order and early termination to CyclicController (NVIDIA#2387)

* Add custom order and early termination to CyclicController and add tests

* Add more error handling

Add IPC agent and exchanger (NVIDIA#2435)

* support av ipc agent

* removed unused import

* address PR comments

fix typo (NVIDIA#2473)

Refactor WFController and ModelController (NVIDIA#2475)

* refactor wf and model controller

* clarify persisor_id

Add example for mulitparty kaplan-meier analysis with HE (NVIDIA#2259)

* add example for mulitparty kaplan meier analysis with HE

* update requirements

* update baseline script, remove complex settings and keep basic only

* add readme with details

* add readme with details

* add curves, modify saving functions (curve and km details)

* job name update

* remove redundant print

* move data preparation part out of local code

* move HE context part out of FL process to better accomodate the transition to real application

* update to use new controller interface

* change to send_model_and_wait

* format

* updated readme

* fix merge conflict

* update readme

* update readme

* update readme

* update readme

* move to job template

---------

Co-authored-by: Sean Yang <[email protected]>

remove old task_fn_args (NVIDIA#2479)

Enable simulator to run HE (NVIDIA#2339)

* Enable simulator to run HE.

* fixed the unittest.

* Created startup folder for simulator run if not exist.

* Changed to use setup and teardown for pytest.

* extract common codes init_security_content_service().

* removed no use import.

---------

Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>
Co-authored-by: Chester Chen <[email protected]>

not creating Workspace object (NVIDIA#2489)

Fix xgboost integration tests (NVIDIA#2486)

* change to use path

* update finance and vertical xgboost

Added ability to handle parameters from base class.

Move the parameter data format conversion to the START_RUN event for InProcessClientAPIExecutor.

Added params_exchange_format for PTInProcessClientAPIExecutor.

codestyle fix.

Fixed a custom code folder structure issue.

work for sub-folder custom files.

backed to handle parameters from base classes.

Support folder structure job config.

Added support for flat folder from '.XXX' import.

codestyle fix.

refactored and add docstring.

Add FedBPT research example (NVIDIA#2465)

* Add FedBPT research example

initial fedbpt files

add roberta model and run FL

move send to end

upgrade to 2.4.1rc and run experiment with 10 clients

move init to top

debug using pickle

record successful setting

use custom decomposer

clean code

add summary writer

add result figure

formatting

fix broken links

remove debug messages

update readme with system resources

use decomposer widget on server

* address comments; enable selection of evaluation client

* use new FedAvg api

* exclude dir from license test

* only exclude file for license check

fix xgboost test setup (NVIDIA#2494)

add Client API documentation (NVIDIA#2497)

* add Client API documentation

* add Client API documentation

Added more logging for the job status changing. (NVIDIA#2480)

* Added more logging for the job status changing.

* Fixed a logging call error.

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

Fix update client status (NVIDIA#2508)

* check workflow id before updating client status

* change order of checks

Address some of the PR reviews.

Rename job config classes (#3)

* rename config related classes

* add client api example

* fix metric streaming

* add to() routine

run demo

run demo

set gpus and external scripts

move FedJob api

change folder structure

xval example

xval example

reuse code

add filter example

minor updates

update job dir

refactor Controller/ExcecutorApps

hide ControllerApp/ExecutorApp

fix doubled deploy call

handle filters

handle cross-site val

add swarm example (wip)

Add user guide on how to deploy to EKS (NVIDIA#2510)

* Add user guide on how to deploy to EKS

* Address comments

Improve dead client handling (NVIDIA#2506)

* dev

* test dead client cmd

* added more info for dead client tracing

* remove unused imports

* fix unit test

* fix test case

* address PR comments

---------

Co-authored-by: Sean Yang <[email protected]>

Enhance WFController (NVIDIA#2505)

* set flmodel variables in basefedavg

* make round info optional, fix inproc api bug

temporarily disable preflight tests (NVIDIA#2521)

Upgrade dependencies (NVIDIA#2516)

Use full path for PSI components (NVIDIA#2437) (NVIDIA#2517)

Multiple bug fixes from 2.4 (NVIDIA#2518)

* [2.4] Support client custom code in simulator (NVIDIA#2447)

* Support client custom code in simulator

* Fix client custom code

* Remove cancel_futures args (NVIDIA#2457)

* Fix sub_worker_process shutdown (NVIDIA#2458)

* Set GRPC_ENABLE_FORK_SUPPORT to False (NVIDIA#2474)

Pythonic job creation (NVIDIA#2483)

* WIP: constructed the FedJob.

* WIP: server_app josn export.

* generate the job app config.

* fully functional pythonic job creation.

* Added simulator_run for pythonic API.

* reformat.

* Added filters support for pythonic job creation.

* handled the direct import case in fed_job.

* refactor.

* Added the resource_spec set function for FedJob.

* refactored.

* Moved the ClientApp and ServerApp into fed_app.py.

* Refactored: removed the _FilterDef class.

* refactored.

* Rename job config classes (#3)

* rename config related classes

* add client api example

* fix metric streaming

* add to() routine

* Enable obj in the constructor as paramenter.

* Added support for the launcher script.

* refactored.

* reformat.

* Update the comment.

* re-arrange the package location.

* Added add_ext_script() for BaseAppConfig.

* codestyle fix.

* Removed the client-api-pt example.

* removed no used import.

* fixed the in_time_accumulate_weighted_aggregator_test.py

* Added Enum parameter support.

* Added docstring.

* Added ability to handle parameters from base class.

* Move the parameter data format conversion to the START_RUN event for InProcessClientAPIExecutor.

* Added params_exchange_format for PTInProcessClientAPIExecutor.

* codestyle fix.

* Fixed a custom code folder structure issue.

* work for sub-folder custom files.

* backed to handle parameters from base classes.

* Support folder structure job config.

* Added support for flat folder from '.XXX' import.

* codestyle fix.

* refactored and add docstring.

* Address some of the PR reviews.

---------

Co-authored-by: Holger Roth <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>
Co-authored-by: Chester Chen <[email protected]>

Enhancements from 2.4 (NVIDIA#2519)

* Starts heartbeat after task is pull and before task execution (NVIDIA#2415)

* Starts pipe handler heartbeat send/check after task is pull before task execution (NVIDIA#2442)

* [2.4] Improve cell pipe timeout handling (NVIDIA#2441)

* improve cell pipe timeout handling

* improved end and abort handling

* improve timeout handling

---------

Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

* [2.4] Enhance launcher executor (NVIDIA#2433)

* Update LauncherExecutor logs and execution setup timeout

* Change name

* [2.4] Fire and forget for pipe handler control messages (NVIDIA#2413)

* Fire and forget for pipe handler control messages

* Add default timeout value

* fix wait-for-reply (NVIDIA#2478)

* Fix pipe handler timeout in task exchanger and launcher executor (NVIDIA#2495)

* Fix metric relay pipe handler timeout (NVIDIA#2496)

* Rely on launcher check_run_status to pause/resume hb (NVIDIA#2502)

Co-authored-by: Chester Chen <[email protected]>

---------

Co-authored-by: Yan Cheng <[email protected]>
Co-authored-by: Chester Chen <[email protected]>

Update ci cd from 2.4 (NVIDIA#2520)

* Update github actions (NVIDIA#2450)

* Fix premerge (NVIDIA#2467)

* Fix issues on hello-world TF2 notebook

* Fix tf integration test (NVIDIA#2504)

* Add client api integration tests

---------

Co-authored-by: Isaac Yang <[email protected]>
Co-authored-by: Sean Yang <[email protected]>

WIP: constructed the FedJob.

WIP: server_app josn export.

generate the job app config.

fully functional pythonic job creation.

Added simulator_run for pythonic API.

reformat.

Added filters support for pythonic job creation.

handled the direct import case in fed_job.

refactor.

Added the resource_spec set function for FedJob.

refactored.

Moved the ClientApp and ServerApp into fed_app.py.

Refactored: removed the _FilterDef class.

refactored.

Rename job config classes (#3)

* rename config related classes

* add client api example

* fix metric streaming

* add to() routine

Enable obj in the constructor as paramenter.

Added support for the launcher script.

refactored.

reformat.

Update the comment.

re-arrange the package location.

Added add_ext_script() for BaseAppConfig.

codestyle fix.

Removed the client-api-pt example.

Rename job config classes (#3)

* rename config related classes

* add client api example

* fix metric streaming

* add to() routine

run demo

set gpus and external scripts

move FedJob api

change folder structure

xval example

xval example

reuse code

add filter example

minor updates

update job dir

refactor Controller/ExcecutorApps

hide ControllerApp/ExecutorApp

fix doubled deploy call

handle filters

handle cross-site val

add swarm example (wip)

make FedJob2 default FedJob

use ScriptExecutor

test swarm learning

add cyclic workflow

add todo

update swarm learning

make FedJob2 default again

use controller name for stats (NVIDIA#2522)

Simulator workspace re-design (NVIDIA#2492)

* Redesign simulator workspace structure.

* working, needs clean.

* Changed the simulator workspacce structure to be consistent with POC.

* Moved the logfile init to start_server_app().

* optimzed.

* adjust the stats pool location.

* Addressed the PR views.

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

Simulator end run for all clients (NVIDIA#2514)

* Provide an option to run END_RUN for all clients.

* Added end_run_all option for simulator to run END_RUN event for all clients.

* Fixed a add_argument type, added help message.

* Changed to use add_argument(() compatible with python 3.8.

* reformat.

* rewrite the _end_run_clients() and add docstring for easier understanding.

* reformat.

* adjusting the locking in the _end_run_clients.

* Fixed a potential None pointer error.

* renamed the clients_finished_end_run variable.

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Sean Yang <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

Secure XGBoost Integration (NVIDIA#2512)

* Updated FOBS readme to add DatumManager, added agrpcs as secure scheme

* Refactoring

* Refactored the secure version to histogram_based_v2

* Replaced Paillier with a mock encryptor

* Added license header

* Put mock back

* Added metrics_writer back and fixed GRPC error reply

use ScriptExecutor

add kmeans example

simplify job simulator_run to take only one workspace parameter. (NVIDIA#2528)

test kmeans, use latest main

fix kmeans

some redesign

address comments

rename source dir

Add missing client api test jobs (NVIDIA#2535)

Fixed the simulator server workspace root dir (NVIDIA#2533)

* Fixed the simulator server root dir error.

* Added unit test for SimulatorRunner start_server_app.

---------

Co-authored-by: Chester Chen <[email protected]>

Improve InProcessClientAPIExecutor  (NVIDIA#2536)

* 1. rename ExeTaskFnWrapper class to TaskScriptRunner
2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function
3. redirect print() to logger.info()

* 1. rename ExeTaskFnWrapper class to TaskScriptRunner
2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function
3. redirect print() to logger.info()

* make result check and result pull use the same configurable variable

* rename exec_task_fn_wrapper to task_script_runner.py

* fix typo

remove use of uuid4

handle ids of built-in components

expose aggregate_fn to users for overwriting (NVIDIA#2539)

FIX MLFLow and Tensorboard Output to be consistent with new Workspace root changes (NVIDIA#2537)

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1. Remove the default code to use configuration
2. fix some broken notebook

* rollback changes

Fix decorator issue (NVIDIA#2542)

FLModel summary (NVIDIA#2544)

* add FLModel Summary

* format

Update KM example, add 2-stage solution without HE (NVIDIA#2541)

* add KM without HE, update everything

* fix license header

* fix license header - update year to 2024

* fix format

---------

Co-authored-by: Chester Chen <[email protected]>

handle cases where the script with relative path in Script Runner (NVIDIA#2543)

* handle cases where the script with relative path

* handle cases where the script with relative path

* add more unit test cases and change the file search logics

* code format

* add more unit test cases and change the file search logics

Lr newton raphson (NVIDIA#2529)

* Implement federated logistic regression with second-order newton raphson.

Update file headers.

Update README.

Update README.

Fix README.

Refine README.

Update README.

Added more logging for the job status changing. (NVIDIA#2480)

* Added more logging for the job status changing.

* Fixed a logging call error.

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

Fix update client status (NVIDIA#2508)

* check workflow id before updating client status

* change order of checks

Add user guide on how to deploy to EKS (NVIDIA#2510)

* Add user guide on how to deploy to EKS

* Address comments

Improve dead client handling (NVIDIA#2506)

* dev

* test dead client cmd

* added more info for dead client tracing

* remove unused imports

* fix unit test

* fix test case

* address PR comments

---------

Co-authored-by: Sean Yang <[email protected]>

Enhance WFController (NVIDIA#2505)

* set flmodel variables in basefedavg

* make round info optional, fix inproc api bug

temporarily disable preflight tests (NVIDIA#2521)

Upgrade dependencies (NVIDIA#2516)

Use full path for PSI components (NVIDIA#2437) (NVIDIA#2517)

Multiple bug fixes from 2.4 (NVIDIA#2518)

* [2.4] Support client custom code in simulator (NVIDIA#2447)

* Support client custom code in simulator

* Fix client custom code

* Remove cancel_futures args (NVIDIA#2457)

* Fix sub_worker_process shutdown (NVIDIA#2458)

* Set GRPC_ENABLE_FORK_SUPPORT to False (NVIDIA#2474)

Pythonic job creation (NVIDIA#2483)

* WIP: constructed the FedJob.

* WIP: server_app josn export.

* generate the job app config.

* fully functional pythonic job creation.

* Added simulator_run for pythonic API.

* reformat.

* Added filters support for pythonic job creation.

* handled the direct import case in fed_job.

* refactor.

* Added the resource_spec set function for FedJob.

* refactored.

* Moved the ClientApp and ServerApp into fed_app.py.

* Refactored: removed the _FilterDef class.

* refactored.

* Rename job config classes (#3)

* rename config related classes

* add client api example

* fix metric streaming

* add to() routine

* Enable obj in the constructor as paramenter.

* Added support for the launcher script.

* refactored.

* reformat.

* Update the comment.

* re-arrange the package location.

* Added add_ext_script() for BaseAppConfig.

* codestyle fix.

* Removed the client-api-pt example.

* removed no used import.

* fixed the in_time_accumulate_weighted_aggregator_test.py

* Added Enum parameter support.

* Added docstring.

* Added ability to handle parameters from base class.

* Move the parameter data format conversion to the START_RUN event for InProcessClientAPIExecutor.

* Added params_exchange_format for PTInProcessClientAPIExecutor.

* codestyle fix.

* Fixed a custom code folder structure issue.

* work for sub-folder custom files.

* backed to handle parameters from base classes.

* Support folder structure job config.

* Added support for flat folder from '.XXX' import.

* codestyle fix.

* refactored and add docstring.

* Address some of the PR reviews.

---------

Co-authored-by: Holger Roth <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>
Co-authored-by: Chester Chen <[email protected]>

Enhancements from 2.4 (NVIDIA#2519)

* Starts heartbeat after task is pull and before task execution (NVIDIA#2415)

* Starts pipe handler heartbeat send/check after task is pull before task execution (NVIDIA#2442)

* [2.4] Improve cell pipe timeout handling (NVIDIA#2441)

* improve cell pipe timeout handling

* improved end and abort handling

* improve timeout handling

---------

Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

* [2.4] Enhance launcher executor (NVIDIA#2433)

* Update LauncherExecutor logs and execution setup timeout

* Change name

* [2.4] Fire and forget for pipe handler control messages (NVIDIA#2413)

* Fire and forget for pipe handler control messages

* Add default timeout value

* fix wait-for-reply (NVIDIA#2478)

* Fix pipe handler timeout in task exchanger and launcher executor (NVIDIA#2495)

* Fix metric relay pipe handler timeout (NVIDIA#2496)

* Rely on launcher check_run_status to pause/resume hb (NVIDIA#2502)

Co-authored-by: Chester Chen <[email protected]>

---------

Co-authored-by: Yan Cheng <[email protected]>
Co-authored-by: Chester Chen <[email protected]>

Update ci cd from 2.4 (NVIDIA#2520)

* Update github actions (NVIDIA#2450)

* Fix premerge (NVIDIA#2467)

* Fix issues on hello-world TF2 notebook

* Fix tf integration test (NVIDIA#2504)

* Add client api integration tests

---------

Co-authored-by: Isaac Yang <[email protected]>
Co-authored-by: Sean Yang <[email protected]>

use controller name for stats (NVIDIA#2522)

Simulator workspace re-design (NVIDIA#2492)

* Redesign simulator workspace structure.

* working, needs clean.

* Changed the simulator workspacce structure to be consistent with POC.

* Moved the logfile init to start_server_app().

* optimzed.

* adjust the stats pool location.

* Addressed the PR views.

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

Simulator end run for all clients (NVIDIA#2514)

* Provide an option to run END_RUN for all clients.

* Added end_run_all option for simulator to run END_RUN event for all clients.

* Fixed a add_argument type, added help message.

* Changed to use add_argument(() compatible with python 3.8.

* reformat.

* rewrite the _end_run_clients() and add docstring for easier understanding.

* reformat.

* adjusting the locking in the _end_run_clients.

* Fixed a potential None pointer error.

* renamed the clients_finished_end_run variable.

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Sean Yang <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

Secure XGBoost Integration (NVIDIA#2512)

* Updated FOBS readme to add DatumManager, added agrpcs as secure scheme

* Refactoring

* Refactored the secure version to histogram_based_v2

* Replaced Paillier with a mock encryptor

* Added license header

* Put mock back

* Added metrics_writer back and fixed GRPC error reply

simplify job simulator_run to take only one workspace parameter. (NVIDIA#2528)

Fix README.

Fix file links in README.

Fix file links in README.

Add comparison between centralized and federated training code.

Add missing client api test jobs (NVIDIA#2535)

Fixed the simulator server workspace root dir (NVIDIA#2533)

* Fixed the simulator server root dir error.

* Added unit test for SimulatorRunner start_server_app.

---------

Co-authored-by: Chester Chen <[email protected]>

Improve InProcessClientAPIExecutor  (NVIDIA#2536)

* 1. rename ExeTaskFnWrapper class to TaskScriptRunner
2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function
3. redirect print() to logger.info()

* 1. rename ExeTaskFnWrapper class to TaskScriptRunner
2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function
3. redirect print() to logger.info()

* make result check and result pull use the same configurable variable

* rename exec_task_fn_wrapper to task_script_runner.py

* fix typo

Update README for launching python script.

Modify tensorboard logdir.

Link to environment setup instructions.

expose aggregate_fn to users for overwriting (NVIDIA#2539)

FIX MLFLow and Tensorboard Output to be consistent with new Workspace root changes (NVIDIA#2537)

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1. Remove the default code to use configuration
2. fix some broken notebook

* rollback changes

Fix decorator issue (NVIDIA#2542)

Remove line number in code link.

FLModel summary (NVIDIA#2544)

* add FLModel Summary

* format

formatting

Update KM example, add 2-stage solution without HE (NVIDIA#2541)

* add KM without HE, update everything

* fix license header

* fix license header - update year to 2024

* fix format

---------

Co-authored-by: Chester Chen <[email protected]>

* update license

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Holger Roth <[email protected]>

handle ids

minor updates

rename folder

use default ids

update kmeans

add lightning example

handle multiple GPUs

make model selection metric configurable

make model selection metric configurable

add docstrings

Add information about dig (bind9-dnsutils) in the document

Update monai readme to remove logging.conf (NVIDIA#2552)

MONAI mednist example (NVIDIA#2532)

* add monai notebook

* add training script

* update example

* update notebook

* use job template

* call init later

* swith back

* add gitignore

* update notebooks

* add readmes

* send received model to GPU

* use monai tb stats handler

* formatting

Improve AWS cloud launch script

restore files

reset file. Add docstring

formatting
MinghuiChen43 pushed a commit to MinghuiChen43/NVFlare that referenced this pull request May 10, 2024
* Implement federated logistic regression with second-order newton raphson.

Update file headers.

Update README.

Update README.

Fix README.

Refine README.

Update README.

Added more logging for the job status changing. (NVIDIA#2480)

* Added more logging for the job status changing.

* Fixed a logging call error.

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

Fix update client status (NVIDIA#2508)

* check workflow id before updating client status

* change order of checks

Add user guide on how to deploy to EKS (NVIDIA#2510)

* Add user guide on how to deploy to EKS

* Address comments

Improve dead client handling (NVIDIA#2506)

* dev

* test dead client cmd

* added more info for dead client tracing

* remove unused imports

* fix unit test

* fix test case

* address PR comments

---------

Co-authored-by: Sean Yang <[email protected]>

Enhance WFController (NVIDIA#2505)

* set flmodel variables in basefedavg

* make round info optional, fix inproc api bug

temporarily disable preflight tests (NVIDIA#2521)

Upgrade dependencies (NVIDIA#2516)

Use full path for PSI components (NVIDIA#2437) (NVIDIA#2517)

Multiple bug fixes from 2.4 (NVIDIA#2518)

* [2.4] Support client custom code in simulator (NVIDIA#2447)

* Support client custom code in simulator

* Fix client custom code

* Remove cancel_futures args (NVIDIA#2457)

* Fix sub_worker_process shutdown (NVIDIA#2458)

* Set GRPC_ENABLE_FORK_SUPPORT to False (NVIDIA#2474)

Pythonic job creation (NVIDIA#2483)

* WIP: constructed the FedJob.

* WIP: server_app josn export.

* generate the job app config.

* fully functional pythonic job creation.

* Added simulator_run for pythonic API.

* reformat.

* Added filters support for pythonic job creation.

* handled the direct import case in fed_job.

* refactor.

* Added the resource_spec set function for FedJob.

* refactored.

* Moved the ClientApp and ServerApp into fed_app.py.

* Refactored: removed the _FilterDef class.

* refactored.

* Rename job config classes (NVIDIA#3)

* rename config related classes

* add client api example

* fix metric streaming

* add to() routine

* Enable obj in the constructor as paramenter.

* Added support for the launcher script.

* refactored.

* reformat.

* Update the comment.

* re-arrange the package location.

* Added add_ext_script() for BaseAppConfig.

* codestyle fix.

* Removed the client-api-pt example.

* removed no used import.

* fixed the in_time_accumulate_weighted_aggregator_test.py

* Added Enum parameter support.

* Added docstring.

* Added ability to handle parameters from base class.

* Move the parameter data format conversion to the START_RUN event for InProcessClientAPIExecutor.

* Added params_exchange_format for PTInProcessClientAPIExecutor.

* codestyle fix.

* Fixed a custom code folder structure issue.

* work for sub-folder custom files.

* backed to handle parameters from base classes.

* Support folder structure job config.

* Added support for flat folder from '.XXX' import.

* codestyle fix.

* refactored and add docstring.

* Address some of the PR reviews.

---------

Co-authored-by: Holger Roth <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>
Co-authored-by: Chester Chen <[email protected]>

Enhancements from 2.4 (NVIDIA#2519)

* Starts heartbeat after task is pull and before task execution (NVIDIA#2415)

* Starts pipe handler heartbeat send/check after task is pull before task execution (NVIDIA#2442)

* [2.4] Improve cell pipe timeout handling (NVIDIA#2441)

* improve cell pipe timeout handling

* improved end and abort handling

* improve timeout handling

---------

Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

* [2.4] Enhance launcher executor (NVIDIA#2433)

* Update LauncherExecutor logs and execution setup timeout

* Change name

* [2.4] Fire and forget for pipe handler control messages (NVIDIA#2413)

* Fire and forget for pipe handler control messages

* Add default timeout value

* fix wait-for-reply (NVIDIA#2478)

* Fix pipe handler timeout in task exchanger and launcher executor (NVIDIA#2495)

* Fix metric relay pipe handler timeout (NVIDIA#2496)

* Rely on launcher check_run_status to pause/resume hb (NVIDIA#2502)

Co-authored-by: Chester Chen <[email protected]>

---------

Co-authored-by: Yan Cheng <[email protected]>
Co-authored-by: Chester Chen <[email protected]>

Update ci cd from 2.4 (NVIDIA#2520)

* Update github actions (NVIDIA#2450)

* Fix premerge (NVIDIA#2467)

* Fix issues on hello-world TF2 notebook

* Fix tf integration test (NVIDIA#2504)

* Add client api integration tests

---------

Co-authored-by: Isaac Yang <[email protected]>
Co-authored-by: Sean Yang <[email protected]>

use controller name for stats (NVIDIA#2522)

Simulator workspace re-design (NVIDIA#2492)

* Redesign simulator workspace structure.

* working, needs clean.

* Changed the simulator workspacce structure to be consistent with POC.

* Moved the logfile init to start_server_app().

* optimzed.

* adjust the stats pool location.

* Addressed the PR views.

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

Simulator end run for all clients (NVIDIA#2514)

* Provide an option to run END_RUN for all clients.

* Added end_run_all option for simulator to run END_RUN event for all clients.

* Fixed a add_argument type, added help message.

* Changed to use add_argument(() compatible with python 3.8.

* reformat.

* rewrite the _end_run_clients() and add docstring for easier understanding.

* reformat.

* adjusting the locking in the _end_run_clients.

* Fixed a potential None pointer error.

* renamed the clients_finished_end_run variable.

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Sean Yang <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

Secure XGBoost Integration (NVIDIA#2512)

* Updated FOBS readme to add DatumManager, added agrpcs as secure scheme

* Refactoring

* Refactored the secure version to histogram_based_v2

* Replaced Paillier with a mock encryptor

* Added license header

* Put mock back

* Added metrics_writer back and fixed GRPC error reply

simplify job simulator_run to take only one workspace parameter. (NVIDIA#2528)

Fix README.

Fix file links in README.

Fix file links in README.

Add comparison between centralized and federated training code.

Add missing client api test jobs (NVIDIA#2535)

Fixed the simulator server workspace root dir (NVIDIA#2533)

* Fixed the simulator server root dir error.

* Added unit test for SimulatorRunner start_server_app.

---------

Co-authored-by: Chester Chen <[email protected]>

Improve InProcessClientAPIExecutor  (NVIDIA#2536)

* 1. rename ExeTaskFnWrapper class to TaskScriptRunner
2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function
3. redirect print() to logger.info()

* 1. rename ExeTaskFnWrapper class to TaskScriptRunner
2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function
3. redirect print() to logger.info()

* make result check and result pull use the same configurable variable

* rename exec_task_fn_wrapper to task_script_runner.py

* fix typo

Update README for launching python script.

Modify tensorboard logdir.

Link to environment setup instructions.

expose aggregate_fn to users for overwriting (NVIDIA#2539)

FIX MLFLow and Tensorboard Output to be consistent with new Workspace root changes (NVIDIA#2537)

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1. Remove the default code to use configuration
2. fix some broken notebook

* rollback changes

Fix decorator issue (NVIDIA#2542)

Remove line number in code link.

FLModel summary (NVIDIA#2544)

* add FLModel Summary

* format

formatting

Update KM example, add 2-stage solution without HE (NVIDIA#2541)

* add KM without HE, update everything

* fix license header

* fix license header - update year to 2024

* fix format

---------

Co-authored-by: Chester Chen <[email protected]>

* update license

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Holger Roth <[email protected]>
YuanTingHsieh added a commit that referenced this pull request May 10, 2024
…ormalization federated learning method (#2524)

* add research/fedbn

* delete redudant controller and correct figs requirements

* update plot_requirements

* rewrite fedbn

* update jobs

* remove workspace

* update README

* simplify job simulator_run to take only one workspace parameter. (#2528)

* Add missing client api test jobs (#2535)

* Fixed the simulator server workspace root dir (#2533)

* Fixed the simulator server root dir error.

* Added unit test for SimulatorRunner start_server_app.

---------

Co-authored-by: Chester Chen <[email protected]>

* Improve InProcessClientAPIExecutor  (#2536)

* 1. rename ExeTaskFnWrapper class to TaskScriptRunner
2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function
3. redirect print() to logger.info()

* 1. rename ExeTaskFnWrapper class to TaskScriptRunner
2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function
3. redirect print() to logger.info()

* make result check and result pull use the same configurable variable

* rename exec_task_fn_wrapper to task_script_runner.py

* fix typo

* FIX MLFLow and Tensorboard Output to be consistent with new Workspace root changes (#2537)

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1. Remove the default code to use configuration
2. fix some broken notebook

* rollback changes

* Fix decorator issue (#2542)

* update create and run job script

* FLModel summary (#2544)

* add FLModel Summary

* format

* remove jobs folder

* expose aggregate_fn to users for overwriting (#2539)

* handle cases where the script with relative path in Script Runner (#2543)

* handle cases where the script with relative path

* handle cases where the script with relative path

* add more unit test cases and change the file search logics

* code format

* add more unit test cases and change the file search logics

* Lr newton raphson (#2529)

* Implement federated logistic regression with second-order newton raphson.

Update file headers.

Update README.

Update README.

Fix README.

Refine README.

Update README.

Added more logging for the job status changing. (#2480)

* Added more logging for the job status changing.

* Fixed a logging call error.

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

Fix update client status (#2508)

* check workflow id before updating client status

* change order of checks

Add user guide on how to deploy to EKS (#2510)

* Add user guide on how to deploy to EKS

* Address comments

Improve dead client handling (#2506)

* dev

* test dead client cmd

* added more info for dead client tracing

* remove unused imports

* fix unit test

* fix test case

* address PR comments

---------

Co-authored-by: Sean Yang <[email protected]>

Enhance WFController (#2505)

* set flmodel variables in basefedavg

* make round info optional, fix inproc api bug

temporarily disable preflight tests (#2521)

Upgrade dependencies (#2516)

Use full path for PSI components (#2437) (#2517)

Multiple bug fixes from 2.4 (#2518)

* [2.4] Support client custom code in simulator (#2447)

* Support client custom code in simulator

* Fix client custom code

* Remove cancel_futures args (#2457)

* Fix sub_worker_process shutdown (#2458)

* Set GRPC_ENABLE_FORK_SUPPORT to False (#2474)

Pythonic job creation (#2483)

* WIP: constructed the FedJob.

* WIP: server_app josn export.

* generate the job app config.

* fully functional pythonic job creation.

* Added simulator_run for pythonic API.

* reformat.

* Added filters support for pythonic job creation.

* handled the direct import case in fed_job.

* refactor.

* Added the resource_spec set function for FedJob.

* refactored.

* Moved the ClientApp and ServerApp into fed_app.py.

* Refactored: removed the _FilterDef class.

* refactored.

* Rename job config classes (#3)

* rename config related classes

* add client api example

* fix metric streaming

* add to() routine

* Enable obj in the constructor as paramenter.

* Added support for the launcher script.

* refactored.

* reformat.

* Update the comment.

* re-arrange the package location.

* Added add_ext_script() for BaseAppConfig.

* codestyle fix.

* Removed the client-api-pt example.

* removed no used import.

* fixed the in_time_accumulate_weighted_aggregator_test.py

* Added Enum parameter support.

* Added docstring.

* Added ability to handle parameters from base class.

* Move the parameter data format conversion to the START_RUN event for InProcessClientAPIExecutor.

* Added params_exchange_format for PTInProcessClientAPIExecutor.

* codestyle fix.

* Fixed a custom code folder structure issue.

* work for sub-folder custom files.

* backed to handle parameters from base classes.

* Support folder structure job config.

* Added support for flat folder from '.XXX' import.

* codestyle fix.

* refactored and add docstring.

* Address some of the PR reviews.

---------

Co-authored-by: Holger Roth <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>
Co-authored-by: Chester Chen <[email protected]>

Enhancements from 2.4 (#2519)

* Starts heartbeat after task is pull and before task execution (#2415)

* Starts pipe handler heartbeat send/check after task is pull before task execution (#2442)

* [2.4] Improve cell pipe timeout handling (#2441)

* improve cell pipe timeout handling

* improved end and abort handling

* improve timeout handling

---------

Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

* [2.4] Enhance launcher executor (#2433)

* Update LauncherExecutor logs and execution setup timeout

* Change name

* [2.4] Fire and forget for pipe handler control messages (#2413)

* Fire and forget for pipe handler control messages

* Add default timeout value

* fix wait-for-reply (#2478)

* Fix pipe handler timeout in task exchanger and launcher executor (#2495)

* Fix metric relay pipe handler timeout (#2496)

* Rely on launcher check_run_status to pause/resume hb (#2502)

Co-authored-by: Chester Chen <[email protected]>

---------

Co-authored-by: Yan Cheng <[email protected]>
Co-authored-by: Chester Chen <[email protected]>

Update ci cd from 2.4 (#2520)

* Update github actions (#2450)

* Fix premerge (#2467)

* Fix issues on hello-world TF2 notebook

* Fix tf integration test (#2504)

* Add client api integration tests

---------

Co-authored-by: Isaac Yang <[email protected]>
Co-authored-by: Sean Yang <[email protected]>

use controller name for stats (#2522)

Simulator workspace re-design (#2492)

* Redesign simulator workspace structure.

* working, needs clean.

* Changed the simulator workspacce structure to be consistent with POC.

* Moved the logfile init to start_server_app().

* optimzed.

* adjust the stats pool location.

* Addressed the PR views.

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

Simulator end run for all clients (#2514)

* Provide an option to run END_RUN for all clients.

* Added end_run_all option for simulator to run END_RUN event for all clients.

* Fixed a add_argument type, added help message.

* Changed to use add_argument(() compatible with python 3.8.

* reformat.

* rewrite the _end_run_clients() and add docstring for easier understanding.

* reformat.

* adjusting the locking in the _end_run_clients.

* Fixed a potential None pointer error.

* renamed the clients_finished_end_run variable.

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Sean Yang <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

Secure XGBoost Integration (#2512)

* Updated FOBS readme to add DatumManager, added agrpcs as secure scheme

* Refactoring

* Refactored the secure version to histogram_based_v2

* Replaced Paillier with a mock encryptor

* Added license header

* Put mock back

* Added metrics_writer back and fixed GRPC error reply

simplify job simulator_run to take only one workspace parameter. (#2528)

Fix README.

Fix file links in README.

Fix file links in README.

Add comparison between centralized and federated training code.

Add missing client api test jobs (#2535)

Fixed the simulator server workspace root dir (#2533)

* Fixed the simulator server root dir error.

* Added unit test for SimulatorRunner start_server_app.

---------

Co-authored-by: Chester Chen <[email protected]>

Improve InProcessClientAPIExecutor  (#2536)

* 1. rename ExeTaskFnWrapper class to TaskScriptRunner
2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function
3. redirect print() to logger.info()

* 1. rename ExeTaskFnWrapper class to TaskScriptRunner
2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function
3. redirect print() to logger.info()

* make result check and result pull use the same configurable variable

* rename exec_task_fn_wrapper to task_script_runner.py

* fix typo

Update README for launching python script.

Modify tensorboard logdir.

Link to environment setup instructions.

expose aggregate_fn to users for overwriting (#2539)

FIX MLFLow and Tensorboard Output to be consistent with new Workspace root changes (#2537)

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1. Remove the default code to use configuration
2. fix some broken notebook

* rollback changes

Fix decorator issue (#2542)

Remove line number in code link.

FLModel summary (#2544)

* add FLModel Summary

* format

formatting

Update KM example, add 2-stage solution without HE (#2541)

* add KM without HE, update everything

* fix license header

* fix license header - update year to 2024

* fix format

---------

Co-authored-by: Chester Chen <[email protected]>

* update license

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Holger Roth <[email protected]>

* Add information about dig (bind9-dnsutils) in the document

* format update

* Update KM example, add 2-stage solution without HE (#2541)

* add KM without HE, update everything

* fix license header

* fix license header - update year to 2024

* fix format

---------

Co-authored-by: Chester Chen <[email protected]>

* Update monai readme to remove logging.conf (#2552)

* MONAI mednist example (#2532)

* add monai notebook

* add training script

* update example

* update notebook

* use job template

* call init later

* swith back

* add gitignore

* update notebooks

* add readmes

* send received model to GPU

* use monai tb stats handler

* formatting

* Improve AWS cloud launch script

* Add in process client api tests (#2549)

* Add in process client api tests

* Fix headers

* Fix comments

* Add client controller executor (#2530)

* add client controller executor

* address comments

* enhance abort, set peer props

* remove asserts

---------

Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

* Add option in dashboard cli for AWS vpc and subnet

* add note on README visualization

* update README

* update readme

* update readme

* update readme

* [2.5] Clean up to allow creation of nvflare light (#2573)

* clean up to allow creation of nvflare light

* move defs to cellnet

* Enable patch and build for nvflight (#2574)

* verified commit

---------

Co-authored-by: Yuhong Wen <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>
Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Sean Yang <[email protected]>
Co-authored-by: Zhijin <[email protected]>
Co-authored-by: Holger Roth <[email protected]>
Co-authored-by: Isaac Yang <[email protected]>
Co-authored-by: Ziyue Xu <[email protected]>
Co-authored-by: Ziyue Xu <[email protected]>
Co-authored-by: Holger Roth <[email protected]>
Co-authored-by: Yan Cheng <[email protected]>
nvidianz pushed a commit to nvidianz/NVFlare that referenced this pull request May 14, 2024
…ormalization federated learning method (NVIDIA#2524)

* add research/fedbn

* delete redudant controller and correct figs requirements

* update plot_requirements

* rewrite fedbn

* update jobs

* remove workspace

* update README

* simplify job simulator_run to take only one workspace parameter. (NVIDIA#2528)

* Add missing client api test jobs (NVIDIA#2535)

* Fixed the simulator server workspace root dir (NVIDIA#2533)

* Fixed the simulator server root dir error.

* Added unit test for SimulatorRunner start_server_app.

---------

Co-authored-by: Chester Chen <[email protected]>

* Improve InProcessClientAPIExecutor  (NVIDIA#2536)

* 1. rename ExeTaskFnWrapper class to TaskScriptRunner
2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function
3. redirect print() to logger.info()

* 1. rename ExeTaskFnWrapper class to TaskScriptRunner
2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function
3. redirect print() to logger.info()

* make result check and result pull use the same configurable variable

* rename exec_task_fn_wrapper to task_script_runner.py

* fix typo

* FIX MLFLow and Tensorboard Output to be consistent with new Workspace root changes (NVIDIA#2537)

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1. Remove the default code to use configuration
2. fix some broken notebook

* rollback changes

* Fix decorator issue (NVIDIA#2542)

* update create and run job script

* FLModel summary (NVIDIA#2544)

* add FLModel Summary

* format

* remove jobs folder

* expose aggregate_fn to users for overwriting (NVIDIA#2539)

* handle cases where the script with relative path in Script Runner (NVIDIA#2543)

* handle cases where the script with relative path

* handle cases where the script with relative path

* add more unit test cases and change the file search logics

* code format

* add more unit test cases and change the file search logics

* Lr newton raphson (NVIDIA#2529)

* Implement federated logistic regression with second-order newton raphson.

Update file headers.

Update README.

Update README.

Fix README.

Refine README.

Update README.

Added more logging for the job status changing. (NVIDIA#2480)

* Added more logging for the job status changing.

* Fixed a logging call error.

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

Fix update client status (NVIDIA#2508)

* check workflow id before updating client status

* change order of checks

Add user guide on how to deploy to EKS (NVIDIA#2510)

* Add user guide on how to deploy to EKS

* Address comments

Improve dead client handling (NVIDIA#2506)

* dev

* test dead client cmd

* added more info for dead client tracing

* remove unused imports

* fix unit test

* fix test case

* address PR comments

---------

Co-authored-by: Sean Yang <[email protected]>

Enhance WFController (NVIDIA#2505)

* set flmodel variables in basefedavg

* make round info optional, fix inproc api bug

temporarily disable preflight tests (NVIDIA#2521)

Upgrade dependencies (NVIDIA#2516)

Use full path for PSI components (NVIDIA#2437) (NVIDIA#2517)

Multiple bug fixes from 2.4 (NVIDIA#2518)

* [2.4] Support client custom code in simulator (NVIDIA#2447)

* Support client custom code in simulator

* Fix client custom code

* Remove cancel_futures args (NVIDIA#2457)

* Fix sub_worker_process shutdown (NVIDIA#2458)

* Set GRPC_ENABLE_FORK_SUPPORT to False (NVIDIA#2474)

Pythonic job creation (NVIDIA#2483)

* WIP: constructed the FedJob.

* WIP: server_app josn export.

* generate the job app config.

* fully functional pythonic job creation.

* Added simulator_run for pythonic API.

* reformat.

* Added filters support for pythonic job creation.

* handled the direct import case in fed_job.

* refactor.

* Added the resource_spec set function for FedJob.

* refactored.

* Moved the ClientApp and ServerApp into fed_app.py.

* Refactored: removed the _FilterDef class.

* refactored.

* Rename job config classes (NVIDIA#3)

* rename config related classes

* add client api example

* fix metric streaming

* add to() routine

* Enable obj in the constructor as paramenter.

* Added support for the launcher script.

* refactored.

* reformat.

* Update the comment.

* re-arrange the package location.

* Added add_ext_script() for BaseAppConfig.

* codestyle fix.

* Removed the client-api-pt example.

* removed no used import.

* fixed the in_time_accumulate_weighted_aggregator_test.py

* Added Enum parameter support.

* Added docstring.

* Added ability to handle parameters from base class.

* Move the parameter data format conversion to the START_RUN event for InProcessClientAPIExecutor.

* Added params_exchange_format for PTInProcessClientAPIExecutor.

* codestyle fix.

* Fixed a custom code folder structure issue.

* work for sub-folder custom files.

* backed to handle parameters from base classes.

* Support folder structure job config.

* Added support for flat folder from '.XXX' import.

* codestyle fix.

* refactored and add docstring.

* Address some of the PR reviews.

---------

Co-authored-by: Holger Roth <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>
Co-authored-by: Chester Chen <[email protected]>

Enhancements from 2.4 (NVIDIA#2519)

* Starts heartbeat after task is pull and before task execution (NVIDIA#2415)

* Starts pipe handler heartbeat send/check after task is pull before task execution (NVIDIA#2442)

* [2.4] Improve cell pipe timeout handling (NVIDIA#2441)

* improve cell pipe timeout handling

* improved end and abort handling

* improve timeout handling

---------

Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

* [2.4] Enhance launcher executor (NVIDIA#2433)

* Update LauncherExecutor logs and execution setup timeout

* Change name

* [2.4] Fire and forget for pipe handler control messages (NVIDIA#2413)

* Fire and forget for pipe handler control messages

* Add default timeout value

* fix wait-for-reply (NVIDIA#2478)

* Fix pipe handler timeout in task exchanger and launcher executor (NVIDIA#2495)

* Fix metric relay pipe handler timeout (NVIDIA#2496)

* Rely on launcher check_run_status to pause/resume hb (NVIDIA#2502)

Co-authored-by: Chester Chen <[email protected]>

---------

Co-authored-by: Yan Cheng <[email protected]>
Co-authored-by: Chester Chen <[email protected]>

Update ci cd from 2.4 (NVIDIA#2520)

* Update github actions (NVIDIA#2450)

* Fix premerge (NVIDIA#2467)

* Fix issues on hello-world TF2 notebook

* Fix tf integration test (NVIDIA#2504)

* Add client api integration tests

---------

Co-authored-by: Isaac Yang <[email protected]>
Co-authored-by: Sean Yang <[email protected]>

use controller name for stats (NVIDIA#2522)

Simulator workspace re-design (NVIDIA#2492)

* Redesign simulator workspace structure.

* working, needs clean.

* Changed the simulator workspacce structure to be consistent with POC.

* Moved the logfile init to start_server_app().

* optimzed.

* adjust the stats pool location.

* Addressed the PR views.

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

Simulator end run for all clients (NVIDIA#2514)

* Provide an option to run END_RUN for all clients.

* Added end_run_all option for simulator to run END_RUN event for all clients.

* Fixed a add_argument type, added help message.

* Changed to use add_argument(() compatible with python 3.8.

* reformat.

* rewrite the _end_run_clients() and add docstring for easier understanding.

* reformat.

* adjusting the locking in the _end_run_clients.

* Fixed a potential None pointer error.

* renamed the clients_finished_end_run variable.

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Sean Yang <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

Secure XGBoost Integration (NVIDIA#2512)

* Updated FOBS readme to add DatumManager, added agrpcs as secure scheme

* Refactoring

* Refactored the secure version to histogram_based_v2

* Replaced Paillier with a mock encryptor

* Added license header

* Put mock back

* Added metrics_writer back and fixed GRPC error reply

simplify job simulator_run to take only one workspace parameter. (NVIDIA#2528)

Fix README.

Fix file links in README.

Fix file links in README.

Add comparison between centralized and federated training code.

Add missing client api test jobs (NVIDIA#2535)

Fixed the simulator server workspace root dir (NVIDIA#2533)

* Fixed the simulator server root dir error.

* Added unit test for SimulatorRunner start_server_app.

---------

Co-authored-by: Chester Chen <[email protected]>

Improve InProcessClientAPIExecutor  (NVIDIA#2536)

* 1. rename ExeTaskFnWrapper class to TaskScriptRunner
2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function
3. redirect print() to logger.info()

* 1. rename ExeTaskFnWrapper class to TaskScriptRunner
2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function
3. redirect print() to logger.info()

* make result check and result pull use the same configurable variable

* rename exec_task_fn_wrapper to task_script_runner.py

* fix typo

Update README for launching python script.

Modify tensorboard logdir.

Link to environment setup instructions.

expose aggregate_fn to users for overwriting (NVIDIA#2539)

FIX MLFLow and Tensorboard Output to be consistent with new Workspace root changes (NVIDIA#2537)

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1. Remove the default code to use configuration
2. fix some broken notebook

* rollback changes

Fix decorator issue (NVIDIA#2542)

Remove line number in code link.

FLModel summary (NVIDIA#2544)

* add FLModel Summary

* format

formatting

Update KM example, add 2-stage solution without HE (NVIDIA#2541)

* add KM without HE, update everything

* fix license header

* fix license header - update year to 2024

* fix format

---------

Co-authored-by: Chester Chen <[email protected]>

* update license

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Holger Roth <[email protected]>

* Add information about dig (bind9-dnsutils) in the document

* format update

* Update KM example, add 2-stage solution without HE (NVIDIA#2541)

* add KM without HE, update everything

* fix license header

* fix license header - update year to 2024

* fix format

---------

Co-authored-by: Chester Chen <[email protected]>

* Update monai readme to remove logging.conf (NVIDIA#2552)

* MONAI mednist example (NVIDIA#2532)

* add monai notebook

* add training script

* update example

* update notebook

* use job template

* call init later

* swith back

* add gitignore

* update notebooks

* add readmes

* send received model to GPU

* use monai tb stats handler

* formatting

* Improve AWS cloud launch script

* Add in process client api tests (NVIDIA#2549)

* Add in process client api tests

* Fix headers

* Fix comments

* Add client controller executor (NVIDIA#2530)

* add client controller executor

* address comments

* enhance abort, set peer props

* remove asserts

---------

Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

* Add option in dashboard cli for AWS vpc and subnet

* add note on README visualization

* update README

* update readme

* update readme

* update readme

* [2.5] Clean up to allow creation of nvflare light (NVIDIA#2573)

* clean up to allow creation of nvflare light

* move defs to cellnet

* Enable patch and build for nvflight (NVIDIA#2574)

* verified commit

---------

Co-authored-by: Yuhong Wen <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>
Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Sean Yang <[email protected]>
Co-authored-by: Zhijin <[email protected]>
Co-authored-by: Holger Roth <[email protected]>
Co-authored-by: Isaac Yang <[email protected]>
Co-authored-by: Ziyue Xu <[email protected]>
Co-authored-by: Ziyue Xu <[email protected]>
Co-authored-by: Holger Roth <[email protected]>
Co-authored-by: Yan Cheng <[email protected]>
nvidianz added a commit that referenced this pull request May 16, 2024
* Updated FOBS readme to add DatumManager, added agrpcs as secure scheme

* Implemented horizontal calls in nvflare plugin

* Added support for horizontal secure XGBoost

* Fixed a few horizontal issues

* Added reliable message

* Added ReliableMessage parameters

* Added log for debugging empty rcv_buf

* Added finally block to finish duplicate seq

* Removed debug statements

* format change

* Add in process client api tests (#2549)

* Add in process client api tests

* Fix headers

* Fix comments

* Add client controller executor (#2530)

* add client controller executor

* address comments

* enhance abort, set peer props

* remove asserts

---------

Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

* Add option in dashboard cli for AWS vpc and subnet

* [2.5] Clean up to allow creation of nvflare light (#2573)

* clean up to allow creation of nvflare light

* move defs to cellnet

* Enable patch and build for nvflight (#2574)

* add FedBN Implementation on NVFlare research folder - a local batch normalization federated learning method  (#2524)

* add research/fedbn

* delete redudant controller and correct figs requirements

* update plot_requirements

* rewrite fedbn

* update jobs

* remove workspace

* update README

* simplify job simulator_run to take only one workspace parameter. (#2528)

* Add missing client api test jobs (#2535)

* Fixed the simulator server workspace root dir (#2533)

* Fixed the simulator server root dir error.

* Added unit test for SimulatorRunner start_server_app.

---------

Co-authored-by: Chester Chen <[email protected]>

* Improve InProcessClientAPIExecutor  (#2536)

* 1. rename ExeTaskFnWrapper class to TaskScriptRunner
2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function
3. redirect print() to logger.info()

* 1. rename ExeTaskFnWrapper class to TaskScriptRunner
2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function
3. redirect print() to logger.info()

* make result check and result pull use the same configurable variable

* rename exec_task_fn_wrapper to task_script_runner.py

* fix typo

* FIX MLFLow and Tensorboard Output to be consistent with new Workspace root changes (#2537)

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1. Remove the default code to use configuration
2. fix some broken notebook

* rollback changes

* Fix decorator issue (#2542)

* update create and run job script

* FLModel summary (#2544)

* add FLModel Summary

* format

* remove jobs folder

* expose aggregate_fn to users for overwriting (#2539)

* handle cases where the script with relative path in Script Runner (#2543)

* handle cases where the script with relative path

* handle cases where the script with relative path

* add more unit test cases and change the file search logics

* code format

* add more unit test cases and change the file search logics

* Lr newton raphson (#2529)

* Implement federated logistic regression with second-order newton raphson.

Update file headers.

Update README.

Update README.

Fix README.

Refine README.

Update README.

Added more logging for the job status changing. (#2480)

* Added more logging for the job status changing.

* Fixed a logging call error.

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

Fix update client status (#2508)

* check workflow id before updating client status

* change order of checks

Add user guide on how to deploy to EKS (#2510)

* Add user guide on how to deploy to EKS

* Address comments

Improve dead client handling (#2506)

* dev

* test dead client cmd

* added more info for dead client tracing

* remove unused imports

* fix unit test

* fix test case

* address PR comments

---------

Co-authored-by: Sean Yang <[email protected]>

Enhance WFController (#2505)

* set flmodel variables in basefedavg

* make round info optional, fix inproc api bug

temporarily disable preflight tests (#2521)

Upgrade dependencies (#2516)

Use full path for PSI components (#2437) (#2517)

Multiple bug fixes from 2.4 (#2518)

* [2.4] Support client custom code in simulator (#2447)

* Support client custom code in simulator

* Fix client custom code

* Remove cancel_futures args (#2457)

* Fix sub_worker_process shutdown (#2458)

* Set GRPC_ENABLE_FORK_SUPPORT to False (#2474)

Pythonic job creation (#2483)

* WIP: constructed the FedJob.

* WIP: server_app josn export.

* generate the job app config.

* fully functional pythonic job creation.

* Added simulator_run for pythonic API.

* reformat.

* Added filters support for pythonic job creation.

* handled the direct import case in fed_job.

* refactor.

* Added the resource_spec set function for FedJob.

* refactored.

* Moved the ClientApp and ServerApp into fed_app.py.

* Refactored: removed the _FilterDef class.

* refactored.

* Rename job config classes (#3)

* rename config related classes

* add client api example

* fix metric streaming

* add to() routine

* Enable obj in the constructor as paramenter.

* Added support for the launcher script.

* refactored.

* reformat.

* Update the comment.

* re-arrange the package location.

* Added add_ext_script() for BaseAppConfig.

* codestyle fix.

* Removed the client-api-pt example.

* removed no used import.

* fixed the in_time_accumulate_weighted_aggregator_test.py

* Added Enum parameter support.

* Added docstring.

* Added ability to handle parameters from base class.

* Move the parameter data format conversion to the START_RUN event for InProcessClientAPIExecutor.

* Added params_exchange_format for PTInProcessClientAPIExecutor.

* codestyle fix.

* Fixed a custom code folder structure issue.

* work for sub-folder custom files.

* backed to handle parameters from base classes.

* Support folder structure job config.

* Added support for flat folder from '.XXX' import.

* codestyle fix.

* refactored and add docstring.

* Address some of the PR reviews.

---------

Co-authored-by: Holger Roth <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>
Co-authored-by: Chester Chen <[email protected]>

Enhancements from 2.4 (#2519)

* Starts heartbeat after task is pull and before task execution (#2415)

* Starts pipe handler heartbeat send/check after task is pull before task execution (#2442)

* [2.4] Improve cell pipe timeout handling (#2441)

* improve cell pipe timeout handling

* improved end and abort handling

* improve timeout handling

---------

Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

* [2.4] Enhance launcher executor (#2433)

* Update LauncherExecutor logs and execution setup timeout

* Change name

* [2.4] Fire and forget for pipe handler control messages (#2413)

* Fire and forget for pipe handler control messages

* Add default timeout value

* fix wait-for-reply (#2478)

* Fix pipe handler timeout in task exchanger and launcher executor (#2495)

* Fix metric relay pipe handler timeout (#2496)

* Rely on launcher check_run_status to pause/resume hb (#2502)

Co-authored-by: Chester Chen <[email protected]>

---------

Co-authored-by: Yan Cheng <[email protected]>
Co-authored-by: Chester Chen <[email protected]>

Update ci cd from 2.4 (#2520)

* Update github actions (#2450)

* Fix premerge (#2467)

* Fix issues on hello-world TF2 notebook

* Fix tf integration test (#2504)

* Add client api integration tests

---------

Co-authored-by: Isaac Yang <[email protected]>
Co-authored-by: Sean Yang <[email protected]>

use controller name for stats (#2522)

Simulator workspace re-design (#2492)

* Redesign simulator workspace structure.

* working, needs clean.

* Changed the simulator workspacce structure to be consistent with POC.

* Moved the logfile init to start_server_app().

* optimzed.

* adjust the stats pool location.

* Addressed the PR views.

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

Simulator end run for all clients (#2514)

* Provide an option to run END_RUN for all clients.

* Added end_run_all option for simulator to run END_RUN event for all clients.

* Fixed a add_argument type, added help message.

* Changed to use add_argument(() compatible with python 3.8.

* reformat.

* rewrite the _end_run_clients() and add docstring for easier understanding.

* reformat.

* adjusting the locking in the _end_run_clients.

* Fixed a potential None pointer error.

* renamed the clients_finished_end_run variable.

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Sean Yang <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

Secure XGBoost Integration (#2512)

* Updated FOBS readme to add DatumManager, added agrpcs as secure scheme

* Refactoring

* Refactored the secure version to histogram_based_v2

* Replaced Paillier with a mock encryptor

* Added license header

* Put mock back

* Added metrics_writer back and fixed GRPC error reply

simplify job simulator_run to take only one workspace parameter. (#2528)

Fix README.

Fix file links in README.

Fix file links in README.

Add comparison between centralized and federated training code.

Add missing client api test jobs (#2535)

Fixed the simulator server workspace root dir (#2533)

* Fixed the simulator server root dir error.

* Added unit test for SimulatorRunner start_server_app.

---------

Co-authored-by: Chester Chen <[email protected]>

Improve InProcessClientAPIExecutor  (#2536)

* 1. rename ExeTaskFnWrapper class to TaskScriptRunner
2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function
3. redirect print() to logger.info()

* 1. rename ExeTaskFnWrapper class to TaskScriptRunner
2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function
3. redirect print() to logger.info()

* make result check and result pull use the same configurable variable

* rename exec_task_fn_wrapper to task_script_runner.py

* fix typo

Update README for launching python script.

Modify tensorboard logdir.

Link to environment setup instructions.

expose aggregate_fn to users for overwriting (#2539)

FIX MLFLow and Tensorboard Output to be consistent with new Workspace root changes (#2537)

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1. Remove the default code to use configuration
2. fix some broken notebook

* rollback changes

Fix decorator issue (#2542)

Remove line number in code link.

FLModel summary (#2544)

* add FLModel Summary

* format

formatting

Update KM example, add 2-stage solution without HE (#2541)

* add KM without HE, update everything

* fix license header

* fix license header - update year to 2024

* fix format

---------

Co-authored-by: Chester Chen <[email protected]>

* update license

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Holger Roth <[email protected]>

* Add information about dig (bind9-dnsutils) in the document

* format update

* Update KM example, add 2-stage solution without HE (#2541)

* add KM without HE, update everything

* fix license header

* fix license header - update year to 2024

* fix format

---------

Co-authored-by: Chester Chen <[email protected]>

* Update monai readme to remove logging.conf (#2552)

* MONAI mednist example (#2532)

* add monai notebook

* add training script

* update example

* update notebook

* use job template

* call init later

* swith back

* add gitignore

* update notebooks

* add readmes

* send received model to GPU

* use monai tb stats handler

* formatting

* Improve AWS cloud launch script

* Add in process client api tests (#2549)

* Add in process client api tests

* Fix headers

* Fix comments

* Add client controller executor (#2530)

* add client controller executor

* address comments

* enhance abort, set peer props

* remove asserts

---------

Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

* Add option in dashboard cli for AWS vpc and subnet

* add note on README visualization

* update README

* update readme

* update readme

* update readme

* [2.5] Clean up to allow creation of nvflare light (#2573)

* clean up to allow creation of nvflare light

* move defs to cellnet

* Enable patch and build for nvflight (#2574)

* verified commit

---------

Co-authored-by: Yuhong Wen <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>
Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Sean Yang <[email protected]>
Co-authored-by: Zhijin <[email protected]>
Co-authored-by: Holger Roth <[email protected]>
Co-authored-by: Isaac Yang <[email protected]>
Co-authored-by: Ziyue Xu <[email protected]>
Co-authored-by: Ziyue Xu <[email protected]>
Co-authored-by: Holger Roth <[email protected]>
Co-authored-by: Yan Cheng <[email protected]>

* fix MLFLOW example (#2575)

Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

* BugFix: InProcessClientAPIExecutor's TaskScriptRunner (#2558)

* 1) find script full path to indicate which site script to avoid loading run script
2) make sure the task script failed will cause the client to return failure status which will trigger job stop rather wait forever
3) add different unit tests

* sort key in unit test

* add logic to improve error message

* style format

* add more tests and logics

* code format

* code format

* fix steps error

* fix global steps

* rollback some changes and split it into another PR

* rollback some changes and split it into another PR

---------

Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

* update client_api.png (#2577)

* Fix the simulator worker sys path (#2561)

* Fixed the simulator worker sys path.

* fixed the get_new_sys_path() logic, added in unit test.

* fixed isort.

* Changed the _get_new_sys_path() implementation.

---------

Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

* ReliableMessage register is changed to register aux message. Added support for Mac with vertical

---------

Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>
Co-authored-by: Sean Yang <[email protected]>
Co-authored-by: Isaac Yang <[email protected]>
Co-authored-by: Yan Cheng <[email protected]>
Co-authored-by: Minghui Chen <[email protected]>
Co-authored-by: Yuhong Wen <[email protected]>
Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Zhijin <[email protected]>
Co-authored-by: Holger Roth <[email protected]>
Co-authored-by: Ziyue Xu <[email protected]>
Co-authored-by: Ziyue Xu <[email protected]>
Co-authored-by: Holger Roth <[email protected]>
nvidianz added a commit to nvidianz/NVFlare that referenced this pull request May 22, 2024
* Updated FOBS readme to add DatumManager, added agrpcs as secure scheme

* Implemented horizontal calls in nvflare plugin

* Added support for horizontal secure XGBoost

* Fixed a few horizontal issues

* Added reliable message

* Added ReliableMessage parameters

* Added log for debugging empty rcv_buf

* Added finally block to finish duplicate seq

* Removed debug statements

* format change

* Add in process client api tests (NVIDIA#2549)

* Add in process client api tests

* Fix headers

* Fix comments

* Add client controller executor (NVIDIA#2530)

* add client controller executor

* address comments

* enhance abort, set peer props

* remove asserts

---------

Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

* Add option in dashboard cli for AWS vpc and subnet

* [2.5] Clean up to allow creation of nvflare light (NVIDIA#2573)

* clean up to allow creation of nvflare light

* move defs to cellnet

* Enable patch and build for nvflight (NVIDIA#2574)

* add FedBN Implementation on NVFlare research folder - a local batch normalization federated learning method  (NVIDIA#2524)

* add research/fedbn

* delete redudant controller and correct figs requirements

* update plot_requirements

* rewrite fedbn

* update jobs

* remove workspace

* update README

* simplify job simulator_run to take only one workspace parameter. (NVIDIA#2528)

* Add missing client api test jobs (NVIDIA#2535)

* Fixed the simulator server workspace root dir (NVIDIA#2533)

* Fixed the simulator server root dir error.

* Added unit test for SimulatorRunner start_server_app.

---------

Co-authored-by: Chester Chen <[email protected]>

* Improve InProcessClientAPIExecutor  (NVIDIA#2536)

* 1. rename ExeTaskFnWrapper class to TaskScriptRunner
2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function
3. redirect print() to logger.info()

* 1. rename ExeTaskFnWrapper class to TaskScriptRunner
2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function
3. redirect print() to logger.info()

* make result check and result pull use the same configurable variable

* rename exec_task_fn_wrapper to task_script_runner.py

* fix typo

* FIX MLFLow and Tensorboard Output to be consistent with new Workspace root changes (NVIDIA#2537)

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1. Remove the default code to use configuration
2. fix some broken notebook

* rollback changes

* Fix decorator issue (NVIDIA#2542)

* update create and run job script

* FLModel summary (NVIDIA#2544)

* add FLModel Summary

* format

* remove jobs folder

* expose aggregate_fn to users for overwriting (NVIDIA#2539)

* handle cases where the script with relative path in Script Runner (NVIDIA#2543)

* handle cases where the script with relative path

* handle cases where the script with relative path

* add more unit test cases and change the file search logics

* code format

* add more unit test cases and change the file search logics

* Lr newton raphson (NVIDIA#2529)

* Implement federated logistic regression with second-order newton raphson.

Update file headers.

Update README.

Update README.

Fix README.

Refine README.

Update README.

Added more logging for the job status changing. (NVIDIA#2480)

* Added more logging for the job status changing.

* Fixed a logging call error.

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

Fix update client status (NVIDIA#2508)

* check workflow id before updating client status

* change order of checks

Add user guide on how to deploy to EKS (NVIDIA#2510)

* Add user guide on how to deploy to EKS

* Address comments

Improve dead client handling (NVIDIA#2506)

* dev

* test dead client cmd

* added more info for dead client tracing

* remove unused imports

* fix unit test

* fix test case

* address PR comments

---------

Co-authored-by: Sean Yang <[email protected]>

Enhance WFController (NVIDIA#2505)

* set flmodel variables in basefedavg

* make round info optional, fix inproc api bug

temporarily disable preflight tests (NVIDIA#2521)

Upgrade dependencies (NVIDIA#2516)

Use full path for PSI components (NVIDIA#2437) (NVIDIA#2517)

Multiple bug fixes from 2.4 (NVIDIA#2518)

* [2.4] Support client custom code in simulator (NVIDIA#2447)

* Support client custom code in simulator

* Fix client custom code

* Remove cancel_futures args (NVIDIA#2457)

* Fix sub_worker_process shutdown (NVIDIA#2458)

* Set GRPC_ENABLE_FORK_SUPPORT to False (NVIDIA#2474)

Pythonic job creation (NVIDIA#2483)

* WIP: constructed the FedJob.

* WIP: server_app josn export.

* generate the job app config.

* fully functional pythonic job creation.

* Added simulator_run for pythonic API.

* reformat.

* Added filters support for pythonic job creation.

* handled the direct import case in fed_job.

* refactor.

* Added the resource_spec set function for FedJob.

* refactored.

* Moved the ClientApp and ServerApp into fed_app.py.

* Refactored: removed the _FilterDef class.

* refactored.

* Rename job config classes (NVIDIA#3)

* rename config related classes

* add client api example

* fix metric streaming

* add to() routine

* Enable obj in the constructor as paramenter.

* Added support for the launcher script.

* refactored.

* reformat.

* Update the comment.

* re-arrange the package location.

* Added add_ext_script() for BaseAppConfig.

* codestyle fix.

* Removed the client-api-pt example.

* removed no used import.

* fixed the in_time_accumulate_weighted_aggregator_test.py

* Added Enum parameter support.

* Added docstring.

* Added ability to handle parameters from base class.

* Move the parameter data format conversion to the START_RUN event for InProcessClientAPIExecutor.

* Added params_exchange_format for PTInProcessClientAPIExecutor.

* codestyle fix.

* Fixed a custom code folder structure issue.

* work for sub-folder custom files.

* backed to handle parameters from base classes.

* Support folder structure job config.

* Added support for flat folder from '.XXX' import.

* codestyle fix.

* refactored and add docstring.

* Address some of the PR reviews.

---------

Co-authored-by: Holger Roth <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>
Co-authored-by: Chester Chen <[email protected]>

Enhancements from 2.4 (NVIDIA#2519)

* Starts heartbeat after task is pull and before task execution (NVIDIA#2415)

* Starts pipe handler heartbeat send/check after task is pull before task execution (NVIDIA#2442)

* [2.4] Improve cell pipe timeout handling (NVIDIA#2441)

* improve cell pipe timeout handling

* improved end and abort handling

* improve timeout handling

---------

Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

* [2.4] Enhance launcher executor (NVIDIA#2433)

* Update LauncherExecutor logs and execution setup timeout

* Change name

* [2.4] Fire and forget for pipe handler control messages (NVIDIA#2413)

* Fire and forget for pipe handler control messages

* Add default timeout value

* fix wait-for-reply (NVIDIA#2478)

* Fix pipe handler timeout in task exchanger and launcher executor (NVIDIA#2495)

* Fix metric relay pipe handler timeout (NVIDIA#2496)

* Rely on launcher check_run_status to pause/resume hb (NVIDIA#2502)

Co-authored-by: Chester Chen <[email protected]>

---------

Co-authored-by: Yan Cheng <[email protected]>
Co-authored-by: Chester Chen <[email protected]>

Update ci cd from 2.4 (NVIDIA#2520)

* Update github actions (NVIDIA#2450)

* Fix premerge (NVIDIA#2467)

* Fix issues on hello-world TF2 notebook

* Fix tf integration test (NVIDIA#2504)

* Add client api integration tests

---------

Co-authored-by: Isaac Yang <[email protected]>
Co-authored-by: Sean Yang <[email protected]>

use controller name for stats (NVIDIA#2522)

Simulator workspace re-design (NVIDIA#2492)

* Redesign simulator workspace structure.

* working, needs clean.

* Changed the simulator workspacce structure to be consistent with POC.

* Moved the logfile init to start_server_app().

* optimzed.

* adjust the stats pool location.

* Addressed the PR views.

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

Simulator end run for all clients (NVIDIA#2514)

* Provide an option to run END_RUN for all clients.

* Added end_run_all option for simulator to run END_RUN event for all clients.

* Fixed a add_argument type, added help message.

* Changed to use add_argument(() compatible with python 3.8.

* reformat.

* rewrite the _end_run_clients() and add docstring for easier understanding.

* reformat.

* adjusting the locking in the _end_run_clients.

* Fixed a potential None pointer error.

* renamed the clients_finished_end_run variable.

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Sean Yang <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

Secure XGBoost Integration (NVIDIA#2512)

* Updated FOBS readme to add DatumManager, added agrpcs as secure scheme

* Refactoring

* Refactored the secure version to histogram_based_v2

* Replaced Paillier with a mock encryptor

* Added license header

* Put mock back

* Added metrics_writer back and fixed GRPC error reply

simplify job simulator_run to take only one workspace parameter. (NVIDIA#2528)

Fix README.

Fix file links in README.

Fix file links in README.

Add comparison between centralized and federated training code.

Add missing client api test jobs (NVIDIA#2535)

Fixed the simulator server workspace root dir (NVIDIA#2533)

* Fixed the simulator server root dir error.

* Added unit test for SimulatorRunner start_server_app.

---------

Co-authored-by: Chester Chen <[email protected]>

Improve InProcessClientAPIExecutor  (NVIDIA#2536)

* 1. rename ExeTaskFnWrapper class to TaskScriptRunner
2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function
3. redirect print() to logger.info()

* 1. rename ExeTaskFnWrapper class to TaskScriptRunner
2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function
3. redirect print() to logger.info()

* make result check and result pull use the same configurable variable

* rename exec_task_fn_wrapper to task_script_runner.py

* fix typo

Update README for launching python script.

Modify tensorboard logdir.

Link to environment setup instructions.

expose aggregate_fn to users for overwriting (NVIDIA#2539)

FIX MLFLow and Tensorboard Output to be consistent with new Workspace root changes (NVIDIA#2537)

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1. Remove the default code to use configuration
2. fix some broken notebook

* rollback changes

Fix decorator issue (NVIDIA#2542)

Remove line number in code link.

FLModel summary (NVIDIA#2544)

* add FLModel Summary

* format

formatting

Update KM example, add 2-stage solution without HE (NVIDIA#2541)

* add KM without HE, update everything

* fix license header

* fix license header - update year to 2024

* fix format

---------

Co-authored-by: Chester Chen <[email protected]>

* update license

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Holger Roth <[email protected]>

* Add information about dig (bind9-dnsutils) in the document

* format update

* Update KM example, add 2-stage solution without HE (NVIDIA#2541)

* add KM without HE, update everything

* fix license header

* fix license header - update year to 2024

* fix format

---------

Co-authored-by: Chester Chen <[email protected]>

* Update monai readme to remove logging.conf (NVIDIA#2552)

* MONAI mednist example (NVIDIA#2532)

* add monai notebook

* add training script

* update example

* update notebook

* use job template

* call init later

* swith back

* add gitignore

* update notebooks

* add readmes

* send received model to GPU

* use monai tb stats handler

* formatting

* Improve AWS cloud launch script

* Add in process client api tests (NVIDIA#2549)

* Add in process client api tests

* Fix headers

* Fix comments

* Add client controller executor (NVIDIA#2530)

* add client controller executor

* address comments

* enhance abort, set peer props

* remove asserts

---------

Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

* Add option in dashboard cli for AWS vpc and subnet

* add note on README visualization

* update README

* update readme

* update readme

* update readme

* [2.5] Clean up to allow creation of nvflare light (NVIDIA#2573)

* clean up to allow creation of nvflare light

* move defs to cellnet

* Enable patch and build for nvflight (NVIDIA#2574)

* verified commit

---------

Co-authored-by: Yuhong Wen <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>
Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Sean Yang <[email protected]>
Co-authored-by: Zhijin <[email protected]>
Co-authored-by: Holger Roth <[email protected]>
Co-authored-by: Isaac Yang <[email protected]>
Co-authored-by: Ziyue Xu <[email protected]>
Co-authored-by: Ziyue Xu <[email protected]>
Co-authored-by: Holger Roth <[email protected]>
Co-authored-by: Yan Cheng <[email protected]>

* fix MLFLOW example (NVIDIA#2575)

Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

* BugFix: InProcessClientAPIExecutor's TaskScriptRunner (NVIDIA#2558)

* 1) find script full path to indicate which site script to avoid loading run script
2) make sure the task script failed will cause the client to return failure status which will trigger job stop rather wait forever
3) add different unit tests

* sort key in unit test

* add logic to improve error message

* style format

* add more tests and logics

* code format

* code format

* fix steps error

* fix global steps

* rollback some changes and split it into another PR

* rollback some changes and split it into another PR

---------

Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

* update client_api.png (NVIDIA#2577)

* Fix the simulator worker sys path (NVIDIA#2561)

* Fixed the simulator worker sys path.

* fixed the get_new_sys_path() logic, added in unit test.

* fixed isort.

* Changed the _get_new_sys_path() implementation.

---------

Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

* ReliableMessage register is changed to register aux message. Added support for Mac with vertical

---------

Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>
Co-authored-by: Sean Yang <[email protected]>
Co-authored-by: Isaac Yang <[email protected]>
Co-authored-by: Yan Cheng <[email protected]>
Co-authored-by: Minghui Chen <[email protected]>
Co-authored-by: Yuhong Wen <[email protected]>
Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Zhijin <[email protected]>
Co-authored-by: Holger Roth <[email protected]>
Co-authored-by: Ziyue Xu <[email protected]>
Co-authored-by: Ziyue Xu <[email protected]>
Co-authored-by: Holger Roth <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants