-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add user experience related metrics #617
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Pull Request Test Coverage Report for Build 1620
💛 - Coveralls |
Anbang-Hu
reviewed
Nov 1, 2019
self.submit_time = None | ||
|
||
|
||
class LRUDefatulDict(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to leverage third party caching package e.g. https://github.com/tkem/cachetools/tree/c530924cdec86855be6322d3e4dd979bfc9250e4
YinYangOfDao
pushed a commit
to YinYangOfDao/DLWorkspace
that referenced
this pull request
Nov 5, 2019
YinYangOfDao
added a commit
that referenced
this pull request
Jan 29, 2020
* Add nfs storage info in gpu card page , fix the issue of blank load in job page, issue of prreemtible job total and training job submission error (#613) * Add unit test framework * Add coverage report * Add test config * Disable logger in test env * Travis * Use npm to run scripts Ref: istanbuljs/nyc#1181 * Remove console logs * authenticate unit tests * Externalize User.generateToken for test use * bootstrap.js: fix content type * Make Service#context not enumerable * bootstrap.js unit tests * Make user#_token not enumerable * Fix log out test * Use `resolves()` instead of `returns(Promise.resolve())` * add cluster config export test * correct some mistakes in exportClusterConfig.js * adjust test structure * add GET user test * correct the schema for job * add test for jobs.post.js * continue to adjust the test structure * jobs.post.js change post format to make it neat * add test for controllers/user.js * team.js: adjust code structure and fix some mistakes * logger.js: set app.silent as true during test * teams.js: change the test info format * add test for getting job detail * Introduce coveralls * Fix nycrc path * add test for job posting behavior * Add addGroupLink to unit test * add test for job status putting * add test for job priority putting * add test for controllers/team/cluster.js * add test for getting template * add test for updating template * add test for template deletion * add test for controllers/team/jobs.js * Remove debug log * Fill wiki link to test config * Ignore the main module detection in cov test * Ignore another main module detection * controllers/team/jobs.js: add specific user case * controllers/team/jobs.js: add null jobTime case * postJob.js: correct test message * update test for endpoint posting * update test for priority setting * Add badge * Fix badge link * Change appbar to normal format (#616) * add user experience related metrics (#617) * Fix VC storage storage containerPath (#621) * Support custom storage mountpoints * Add extra logging * Add TTLCache for ListVCs (#619) * Lower case custom mountpoint name (#623) * Add default team key into request body (#620) * Add lock for cache * Bump twisted from 19.2.1 to 19.7.0 in /src/ClusterManager (#618) Bumps [twisted](https://github.com/twisted/twisted) from 19.2.1 to 19.7.0. - [Release notes](https://github.com/twisted/twisted/releases) - [Changelog](https://github.com/twisted/twisted/blob/trunk/NEWS.rst) - [Commits](twisted/twisted@twisted-19.2.1...twisted-19.7.0) Signed-off-by: dependabot[bot] <[email protected]> * Add support for ./deploy.py connect samba (#625) * Add support for ./deploy.py connect samba * Add samba to all roles * update * Add password to user profile to copy (#635) * add job time from scheduling to running metric (#633) * fix bug in update job metrics (#636) * use redis to save job status info (#637) * avoid override running metric (#641) * Move pure CPU jobs to CPU machines (#640) * Dashboard: upgrade node.js version to erbium * Dashboard API: increase max-http-header-size to 64KB * accelerate jobmanager loop (#642) * azure blobfuse plugin * fix breaks of the deployment pipeline for on premise machines and updated kernel (#639) * Inference job: add soft podAffinity to deployment (#638) * Web Portal: Add userName field in add-endpoints call. * support blob array * profile bootstrap (#649) * Dashboard/https support (#647) * repair manager initial code check-in (#644) * ECC Repair Manager Initial Check-In * remove test rule from config * pr feedback * remove test_rule.py * fix double download * Use k8s API to create and delete secrets * Fix typos and add delete_job for succeed and failed cases * let prometheus ignore redis port (#653) * A few fixes * Fix distributed jobs * Ignore invalid strings * Refactor * Add tests for job * Add local fast storage * Support cluster wide local fast storage * tmppath format * Dashboard UI: fix enableJobPath binding (#652) * Fix issue of template save and delete not working with azure blob and optimize the user experience of template (#655) * pull shorter logs (#654) * Add a flag to enable Azure blobfuse (#657) * Add enable_blobfuse flag * Fix default params location * profile more (#656) * Fix job status check * Improve the performance of cluster status & jobs load efficiecncy (#659) * Revert tail=3000 * Change free size to avail size for filesystem (#665) * fix broken deployment * fix init script to support default ssh config in docker image (#666) * add the mount options in azure blob (#671) * support password login (#668) * support password login * fix bug * dockerize and k8s service - repair manager (#663) * dockerize repair manager * make kubernetes service * add repairmanager to params.py * Dashboard backend: set current user in job.post (#674) * Dashboard backend: set current user in job.post * Unit test * Revert lint * Revert "Add script to set up network GC to prevent docker network issue" (#678) * based on worker/nfs PR, refactored the code to load config and create clusterID (#670) * cloudinit for worker node and nfs mount refactorization * add mkdir_and_cp.sh for worker cloud-init * add binaries and copies for mount service * refactor deploy.py and az_tools.py to get rid of global vars and support python3 * refactor code to create clusterID, and read configs * minor format/naming bugs * fix formatting issue, add deleted deprecated function back * Init Adding NFS Storage Manager (#672) * Add storage monitor * Add support for expiry * Add subtree atime * Fix tests * Refactor storage monitor * Add a loop in main * Add kubelabels * Add Dockerfile and service yaml file * Modify some code * Create utility servers if specified * Make allowalltcp source range configurable * Fix creating nsg and private ip for utility servers * Fix typo * Fix genconfig vmSize * get nodes by role 'utility' * Add updateutility * Add utility_node in get_node_lists_for_service * Add utility nodes to get_nodes * AAdd docker build for storage manager * Add deploy utility configs and storage configuration file * Fix typo * Temporarily make gpu_type='None' for utility node deployment * Add storage manager mapping * Fix docker image and bugs * Fix typop * mount scanpoints upon service startup * Update to deploy storage manager on nfs node * Modify deployment rendering and mounts * Utility -> NFS * NfS template rendering * Clean rendered target directory for nfs config rendering * Allow master to access nfs * Change nfs_allow_master nsg name * support custom_nfs_nsg_names * Fix typo in custom_nfs_nsg_names * Fix * Mount /data/share for storagemanager * storage_monitor -> storage_manager * Do not include nfs server in get_nodes * Make mountoptions configurable per blobfuse per job (#673) * Make mountoptions configurable per job * Use invalid_entry to check mount_options * Add an additional regular expression check * Dashboard: reduce over-detailed logs (#683) * remove sudo in endpoint manager (#682) * Handle terminating pods when machines are taken away for k8s > 1.13 + expose more job info (#681) * Mark pod with deletion_timestamp as Unknown * Ignore None user_sign_token * fix * Log node_name, host_ip, pod_ip * Add requested and available resource info for queued jobs * Fix resource order * Use literal * email alerts + refactoring (#684) * add repairmanager to params.py * Refactoring + Email Alerts * support db pool (#675) * Add GetAllACL API * Change user_sign_token to master_token (#690) * Dashboard deployment (#680) * fix init script to support default ssh config in docker image * deploy dashboard * Fix bad decoding of jobStatusDetail * use init container to copy sshd and openssl command (#662) * Add SKU meta section to support scheduling on CPU machines (#676) * Init machine SKU * Update * A few fixes * Refactor * Adding comments to methods and renaming variables * Fix interface break * Fix * Add comments for command * User synchronizer (#687) * Add user-synchronizer * Fix * add pymysql * Use mysqlclient instead of pymysql * fix * Add cronjob * fix dockerfile * Revert "fix dockerfile" This reverts commit 74abee9. * use prebuild.sh * typo * Usey synchronizer: restfulapi version * Fix * Fix deployment * Add tolerations * typo * Use onPremisesSecurityIdentifier to calculate id * deploy * Issue fix * Add lint sript * Add default group to groups * comfigmap * label * labels * Add NCCL_IB_DISABLE=1 to disable IB usage (#694) * Support per-job configurable docker registry secret (#689) * Allow using custom docker registry * Add flag to config * Change variable name * lowercase for k8s name * Fix the issue of azure blob input remains after chaning template (#696) * Remove kvp file for Network Direct for Infiniband (#699) * repair manager: more refactoring (#697) * add repairmanager to params.py * Refactoring + Email Alerts * repair manager: more refactoring * PR feedback * VC node hard assignment (#698) * VC node hard assignment * try catch invalid cpu and memory spec * Add comma * Update * Set gpuType=None in pod description for CPU jobs * Fix logic for cpu jobs * Add command explanation * sync and service discovery using k8s (#695) * use sync.py to do distributed job sync and service discovery * add sshd check * fix params syntax bug (#701) * use deepscale sshd config (#700) * repair manager email fixes (#702) * Install Azure blobfuse at deployment (#705) * Dashboard: ignore frontend build directory * [Temp] Hide the data storage when vc is MMBellevue (#712) * Fix readonly detection * Add kill button to job details * Dashboard frontend: remove requests other than Grafana (#704) * Replace prometheus request with grafana api * Proxy gpu_reporter to GetVC API * Dashboard frontend: Use proxied GPU reporter data. * Support restfulapi w/o gpu_idle proxy * Use batch delete secrets (#706) * Use batch delete secrets * Update Training.tsx Revert * MySQL server deployment (#707) * MYSQL server deployment * Add mysql in allroles * New private ip for mysql * genconfig mysql * deploy.py connect mysql * updatemysql * mysql deployment yaml * get node lists for service - mysql * Take the first element * Update * repair manager - add more details to alert emails (#709) * add repairmanager to params.py * Refactoring + Email Alerts * repair manager: more refactoring * PR feedback * update time between rules * email config fixes * more descriptive email alert * nit * "fixing output error message" * try/catch for prometheus request * email multiple recipients, configurable * add functionaity to email job owners * use logger instead of logging (#710) * dashboard/new-bootstrap-schema (#714) * Refactor: add config to bootstrap param * Support new bootstrap schema in frontend * MySQL server node deployment and support mountOptions list for blobfuse (#720) * Use a unique tmppath for each blobfuse mount and support mount option list * Handle single mysql_node * mysql -> mysqlserver * Convert to string before checking invalid * Make tmppath of format $root_tmppath/$jobId/$podName/$blobfuse_name * Hide all credentials in REST call returns * A few fixes * Fix dashboard deployment * Dashboard backend: adjust some logs to debug level * fix app.silent * Remove default mountOptions for blobfuse (#721) * Add install-blobfuse.sh and docker push init-container to deployment script (#723) * Add ./deploy.py docker push init-container to deployment script * Execute install-blobfuse.sh at deployment * restrict port range for ssh (#724) * Dashboard: Clarify password and token. (#722) Password: The only term user should use, which is the string-typed user credential for dashboard API use. User should pass `email` as well as `password` as queries in API call to get access of the dashboard resource. For backward compatibility, `token` is also available in query, which is deprecated. Token: Internal used in dashboard **backend**, which stands for the Buffer-typed password. Backend always does not store the string-typed password for security reasons. IdToken: jwt typed token from Azure Active Directory CookieToken: `token` field value in cookie, jwt typed. Should be plain since koa already provided a signed cookie approach. * fix ssh problem (#725) * Mapping the actions into job detail page and adjust layout of appbar in homepage (#726) * Enable PermitUserEnvironment and propagate variables containing NCCL|PATH|DLWS|DLTS (#727) * do not generate new port in host network (#728) * clean up configmap of last retry (#729) * make dry run configurable (#731) * exit on failed to get enough configmap (#732) * longer wait time (#733) * Send email alert for overused storage paths (#730) * Fix typo in storage manager (#734) * add a hidden feature for deploy.py to separate code and config (#736) * fix previous bug (#737) * Revert "add a hidden feature for deploy.py to separate code and config (#736)" This reverts commit bd891b7. * Revert "fix previous bug (#737)" This reverts commit cdb176c. * fix yaml load warning * Remove config_dir from unsuccessful reverts and conflict resolutions (#739) * remove yaml load warning (#740) * repair manager - fix email alerting bugs (#741) * perf optimization for job list and detail * optimize authorize cache; update by comment * improve Endpoint API; improve VC list cache * move getAlias to utils; fix typo * join priority when getting my job list * Send to CC list and refactor storagemanager (#743) * Add separated GetJobLog API in restfulapi service * use environment variable to pass user's command (#742) * wait forever in setting up the ssh (#745) * Dashboard backend: remove winbind dependency Lint files Add lint to CI Dashboard frontent: Remove uid dependency Workaround bootstrap unit test * Dashboard: add v2 API View and Manage Jobs V2 Priority snackbar WIP Load MyJobs / AllJobs on demand Use notistack Bump dependencies Lint Lint Reduce dependencies Fix warning Layout Add empty view in AllJobs Fix Fix icon Use clusterId in RouteParams Job Details v2 Fix add job log api to dashboard Use error notistack instead of Error component Implement Console Add Helmet Restructure useConfirm Strict route Issue fix Fix key Add job status change notification detail v2: Use job name as title Container width support action Fix support Compress & cache frontend files usePrevious instead of useChange Leverage priority from job details v2 wider Fix PriorityField * User Synchronizer: use host network (#748) * fix endpoint extract (#746) * fix get all acl * Job table: use Link instead of onRowClick ...to support functionalities of anchor * Fix GPU rendering * User synchronizer: filter out subgroups (#752) * special temporary code to be generalized (#753) * print selected environment variables (#754) * fix env problem (#756) * change cluster_manager, restfulapi and deploy.py to python3 (#750) * Add tooltip to job status * Capitalize job details title * Refine notification notice * typo: Preemptible * typo * Fix job status detail * Status tooltip: only show first details * Fix work path * Auth succ: use JS redirect instead of HTTP 302 It seems we hit https://bugs.chromium.org/p/chromium/issues/detail?id=696204 * Fix redirection * job status: place to right * fix scp could not found ssh (#759) * JobV2: fix crash when job data comes early than cluster config (#760) * fix byte/str conversion error (#763) * Install python3 pip3 in prerequisites (#762) * fix ACL isDeny default value (#765) * Refactor authorization.py (#761) * autopep8 some python files (#764) * notify user about job status changes (#768) * add travis (#769) * refactor endpoint (#770) * repair manager - email alerting refactoring (#747) * add some automatic test in common functionality (#771) * record longer latency for calling some program (#773) * Define new resource type to simplify code logic (#772) * Cluster resource Init * Refactor with Resource type * Rename Resource to ResourceStat * A few bug fixes * Refactor ClusterStatus * revert formatting for gpu usage url * Bug fix * Allow empty GPU type * Revert "Allow empty GPU type" This reverts commit 36e0f83. * Add backward compatibility for typo * Namespaced to default for now * Add test_cluster_status * Enable tests for cluster_status and utils in travis * apt-get install python3-pycurl * pycurl * Remove pycurl dependency * next check UI and job submission * git refactor NFS and mount, job running after manually set mysql identity table and secrets * wait and retry deploying service in cloud_init_infra.sh, and test whether repairmanager is up after slightly modify prebuild.sh * update doc for Azure deployment and change default value of workFolderAccessPoint and dataFolderAccessPoint * minor changes, map service names to docker names * resolve v2deploy.sh conflict * fix breaks for citest * fix breaks after rebasing, add default api_servers back after render worker generic * update configure.md * update azure deployment instructions * rename deploy.sh * modify utils.py scp and sudo scp, improve maintain.py * modified config file format merge Hongzhi's update and add more details for docs. * hide config file names when using command, stop generating scripts if not dryrun * use multiprocess and subprocess to parallely adding the vms * improve parallel execution and add node ready verification * update citest to test cloudinit based deployment Co-authored-by: hongyiliu <[email protected]> Co-authored-by: George Cheng <[email protected]> Co-authored-by: hzzhang <[email protected]> Co-authored-by: Di Xu <[email protected]> Co-authored-by: anbhu <[email protected]> Co-authored-by: Hongzhi Li <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: leigaoms <[email protected]> Co-authored-by: Deborah Sandoval <[email protected]>
YinYangOfDao
added a commit
that referenced
this pull request
Jan 29, 2020
* Add coverage report * Add test config * Disable logger in test env * Travis * Use npm to run scripts Ref: istanbuljs/nyc#1181 * Remove console logs * authenticate unit tests * Externalize User.generateToken for test use * bootstrap.js: fix content type * Make Service#context not enumerable * bootstrap.js unit tests * Make user#_token not enumerable * Fix log out test * Use `resolves()` instead of `returns(Promise.resolve())` * add cluster config export test * correct some mistakes in exportClusterConfig.js * adjust test structure * add GET user test * correct the schema for job * add test for jobs.post.js * continue to adjust the test structure * jobs.post.js change post format to make it neat * add test for controllers/user.js * team.js: adjust code structure and fix some mistakes * logger.js: set app.silent as true during test * teams.js: change the test info format * add test for getting job detail * Introduce coveralls * Fix nycrc path * add test for job posting behavior * Add addGroupLink to unit test * add test for job status putting * add test for job priority putting * add test for controllers/team/cluster.js * add test for getting template * add test for updating template * add test for template deletion * add test for controllers/team/jobs.js * Remove debug log * Fill wiki link to test config * Ignore the main module detection in cov test * Ignore another main module detection * controllers/team/jobs.js: add specific user case * controllers/team/jobs.js: add null jobTime case * postJob.js: correct test message * update test for endpoint posting * update test for priority setting * Add badge * Fix badge link * Change appbar to normal format (#616) * add user experience related metrics (#617) * Fix VC storage storage containerPath (#621) * Support custom storage mountpoints * Add extra logging * Add TTLCache for ListVCs (#619) * Lower case custom mountpoint name (#623) * Add default team key into request body (#620) * Add lock for cache * Bump twisted from 19.2.1 to 19.7.0 in /src/ClusterManager (#618) Bumps [twisted](https://github.com/twisted/twisted) from 19.2.1 to 19.7.0. - [Release notes](https://github.com/twisted/twisted/releases) - [Changelog](https://github.com/twisted/twisted/blob/trunk/NEWS.rst) - [Commits](twisted/twisted@twisted-19.2.1...twisted-19.7.0) Signed-off-by: dependabot[bot] <[email protected]> * Add support for ./deploy.py connect samba (#625) * Add support for ./deploy.py connect samba * Add samba to all roles * update * Add password to user profile to copy (#635) * add job time from scheduling to running metric (#633) * fix bug in update job metrics (#636) * use redis to save job status info (#637) * avoid override running metric (#641) * Move pure CPU jobs to CPU machines (#640) * Dashboard: upgrade node.js version to erbium * Dashboard API: increase max-http-header-size to 64KB * accelerate jobmanager loop (#642) * azure blobfuse plugin * fix breaks of the deployment pipeline for on premise machines and updated kernel (#639) * Inference job: add soft podAffinity to deployment (#638) * Web Portal: Add userName field in add-endpoints call. * support blob array * profile bootstrap (#649) * Dashboard/https support (#647) * repair manager initial code check-in (#644) * ECC Repair Manager Initial Check-In * remove test rule from config * pr feedback * remove test_rule.py * fix double download * Use k8s API to create and delete secrets * Fix typos and add delete_job for succeed and failed cases * let prometheus ignore redis port (#653) * A few fixes * Fix distributed jobs * Ignore invalid strings * Refactor * Add tests for job * Add local fast storage * Support cluster wide local fast storage * tmppath format * Dashboard UI: fix enableJobPath binding (#652) * Fix issue of template save and delete not working with azure blob and optimize the user experience of template (#655) * pull shorter logs (#654) * Add a flag to enable Azure blobfuse (#657) * Add enable_blobfuse flag * Fix default params location * profile more (#656) * Fix job status check * Improve the performance of cluster status & jobs load efficiecncy (#659) * Revert tail=3000 * Change free size to avail size for filesystem (#665) * fix broken deployment * fix init script to support default ssh config in docker image (#666) * add the mount options in azure blob (#671) * support password login (#668) * support password login * fix bug * dockerize and k8s service - repair manager (#663) * dockerize repair manager * make kubernetes service * add repairmanager to params.py * Dashboard backend: set current user in job.post (#674) * Dashboard backend: set current user in job.post * Unit test * Revert lint * Revert "Add script to set up network GC to prevent docker network issue" (#678) * based on worker/nfs PR, refactored the code to load config and create clusterID (#670) * cloudinit for worker node and nfs mount refactorization * add mkdir_and_cp.sh for worker cloud-init * add binaries and copies for mount service * refactor deploy.py and az_tools.py to get rid of global vars and support python3 * refactor code to create clusterID, and read configs * minor format/naming bugs * fix formatting issue, add deleted deprecated function back * Init Adding NFS Storage Manager (#672) * Add storage monitor * Add support for expiry * Add subtree atime * Fix tests * Refactor storage monitor * Add a loop in main * Add kubelabels * Add Dockerfile and service yaml file * Modify some code * Create utility servers if specified * Make allowalltcp source range configurable * Fix creating nsg and private ip for utility servers * Fix typo * Fix genconfig vmSize * get nodes by role 'utility' * Add updateutility * Add utility_node in get_node_lists_for_service * Add utility nodes to get_nodes * AAdd docker build for storage manager * Add deploy utility configs and storage configuration file * Fix typo * Temporarily make gpu_type='None' for utility node deployment * Add storage manager mapping * Fix docker image and bugs * Fix typop * mount scanpoints upon service startup * Update to deploy storage manager on nfs node * Modify deployment rendering and mounts * Utility -> NFS * NfS template rendering * Clean rendered target directory for nfs config rendering * Allow master to access nfs * Change nfs_allow_master nsg name * support custom_nfs_nsg_names * Fix typo in custom_nfs_nsg_names * Fix * Mount /data/share for storagemanager * storage_monitor -> storage_manager * Do not include nfs server in get_nodes * Make mountoptions configurable per blobfuse per job (#673) * Make mountoptions configurable per job * Use invalid_entry to check mount_options * Add an additional regular expression check * Dashboard: reduce over-detailed logs (#683) * remove sudo in endpoint manager (#682) * Handle terminating pods when machines are taken away for k8s > 1.13 + expose more job info (#681) * Mark pod with deletion_timestamp as Unknown * Ignore None user_sign_token * fix * Log node_name, host_ip, pod_ip * Add requested and available resource info for queued jobs * Fix resource order * Use literal * email alerts + refactoring (#684) * add repairmanager to params.py * Refactoring + Email Alerts * support db pool (#675) * Add GetAllACL API * Change user_sign_token to master_token (#690) * Dashboard deployment (#680) * fix init script to support default ssh config in docker image * deploy dashboard * Fix bad decoding of jobStatusDetail * use init container to copy sshd and openssl command (#662) * Add SKU meta section to support scheduling on CPU machines (#676) * Init machine SKU * Update * A few fixes * Refactor * Adding comments to methods and renaming variables * Fix interface break * Fix * Add comments for command * User synchronizer (#687) * Add user-synchronizer * Fix * add pymysql * Use mysqlclient instead of pymysql * fix * Add cronjob * fix dockerfile * Revert "fix dockerfile" This reverts commit 74abee9. * use prebuild.sh * typo * Usey synchronizer: restfulapi version * Fix * Fix deployment * Add tolerations * typo * Use onPremisesSecurityIdentifier to calculate id * deploy * Issue fix * Add lint sript * Add default group to groups * comfigmap * label * labels * Add NCCL_IB_DISABLE=1 to disable IB usage (#694) * Support per-job configurable docker registry secret (#689) * Allow using custom docker registry * Add flag to config * Change variable name * lowercase for k8s name * Fix the issue of azure blob input remains after chaning template (#696) * Remove kvp file for Network Direct for Infiniband (#699) * repair manager: more refactoring (#697) * add repairmanager to params.py * Refactoring + Email Alerts * repair manager: more refactoring * PR feedback * VC node hard assignment (#698) * VC node hard assignment * try catch invalid cpu and memory spec * Add comma * Update * Set gpuType=None in pod description for CPU jobs * Fix logic for cpu jobs * Add command explanation * sync and service discovery using k8s (#695) * use sync.py to do distributed job sync and service discovery * add sshd check * fix params syntax bug (#701) * use deepscale sshd config (#700) * repair manager email fixes (#702) * Install Azure blobfuse at deployment (#705) * Dashboard: ignore frontend build directory * [Temp] Hide the data storage when vc is MMBellevue (#712) * Fix readonly detection * Add kill button to job details * Dashboard frontend: remove requests other than Grafana (#704) * Replace prometheus request with grafana api * Proxy gpu_reporter to GetVC API * Dashboard frontend: Use proxied GPU reporter data. * Support restfulapi w/o gpu_idle proxy * Use batch delete secrets (#706) * Use batch delete secrets * Update Training.tsx Revert * MySQL server deployment (#707) * MYSQL server deployment * Add mysql in allroles * New private ip for mysql * genconfig mysql * deploy.py connect mysql * updatemysql * mysql deployment yaml * get node lists for service - mysql * Take the first element * Update * repair manager - add more details to alert emails (#709) * add repairmanager to params.py * Refactoring + Email Alerts * repair manager: more refactoring * PR feedback * update time between rules * email config fixes * more descriptive email alert * nit * "fixing output error message" * try/catch for prometheus request * email multiple recipients, configurable * add functionaity to email job owners * use logger instead of logging (#710) * dashboard/new-bootstrap-schema (#714) * Refactor: add config to bootstrap param * Support new bootstrap schema in frontend * MySQL server node deployment and support mountOptions list for blobfuse (#720) * Use a unique tmppath for each blobfuse mount and support mount option list * Handle single mysql_node * mysql -> mysqlserver * Convert to string before checking invalid * Make tmppath of format $root_tmppath/$jobId/$podName/$blobfuse_name * Hide all credentials in REST call returns * A few fixes * Fix dashboard deployment * Dashboard backend: adjust some logs to debug level * fix app.silent * Remove default mountOptions for blobfuse (#721) * Add install-blobfuse.sh and docker push init-container to deployment script (#723) * Add ./deploy.py docker push init-container to deployment script * Execute install-blobfuse.sh at deployment * restrict port range for ssh (#724) * Dashboard: Clarify password and token. (#722) Password: The only term user should use, which is the string-typed user credential for dashboard API use. User should pass `email` as well as `password` as queries in API call to get access of the dashboard resource. For backward compatibility, `token` is also available in query, which is deprecated. Token: Internal used in dashboard **backend**, which stands for the Buffer-typed password. Backend always does not store the string-typed password for security reasons. IdToken: jwt typed token from Azure Active Directory CookieToken: `token` field value in cookie, jwt typed. Should be plain since koa already provided a signed cookie approach. * fix ssh problem (#725) * Mapping the actions into job detail page and adjust layout of appbar in homepage (#726) * Enable PermitUserEnvironment and propagate variables containing NCCL|PATH|DLWS|DLTS (#727) * do not generate new port in host network (#728) * clean up configmap of last retry (#729) * make dry run configurable (#731) * exit on failed to get enough configmap (#732) * longer wait time (#733) * Send email alert for overused storage paths (#730) * Fix typo in storage manager (#734) * add a hidden feature for deploy.py to separate code and config (#736) * fix previous bug (#737) * Revert "add a hidden feature for deploy.py to separate code and config (#736)" This reverts commit bd891b7. * Revert "fix previous bug (#737)" This reverts commit cdb176c. * fix yaml load warning * Remove config_dir from unsuccessful reverts and conflict resolutions (#739) * remove yaml load warning (#740) * repair manager - fix email alerting bugs (#741) * perf optimization for job list and detail * optimize authorize cache; update by comment * improve Endpoint API; improve VC list cache * move getAlias to utils; fix typo * join priority when getting my job list * Send to CC list and refactor storagemanager (#743) * Add separated GetJobLog API in restfulapi service * use environment variable to pass user's command (#742) * wait forever in setting up the ssh (#745) * Dashboard backend: remove winbind dependency Lint files Add lint to CI Dashboard frontent: Remove uid dependency Workaround bootstrap unit test * Dashboard: add v2 API View and Manage Jobs V2 Priority snackbar WIP Load MyJobs / AllJobs on demand Use notistack Bump dependencies Lint Lint Reduce dependencies Fix warning Layout Add empty view in AllJobs Fix Fix icon Use clusterId in RouteParams Job Details v2 Fix add job log api to dashboard Use error notistack instead of Error component Implement Console Add Helmet Restructure useConfirm Strict route Issue fix Fix key Add job status change notification detail v2: Use job name as title Container width support action Fix support Compress & cache frontend files usePrevious instead of useChange Leverage priority from job details v2 wider Fix PriorityField * User Synchronizer: use host network (#748) * fix endpoint extract (#746) * fix get all acl * Job table: use Link instead of onRowClick ...to support functionalities of anchor * Fix GPU rendering * User synchronizer: filter out subgroups (#752) * special temporary code to be generalized (#753) * print selected environment variables (#754) * fix env problem (#756) * change cluster_manager, restfulapi and deploy.py to python3 (#750) * Add tooltip to job status * Capitalize job details title * Refine notification notice * typo: Preemptible * typo * Fix job status detail * Status tooltip: only show first details * Fix work path * Auth succ: use JS redirect instead of HTTP 302 It seems we hit https://bugs.chromium.org/p/chromium/issues/detail?id=696204 * Fix redirection * job status: place to right * fix scp could not found ssh (#759) * JobV2: fix crash when job data comes early than cluster config (#760) * fix byte/str conversion error (#763) * Install python3 pip3 in prerequisites (#762) * fix ACL isDeny default value (#765) * Refactor authorization.py (#761) * autopep8 some python files (#764) * notify user about job status changes (#768) * add travis (#769) * refactor endpoint (#770) * repair manager - email alerting refactoring (#747) * add some automatic test in common functionality (#771) * record longer latency for calling some program (#773) * Define new resource type to simplify code logic (#772) * Cluster resource Init * Refactor with Resource type * Rename Resource to ResourceStat * A few bug fixes * Refactor ClusterStatus * revert formatting for gpu usage url * Bug fix * Allow empty GPU type * Revert "Allow empty GPU type" This reverts commit 36e0f83. * Add backward compatibility for typo * Namespaced to default for now * Add test_cluster_status * Enable tests for cluster_status and utils in travis * apt-get install python3-pycurl * pycurl * Remove pycurl dependency * next check UI and job submission * git refactor NFS and mount, job running after manually set mysql identity table and secrets * wait and retry deploying service in cloud_init_infra.sh, and test whether repairmanager is up after slightly modify prebuild.sh * update doc for Azure deployment and change default value of workFolderAccessPoint and dataFolderAccessPoint * minor changes, map service names to docker names * resolve v2deploy.sh conflict * fix breaks for citest * fix breaks after rebasing, add default api_servers back after render worker generic * update configure.md * update azure deployment instructions * rename deploy.sh * modify utils.py scp and sudo scp, improve maintain.py * modified config file format merge Hongzhi's update and add more details for docs. * hide config file names when using command, stop generating scripts if not dryrun * use multiprocess and subprocess to parallely adding the vms * improve parallel execution and add node ready verification * update citest to test cloudinit based deployment * change CI clustername to lowercase and change pip3 installation commands Co-authored-by: George Cheng <[email protected]> Co-authored-by: hzzhang <[email protected]> Co-authored-by: hongyiliu <[email protected]> Co-authored-by: Di Xu <[email protected]> Co-authored-by: anbhu <[email protected]> Co-authored-by: Hongzhi Li <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: leigaoms <[email protected]> Co-authored-by: Deborah Sandoval <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
preliminary result in dev1 cluster
Left one is result from submitting 200 cpu jobs 1 by 1, right one is result from submitting 50 cpu jobs 1 by 1.
Query:
histogram_quantile(0.95, sum(rate(job_state_change_latency_seconds_bucket[5m])) by (le, current_state))
{current_state="submit"}
means 95th latency for job change state from approved to submit to k8s{current_state="approve"}
means 95th latency for job state changed from created in restfulapi to approved.