Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

VC usage (activity) reports #2073

Closed
scarlett2018 opened this issue Jan 24, 2019 · 6 comments
Closed

VC usage (activity) reports #2073

scarlett2018 opened this issue Jan 24, 2019 · 6 comments

Comments

@scarlett2018
Copy link
Member

scarlett2018 commented Jan 24, 2019

As we are suggesting to use VC to organize functional group teams, the operation team needs to know how well users/jobs in each VC is.
i.e. need a daily/weekly/monthly report for each team's VC usage, including jobs, gpu, memory.
i.e. need to know the individual's activity in a VC. who use the most resources in the vc for example
[To be evaluated the value ] i.e. need alert and email notification for VC loads
i.e. would like to have report summary send to admin through email

@xudifsd
Copy link
Member

xudifsd commented Mar 25, 2019

Do we need to alert on VC loads? I think this is kind of abuse the alert email, since this should a normal situation, and admin can do nothing about it.

Maybe we can use power BI to implement the report, it's not possible to send report via alert rule, and if we want to generate a report we can write another services, but I think BI maybe more appropriate.

@xudifsd
Copy link
Member

xudifsd commented Mar 25, 2019

Two problems with monthly report:

  • prometheus currently only retain 15 days data
  • we do not save prometheus data in host path, so all data will be lost in prometheus redeployment

We can prolong the retention date but may incur too much disk consumption, will need to investigate.

@scarlett2018
Copy link
Member Author

Do we need to alert on VC loads? I think this is kind of abuse the alert email, since this should a normal situation, and admin can do nothing about it.

Maybe we can use power BI to implement the report, it's not possible to send report via alert rule, and if we want to generate a report we can write another services, but I think BI maybe more appropriate.

Sounds reasonable, just updated the original request and mark that item as to be evaluated. Let's pending that one, and gather more feedbacks about whether it is needed.

@scarlett2018
Copy link
Member Author

Two problems with monthly report:

  • prometheus currently only retain 15 days data
  • we do not save prometheus data in host path, so all data will be lost in prometheus redeployment

We can prolong the retention date but may incur too much disk consumption, will need to investigate.

All good pionts, my quick thinking is we should figure out a way to persist usage and log related histories. Please investigate and get @fanyangCS and @sterowang 's technical suggestions accordingly.

@xudifsd
Copy link
Member

xudifsd commented Mar 26, 2019

user/vc's resource usage can be scrapped from yarn's api /ws/v1/cluster/apps, it has application's resource usage info and final status. But this API will only retain specific number of entries, like 1000 entries, so we need another service to periodically get this info and persist somewhere for later report use.

@xudifsd
Copy link
Member

xudifsd commented Mar 26, 2019

/ws/v1/cluster/scheduler API from yarn already have info we need from ws/v1/cluster/apps, and easier to get, we can implement by scrapping /ws/v1/cluster/scheduler API in yarn-exporter and show the usage graph in Grafana.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants