Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance app API / UI #141

Closed
ayrat555 opened this issue Sep 29, 2020 · 17 comments
Closed

Performance app API / UI #141

ayrat555 opened this issue Sep 29, 2020 · 17 comments
Assignees

Comments

@ayrat555
Copy link
Contributor

Part of #59
WIP PR omgnetwork/elixir-omg#1745 (adds deposit tests)

This issue is open for discussions.

API I have in mind :

  • endpoint: POST /api/v1/run_test
  • params: a map with run_params and test specific chain_params. Example
%{
      chain_config: %{
        token: token,
        amount: amount
      },
      run_config: %{
        tps: 1,
        period_in_seconds: 5
      }

Also, I'm planning to add UI which will include:

  • status of the test
  • its params
  • progress
@ayrat555
Copy link
Contributor Author

@InoMurko @unnawut @boolafish does it look good?

@InoMurko
Copy link
Contributor

afaik, the perf tests have more then one test, so how do you select a test?
what response does the api give?

is there a test running already? as this is a deployed application, someone could be using it already (or it could be still running a previous test.

how do you plan to check the status of the test? progress of a test?

@ayrat555
Copy link
Contributor Author

ayrat555 commented Sep 29, 2020

afaik, the perf tests have more then one test, so how do you select a test?

a name of the test should be passed

what response does the api give?

it will return 201 status and id of the test on success and 422 or 500 on error

is there a test running already? as this is a deployed application, someone could be using it already (or it could be still running a previous test.

I didn't think about that. I can check if a test with the same parameters is already running or not.

how do you plan to check the status of the test? progress of a test?

I was planning to periodically save the state of the test to Postgres record

@InoMurko
Copy link
Contributor

InoMurko commented Sep 29, 2020

Okay, as you can see, your intial post does not have all the information. We need a full API spec for the flows we want to support.
For example:

GET: List all tests
/api/v1/tests
Request: /
Response:

[{
description: "blabla",
id: 55
},
{
description: "blabla",
id: 666
},...
]

GET: Test status
/api/v1/tests/666
Response:

{
status: "running"
target_environment: "no idea, brainstorming"
}

Now it's your assignment to figure out which flows we need to support, that's why I said it would be useful to talk to @boolafish and @unnawut. I don't mean they should give you a full spec! But how the current pipeline looks like, if they have any special requests. I think more or less it's up to you how you design this. But it's up to us to review that based on what you think we should do.

Once you have these APIs and flows, it's easier to imagine how the frontend should look like, and what backend you need to build.

@boolafish
Copy link

let's have a discussion call for this. I would like to sync up as well actually from my side

@boolafish
Copy link

boolafish commented Sep 30, 2020

Few topics from my side:

  1. Requirement for release pipeline integration
    • basically the main question is how to hook back to the release pipeline for the perf test result. I think right now might be 2 webhooks. Perf end up calling a succeeding Webhook or a failing Webhook.
    • need a way to cancel the tests. Otherwise if we decide to skip on release (eg. hot fix), we do not want any hook back surprise trigger after some long hours
  2. API security concern. Just general service concern, if we are doing this way we need authentication too I guess.
  3. Auto scaling/isolation on infra and allow multiple test trigger. For the service to be useful long term, I think we need to ensure it always have enough resource to run the test. Found an interesting article that might be able to copy the approach from: https://medium.com/google-cloud/scale-your-kubernetes-cluster-to-almost-zero-with-gke-autoscaler-9c78051cbf40. However, this means it need to run perf as kubernate job (which is the only scalable way for now that I know for long running job) from the backend I think.
  4. Alternative competing idea, run directly with release pipeline tool. For configuration of the tests and visualization, the pipeline tooling can already show you what is running with what config. The missing part is only current status of the test. But at the same time, since we need to monitor most metrics still from datadog, might be simpler to just put the current status of the tests (like how many iterations has been run for each session) to a datadog dashboard.

@InoMurko
Copy link
Contributor

I think I've put all the requirements in the parent issue.

@boolafish
Copy link

ah...okay, add a comment there to link back

@ayrat555
Copy link
Contributor Author

ayrat555 commented Sep 30, 2020

- Test modules

Currently, I'm using the chaperon with SpreadAsync. It calls a function with a given rate over a given interval of time. Example:

  def run(session) do
    tps = config(session, [:run_config, :tps])
    period_in_seconds = config(session, [:run_config, :period_in_seconds])

    total_number_of_transactions = tps * period_in_seconds
    period_in_mseconds = period_in_seconds * 1_000

    session
    |> cc_spread(
      :create_deposit,
      total_number_of_transactions,
      period_in_mseconds
    )
    |> await_all(:create_deposit)
  end

I took a look at the implementation. SpreadAsync start all tasks simultaneously with some delays so there are a lot of idle tasks in the memory. I have a concern that VM may not survive a large number of idle tasks. 3000 tasks a second over 10 hours = 3000 * 60 * 60 * 10 = 108_000_000 tasks. If it is a problem, I'll implement an ad-hoc solution which will spawn tasks as they needed to keep the required tps.

Test module will accept the following params:

  • test specific params, ie amount to transfer, token addresses etc
  • test run params: tps, contract addresses, geth endpoint

Another idea is instead of test run params we can just hardcode test environments in the application, for example, circle ci - geth url, contracts, tps, staging - .... and only pass the environment to the test.

- Monitoring

The status of the run will be persistent in 2 places:

  1. Permanent storage: postgres table. It will hold all the required info about a test run:
  • id [uuid]
  • name [string]
  • state [enum] (running/finished/canceled)
  • status [enum] (failed/passed/pending)
  • params [json_b] - params used to start a test
  • aggregated_data [json_b] - averaged response time, number of error types
  • process_pid [string] - pid of the monitoring process if the test is running (see the next step)
  • created_at
  • updated_at
  • finished_at
  1. Temporary storage: a monitoring process which will aggregate info about the current test run. For every running test, there will be a monitoring process.
  • Monitoring process will trigger test runs
  • Test tasks will hold a reference to the monitoring process and notify it about the result (success, failure with error description).
  • Monitoring task from time to time will be dumping the current status and aggregated data of the test to postgres table

- UI

Currently, I have in mind five pages:

  1. /test_runs - will have a paginated list of all test runs sorted by creation date. Each row will contain:
  • test id
  • test name
  • test state
  • test status

I think it may be possible to use liveview to load new test runs as they appear in the db but it's just UX issue and I don't think it has a high priority

  1. /test_runs/id - will show aggregated info about the test run with the specified id.

It will show all the fields from the postgres table. Monitoring process will be broadcasting an event when dumping test data to the DB so liveview can update this page in real-time.

  1. /test_runs/new - new run creation form

The form will contain:

  • test selector
  • params for the test
  1. /api_tokens - list all API tokens. See security section

  2. /api_tokens/new - creates a new API token. See security section

- API

If I understand correctly, API will be used only to trigger test runs from circle ci, so I suggest adding only two endpoints:

  1. POST /api/v1/test_runs - it will trigger a new test run

parameters:

test configuration: map with test specific params
environment or test run params: see the first section
test_key (branch)

result:

status 201
id: - id of the test

status 500

status 422
errors - a list of validation errors

On any period of time, only one test with the same test key can be running, all the tests will be cancelled. On test finish, it can send a message to slack about the test.

  1. POST /api/v1/test_runs/cancel/test_key - will cancel the test with the specified test_key

returns
status 200 on success
status 404 if the test is not found
status 500 on server error

- Security

I think JWT tokens can be generated for every API user of the application. About UI pages, I think hardcoded login/pasword will suffice?

@boolafish
Copy link

boolafish commented Oct 1, 2020

SpreadAsync start all tasks simultaneously with some delays so there are a lot of idle tasks in the memory
...3000 tasks a second over 10 hours = 3000 * 60 * 60 * 10 = 108_000_000 tasks.

From the description, it seems the approach there and the current approach implemented for childchain transaction test is not too much different. One use concurrent session and one use concurrent task with idle to achieve the targeting TPS. Session, of course, have higher overhead. But I believe our current approach does not need to spawn that many tasks/processes since it only need one for one concurrent session, and then it relies on iteration/looping. Though I am not sure if there is (too much of) performance indication or not.

But at the same time, another concern I have is on hardware with multiple test triggers at the same time. If multiple tests are running under same hardware, they will compete for each other's resources.

@ayrat555
Copy link
Contributor Author

ayrat555 commented Oct 5, 2020

@boolafish I created a small library that uses a reasonable number of processes https://github.com/ayrat555/hornet. Each process periodically executes the given function and the number of processes is increased only if the required rate can not be achieved.

@InoMurko
Copy link
Contributor

InoMurko commented Oct 5, 2020

What is the purpose of hornet long term? Is it to replace chaperon?

@ayrat555
Copy link
Contributor Author

ayrat555 commented Oct 5, 2020

I think chaperon can not be used for long-running performance tests because it spawns too many processes. I wasn't planning to replace chaperon with it. I'm only planning to use hornet for the performance app issue which needs a reasonable amount of processes because there will multiple simultaneous test runs I think.

I will leave the current tests as they are. Or should they be re-written?

@InoMurko
Copy link
Contributor

InoMurko commented Oct 5, 2020

It's up to you. I just thought that perhaps doing a perf framework is a bit too much (not sure how it integrates with the current tests?) and that chaperon could be adjusted.

@boolafish
Copy link

boolafish commented Oct 5, 2020

@ayrat555 Do you mind to give a brief introduction of what magic your lib is doing on scheduling the workers??

just curious, have we verified that:

  1. VM indeed cannot handle such amount of task. Or is it possible to just run with better instance (I guess higher memory as idle task seems to still need some memory)
  2. Do we need to run for such amount: 3000 tasks a second over 10 hours = 3000 * 60 * 60 * 10 = 108_000_000 tasks?? Even Visa only have average TPS < 2000. Not sure do we want to run endurance tests for close to our peak TPS or shall we just run a shorter period for peak TPS (eg 30 min) for business to claim our performance and we have something like 2~10x production load as the long-running endurance tests instead.

@ayrat555
Copy link
Contributor Author

ayrat555 commented Oct 6, 2020

VM indeed cannot handle such amount of task. Or is it possible to just run with better instance (I guess higher memory as idle task seems to still need some memory)

I was experimenting today with it today. If the number is too big (6000-10000tps over 24h), it just fails. for other cases memory usage increases over time
rate: 1_000_000, interval: seconds(60*60*24) ~ 11,6 tps

Do you mind to give a brief introduction of what magic your lib is doing on scheduling the workers??

My library starts a fixed number processes which execute the function periodically. Over time it adjusts the number of processes to keep up with the given rate.

For example, you want to execute a function with rate 5 operation per second. You can do it by setting params: start_period: 200, tps: 5, func: func
start_period - Every process will execute this function periodically in start_period time
so hornet will start 1000 / 200 = 5 processes. If the operation can not be completed in 200ms, hornet will start increasing period and increasing the number of processes

@ayrat555
Copy link
Contributor Author

ayrat555 commented Oct 8, 2020

After some discussion in the slack https://omgnetworkhq.slack.com/archives/CV8DJPZ9V/p1602144266169700. My thoughts:

  1. I think test runs can be triggered using elixir scripts that will be executed using mix run. Parameters for the test can be passed using command args. I will need a list of parameters that should be configurable during test runs
  2. A separate docker container should be created for the perf project. so it can be used in k8s job

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants