-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x-pack/metricbeat/module/openai: Add new module #41516
base: main
Are you sure you want to change the base?
Conversation
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
|
Getting hit by this error: #41174 (comment) and hence the CI is failing. Rest all okay. |
To continue with my testing and to avoid: See this: https://www.elastic.co/guide/en/beats/metricbeat/current/configuration-template.html So, this has currently unblocked me but yes we definitely need a fix for this. |
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
I've explain the complicated collection mechanism in the PR description itself. Rest is self-explanatory from the code. Please let me know, if anything needs further clarification. |
x-pack/metricbeat/module/openai/usage/usage_integration_test.go
Outdated
Show resolved
Hide resolved
} | ||
], | ||
"ft_data": [], | ||
"dalle_api_data": [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be good to have data for each data set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I tried generating ft (fine-tuning) data but it doesn't seem to work. As OpenAI provides this API as undocumented, I couldn't find a single source with any samples. Not even sure they even populate for the response of this particular endpoint. For dalle_api_data, I'll add.
x-pack/metricbeat/module/openai/usage/usage_integration_test.go
Outdated
Show resolved
Hide resolved
x-pack/metricbeat/module/openai/usage/usage_integration_test.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You forgot to revert the changes you made to the default config file
return fmt.Errorf("error making request: %w", err) | ||
} | ||
defer resp.Body.Close() | ||
|
||
if resp.StatusCode != http.StatusOK { | ||
return fmt.Errorf("error response from API: %s", resp.Status) | ||
body, _ := io.ReadAll(resp.Body) | ||
return fmt.Errorf("error response from API: status=%s, body=%s", resp.Status, string(body)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
be careful here to do not leak sensitive information. I usually try to log the body in its own log, without including the body in the error message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying this morning only. I did found one case where the logging the body is not ideal. In case of invalid API keys, the body shows the API key that it's invalid to connect to OpenAI. So, in case there's a typo when typing the API key, most part of the API key is exposed. So, I'll remove this in order to avoid any sensitive info leakage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I spotted a wee issue on how you create the http client and measure how long the request take, as soon as it's fixed, it's good to go.
err := c.Ratelimiter.Wait(req.Context()) | ||
waitDuration := time.Since(start) | ||
if err != nil { | ||
return nil, err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[NIT|Go convention]
err := c.Ratelimiter.Wait(req.Context()) | |
waitDuration := time.Since(start) | |
if err != nil { | |
return nil, err | |
} | |
err := c.Ratelimiter.Wait(req.Context()) | |
if err != nil { | |
return nil, err | |
} | |
waitDuration := time.Since(start) |
var client = http.DefaultClient | ||
client.Timeout = timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Blocker]
You should not change the default cleint, you should create a new one.
var client = http.DefaultClient
will make client
to reference the http.DefaultClient
, so any change in client
will take effect for everyone using http.DefaultClient
var client = http.DefaultClient | |
client.Timeout = timeout | |
client := http.Client{Timeout: timeout} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very good point. Thanks.
@shmsr : Started reviewing the module. Can you please link the document where this approach and the metrics we are collcting was decided ? |
As it is a private repo, I've linked reversely linked the issue as per recommendation. Also, the research there will show the research on which API to use. And the decision on how to process the metrics (more implementation specific details) is defined here in the description itself: #41516 (comment); @muthu-mps and I decided it on a call. |
Thanks @AndersonQ. Very helpful review comments; appreciate it! ✨ |
"api_key_redacted": null, | ||
"api_key_type": null, | ||
"email": null, | ||
"n_cached_context_tokens_total": 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shmsr , @ishleenk17 - Are we okay with the field names? IMO, Updating the field names something similar to
n_context_tokens_total -> tokens.total or tokens_total
in the context of usability. But still it is also good to keep as it is to relate the ES fields with the openai API metrics. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we should keep fields more readable and inline with other LLM Integrations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking to keep it like this here and change the names in ingest pipelines in integrations. What do you think?
"time" | ||
) | ||
|
||
type Config struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although not necessary but the name Config sounds a little vague, and doesn't clearly explain the purpose. Consider renaming it for clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any suggestion? The Config
is like that in many modules, so kept it like that only
Search for type Config struct
(case-insensitive search) in metricbeat/module/*
if len(parts) != 2 { | ||
continue | ||
} | ||
headersMap[strings.TrimSpace(parts[0])] = strings.TrimSpace(parts[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Extract them separately for clarity.
"project_name": usage.ProjectName, | ||
"request_type": usage.RequestType, | ||
"n_cached_context_tokens_total": usage.NCachedContextTokensTotal, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this API provide a division of Input and Output tokens?
If yes, we can include that also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't get this. Can you please elaborate?
} | ||
} | ||
if c.Timeout <= 0 { | ||
errs = append(errs, errors.New("timeout must be greater than 0")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this happen ? Since we are having a default value for it always ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, no. Yes, because of default. Just kept it there as it's a just a one time check and to avoid issues if someday someone changes the default i.e., 0 (or lower).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yes also in the case if the user has configured the timeout as 0 or lower.
APIURL: "https://api.openai.com/v1/usage", | ||
Timeout: 30 * time.Second, | ||
RateLimit: &rateLimitConfig{ | ||
Limit: ptr(60), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What defines these limits ?
If you could point me to documentation due to which we have chosen these defaults ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout is a sane default. And the rate-limit is decided as per testing. If you see the meta ticket, you'd find an issue there where we have researched about rate-limits of this API. Not linking it here as it's private. Can share over Slack.
} | ||
|
||
// Clear removes all state files by deleting and recreating the state directory | ||
func (s *stateStore) Clear() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If our intent here is to remove the files, since we are again recreating the directory. Should we delete just the files instead?
It might avoid potential permission issues or race conditions. WDYT ?
Eg:
files, err := os.ReadDir(s.Dir)
if err != nil {
return fmt.Errorf("reading state directory: %w", err)
}
for _, file := range files {
if err := os.Remove(path.Join(s.Dir, file.Name())); err != nil {
return fmt.Errorf("removing state file: %w", err)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, when I was writing this I was creating it to be good package for reuse later as well and this method was inteded to completely reset the state. Plan earlier was to add subdirectories too inside the state folder because what if in future we add a new dataset. For now, we just have "usage", but in case we add more.
But as it just resides in this package, we are not even using this method. It is unused. We can even remove this method.
So, that's why this. It's a complete reset of state.
return nil, errors.New("empty path provided") | ||
} | ||
|
||
store, err := newStateStore(storePath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check if the storepath indeed exists ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are creating the storePath. Part of it already there and the rest we create with MkdirAll.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See this: newStateManager(paths.Resolve(paths.Data, path.Join("state", base.Module().Name(), base.Name())))
paths.Data is supplied by Agent using --paths.data
flag. So, even if the directory doesn't exist, we create them.
Proposed commit message
Implement a new module for OpenAI usage collection. This module operates on
https://api.openai.com/v1/usage
(by default; also configurable for Proxy URLs, etc.) and collects the limited set of usage metrics emitted from the undocumented endpoint.Example how the usage endpoints emits metrics:
Given timestamps
t0
,t1
,t2
, ...tn
in ascending order:t0
(first collection):t1
(after new API usage):t2
(continuous collection):and so on.
Example response:
As soon as the API is used, usage is generated after a few times. So, if collecting using the module real-time and that too multiple times of the day, it would collect duplicates and it is not good for storage as well as analytics of the usage data.
It's better to collect
time.Now() (in UTC) - 24h
so that we get full usage collection of the past day (in UTC) and it avoids duplication. So that's why I have introduced a configrealtime
and set it tofalse
as the collection is 24h delayed; we are now getting daily data.realtime: true
will work as any other normal collection where metrics are fetched in set intervals. Our recommendation is to keeprealtime: false
.As this is a metricbeat module, we do not have existing package that gives us support to store the cursor. So, in order to avoid pulling already pulled data, timestamps are being stored per API key. Logic for the same is commented in the code on how it is stored. We are using a new custom code to store the state in order to store the cursor and begin from the next available date.
Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Author's Checklist
How to test this PR locally