[WorkplaceAI] Add Google Drive data source and connector#250677
[WorkplaceAI] Add Google Drive data source and connector#250677Apmats merged 17 commits intoelastic:mainfrom
Conversation
- Add Google Drive data source with OAuth and file extraction workflows - Refactor generateWorkflows to use ConnectorReference for type-safe connector matching - Add i18n translations to stackConnector descriptions - Remove accidental trailing newline - Fix misleading comment about quote escaping - Remove mimeType param, improve query description - Extract Google Drive API base URL to constant - Simplify ConnectorReference comment - Revert unnecessary comment edit - Add connector processing order logic Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
9079d29 to
15017fa
Compare
mattnowzari
left a comment
There was a problem hiding this comment.
This PR looks pretty solid overall, had a couple of questions about how we're doing content extraction and processing currently, as it'd be good to ensure the SPO connector's way of handling processing is aligned with GDrive
| ? typedInput.mimeType || DEFAULT_EXPORT_MIME_TYPE | ||
| : fileMetadata.mimeType, | ||
| size: fileMetadata.size, | ||
| content: base64Content, |
There was a problem hiding this comment.
Clarifying for my own understanding - we are returning the content in base64 because we're assuming it'll get shipped to a processor of some sort, correct?
I think I'd seen some conversations in Slack that we may use the ingest pipeline processor (basically Tika) for the time being - do we know if this is going to be the case?
SPO connector will eventually need something like this too for PDFs that are saved in drives, it'd be cool to align that connector with whatever we do here for GDrive
There was a problem hiding this comment.
Yes - base64 fits with both the attachment processor and Jina Reader which we use here.
I think I'd seen some conversations in Slack that we may use the ingest pipeline processor (basically Tika) for the time being - do we know if this is going to be the case?
We've had multiple discussions, what I did here is you can bring your own Jina key and use the Jina Reader or you can fall back to Tika (that simple, no considerations of which format gets parsed better by each option at this moment).
We can discuss, from both the core functionality and the UX perspective whether this makes sense.
Responding here to the other comment as well since it's quite related -
This API:
https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-ingest-simulate
allows you to run ingest pipelines without actually shipping anything for indexing. By using a pipeline that uses the attachment processor we can use Tika on the ES side to extract content (with all the implications, we ship big chunks of data to ES, resource utilization there rises, the attachment processor is meant for indexing thus not optimized for real time stuff like we might want this to be...).
No placeholder or anything, this works now but probably not what we want to do long term. 😀
| type: elasticsearch.request | ||
| with: | ||
| method: POST | ||
| path: /_ingest/pipeline/_simulate |
There was a problem hiding this comment.
I'm not familiar with the _simulate endpoint - does this workflow currently function as expected or is it simply a stand-in until some functionality regarding ingest pipelines and Workflows is implemented?
| pageToken?: string; | ||
| }; | ||
|
|
||
| ctx.log.debug(`[google_drive.searchFiles] input: ${JSON.stringify(input)}`); |
There was a problem hiding this comment.
Any chance it's possible to somehow get part of google_drive.searchFiles from the context when it's logged?
Intuitively, ctx.log should log the metadata such as connector name and subaction
There was a problem hiding this comment.
My LLM tells me - the connector type is actually already in the logger's context - the logger is created with the connector type here
kibana/x-pack/platform/plugins/shared/actions/server/lib/action_executor.ts
Lines 418 to 419 in 70f00fe
So logs are already prefixed with something like [actions.google_drive].
To also include the action name automatically, we'd need a small change here
This would give us automatic [actions.google_drive.searchFiles] prefixing without manual strings.
I'd prefer to do this in a separate PR since it touches the connector infrastructure and would affect all connectors.
What do you think?
There was a problem hiding this comment.
Then might not even need to prefix/auto-prefix if it's in a logger's context?
Cause if you can go into obs cluster and when looking over logs add columns with action/subaction, you get the same without additional clutter. Or am I misunderstanding it?
If it's there we can drop this prefixing altogether and it'd be cleaned and nicer.
There was a problem hiding this comment.
Yeah, this manual prefixing is gone here. It's just that we're missing the "subAction" in the context here. The change I suggested would add that. I will do in a separate PR (probably need an issue too).
| const response = await ctx.client.get(`${GOOGLE_DRIVE_API_BASE}/files`, { | ||
| params, |
There was a problem hiding this comment.
Does this part also log what endpoint was called with which params?
There was a problem hiding this comment.
From what I understand no, not internally.
The debug logs I added capture the input and API params. These are at debug level intentionally since they may contain user queries. Do you think we need additional logging at a different level?
Also noticed I'm missing the same logging for the other actions, adding.
There was a problem hiding this comment.
Even at debug level, if the project/deployment's log level is bumped up to DEBUG, Elastic will get those logs. We shouldn't capture this type of data in logs.
There was a problem hiding this comment.
Hrm okay. I'll remove. Broader question, do we never log any query that might be coming from a user on any level then across the stack at debug level? 🤔
Or just because we think this code will be primarily executed in elastic-managed envs?
There was a problem hiding this comment.
Removing anyway, found a similar statement on the sharepoint connector too so removing that as well.
There was a problem hiding this comment.
Also removed the error message from the thrown errors. These might end up getting logged (I think they might be already). For misformated queries right now gdrive doesn't include them in the message but it includes file ids in the message and a dumb LLM might pass a user input as a file id.
There was a problem hiding this comment.
Depends on the PII and stuff, but without logging at least something we won't be able to troubleshoot.
Do you see problems @seanstory? I think we've logged things like that in the past - e.g. calls to Sharepoint Online APIs and such (without POST body, obviously, but with GET query present).
There was a problem hiding this comment.
@artem-shelkovnikov our past connectors were just doing pagination, or had folder filters. But since these will power federated search, there's a lot less clarity to the user what will end up being sent to google or our logs.
Logging endpoints should be fine. Logging metadata query params should be fine. Generally logging error response codes and bodies should be fine. But anything in a POST body or any human-language query string can get iffy.
There was a problem hiding this comment.
Yeah I could see how logging human queries is going to be a problem. I'm on board with logging things that are safe.
If we could reliably sanitize the query to drop filter values it'd be awesome, but it's likely a lot of work. If we don't sanitise, that's PII and we don't want that 😬
1bd8928 to
e325024
Compare
|
@artem-shelkovnikov @mattnowzari - after discussing with @seanstory all work related to using Jina Reader is dropped for now. This means I reverted all the changes related to registering multiple connectors for a data source - along with all the work to conditionally build the right workflow. This is now a very straightforward data source + connector I think, should be much easier to give the go ahead to. I addressed all your comments as well I believe, so please have a fresh look. |
|
@Apmats it looks like youre missing docs and CODEOWNERS changes. See the list of changed files in https://github.com/elastic/kibana/pull/248737/files for example. For next time, if you use this script at the start, it'll bootstrap all the right files for you. |
| }; | ||
| const googleError = axiosError.response?.data?.error; | ||
| if (googleError) { | ||
| throw new Error(`Google Drive API error: ${googleError.message}`); |
There was a problem hiding this comment.
Worth including the code too, just for extra debugging datapoints?
| }; | ||
|
|
||
| ctx.log.debug(`[google_drive.searchFiles] input: ${JSON.stringify(input)}`); | ||
|
|
There was a problem hiding this comment.
This logs the query, which we probably shouldn't do, as it could contain sensitive information.
| params.pageToken = typedInput.pageToken; | ||
| } | ||
|
|
||
| ctx.log.debug(`[google_drive.searchFiles] API params: ${JSON.stringify(params)}`); |
| const response = await ctx.client.get(`${GOOGLE_DRIVE_API_BASE}/files`, { | ||
| params, |
There was a problem hiding this comment.
Even at debug level, if the project/deployment's log level is bumped up to DEBUG, Elastic will get those logs. We shouldn't capture this type of data in logs.
| } | ||
| }, | ||
| }, | ||
| }, |
There was a problem hiding this comment.
I think it might be helpful to add a tool to get the current user's information. This might help a prompt like "Get my most recently added doc" be able to figure out who "my" refers to. And looks like you're effectively doing that in the test, so might as well expose. :)
There was a problem hiding this comment.
This is a great point actually, but there's a shortcut in the google drive DSL and in the prompt we already tell (and hope) the LLM to write a query in it. (you can do something like 'me' in owners and it does what you expect, it's a magic string).
I can adjust the prompts to highlight this if we think it's important - it saves a tool call and it saves also context from an extra tool description and tool call results.
There was a problem hiding this comment.
The "recent" thing you said got me thinking about what other metadata gaps this has. As said on another response, search/list only return a slim set of fields (id, name, type, size, timestamps, link) — enough to find files but not enough to answer "who owns this?" or "who has access?".
I added a getFileMetadata tool that takes an array of file IDs and returns a curated set of fields including owners, last editor, permissions, sharing status, description, etc. Also added orderBy to searchFiles so "get my most recently created doc" can be properly served with 'me' in owners and orderBy: 'createdTime desc'.
I also got my LLM to beef up the search query description (LLM inception) with concrete syntax examples so dumber models are less likely to mess up the DSL.
| provider: EARSSupportedOAuthProvider.GOOGLE, | ||
| initiatePath: '/oauth/start/google_drive', | ||
| fetchSecretsPath: '/oauth/fetch_request_secrets', | ||
| oauthBaseUrl: 'https://localhost:8052', // update once EARS deploys to QA |
There was a problem hiding this comment.
I'm probably out of touch - but we shouldn't be hitting EARS yet, right? EARS needs the OAuth Authorization Code Flow, which isn't merged. We should be able to use the Data Sources with existing Auth forms from Connectors.
There was a problem hiding this comment.
Yeah fair point, I was just following the pattern from the existing Notion and GitHub data sources which both have the same oauthConfiguration block with localhost:8052. It's all placeholder config that nothing reads right now as far as I understand. Happy to drop it entirely and add it back when the EARS flow f actually lands. The connector already works with the standard bearer token form from the Connectors flyout.
There was a problem hiding this comment.
I was just following the pattern from the existing Notion and GitHub data sources which both have the same oauthConfiguration block
:/ my guess is this is vestigial from the "kitchen sink" work, and probably needs to be cleaned up. @lorenabalan @meghanmurphy1 any reason this should be there for notion/github?
There was a problem hiding this comment.
@erikcurrin-elastic had insisted to include something about EARS in the initial implementation but as Apostolos pointed out, this isn't being read or used anywhere yet. Happy to remove it as well until we integrate EARS with data sources, the design of it has changed anw so stuff like fetchSecretsPath doesn't make sense anymore, and I think some of the implementation details might actually be absorbed under the connectors architecture in the future.
There was a problem hiding this comment.
Let's remove it, then. +1, when we add in support for Elastic-owned OAuth Apps, it'll be through the Connectors, not in our Data Sources.
There was a problem hiding this comment.
Okay cleaning up for the other connectors too.
🔍 Preview links for changed docs |
|
@seanstory addressed everything discussed, cleaned up some logs from sharepoint connector and also cleaned up the EARS stuff in notion and github as well. I'll test everything again end to end for a few queries. @seanstory @artem-shelkovnikov @mattnowzari I'd appreciate a last look and ✅ |
alaudazzi
left a comment
There was a problem hiding this comment.
Left a minor suggestion, otherwise LGTM.
| Google Drive connectors have the following configuration properties: | ||
|
|
||
| Bearer Token | ||
| : A Google OAuth 2.0 access token with Google Drive API scopes. See [Get API credentials](#google-drive-api-credentials) for instructions. |
There was a problem hiding this comment.
| : A Google OAuth 2.0 access token with Google Drive API scopes. See [Get API credentials](#google-drive-api-credentials) for instructions. | |
| : A Google OAuth 2.0 access token with Google Drive API scopes. Check the [Get API credentials](#google-drive-api-credentials) for instructions. |
Replace visual clues with more accessible terms.
There was a problem hiding this comment.
Thank you, applying.
💚 Build Succeeded
Metrics [docs]Module Count
Async chunks
Page load bundle
Unknown metric groupsasync chunk count
miscellaneous assets size
History
|
* commit '7dcc1fe3c205d2de0c3ca3f65804f21de09013c3': (285 commits) Enrich kbn-check-saved-objects-cli README with CI and manual usage docs (elastic#252557) [Discover] Add feature flag to make ESQL the default query mode (elastic#252268) Add maskProps.headerZindexLocation above to inspect component flyout (elastic#252543) [Security Solution][Atack/Alerts] Flyout header: Assignees (elastic#252190) Upgrade EUI to v112.3.0 (elastic#252315) [Fleet] Make save_knowledge_base async in streaming state machine (elastic#252328) Upgrade @smithy/config-resolver 4.3.0 → 4.4.6 (elastic#252457) [Lens as API] Add colorMapping support for XY charts (ES|QL data layers) (elastic#252051) [WorkplaceAI] Add Google Drive data source and connector (elastic#250677) [Scout] Move GlobalSearch FTR tests to Scout (elastic#252201) [EDR Workflows] Fix osquery pack results display when agent clock is skewed (elastic#251417) [Observability Onboarding] Apply integrations limit after dedup in parseIntegrationsTSV (elastic#252486) [Entity Analytics] Update `host.ip` aggregation to remove painless script (elastic#252426) Address `@elastic/eui/require-table-caption` lint violations across `@elastic/obs-presentation-team` files (elastic#251050) Consolidate JSON stringify dependencies (elastic#251890) [index mgmt] Use esql instead of query dsl to get the index count (elastic#252422) Add Usage API Plugin (elastic#252434) Cases All Templates page (elastic#250372) [Agent Builder] Default value for optional params in ESQL tools (elastic#238472) [Fleet] Add upgrade_details.metadata.reason to AgentResponseSchema (elastic#252485) ...
Summary
Resolves https://github.com/elastic/search-team/issues/12165
This PR aims to add a Google Drive Data Source that uses a Google Drive KSC.
So we're adding the data source, and the workflow and tool building logic - we're also adding a KSC following the rest of the recent work done on connectors to third parties (Notion, Github).
A slight rework to allow support of multiple connectors per Data Source is also included.
On the Google Drive KSC:
The Google Drive KSC supports as of now only bearer token auth - and testing was done locally with EARS. Service accounts aren't supported because we would require a new auth type as far as I understand and figure out where to do token refresh etc. ( and I don't want to extend the scope of these changes).
If we think this is important we can figure it out however.
The KSC exposes 3 actions - list, search and download. List allows hierarchical navigation which might be important for certain flows, but search is probably the most important action here.
Download fetches a base64 encoded form of the fetched document, having exported native Google suite docs as PDFs.
On the Google Drive Data Source:
We introduce a rework as mentioned to how data sources are modeled, and they now can refer to multiple KSCs, and each of these can be considered primary, required or optional.
The Data Source depends on a Google Drive KSC and an optional Jina Reader one. It declares the Google Drive one as the primary and the other one as an optional.
When building workflows, we consider whether a Jina Reader connector was set up - if it wasn't, then we instead use the simulate ingest pipeline functionality from ES to do document extraction.
On content extraction
Didn't spend too much time focusing on extraction quality at all.
Points that could be considered:
On simulate ingest pipeline
When Jina Reader isn't configured, the download workflow falls back to Elasticsearch's _ingest/pipeline/_simulate API with an attachment processor. This is done via the generic elasticsearch.request step which makes a direct HTTP call to the simulate endpoint. During development, a typed elasticsearch.ingest.simulate action was implemented using the auto-generation from the Elasticsearch OpenAPI spec, but I moved away from it to not complicate things - I don't know how useful these ingest actions will be in workflows, and the codegen added ~1100 lines for a single action.
On the rework to support multiple KSCs per data source
I think the reasoning makes sense, the implementation especially in the UI/UX I am really not sure about - users are presented with all associated connector configs in the flyout, seeing this between optional connectors:
devs can customize this description and the skip-description as well to inform users of the implications.
Code in the FE was heavily vibe coded by someone that doesn't know React so I'm open to being told to throw everything away and try again (or handing off 😁 )
Also for consideration:
I'm really hoping for some feedback on this.
Single file downloads
We're handling only single files in the exposed workflows / tool. This is:
a) Because of using the Jina Reader connector exposing an API for only one file and wanting to use that
b)
data.setdoesn't behave as expected inforeachloops so it's not easy to aggregate the results from multiple connector callsc) Less stressful about hitting memory problems!
Once the point in b) is addressed, we can extend the API of the workflows/tools generated to take a list of file IDs (so that the LLM agent won't struggle as much making 10 calls to fetch 10 files) and also potentially expose a parameter to do reranking and pruning of fetched docs.
Update: fix is here #250852 once that gets merged we can change the download interface here to take a list of files instead, and expose an optional rerank option as well.
Checklist
Check the PR satisfies following conditions.
Reviewers should verify this PR satisfies this list as well.
release_note:breakinglabel should be applied in these situations.release_note:*label is applied per the guidelinesbackport:*labels.Identify risks
Does this PR introduce any risks? For example, consider risks like hard to test bugs, performance regression, potential of data loss.
Describe the risk, its severity, and mitigation for each identified risk. Invite stakeholders and evaluate how to proceed before merging.