Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic workspace discovery #442

Merged
merged 6 commits into from
Feb 1, 2021
Merged

Automatic workspace discovery #442

merged 6 commits into from
Feb 1, 2021

Conversation

mrnugget
Copy link
Contributor

@mrnugget mrnugget commented Jan 25, 2021

What?

This adds automatic workspace discovery to src-cli. It allows users to

  1. define where projects in large repositories live and
  2. run the campaign specs steps in those project folders, turning them into workspaces.

How?

Users define workspaces like so:

workspaces:
  - rootAtLocationOf: go.mod
    in: github.com/sourcegraph/sourc*

That means: in every repository that starts with github.com/sourcegraph/sourc projects have a go.mod at its root and those folders should be used as workspaces for the execution of campaign spec steps.

src-cli uses Sourcegraph search under the hood to search for the locations of the rootAtLocationOf file, which means it doesn't need to download the repository first and search the file system.

workspaces can also contain multiple definitions, matching different repositories (but a repository cannot be matched by multiple definitions):

on:
  - repositoriesMatchingQuery: repo:automation-testing$
  - repository: github.com/sourcegraph/sourcegraph

workspaces:
  - rootAtLocationOf: go.mod
    in: github.com/sourcegraph/sourcegraph*
  - rootAtLocationOf: package.json
    in: "*automation-testing"

Since multiple workspaces per repository means that multiple changesets will be produced in a single repository, the changesetTemplate.branch needs to use templating to avoid name clashes.

For that, users can access the template variable steps.path and use helper functions to generate a unique branch name per changeset. Example:

changesetTemplate:
  # [...]
  branch: ${{ join_if "-" "thorsten/workspace-discovery" (replace steps.path "/" "-") }}
  published: false
  commit:
    message: Automatic workspace discovery

(The join_if and the replace helpers are new. join_if joins the given list of strings, but ignoring the blank strings. replace is an alias for strings.ReplaceAll)

Users can, of course, also user other ways to generate a unique branch name per changeset. With outputs, for example:

steps:
  # [...]
  - run:  if [[ -f "package.json" ]]; then cat package.json | jq -j .name; fi
    container: jiapantw/jq-alpine:latest
    outputs:
      projectName:
        value: ${{ step.stdout }}

changesetTemplate:
  # [...]
  branch: thorsten/workspace-discovery-${{ outputs.projectName }}

Or, in combination:

steps:
  # [...]
  - run:  if [[ -f "package.json" ]]; then cat package.json | jq -j .name; fi
    container: jiapantw/jq-alpine:latest
    outputs:
      projectName:
        value: ${{ step.stdout }}

changesetTemplate:
  # If we have an `outputs.projectName` we use it, otherwise we append the path
  # of the workspace. If the path is emtpy (as is the case in the root folder),
  # we ignore it.
  branch: |
    ${{ if eq outputs.projectName "" }}
    ${{ join_if "-" "thorsten/workspace-discovery" (replace steps.path "/" "-") }}
    ${{ else }}
    thorsten/workspace-discovery-${{ outputs.projectName }}
    ${{ end }}

Details & Edge Cases

  • Repositories matching multiple workspace definitions produce an error.
  • If a repository is yielded by on and not matched by an workspaces.in: glob, the steps will be executed in its root folder.
  • If a repository is yielded by on and matches a workspace.in: glob, but there are no workspaces in it that contain the file in rootAtLocationOf then the steps won't be executed in the repository.
  • If a repository contains multiple workspaces where one is the subdirectory of another, then the steps will be executed in both and the parent directory also includes the sub directory. I think this is the most intuitive choice.

Dependency

This requires the addition of workspaces to the campaign spec schema, which means it requires changes to the Sourcegraph server.

The PR is here: https://github.com/sourcegraph/sourcegraph/pull/17757

What's not included

src-cli still downloads a complete archive of every matched repository, even if the steps should only be included in subdirectories. Only downloading archives of the workspace directories is something we should implement in the near future to make support for large monorepos better.

Full campaign spec to try this at home

There you go:

name: workspace-discovery
description: Automatic workspace discovery

on:
  - repositoriesMatchingQuery: repo:automation-testing$
  - repository: github.com/sourcegraph/sourcegraph

workspaces:
  - rootAtLocationOf: go.mod
    in: github.com/sourcegraph/sourcegraph*
  - rootAtLocationOf: package.json
    in: "*automation-testing"

steps:
  - run: "echo \"pwd: $(pwd)\nfiles: $(ls)\" >> message.txt"
    container: alpine:3
  - run:  if [[ -f "package.json" ]]; then cat package.json | jq -j .name; fi
    container: jiapantw/jq-alpine:latest
    outputs:
      projectName:
        value: ${{ step.stdout }}

changesetTemplate:
  title: Automatic workspace discovery
  body:  Automatic workspace discovery

  # If we have an `outputs.projectName` we use it, otherwise we append the path
  # of the workspace. If the path is emtpy (as is the case in the root folder),
  # we ignore it.
  branch: |
    ${{ if eq outputs.projectName "" }}
    ${{ join_if "-" "thorsten/workspace-discovery" (replace steps.path "/" "-") }}
    ${{ else }}
    thorsten/workspace-discovery-${{ outputs.projectName }}
    ${{ end }}
  published: false
  commit:
    message: Automatic workspace discovery

@mrnugget mrnugget force-pushed the mrn/workspace-discovery branch 2 times, most recently from 34edd53 to 987d310 Compare January 27, 2021 11:24
@mrnugget mrnugget changed the title DRAFT: automatic workspace discovery Add automatic workspace discovery Jan 28, 2021
@mrnugget mrnugget changed the title Add automatic workspace discovery Automatic workspace discovery Jan 28, 2021
@mrnugget mrnugget force-pushed the mrn/workspace-discovery branch from 241d657 to a03c060 Compare January 29, 2021 11:14
@mrnugget mrnugget marked this pull request as ready for review January 29, 2021 11:19
@mrnugget mrnugget requested a review from a team January 29, 2021 12:18
@mrnugget
Copy link
Contributor Author

Heads up reviewers: just noticed that I forgot to add a validation that produces an error and human readable message when multiple changesets with the same branch have been produced in the same repository. I'll add that, but that shouldn't stop anyone from reviewing this.

Copy link
Member

@eseliger eseliger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me 🌟 :shipit:

defer rz.mu.Unlock()

rz.references -= 1
if rz.references == 0 && rz.fetcher.deleteZips {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this mean when using 1 worker thread, there would be no caching at all? and can there be the case where the workers are utilized like the following

[repo A:/path1]
[repo B]
[repo C]
[repo D]

=>

[repo A:/path2]
[repo B]
[repo C]
[repo D]

where the worker would close the zip for repo A at path1, and then need to refetch for path2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. You're right. I'll address in follow-up PR.

@mrnugget mrnugget force-pushed the mrn/workspace-discovery branch from 40c54da to 358684c Compare February 1, 2021 09:45
@mrnugget
Copy link
Contributor Author

mrnugget commented Feb 1, 2021

There are two things that I need to do:

I'll address those in follow-up PRs to make it easier for review.

@mrnugget mrnugget merged commit 588c7fa into main Feb 1, 2021
@mrnugget mrnugget deleted the mrn/workspace-discovery branch February 1, 2021 09:49
mrnugget added a commit that referenced this pull request Feb 1, 2021
This is a follow-up to #442 and ensures that changeset specs are not
getting silently lost by validating that multiple changeset specs in the
same repository have different branches.

I decided to make this a separate step _after_ the execution of the
steps so that users can leverage the cache. That allows them to change
the campaign spec and then rerun the command after they get this error,
vs. the execution being aborted after running into this error (if we'd
do the check inside executor).
mrnugget added a commit that referenced this pull request Feb 1, 2021
This is a follow-up to fix the issue discovered by @eseliger here: #442 (comment)

Short version: the previous implementation would only avoid deleting an
archive if there were *currently active tasks* holding references to it.
If tasks that need the same archive would execute sequentially, though,
the archive would be downloaded, deleted, downloaded again.

This here is a fix for the issue by first marking all repository
archives for later use and only once all marks have been turned into
references and those references have been closed is the archive deleted.
mrnugget added a commit that referenced this pull request Feb 2, 2021
* Check for branch duplicates after creating changeset specs

This is a follow-up to #442 and ensures that changeset specs are not
getting silently lost by validating that multiple changeset specs in the
same repository have different branches.

I decided to make this a separate step _after_ the execution of the
steps so that users can leverage the cache. That allows them to change
the campaign spec and then rerun the command after they get this error,
vs. the execution being aborted after running into this error (if we'd
do the check inside executor).

* Fix naming in duplicateBranchesErr
scjohns pushed a commit that referenced this pull request Apr 24, 2023
* Implement dynamic workspace discovery

* update schema and fix template helpers

* Add changelog entry

* Rename file

* Change naming

* Use strings.ReplaceAll
scjohns pushed a commit that referenced this pull request Apr 24, 2023
* Check for branch duplicates after creating changeset specs

This is a follow-up to #442 and ensures that changeset specs are not
getting silently lost by validating that multiple changeset specs in the
same repository have different branches.

I decided to make this a separate step _after_ the execution of the
steps so that users can leverage the cache. That allows them to change
the campaign spec and then rerun the command after they get this error,
vs. the execution being aborted after running into this error (if we'd
do the check inside executor).

* Fix naming in duplicateBranchesErr
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants