Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discuss switching to a monorepo #192

Closed
zackkrida opened this issue Mar 17, 2022 · 8 comments
Closed

Discuss switching to a monorepo #192

zackkrida opened this issue Mar 17, 2022 · 8 comments
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 💬 talk: discussion Open for discussions and feedback

Comments

@zackkrida
Copy link
Member

zackkrida commented Mar 17, 2022

We likely need an RFC to evaluate the pros and cons of monorepo support. To help ease the creation of such an RFC, I think it would be first wise to discuss the pros and cons of moving to one. I'll kick things off with some quick ideas:

Pros

  • Easier to run the entire stack locally (I.E, the catalog, api, and frontend all connected)
  • Could theoretically create an entire docker-compose stack used for production and local development (seems much easier said than done)
  • Simplified discovery of the project, only one repo to look at!
  • Easier to manage cross-repo issues, discussions, and milestones
  • Eliminate code maintenance of multi repo file syncing
  • Could develop a full-stack CI pipeline
  • Can move our project board to the repo, instead of the WordPress org, for easier access

Cons

  • Could potentially increase confusion for developers (debatable, as there's many fewer entrypoints to access the code)
  • Significant increases in the size of cloning the project, especially for contributors who only want to work on one part
  • CI scripts would need to check which project was modified and only run checks on that projects
  • Release management (how do we have actions/deploys coupled to GitHub releases coordinated across multiple projects?)
@zackkrida zackkrida added 🟧 priority: high Stalls work on the project or its dependents 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository 💬 talk: discussion Open for discussions and feedback and removed 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work labels Mar 17, 2022
@zackkrida
Copy link
Member Author

@WordPress/openverse-developers can we discuss this async this week?

@sarayourfriend
Copy link
Contributor

Some initial thoughts about the cons.

Could increase confusion for developers

We got some feedback from a community contributor that the current multiple repository approach is also confusing. A single repository with a single README that links out to the relevant sub-projects would make it easier to direct devs to the project without having to know exactly what kind of stuff they're looking for (JavaScript? Python but not Airflow? Python but Data stuff? etc).

Significant increases in the size of cloning the project, especially for contributors who only want to work on one part

The side of dependencies for any of the repositories vastly outweighs the size of the codebases themselves, doesn't it? For Airflow and API the various docker image dependencies are huge (several hundred MBs).

It does significantly increase it from a statistical perspective though and probably from a practical perspective for some situations, after all we'd be basically making a single repository out of three small-to-medium sized projects and one very small one (/openverse).

CI scripts would need to check which project was modified and only run checks on that projects

and

Release management (how do we have actions/deploys coupled to GitHub releases coordinated across multiple projects?)

These don't seem so much as "cons" to me as roadblocks for implementation. Once the problem is solved then it's solved forever (or at least until we need to make other large code infrastructure changes). It's not like this is a long-term cost that we will continue to pay once monorepo-ization is finished.

The pros...

Could theoretically create an entire docker-compose stack used for production and local development (seems much easier said than done)

This would be amazing, especially sharing it with production. I think @rbadillap's ongoing work in the API to deploy it with docker will reveal whether this is a reasonable possibility.

I'm in favor of this change, but I don't want to trivialize the repository size considerations. There are things we can do to make the repository size more efficient and we can trim certain unused things from the history.

@zackkrida
Copy link
Member Author

zackkrida commented Mar 22, 2022

Completely agree with everything you've said @sarayourfriend. I just talked to @dhruvkb about this a minute ago and we raised some similar points. Ongoing efforts to clean up dependencies will probably better serve users from the perspective of download sizes, and currently dealing with 5+ discrete Openverse repositories is potentially much more confusing than one larger repo.

We also discussed the idea of project: frontend, project: api, and project: catalog style labels to tag issues for each project.

The last thing we mentioned—the API is essentially currently a monorepo, and eliminating that prior to making Openverse a monorepo should make things even simpler.

@zackkrida zackkrida added 🟨 priority: medium Not blocking but should be addressed soon and removed 🟧 priority: high Stalls work on the project or its dependents labels Mar 22, 2022
@AetherUnbound
Copy link
Contributor

I'm all for this, and I love the points y'all have brought up. I'll also mention that another project I've worked on previously made a similar move and also had an RFC for their proposal: amundsen-io/rfcs#31

@krysal
Copy link
Member

krysal commented Mar 24, 2022

I have some questions about this movement, the main thing being that is not clear to me is what is really the need for changing to a monorepo?

Pros

  • Easier to run the entire stack locally (I.E, the catalog, api, and frontend all connected)

How would that look? I guess this is related to the next point and a bunch of new just commands to summarize the ones of docker/docker-compose. But how often does this scenario actually happen? Most of the time someone is working just fine on only one of the repositories, or two at most.

  • Could theoretically create an entire docker-compose stack used for production and local development (seems much easier said than done)

Does this make deployments easier or more complicated? I'm not sure so I'd love the opinion of @rbadillap here.

  • Easier to manage cross-repo issues, discussions, and milestones

IMO the linking feature of GitHub (<organization_or_user>/#<id_of_issue_or_pr>) works pretty well for this.

  • Eliminate code maintenance of multi repo file syncing
  • Can move our project board to the repo, instead of the WordPress org, for easier access

Fair! But I'm not sure if is worth the move.

  • Could develop a full-stack CI pipeline

What is this needed for? Do we want to check full cycles in every PR? Sounds like a heavy large process, prone to fail often when dividing the system into parts is much easier to debug.

Cons

From @sarayourfriend:

Could increase confusion for developers

We got some feedback from a community contributor that the current multiple repository approach is also confusing. A single repository with a single README that links out to the relevant sub-projects would make it easier to direct devs to the project without having to know exactly what kind of stuff they're looking for (JavaScript? Python but not Airflow? Python but Data stuff? etc).

Is it an issue with the multiple repositories per se or more with lacking documentation? Because to me looks like we can already do that in this repo linking the rest. Honestly, I don't see the advantage in merging into a monorepo instead of adding explanations in docs (that will be required anyway).


From @zackkrida:

the API is essentially currently a monorepo, and eliminating that prior to making Openverse a monorepo should make things even simpler.

What do you mean here? As we are ditching the analytics server until finding a better solution only the ingestion-server will remain as the extra service, and then that makes me think that probably makes more sense to try doing a monorepo between the API and the Catalog. That might be a good proof of concept for a final big Openverse monorepo.

@dhruvkb
Copy link
Member

dhruvkb commented Mar 25, 2022

I love all the pros of a monorepo, and would be a 100% on board if there was a consensus to move ahead with it. I agree with @krysal about merging the two Pythonic repositories first and once the issues are ironed out, merging frontend with it too, instead of going all in.

But for the sake of a balanced argument, here are some of my concerns against the move (in addition to @krysal's points above).

Notifications

Here's a snippet of what my notifications looks like right now.
image

If Openverse were a monorepo, the entire section would be a huge mix of issues from every part of the stack.

Documentation

We're currently not in a very developed/stable position with documentation. A monorepo can definitely exacerbate the problem to a whole new level. A combined documentation would be difficult to organise for the documenter and hard to navigate for the reader.

Separation

We move very fast. Given the rate at which we open, close and comment on issues and PRs, the issue and PR tabs will be always be in a stormy state. I'm pretty sure I'll not be able to find any issue I'm looking for without searching and filtering by labels. Not glanceable as it mostly is right now.

Faux pros

Also I don't agree with some of the pros:

Easier to run the entire stack locally

Could theoretically create an entire docker-compose stack used for production and local development (seems much easier said than done)

Do we really want this? Wouldn't it be better to focus on a part of it without running all the unrelated containers? I can see some utility here but most of the time, I'd be running the API on sample data or the frontend outside of Docker. Running all these containers all the time would be a power/RAM hog while providing limited utility.

Could develop a full-stack CI pipeline

Such a pipeline, though useful, could take very very long and be very very fragile. To speed it up we might skip steps based on the location of the changes but then that'll be very complicated workflow. Fun? yes. Challenge? Also yes.

Can move our project board to the repo, instead of the WordPress org, for easier access

Not really a pro, the location of the board isn't an issue because bookmarks exist 😄

Withdrawn in light of #199 (review).

We got some feedback from a community contributor that the current multiple repository approach is also confusing.

We can do a better job of documenting the different repos in the WordPress/openverse README.md file, allowing us to send devs there and letting them find the repo they feel comfortable about. To me this seems like a documentation issue rather than an architectural one.

@zackkrida
Copy link
Member Author

We might also lose the ability to use the built-in support for assigning reviewers to issues.

No actually, the CODEOWNERS file supports per-directory owners, which would work great!

https://satellytes.com/blog/monorepo-codeowner-github-enterprise/

@zackkrida
Copy link
Member Author

zackkrida commented Apr 4, 2022

Closing in favor of #205. It's clear we have enough interest to move forward on this RFC; and follow up on the points raised here can be addressed in the RFC.

@obulat obulat mentioned this issue Feb 9, 2023
16 tasks
dhruvkb pushed a commit that referenced this issue Apr 14, 2023
* Move storage module higher up

Signed-off-by: Olga Bulat <[email protected]>

* Deduplicate MediaStore tests

Signed-off-by: Olga Bulat <[email protected]>

* Use constants

Signed-off-by: Olga Bulat <[email protected]>

* Parametrize image tests

Co-authored-by: Madison Swain-Bowden <[email protected]>

* Pluralize table name lookup dictionary

Signed-off-by: Olga Bulat <[email protected]>

* Fix lint error

Signed-off-by: Olga Bulat <[email protected]>

Co-authored-by: Madison Swain-Bowden <[email protected]>
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 💬 talk: discussion Open for discussions and feedback
Projects
None yet
Development

No branches or pull requests

5 participants