Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOMA Profiler RFCs #162

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

SOMA Profiler RFCs #162

wants to merge 3 commits into from

Conversation

beroy
Copy link
Collaborator

@beroy beroy commented May 9, 2023

No description provided.

@bkmartinjr
Copy link
Member

@beroy - one thing that would help me review this is some additional context and background information on the requirements and definition of success. Questions top of mind:

  1. Is the primary use case automating a smoke test to detect regressions as part of CI? Or is the goal to help diagnose/debug regressions once they are detected? Or both?
  2. If your goal is a diagnostic tool: my experience is that regressions are often I/O or schema related, or are highly obscured by concurrency issues. I am wondering if a focus on CPU/memory profiling is going to be useful in practice (vs. for example, the built-in TileDB query statistics). Put another way, if your goal is a diagnostic tool, is this the biggest bang for the buck? (very curious to hear what @Shelnutt2 and @gspowley think as I'm sure they have internal tooling that has covered some of this ground).
  3. If your goal is detection/smoke testing for regressions, and you think a historical log of past runs will be useful, do you want to capture core DB statistics?

Lastly (minor point) - we should probably discuss if this belongs in the TileDB-SOMA repo, or as a peer repo (motivated by a goal of keeping each repo simpler/focused - i.e., the old monorepo debate :-)

@beroy
Copy link
Collaborator Author

beroy commented May 10, 2023

@bkmartinjr, thanks a lot for the great comments. We had a design review last week and discussed some of the related issues but did not cover all the questions here. I try to address them here:
1, 2) The main goals of this are a) figuring the hot spots (perf/mem) in our code stack (either python or R) and try to reduce the bottleneck b) the ability of detect regressions as a part of CI. The generic profiler only allows detection in (b) and not necessarily diagnosis as you mentioned. Diagnosis could be a lot more complicated than what software stack frame graphs can provide.
3) That's a great point. I did not think about it but thinking more, I really think that can be very useful if I can do that as custom profiler! That means we still have the general architecture but we also have a TileDB profiler that gets attached to the app and collects the TileDB stats. Not sure how possible it is to access TileDB API from outside a process. In the worst case we can add an access point or just a flag for doing so.

Finally, this repo suggestion seems to come from an agreement between TileDB and CZI for sharing RFCs. @maniarathi can comment here.

@bkmartinjr
Copy link
Member

Hi @beroy,

design review last week

Did this review include the TileDB team? Most of the code is in their DB stack (not in SOMA or Census code), and that is also where most of the complexity and performance sensitive code is (e.g, github.com/TileDB/TileDB and github.com/TileDB/TileDB-Py). The core TileDB team has extensive experience with performance work, and I presume have a tool stack already in place. I just want to make sure we are filling a gap that exists, and that we benefit from the existing knowledge and tools to build upon.

TileDB stats

AFAIK, you can't capture them from outside the process, but enabling in-process collection has very little overhead and they are designed to be captured in the normal course of using the DB. Given that you must run the system with your own driver process, I suggest just capturing them as part of that driver process. The total size of stats summary is tiny and can be written to logs as part of running the core tests.

There are some near-term caveats to this due to our double-instantiation of the core DB, @gspowley and others can talk you through their plans to centralize everything (tactically,just capture stats fro both instances, or perhaps even ignore the second if you are focused on read-only perf). The stats API in SOMA is already exposed via tiledbsoma.tiledbsoma_stats_*, and there are equivalent API in the tiledb package for the second DB instance.

@johnkerl
Copy link
Member

@gspowley and others can talk you through their plans to centralize everything

@bkmartinjr the current POC is @nguyenv

@beroy
Copy link
Collaborator Author

beroy commented May 10, 2023

@bkmartinjr. TileDB people were not in the review but I discuss their profiling tools with @gspowley. While their tool capture some of the needed stuff the main focus is on distributed nodes. Also my goal is to capture the breakdown across python/R and C++. Your suggestion about collecting DB stats makes total sense (directly log them from the driver). BTW, here by core DB do you mean TileDB? Also, I'm not familiar with the double-instantiation plan. Will check with them

@bkmartinjr
Copy link
Member

@beroy - I'm using "core DB" and "embedded TileDB" as synonyms.

We end up with two separate instantiations, with their own private memory/context/etc due to the way we are bootstrapping the SOMA C++ layer. As noted by @johnkerl, @nguyenv is POC on this (apologies for earlier misdirect). Using Python as an example (similar issue in R):

  • We started using only TileDB-Py, which has its own linked TileDB core
  • We then introduced the SOMA-specific (libtiledbsoma) for a narrow set of code paths, which created a second linked C++ shared object, TileDB context, etc. Initial code path was focused on read, grew to include nnz, etc.
  • Over time, we are migrating all functionality to the SOMA C++ layer, and at some future date we should be able to remove the TileDB-Py (and TileDB-R) dependency.

I only point this out as the separate core DB instances have their own stats data.

@thetorpedodog
Copy link
Contributor

two non–content-related notes from me (these apply to both open RFCs right now so I am pasting this note in both places):

  1. Rebase so that you get the pre-commit format verification thingy
  2. (Optional) Consider formatting to one sentence per line. I did this in 3b5891c, but I didn’t formally write it down, so it’s not a hard Rule but it is a format I have found useful for editing.

Copy link
Member

@atolopko-czi atolopko-czi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few requests for clarifications and decisions about scope of the RFC.

rfcs/profiler.md Outdated Show resolved Hide resolved
rfcs/profiler.md Outdated Show resolved Hide resolved
rfcs/profiler.md Outdated Show resolved Hide resolved
rfcs/profiler.md Outdated Show resolved Hide resolved
rfcs/profiler.md Outdated Show resolved Hide resolved
rfcs/profiler.md Outdated Show resolved Hide resolved
rfcs/profiler.md Show resolved Hide resolved
rfcs/profiler.md Outdated Show resolved Hide resolved
rfcs/images/profilerarchitecture.png Outdated Show resolved Hide resolved
Copy link
Member

@atolopko-czi atolopko-czi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few requests for clarifications and decisions about scope of the RFC.

rfcs/profiler.md Outdated Show resolved Hide resolved
rfcs/profiler.md Outdated Show resolved Hide resolved
rfcs/profiler.md Outdated Show resolved Hide resolved
rfcs/profiler.md Show resolved Hide resolved
@beroy beroy requested review from gspowley and johnkerl June 6, 2023 16:49
rfcs/profiler.md Outdated Show resolved Hide resolved
rfcs/profiler.md Outdated Show resolved Hide resolved
rfcs/profiler.md Outdated Show resolved Hide resolved
rfcs/profiler.md Outdated Show resolved Hide resolved
rfcs/profiler.md Outdated Show resolved Hide resolved
rfcs/profiler.md Outdated Show resolved Hide resolved
rfcs/profiler.md Outdated Show resolved Hide resolved
rfcs/profiler.md Outdated Show resolved Hide resolved
rfcs/profiler.md Outdated

### Benefits

- A major benefit of this design is that the profilers (both generic and custom ones) are not necessarily targeted toward single cell applications and can be used for any services across CZI.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- A major benefit of this design is that the profilers (both generic and custom ones) are not necessarily targeted toward single cell applications and can be used for any services across CZI.
- A major benefit of this design is that the profilers (both generic and custom ones) are not necessarily targeted toward SOMA and can be used for various services.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@beroy please make the requested change

rfcs/profiler.md Outdated Show resolved Hide resolved
Copy link
Member

@ebezzi ebezzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. As Andrew suggested, I recommend to fill a more detailed implementation/deployment as a separate RFC or tech spec.

@johnkerl johnkerl requested review from johnkerl and atolopko-czi June 9, 2023 21:14
@atolopko-czi
Copy link
Member

@beroy @johnkerl As we now have a functioning profiler, can we merge the RFC and close this out?

@johnkerl johnkerl self-requested a review August 2, 2023 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants