-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use estimated document count when possible for count()
operations
#5398
base: release/v1.3.0
Are you sure you want to change the base?
Conversation
WalkthroughThe pull request introduces optimizations in the FiftyOne library, specifically enhancing the Changes
Suggested labels
Suggested reviewers
Poem
Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
fiftyone/core/dataset.py (1)
363-371
: LGTM! The implementation looks clean and efficient.The new
_estimated_count
method provides a performant way to get approximate document counts by leveraging MongoDB's native estimation capabilities. The implementation correctly handles both sample and frame collections.Using estimated counts can significantly improve performance for large collections where exact counts are not required, as it avoids scanning the entire collection.
fiftyone/core/collections.py (1)
7729-7742
: Performance optimization: Use estimated document count when possibleGood optimization to use estimated document count for simple counting operations. This will improve performance by avoiding full collection scans when counting samples or frames.
A few suggestions to make this more robust:
- Consider adding a comment explaining when the optimization is used
- Consider adding a parameter to force exact count when needed
- Consider adding logging when falling back to exact count
# Optimization: use estimated document count when possible +# Only applies to datasets (not views) when counting all samples or frames if ( isinstance(self, fod.Dataset) and expr is None and ( field_or_expr is None or (field_or_expr == "frames" and self._has_frame_fields()) ) ): frames = field_or_expr == "frames" # pylint: disable=no-member + logger.debug("Using estimated document count") return self._estimated_count(frames=frames)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
fiftyone/core/collections.py
(1 hunks)fiftyone/core/dataset.py
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (6)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
- GitHub Check: test / test-app
- GitHub Check: e2e / test-e2e
- GitHub Check: lint / eslint
- GitHub Check: build / build
- GitHub Check: build
count()
operations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will speed up a lot and iterating based on a count rather than mapping is very unreliable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually looks like the tests are failing so it might not be this simple
ah I forgot that this optimization isn't applicable to grouped datasets. I can fix |
fiftyone/core/collections.py
Outdated
|
||
# Optimization: use estimated document count when possible | ||
if ( | ||
isinstance(self, fod.Dataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should fix it. the way groups are currently implemented, we cannot use estimated count.
isinstance(self, fod.Dataset) | |
isinstance(self, fod.Dataset) | |
and self.media_type != fom.GROUP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generalized it slightly more in d11868c so that estimated counts can also be used for grouped datasets when applicable, namely:
grouped_dataset.select_group_slices().count()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
fiftyone/core/collections.py (1)
10617-10632
: Simplify the return statementThe method can be simplified by returning the condition directly instead of using an if/return structure.
def _is_full_collection(self): - if isinstance(self, fod.Dataset) and self.media_type != fom.GROUP: - return True - - # pylint:disable=no-member - if ( - isinstance(self, fov.DatasetView) - and self._dataset.media_type == fom.GROUP - and len(self._stages) == 1 - and isinstance(self._stages[0], fos.SelectGroupSlices) - and self._pipeline() == [] - ): - return True - - return False + return ( + (isinstance(self, fod.Dataset) and self.media_type != fom.GROUP) + or ( # pylint:disable=no-member + isinstance(self, fov.DatasetView) + and self._dataset.media_type == fom.GROUP + and len(self._stages) == 1 + and isinstance(self._stages[0], fos.SelectGroupSlices) + and self._pipeline() == [] + ) + )🧰 Tools
🪛 Ruff (0.8.2)
10622-10631: Return the condition directly
Inline condition
(SIM103)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
fiftyone/core/collections.py
(2 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
fiftyone/core/collections.py
10622-10631: Return the condition directly
Inline condition
(SIM103)
⏰ Context from checks skipped due to timeout of 90000ms (5)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
- GitHub Check: test / test-app
- GitHub Check: build / build
- GitHub Check: e2e / test-e2e
- GitHub Check: build
🔇 Additional comments (1)
fiftyone/core/collections.py (1)
7729-7740
: LGTM! Performance optimization for count() looks goodThe optimization to use estimated document count when possible is a safe and effective improvement. The conditions ensure it's only used for full collections and simple count operations.
5382fa8
to
652711f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
fiftyone/core/collections.py (1)
10621-10636
: Simplify by returning the condition directly.The method can be simplified by returning the condition directly instead of using an if/return structure.
def _is_full_collection(self): - if isinstance(self, fod.Dataset) and self.media_type != fom.GROUP: - return True - - # pylint:disable=no-member - if ( - isinstance(self, fov.DatasetView) - and self._dataset.media_type == fom.GROUP - and len(self._stages) == 1 - and isinstance(self._stages[0], fos.SelectGroupSlices) - and self._pipeline() == [] - ): - return True - - return False + return ( + isinstance(self, fod.Dataset) and self.media_type != fom.GROUP + ) or ( + # pylint:disable=no-member + isinstance(self, fov.DatasetView) + and self._dataset.media_type == fom.GROUP + and len(self._stages) == 1 + and isinstance(self._stages[0], fos.SelectGroupSlices) + and self._pipeline() == [] + )🧰 Tools
🪛 Ruff (0.8.2)
10626-10635: Return the condition directly
Inline condition
(SIM103)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
fiftyone/core/collections.py
(2 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
fiftyone/core/collections.py
10626-10635: Return the condition directly
Inline condition
(SIM103)
⏰ Context from checks skipped due to timeout of 90000ms (5)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
- GitHub Check: test / test-app
- GitHub Check: build / build
- GitHub Check: e2e / test-e2e
- GitHub Check: build
🔇 Additional comments (1)
fiftyone/core/collections.py (1)
7729-7744
: LGTM! Good optimization for count operations.The optimization to use estimated document counts when possible is a good performance improvement. The logic properly handles both sample and frame counts while maintaining correctness.
652711f
to
d11868c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (11)
fiftyone/utils/transformers.py (2)
457-458
: Consider reducing device initialization duplication.The device initialization pattern is repeated across multiple classes. Consider extracting this into a mixin or base class method.
Example implementation:
class DeviceMixin: def initialize_device(self): self.device = torch.device(self.config.device) self.model.to(self.device)Also applies to: 503-504, 756-757
589-589
: Consider standardizing model loading pattern.The model loading with device specification is repeated across multiple classes. Consider extracting this into a utility function.
Example implementation:
def load_model_to_device(model_class, name_or_path, device): return model_class.from_pretrained(name_or_path).to(device)Also applies to: 701-701, 832-832, 886-886, 938-938
app/packages/operators/src/components/OperatorExecutionButton/index.tsx (2)
1-8
: Consider organizing imports by type and alphabetically.For better maintainability, consider grouping imports into:
- External dependencies (React, MUI)
- Internal components/types
- Local imports
+import React from "react"; +import { Button } from "@mui/material"; + +import { + ExecutionCallback, + ExecutionErrorCallback, +} from "../../types-internal"; +import { OperatorExecutionOption } from "../../state"; +import { OperatorExecutionTrigger } from "../OperatorExecutionTrigger"; -import { Button } from "@mui/material"; -import { OperatorExecutionTrigger } from "../OperatorExecutionTrigger"; -import React from "react"; -import { - ExecutionCallback, - ExecutionErrorCallback, -} from "../../types-internal"; -import { OperatorExecutionOption } from "../../state";
Line range hint
41-57
: Consider memoizing the component with React.memo.Since this is a presentational component that receives multiple props, memoizing it could prevent unnecessary re-renders.
-export const OperatorExecutionButton = ({ +export const OperatorExecutionButton = React.memo(({ // ... props -}) => { +}) => { // ... component body -}; +});app/packages/operators/src/usePanelEvent.ts (1)
Line range hint
8-15
: Consider using a more specific type for currentPanelState.The
any
type in HandlerOptions could be replaced with a more specific type to improve type safety.- currentPanelState?: any; // most current panel state + currentPanelState?: Record<string, unknown>; // most current panel stateapp/packages/core/src/plugins/SchemaIO/index.tsx (1)
Line range hint
61-73
: Consider enhancing coerceValue with more type checks.The coerceValue function could be improved to handle more edge cases.
function coerceValue(value, schema) { + if (value === undefined || value === null) { + return null; + } if (schema.type === "array" && Array.isArray(value) && value.length === 0) { return null; } if (schema.type === "string" && value === "") { return null; } + if (schema.type === "number" && typeof value === "string") { + const num = Number(value); + return isNaN(num) ? null : num; + } return value; }app/packages/operators/src/components/OperatorExecutionTrigger/index.tsx (1)
73-84
: Consider error handling for operator execution.While the execution callback is well-implemented, it might benefit from error handling for failed operations.
const onExecute = useCallback( (options?: OperatorExecutorOptions) => { const resolvedOptions = { ...executorOptions, ...options, }; - return operator.execute(executionParams ?? {}, resolvedOptions); + try { + return operator.execute(executionParams ?? {}, resolvedOptions); + } catch (error) { + console.error('Operator execution failed:', error); + throw error; + } }, [executorOptions, operator, executionParams] );app/packages/state/src/hooks/useCreateLooker.ts (1)
227-230
: Improve key format for better readability and type safety.The concatenation of
thisSampleId
andmediaField
with a hyphen could lead to key collisions if either value contains hyphens. Consider using a more robust key format.-const imavidPartitionKey = `${thisSampleId}-${mediaField}`; +const imavidPartitionKey = `${thisSampleId}:${mediaField}`;fiftyone/core/collections.py (3)
10621-10624
: Consider simplifying the conditionThe if statement can be simplified by directly returning the condition.
- if isinstance(self, fod.Dataset) and self.media_type != fom.GROUP: - return True + return isinstance(self, fod.Dataset) and self.media_type != fom.GROUP
10625-10633
: Consider simplifying the conditionThe if statement can be simplified by directly returning the condition.
- if ( - isinstance(self, fov.DatasetView) - and self._dataset.media_type == fom.GROUP - and len(self._stages) == 1 - and isinstance(self._stages[0], fos.SelectGroupSlices) - and self._pipeline() == [] - ): - return True + return ( + isinstance(self, fov.DatasetView) + and self._dataset.media_type == fom.GROUP + and len(self._stages) == 1 + and isinstance(self._stages[0], fos.SelectGroupSlices) + and self._pipeline() == [] + )
10635-10635
: Remove redundant return statementThis return statement becomes redundant if the above refactoring suggestions are implemented.
- return False
🧰 Tools
🪛 Ruff (0.8.2)
10626-10635: Return the condition directly
Inline condition
(SIM103)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
fiftyone/zoo/models/manifest-torch.json
is excluded by!**/*.json
📒 Files selected for processing (22)
app/packages/core/src/components/Modal/ImaVidLooker.tsx
(1 hunks)app/packages/core/src/components/Modal/ModalLooker.tsx
(1 hunks)app/packages/core/src/plugins/SchemaIO/components/DynamicIO.tsx
(1 hunks)app/packages/core/src/plugins/SchemaIO/components/OperatorExecutionButtonView.tsx
(1 hunks)app/packages/core/src/plugins/SchemaIO/components/TooltipProvider.tsx
(1 hunks)app/packages/core/src/plugins/SchemaIO/index.tsx
(1 hunks)app/packages/operators/src/built-in-operators.ts
(1 hunks)app/packages/operators/src/components/OperatorExecutionButton/index.tsx
(1 hunks)app/packages/operators/src/components/OperatorExecutionTrigger/index.tsx
(3 hunks)app/packages/operators/src/state.ts
(1 hunks)app/packages/operators/src/types-internal.ts
(0 hunks)app/packages/operators/src/usePanelEvent.ts
(2 hunks)app/packages/state/src/hooks/useCreateLooker.ts
(2 hunks)app/packages/state/src/recoil/dynamicGroups.ts
(2 hunks)fiftyone/core/collections.py
(2 hunks)fiftyone/core/dataset.py
(1 hunks)fiftyone/operators/executor.py
(0 hunks)fiftyone/utils/clip/zoo.py
(1 hunks)fiftyone/utils/open_clip.py
(3 hunks)fiftyone/utils/super_gradients.py
(1 hunks)fiftyone/utils/transformers.py
(14 hunks)fiftyone/utils/ultralytics.py
(3 hunks)
🔥 Files not summarized due to errors (1)
- fiftyone/utils/clip/zoo.py: Error: Disallowed special token found: <|endoftext|>
💤 Files with no reviewable changes (2)
- app/packages/operators/src/types-internal.ts
- fiftyone/operators/executor.py
✅ Files skipped from review due to trivial changes (1)
- app/packages/core/src/plugins/SchemaIO/components/TooltipProvider.tsx
🚧 Files skipped from review as they are similar to previous changes (1)
- fiftyone/core/dataset.py
🧰 Additional context used
📓 Path-based instructions (12)
app/packages/core/src/components/Modal/ImaVidLooker.tsx (1)
Pattern **/*.{ts,tsx}
: Review the Typescript and React code for conformity with best practices in React, Recoil, Graphql, and Typescript. Highlight any deviations.
app/packages/core/src/components/Modal/ModalLooker.tsx (1)
Pattern **/*.{ts,tsx}
: Review the Typescript and React code for conformity with best practices in React, Recoil, Graphql, and Typescript. Highlight any deviations.
app/packages/core/src/plugins/SchemaIO/components/DynamicIO.tsx (1)
Pattern **/*.{ts,tsx}
: Review the Typescript and React code for conformity with best practices in React, Recoil, Graphql, and Typescript. Highlight any deviations.
app/packages/operators/src/usePanelEvent.ts (1)
Pattern **/*.{ts,tsx}
: Review the Typescript and React code for conformity with best practices in React, Recoil, Graphql, and Typescript. Highlight any deviations.
app/packages/core/src/plugins/SchemaIO/index.tsx (1)
Pattern **/*.{ts,tsx}
: Review the Typescript and React code for conformity with best practices in React, Recoil, Graphql, and Typescript. Highlight any deviations.
app/packages/core/src/plugins/SchemaIO/components/OperatorExecutionButtonView.tsx (1)
Pattern **/*.{ts,tsx}
: Review the Typescript and React code for conformity with best practices in React, Recoil, Graphql, and Typescript. Highlight any deviations.
app/packages/state/src/hooks/useCreateLooker.ts (1)
Pattern **/*.{ts,tsx}
: Review the Typescript and React code for conformity with best practices in React, Recoil, Graphql, and Typescript. Highlight any deviations.
app/packages/operators/src/components/OperatorExecutionButton/index.tsx (1)
Pattern **/*.{ts,tsx}
: Review the Typescript and React code for conformity with best practices in React, Recoil, Graphql, and Typescript. Highlight any deviations.
app/packages/operators/src/built-in-operators.ts (1)
Pattern **/*.{ts,tsx}
: Review the Typescript and React code for conformity with best practices in React, Recoil, Graphql, and Typescript. Highlight any deviations.
app/packages/state/src/recoil/dynamicGroups.ts (1)
Pattern **/*.{ts,tsx}
: Review the Typescript and React code for conformity with best practices in React, Recoil, Graphql, and Typescript. Highlight any deviations.
app/packages/operators/src/components/OperatorExecutionTrigger/index.tsx (1)
Pattern **/*.{ts,tsx}
: Review the Typescript and React code for conformity with best practices in React, Recoil, Graphql, and Typescript. Highlight any deviations.
app/packages/operators/src/state.ts (1)
Pattern **/*.{ts,tsx}
: Review the Typescript and React code for conformity with best practices in React, Recoil, Graphql, and Typescript. Highlight any deviations.
🪛 Ruff (0.8.2)
fiftyone/core/collections.py
10626-10635: Return the condition directly
Inline condition
(SIM103)
⏰ Context from checks skipped due to timeout of 90000ms (7)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.11)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
- GitHub Check: test / test-app
- GitHub Check: lint / eslint
- GitHub Check: e2e / test-e2e
- GitHub Check: build / build
- GitHub Check: build
🔇 Additional comments (22)
fiftyone/utils/super_gradients.py (1)
99-99
: Great improvement in device handling!The change from
model.cuda()
tomodel.to(self.device)
makes the code more flexible by supporting both CPU and GPU devices, following PyTorch best practices.fiftyone/utils/open_clip.py (2)
109-109
: Consistent device handling improvement!The changes from
.cuda()
to.to(self.device)
make the tensor operations device-agnostic, improving code flexibility.Also applies to: 121-121, 146-146
148-150
: Good enhancement to mixed precision support!The update to
torch.amp.autocast
properly handles device-specific mixed precision operations, ensuring optimal performance on both CPU and GPU.fiftyone/utils/ultralytics.py (2)
382-384
: Well-structured device configuration!The device configuration is properly implemented with a sensible default that automatically detects CUDA availability.
397-398
: Clean device initialization!The model is properly initialized and moved to the configured device using PyTorch's recommended approach.
fiftyone/utils/transformers.py (1)
326-328
: Well-implemented device configuration!The device configuration is properly implemented with automatic CUDA detection as a default.
app/packages/operators/src/components/OperatorExecutionButton/index.tsx (1)
Line range hint
19-40
: Props interface looks good!The TypeScript interface for props is well-defined with appropriate types and optional flags.
app/packages/operators/src/usePanelEvent.ts (1)
23-23
: Verify the impact of removing onCancel.The removal of onCancel from options destructuring and promptForOperator call might affect error handling flows.
Also applies to: 52-52
app/packages/core/src/components/Modal/ModalLooker.tsx (1)
68-71
: Good use of key prop for forcing re-render!The addition of modalMediaField as a key prop ensures proper re-rendering when the media field changes.
However, consider extracting the media field selector to a constant for reusability:
+const MODAL_MEDIA_FIELD_SELECTOR = fos.selectedMediaField(true); + export const ModalLooker = React.memo( ({ sample: propsSampleData }: LookerProps) => { // ... - const modalMediaField = useRecoilValue(fos.selectedMediaField(true)); + const modalMediaField = useRecoilValue(MODAL_MEDIA_FIELD_SELECTOR);app/packages/core/src/plugins/SchemaIO/index.tsx (1)
23-29
: Good improvement in value handling!The use of coerceValue before state updates and callbacks ensures consistent value types throughout the component.
app/packages/operators/src/components/OperatorExecutionTrigger/index.tsx (3)
9-13
: LGTM! Clean import organization.The imports are well-organized and properly scoped from the internal state module.
66-70
: Good use of useMemo for stable handler references.The
operatorHandlers
object is correctly memoized to prevent unnecessary re-renders of the operator executor.
86-89
: LGTM! Clean hook usage.The
useOperatorExecutionOptions
hook is properly used with the required dependencies.app/packages/core/src/plugins/SchemaIO/components/DynamicIO.tsx (1)
39-44
: LGTM! Improved schema fallback handling.The addition of
computedSchema
as a fallback ensures that a valid schema is always provided to the onChange handler.app/packages/core/src/plugins/SchemaIO/components/OperatorExecutionButtonView.tsx (1)
100-101
: LGTM! Simplified icon positioning logic.The icon positioning logic has been simplified by directly using the
icon_position
prop, making the code more maintainable.app/packages/state/src/recoil/dynamicGroups.ts (2)
18-18
: LGTM! Clean import addition.The import for
selectedMediaField
is properly added alongside related imports.
292-294
: LGTM! Enhanced store key generation.The addition of
mediaField
to the store key ensures proper state isolation for different media fields.app/packages/state/src/hooks/useCreateLooker.ts (1)
244-245
: LGTM! The controller retrieval uses the new partition key format.The change ensures consistent key usage between storage and retrieval.
app/packages/core/src/components/Modal/ImaVidLooker.tsx (1)
198-198
: LGTM! Removedonce
option to handle multiple fetch events.The change allows the event listener to persist and handle multiple "fetchMore" events, which is necessary for continuous frame fetching.
app/packages/operators/src/state.ts (1)
659-659
: LGTM! Simplified cancellation logic.The change streamlines the cancellation handling by directly using the
close
function, which aligns with the broader simplification of cancellation functionality across the codebase.app/packages/operators/src/built-in-operators.ts (1)
1048-1048
: LGTM! Removed cancellation handling for consistency.The removal of
on_cancel
parameter aligns with the broader changes to simplify cancellation handling across the codebase.fiftyone/core/collections.py (1)
7729-7744
: LGTM! Nice performance optimizationThe optimization to use estimated document count when possible is a good addition that will improve performance for basic counting operations. The conditions for applying the optimization are properly constrained to ensure correctness.
Change log
Use estimated document counts for
dataset.count()
anddataset.count("frames")
.Example usage
Discussion
When are estimated document counts not exact? See
https://www.mongodb.com/docs/manual/reference/method/db.collection.estimatedDocumentCount/#behavior
For context, aside from cases where a user is directly calling
count()
, there are a couple key operations that callcount()
internally:len(dataset)
callsdataset.count()
print(dataset)
callsdataset.summary()
which callsdataset.count()
In my opinion, 2 is a clear case where it makes sense to use estimated document counts. It's more important for
print(dataset)
to be fast than it is for the count to be accurate in every possible case.1 is a bit dubious though. On one hand, it's reasonable to assume that there are cases where you want to know exactly how many samples are in a dataset. On the other hand, there are times, especially in interactive shells, where you don't need precision, you just want to know, well, roughly how big the dataset is (ya know, an estimate).
My current opinion is that it is probably okay for
len(dataset)
to be an estimate. After all, datasets may have samples added/deleted at any time. So there's really no guarantee how long an "exact" document count will remain exact. And therefore it you're writing code that relies onlen(dataset)
being exact for a long period of time, then you're probably asking for trouble.An alternative strategy would be to make this optimization optional, eg by introducing a syntax like this:
We could then make sure that
print(dataset)
uses the optimization, while making the defaultdataset.count(..., allow_estimates=False)
so thatlen(dataset)
is exact.Summary by CodeRabbit
Performance Optimization
New Features