-
Notifications
You must be signed in to change notification settings - Fork 0
added upload caching #387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added upload caching #387
Conversation
PR Review: Upload Caching FeatureSummaryThis PR adds caching functionality to prevent re-uploading assets that have already been uploaded. The implementation includes a configuration flag and a cache mechanism based on file metadata. Code Quality & Best PracticesPositive aspects:
Areas for improvement:
Potential Bugs & IssuesCritical Issues:
Performance ConsiderationsPositive:
Considerations:
Security Concerns
Test CoverageMissing test coverage: The PR adds no tests for the new caching functionality. Critical test cases needed:
Recommendation: Add unit tests in a new file Additional Recommendations
Summary AssessmentSeverity Breakdown:
Overall: The feature is well-implemented for basic use cases but needs attention to cache lifecycle management and test coverage before merging. The unbounded cache growth is the primary concern that should be addressed. Recommendation: Request changes to address memory management and add basic test coverage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements asset upload caching to avoid redundant uploads of the same assets. The caching system tracks uploaded assets using a combination of file metadata (path, size, modification time) for local files and URLs for remote assets.
- Adds a caching mechanism with configurable toggle to prevent duplicate asset uploads
- Introduces cache key generation based on file metadata and URLs
- Updates configuration to allow users to enable/disable upload caching
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/rapidata/rapidata_client/datapoints/_asset_uploader.py | Implements upload cache with key generation logic and cache lookup/storage |
| src/rapidata/rapidata_client/config/upload_config.py | Adds cache_uploads boolean field to control caching behavior |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| logger.info(f"Asset uploaded: {response.file_name}") | ||
| if rapidata_config.upload.cache_uploads: | ||
| self._upload_cache[asset_key] = response.file_name | ||
| logger.debug("Asset added to cache") |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The debug log 'Asset added to cache' is always executed regardless of whether caching is enabled or the asset was actually added to cache. Move this log statement inside the if rapidata_config.upload.cache_uploads: block on line 50.
| logger.debug("Asset added to cache") | |
| logger.debug("Asset added to cache") |
| if re.match(r"^https?://", asset): | ||
| return asset | ||
| else: | ||
| stat = os.stat(asset) |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _get_cache_key method calls os.stat(asset) without checking if the file exists first. This will raise FileNotFoundError for non-existent files. However, the file existence check happens later in upload_asset at line 44. This causes cache key generation to fail before the proper error handling. Consider adding existence check in _get_cache_key or handling the exception appropriately.
| stat = os.stat(asset) | |
| try: | |
| stat = os.stat(asset) | |
| except FileNotFoundError: | |
| raise FileNotFoundError(f"File not found: {asset}") |
| maxWorkers: int = Field(default=10) | ||
| maxRetries: int = Field(default=3) | ||
| chunkSize: int = Field(default=50) | ||
| cache_uploads: bool = Field(default=True) |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new cache_uploads field is not documented in the class docstring. Add documentation for this field explaining that it controls whether uploaded assets are cached to avoid duplicate uploads.
PR Review: Upload Caching ImplementationThank you for implementing upload caching! This is a valuable performance optimization. I've reviewed the changes and have the following feedback: ✅ Positives
🐛 Potential Issues1. Cache Inconsistency Across Instances (Critical)Each
Impact: The same file uploaded in different contexts won't benefit from caching. You're potentially uploading the same asset multiple times. Recommendation: Consider making the cache a class variable or using a module-level cache that's shared across all instances: _shared_upload_cache: LRUCache = None
def __init__(self, openapi_service: OpenAPIService, max_cache_size: int = 100_000):
self.openapi_service = openapi_service
if AssetUploader._shared_upload_cache is None:
AssetUploader._shared_upload_cache = LRUCache(maxsize=max_cache_size)
self._upload_cache = AssetUploader._shared_upload_cache2. Race Condition in Cache Check (Moderate)Lines 38-41 and 52-54 have a TOCTOU (Time-of-Check-Time-of-Use) issue. If caching is disabled between the check and the cache write, you could cache when you shouldn't. Recommendation: Check if rapidata_config.upload.cacheUploads:
asset_key = self._get_cache_key(asset)
if asset_key in self._upload_cache:
logger.debug("Asset found in cache")
return self._upload_cache[asset_key]3. FileNotFoundError Moved Too Early (Moderate)The file existence check moved from line 47-48 to inside
Recommendation: Keep the file check where it was, or restructure the logic: def _get_cache_key(self, asset: str) -> str:
"""Generate cache key for an asset."""
if re.match(r"^https?://", asset):
return asset
# For file paths
if not os.path.exists(asset):
raise FileNotFoundError(f"File not found: {asset}")
stat = os.stat(asset)
return f"{asset}:{stat.st_size}:{stat.st_mtime_ns}"4. Logging Inconsistency (Minor)Line 54: 🧪 Test CoverageMissing tests for:
Recommendation: Add unit tests in 🔒 Security Considerations
📝 DocumentationThe Performance Considerations
SummaryThis is a solid feature implementation, but the shared cache issue significantly reduces its effectiveness. I'd recommend addressing issue #1 before merging to maximize the performance benefits. Priority fixes: Issues #1 and #2 |
PR Review: Upload Caching FeatureThanks for implementing upload caching! This is a valuable performance optimization. Here's my detailed review: Code Quality & Best Practices ✅Strengths:
Suggestions for Improvement:
Potential Bugs & Issues
|
Pull Request Review: Upload CachingOverviewThis PR adds upload caching functionality to prevent redundant uploads of the same assets. The implementation uses an LRU cache with intelligent cache key generation based on file properties. Code Quality & Best PracticesPositive Aspects
Areas for Improvement1. Thread Safety Concerns (Critical)File: The class-level cache
Recommendation: Use import threading
class AssetUploader:
_shared_upload_cache: LRUCache = LRUCache(maxsize=100_000)
_cache_lock = threading.Lock()
def upload_asset(self, asset: str) -> str:
with tracer.start_as_current_span("AssetUploader.upload_asset"):
logger.debug("Uploading asset: %s", asset)
assert isinstance(asset, str), "Asset must be a string"
asset_key = self._get_cache_key(asset)
with self._cache_lock:
if asset_key in self._shared_upload_cache:
logger.debug("Asset found in cache")
return self._shared_upload_cache[asset_key]
# Upload logic...
if rapidata_config.upload.cacheUploads:
with self._cache_lock:
self._shared_upload_cache[asset_key] = response.file_name
logger.debug("Asset added to cache")2. Cache Key Generation Logic IssueFile: The
Recommendation: Move the file existence check to def _get_cache_key(self, asset: str) -> str:
"""Generate cache key for an asset."""
if re.match(r"^https?://", asset):
return asset
# Assume file exists; caller should validate
stat = os.stat(asset) # Will raise FileNotFoundError if missing
return f"{asset}:{stat.st_size}:{stat.st_mtime_ns}"3. Error Handling GapFile: If 4. Cache Size ConfigurationFile: The cache size is hardcoded to 100,000 entries. This could be:
Recommendation: Make this configurable via # In upload_config.py
cacheMaxSize: int = Field(default=100_000)
# In _asset_uploader.py
def __init__(self, openapi_service: OpenAPIService):
self.openapi_service = openapi_service
# Initialize instance cache from shared cache or create new
if not hasattr(AssetUploader, '_cache_initialized'):
AssetUploader._shared_upload_cache = LRUCache(
maxsize=rapidata_config.upload.cacheMaxSize
)
AssetUploader._cache_initialized = True5. Potential Cache Invalidation IssueFile: For local files, the cache key includes
This might be intended behavior, but consider using content hashing (e.g., MD5/SHA256 of file) for true deduplication, though this adds I/O overhead. Potential Bugs
if rapidata_config.upload.cacheUploads:
asset_key = self._get_cache_key(asset)
if asset_key in self._shared_upload_cache:
logger.debug("Asset found in cache")
return self._shared_upload_cache[asset_key]Performance ConsiderationsPositive
Concerns
Security Concerns
Test CoverageCritical Issue: No tests found for the new caching functionality. Required Tests:
DocumentationThe new
SummaryPriority Issues:
Recommended Actions:
The core idea is excellent and will significantly improve performance for repeated uploads. With the threading fixes and tests, this will be a solid addition to the SDK. |
No description provided.