-
-
Notifications
You must be signed in to change notification settings - Fork 263
Chunk model and embeddings of message #1651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Important Review skippedReview was skipped due to path filters ⛔ Files ignored due to path filters (1)
CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including You can disable this status message by setting the Summary by CodeRabbit
WalkthroughThis change introduces a new Django app for AI functionalities, specifically adding a Changes
Assessment against linked issues
Suggested labels
✨ Finishing Touches🧪 Generate Unit Tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (4)
backend/pyproject.toml (1)
42-42: Fix dependency declaration formatting
Add a space around the equals sign forpgvectorto match the project’s existing style (e.g.,pgvector = "^0.4.1").backend/apps/ai/management/commands/slack_create_chunks.py (2)
18-18: Consider standardizing environment variable naming.The environment variable
DJANGO_OPEN_AI_SECRET_KEYcould follow a more consistent pattern likeOPENAI_API_KEYorDJANGO_OPENAI_API_KEYto align with common naming conventions.
46-48: Enhance progress reporting for better user experience.Consider adding more detailed progress reporting, especially for large batches that take significant time to process.
processed_count += len(batch_messages) + + if processed_count % (batch_size * 5) == 0: # Report every 5 batches + self.stdout.write(f"Processed {processed_count}/{total_messages} messages...") self.stdout.write(f"Completed processing all {total_messages} messages")backend/apps/ai/models/chunk.py (1)
28-32: Consider adding return type annotation for consistency.The method works correctly but could benefit from explicit return type annotation for better code documentation.
- def from_chunk(self, chunk_text: str, message: Message, embedding=None) -> None: + def from_chunk(self, chunk_text: str, message: Message, embedding=None) -> None:The current annotation is already correct, but consider adding type hints for the
embeddingparameter:- def from_chunk(self, chunk_text: str, message: Message, embedding=None) -> None: + def from_chunk(self, chunk_text: str, message: Message, embedding: list | None = None) -> None:
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
backend/poetry.lockis excluded by!**/*.lock
📒 Files selected for processing (13)
backend/Makefile(1 hunks)backend/apps/ai/admin.py(1 hunks)backend/apps/ai/management/commands/slack_create_chunks.py(1 hunks)backend/apps/ai/migrations/0001_initial.py(1 hunks)backend/apps/ai/models/__init__.py(1 hunks)backend/apps/ai/models/chunk.py(1 hunks)backend/pyproject.toml(1 hunks)backend/settings/base.py(1 hunks)backend/tests/apps/ai/management/commands/slack_create_chunks_test.py(1 hunks)backend/tests/apps/ai/models/chunk_test.py(1 hunks)docker-compose/local.yaml(1 hunks)docker-compose/production.yaml(1 hunks)docker-compose/staging.yaml(1 hunks)
🧰 Additional context used
🪛 Pylint (3.3.7)
backend/tests/apps/ai/management/commands/slack_create_chunks_test.py
[refactor] 78-78: Too many arguments (6/5)
(R0913)
[refactor] 78-78: Too many positional arguments (6/5)
(R0917)
[refactor] 255-255: Too many local variables (16/15)
(R0914)
[refactor] 15-15: Too many public methods (23/20)
(R0904)
backend/apps/ai/admin.py
[refactor] 8-8: Too few public methods (0/2)
(R0903)
backend/apps/ai/migrations/0001_initial.py
[refactor] 9-9: Too few public methods (0/2)
(R0903)
backend/apps/ai/models/chunk.py
[refactor] 14-14: Too few public methods (0/2)
(R0903)
🪛 GitHub Actions: Run CI/CD
backend/tests/apps/ai/management/commands/slack_create_chunks_test.py
[error] 214-214: CSpell: Unknown word 'thumbsup' found during spelling check.
🪛 GitHub Check: CodeQL
backend/apps/ai/management/commands/slack_create_chunks.py
[warning] 102-102: Overly permissive regular expression range
Suspicious character range that overlaps with \ufffd-\ufffd in the same character class.
[warning] 102-102: Overly permissive regular expression range
Suspicious character range that overlaps with \ufffd-\ufffd in the same character class.
[warning] 102-102: Overly permissive regular expression range
Suspicious character range that overlaps with \ufffd-\ufffd in the same character class.
[warning] 102-102: Overly permissive regular expression range
Suspicious character range that overlaps with \ufffd-\ufffd in the same character class.
[warning] 102-102: Overly permissive regular expression range
Suspicious character range that overlaps with \u2600-\u2b55 in the same character class, and overlaps with \u2640-\u2642 in the same character class, and overlaps with \u2702-\u27b0 in the same character class.
[warning] 102-102: Overly permissive regular expression range
Suspicious character range that overlaps with \u2500-\u2bef in the same character class, and overlaps with \u2600-\u2b55 in the same character class, and overlaps with \u2640-\u2642 in the same character class, and overlaps with \u2702-\u27b0 in the same character class, and overlaps with \ufffd-\ufffd in the same character class.
[warning] 103-103: Overly permissive regular expression range
Suspicious character range that overlaps with \ufffd-\ufffd in the same character class.
[warning] 103-103: Overly permissive regular expression range
Suspicious character range that overlaps with \u2640-\u2642 in the same character class, and overlaps with \u2702-\u27b0 in the same character class.
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: CodeQL (javascript-typescript)
🔇 Additional comments (24)
docker-compose/staging.yaml (1)
40-40: Switch to pgvector-enabled Postgres image
This update aligns the staging environment with local and production configs for pgvector support, which is required for storing vector embeddings.backend/apps/ai/models/__init__.py (1)
1-1: Expose Chunk model at package level
ImportingChunkin__init__.pymakes the model accessible directly from themodelspackage, improving usability.backend/pyproject.toml (1)
37-38: Add AI and vector processing dependencies
Includinglangchainandlangchain-communitysupports embedding generation workflows introduced by the new management command.backend/Makefile (1)
178-180: Add Makefile target for message chunk creation
Theslack-create-message-chunkstarget provides a convenient shortcut to run the newslack_create_chunkscommand in the backend container.backend/settings/base.py (1)
46-46: Register the AI application
Adding"apps.ai"toLOCAL_APPSensures Django recognizes and loads the new AI app and its migrations.docker-compose/local.yaml (1)
55-55: LGTM! Essential infrastructure change for vector embeddings.The switch to
pgvector/pgvector:pg16enables PostgreSQL vector extension support required by the new AI app's Chunk model with embedding fields.backend/apps/ai/admin.py (1)
1-21: LGTM! Clean Django admin implementation.The admin configuration follows Django best practices with appropriate list display fields and search functionality on both the chunk text and related message's Slack ID.
docker-compose/production.yaml (1)
40-40: LGTM! Maintains consistency across environments.The production environment now uses the same pgvector-enabled PostgreSQL image as local development, ensuring consistent vector embedding support.
backend/apps/ai/migrations/0001_initial.py (1)
1-52: LGTM! Well-structured migration with proper vector support.The migration correctly:
- Enables pgvector extension for vector operations
- Creates Chunk model with 1536-dimension embeddings (matching OpenAI's text-embedding-3-small)
- Uses appropriate CASCADE deletion and unique constraints
- Follows Django migration best practices
backend/apps/ai/management/commands/slack_create_chunks.py (2)
34-35: Optimize QuerySet pagination for better performance.Using QuerySet slicing
[offset:offset + batch_size]is inefficient for large datasets as it doesn't use database-level LIMIT/OFFSET optimization.- for offset in range(0, total_messages, batch_size): - batch_messages = Message.objects.all()[offset : offset + batch_size] + for offset in range(0, total_messages, batch_size): + batch_messages = Message.objects.all().order_by('id')[offset:offset + batch_size]Or better yet, use Django's pagination:
+from django.core.paginator import Paginator + + paginator = Paginator(Message.objects.all().order_by('id'), batch_size) + for page_num in paginator.page_range: + batch_messages = paginator.page(page_num).object_listLikely an incorrect or invalid review comment.
79-79: I’ll verify how and whereopenaiis declared and imported in the repo.#!/bin/bash # 1. Look for “openai” in dependency files grep -R "openai" -n --include=requirements* --include=Pipfile* --include=pyproject.toml . \ || echo "No openai entries in dependency files" # 2. Locate the command file path file=$(fd slack_create_chunks.py) # 3. Show any openai imports in the codebase grep -R "import openai" -n . || echo "No import openai found in code" # 4. Print the top of the command file to see how openai is used echo "=== Imports in $file ===" sed -n '1,30p' "$file"backend/apps/ai/models/chunk.py (5)
1-8: LGTM! Clean imports and good organization.The imports are well-organized and include all necessary dependencies for the chunk model functionality.
11-21: Well-structured model definition with proper constraints.The model definition follows Django best practices with appropriate field types and constraints. The unique constraint on
(message, chunk_text)prevents duplicate chunks for the same message.
23-26: Good string representation with truncation.The
__str__method provides a clear, informative representation that includes the chunk ID, message ID, and truncated text preview.
34-39: Efficient bulk save implementation with proper filtering.The static method correctly filters out None values before calling the bulk save operation, which prevents potential errors during database operations.
41-70: Well-implemented conditional creation logic with good documentation.The
update_datamethod has comprehensive logic for checking existing chunks and conditionally creating new ones. The docstring is detailed and the method signature uses keyword-only arguments appropriately.backend/tests/apps/ai/management/commands/slack_create_chunks_test.py (5)
1-13: Clean test setup with appropriate imports.The imports are well-organized and include all necessary testing dependencies.
58-62: Good use of autouse fixture for environment setup.The autouse fixture ensures the OpenAI API key is available for all tests without manual setup in each test method.
78-105: Comprehensive test with good mocking strategy.The test thoroughly covers the successful execution path with appropriate mocking of external dependencies and clear assertions.
255-304: Excellent batch processing test coverage.This test effectively validates that the command processes messages in batches of 1000, which is crucial for performance with large datasets.
208-218: Fix spelling issue in test parameter.The pipeline failure indicates 'thumbsup' is not recognized by the spell checker.
- ("This is :smile: awesome :thumbsup:", "This is awesome "), + ("This is :smile: awesome :thumbs_up:", "This is awesome "),Likely an incorrect or invalid review comment.
backend/tests/apps/ai/models/chunk_test.py (3)
9-14: Useful helper function for mock creation.The
create_model_mockfunction provides a clean way to create model mocks with the necessary attributes for testing.
57-93: Thorough bulk_save method testing.The test suite comprehensively covers all scenarios for the bulk_save method, including edge cases with None values and empty lists.
181-193: Excellent model metadata validation.These tests ensure the model's Meta class attributes are correctly configured, which is important for database schema and admin interface behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
backend/apps/ai/management/commands/slack_create_chunks.py (1)
110-116: Address the overlapping Unicode ranges in regex patterns.The static analysis has flagged overlapping Unicode ranges in regex patterns. While not directly visible in this segment, the past review comments indicate regex issues that should be addressed.
Based on past review feedback, consider using the
emojilibrary more effectively instead of custom regex patterns for emoji handling, which you're already doing withemoji.demojize()on line 110.
🧹 Nitpick comments (1)
backend/apps/ai/management/commands/slack_create_chunks.py (1)
37-50: Consider adding progress reporting for long-running operations.Processing all messages in batches could take a long time for large datasets. Consider adding progress reporting to improve user experience.
for offset in range(0, total_messages, batch_size): + progress_percent = round((processed_count / total_messages) * 100, 1) + self.stdout.write(f"Processing batch at offset {offset} ({progress_percent}% complete)") + batch_messages = Message.objects.all()[offset : offset + batch_size]
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
backend/poetry.lockis excluded by!**/*.lock
📒 Files selected for processing (4)
backend/apps/ai/management/commands/slack_create_chunks.py(1 hunks)backend/apps/slack/management/commands/slack_sync_messages.py(1 hunks)backend/pyproject.toml(1 hunks)cspell/custom-dict.txt(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- cspell/custom-dict.txt
🚧 Files skipped from review as they are similar to previous changes (1)
- backend/pyproject.toml
⏰ Context from checks skipped due to timeout of 90000ms (4)
- GitHub Check: Run frontend e2e tests
- GitHub Check: Run frontend unit tests
- GitHub Check: Run backend tests
- GitHub Check: CodeQL (javascript-typescript)
🔇 Additional comments (3)
backend/apps/slack/management/commands/slack_sync_messages.py (1)
64-68: Verify that removing the sync_messages filter is intentional.The removal of the
sync_messages=Truefilter means all conversations in the workspace will now be processed for message synchronization, not just those specifically marked for syncing. This broadens the scope significantly and may impact performance and data volume.Please confirm this change aligns with the AI chunk creation requirements and verify the performance implications:
#!/bin/bash # Description: Check how the sync_messages field is used elsewhere in the codebase # Expected: Understand if this field has other uses that might be affected # Search for sync_messages field usage rg -A 3 -B 3 "sync_messages" --type py # Check if there are any database constraints or migrations related to sync_messages fd -e py -x grep -l "sync_messages" {} \;backend/apps/ai/management/commands/slack_create_chunks.py (2)
21-27: LGTM! Proper API key validation.Good practice to check for the required environment variable and exit gracefully with a clear error message if it's missing.
89-93: Update exception handling for newer OpenAI library.The exception class
openai.error.OpenAIErroris from an older version of the OpenAI library. The current library usesopenai.OpenAIError.- except openai.error.OpenAIError as e: + except openai.OpenAIError as e:Likely an incorrect or invalid review comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
backend/apps/ai/management/commands/slack_create_chunks.py (2)
69-80: Improve rate limiting implementation.The current implementation using
getattrworks but could be cleaner. Also, consider initializinglast_request_timein__init__or at the class level.+ def __init__(self, *args, **kwargs): + super().__init__(*args, **kwargs) + self.last_request_time = None + def create_chunks_from_message( self, message: Message, cleaned_text: str ) -> list[Chunk | None]: """Create chunks from a message.""" if message.raw_data.get("subtype") in ["channel_join", "channel_leave"]: return [] chunk_texts = self.split_message_text(cleaned_text) if not chunk_texts: self.stdout.write( f"No chunks created for message {message.slack_message_id} - text too short" ) return [] try: - time_since_last_request = datetime.now(UTC) - getattr( - self, "last_request_time", datetime.now(UTC) - timedelta(seconds=2) - ) - - if time_since_last_request < timedelta(seconds=1.2): - time.sleep(1.2 - time_since_last_request.total_seconds()) + # Rate limiting: ensure at least 1.2 seconds between requests + if self.last_request_time: + time_since_last_request = datetime.now(UTC) - self.last_request_time + if time_since_last_request < timedelta(seconds=1.2): + time.sleep(1.2 - time_since_last_request.total_seconds()) response = self.openai_client.embeddings.create( model="text-embedding-3-small", input=chunk_texts ) self.last_request_time = datetime.now(UTC)
111-112: Fix variable reference error in text cleaning.Line 112 references
message_textinstead ofcleaned_text, which will undo the emoji processing from line 111.cleaned_text = emoji.demojize(message_text, delimiters=("", "")) - cleaned_text = re.sub(r"<@U[A-Z0-9]+>", "", message_text) + cleaned_text = re.sub(r"<@U[A-Z0-9]+>", "", cleaned_text)
🧹 Nitpick comments (1)
backend/apps/ai/management/commands/slack_create_chunks.py (1)
37-39: Consider using cursor-based pagination to avoid skipping messages.Offset-based pagination can skip messages if records are deleted during processing. Consider using ID-based pagination for more reliable processing.
- for offset in range(0, total_messages, batch_size): - batch_messages = Message.objects.all()[offset : offset + batch_size] + last_id = 0 + while True: + batch_messages = Message.objects.filter(id__gt=last_id).order_by('id')[:batch_size] + if not batch_messages: + break batch_chunks = [] for message in batch_messages: cleaned_text = self.clean_message_text(message.raw_data.get("text", "")) chunks = self.create_chunks_from_message(message, cleaned_text) batch_chunks.extend(chunks) if batch_chunks: Chunk.bulk_save(batch_chunks) processed_count += len(batch_messages) + last_id = batch_messages[len(batch_messages)-1].id + self.stdout.write(f"Processed {processed_count}/{total_messages} messages...")
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
backend/apps/ai/management/commands/slack_create_chunks.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (4)
- GitHub Check: Run frontend e2e tests
- GitHub Check: Run backend tests
- GitHub Check: Run frontend unit tests
- GitHub Check: CodeQL (javascript-typescript)
🔇 Additional comments (2)
backend/apps/ai/management/commands/slack_create_chunks.py (2)
1-15: Imports are well-organized and necessary.The imports follow PEP 8 conventions with proper ordering (standard library, third-party, local).
96-105: Text splitting configuration is appropriate.The RecursiveCharacterTextSplitter configuration with chunk size 300 and overlap 40 provides good balance for embedding generation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions/suggestions I have regarding this:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
backend/apps/ai/management/commands/slack_create_chunks.py (1)
40-54: Fix whitespace formatting issue.There's a formatting issue with the slice syntax that violates PEP 8.
- batch_messages = Message.objects.all()[offset : offset + batch_size] + batch_messages = Message.objects.all()[offset:offset + batch_size]
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
backend/poetry.lockis excluded by!**/*.lock
📒 Files selected for processing (10)
backend/Makefile(1 hunks)backend/apps/ai/management/commands/slack_create_chunks.py(1 hunks)backend/apps/ai/migrations/0001_initial.py(1 hunks)backend/apps/ai/models/chunk.py(1 hunks)backend/apps/slack/models/message.py(1 hunks)backend/settings/base.py(1 hunks)backend/tests/apps/ai/models/chunk_test.py(1 hunks)cspell/custom-dict.txt(1 hunks)docker-compose/production.yaml(1 hunks)docker-compose/staging.yaml(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- backend/apps/slack/models/message.py
🚧 Files skipped from review as they are similar to previous changes (5)
- cspell/custom-dict.txt
- docker-compose/staging.yaml
- backend/Makefile
- backend/settings/base.py
- docker-compose/production.yaml
🧰 Additional context used
🧬 Code Graph Analysis (1)
backend/apps/ai/models/chunk.py (3)
backend/apps/common/models.py (2)
BulkSaveModel(8-30)TimestampedModel(33-40)backend/apps/common/utils.py (1)
truncate(164-176)backend/apps/slack/models/message.py (4)
Message(13-125)Meta(16-19)bulk_save(86-88)update_data(91-125)
🪛 Flake8 (7.2.0)
backend/apps/ai/management/commands/slack_create_chunks.py
[error] 41-41: whitespace before ':'
(E203)
🪛 Pylint (3.3.7)
backend/apps/ai/migrations/0001_initial.py
[refactor] 9-9: Too few public methods (0/2)
(R0903)
backend/apps/ai/models/chunk.py
[refactor] 14-14: Too few public methods (0/2)
(R0903)
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: Run backend tests
- GitHub Check: Run frontend e2e tests
- GitHub Check: Run frontend unit tests
🔇 Additional comments (16)
backend/apps/ai/models/chunk.py (4)
11-21: LGTM! Well-structured model with appropriate constraints.The Chunk model is well-designed with proper inheritance from TimestampedModel, appropriate field types including the VectorField for embeddings, and a sensible unique constraint on message and chunk_text combination. The table name and verbose name are clear and consistent.
23-26: LGTM! Clear and informative string representation.The
__str__method provides a helpful representation with chunk ID, associated message ID, and truncated text preview using the utility function.
28-33: LGTM! Efficient bulk save implementation with proper filtering.The bulk_save method correctly filters out None values and delegates to the base class implementation. The conditional check ensures the bulk save operation only runs when there are valid chunks to process.
35-63: LGTM! Proper duplicate prevention logic.The update_data method correctly checks for existing chunks using the unique constraint fields and returns None for duplicates, preventing unnecessary database operations. The method signature and documentation are clear.
backend/apps/ai/migrations/0001_initial.py (2)
9-14: LGTM! Proper migration setup with correct dependencies.The migration correctly depends on the slack app migration and is marked as initial for the ai app.
16-51: LGTM! Proper pgvector extension and model creation.The migration correctly enables the VectorExtension before creating the Chunk model, ensuring vector field support is available. The model fields and constraints match the model definition perfectly.
backend/apps/ai/management/commands/slack_create_chunks.py (5)
16-17: LGTM! Good use of constants for configuration.The constants for minimum request interval and default offset improve maintainability and make the rate limiting behavior explicit.
23-35: LGTM! Proper environment variable validation and setup.The command correctly validates the OpenAI API key environment variable and initializes the client. The message count display helps users understand the scope of work.
69-82: LGTM! Effective rate limiting implementation.The rate limiting logic correctly tracks the last request time and enforces the minimum interval between API calls. The use of getattr with a default value handles the first request properly.
94-98: LGTM! Proper error handling for OpenAI API.The exception handling correctly catches OpenAI errors and logs them without stopping the entire process, allowing other messages to be processed.
110-121: LGTM! Comprehensive text cleaning implementation.The text cleaning method properly handles emojis, Slack user mentions, URLs, and emoji codes while normalizing whitespace. The use of the emoji library provides reliable emoji handling.
backend/tests/apps/ai/models/chunk_test.py (5)
9-13: LGTM! Helpful utility function for creating model mocks.The create_model_mock function provides a clean way to create properly configured mock objects with the necessary attributes for testing.
17-30: LGTM! Comprehensive test of string representation.The test properly verifies both the structure and content of the str method output, including truncation behavior.
32-68: LGTM! Thorough testing of bulk_save method.The test suite comprehensively covers different scenarios for bulk_save including valid chunks, None filtering, empty lists, and custom fields parameter.
70-135: LGTM! Complete coverage of update_data method.The tests properly verify both the creation of new chunks and the duplicate prevention logic, including the save parameter behavior.
137-149: LGTM! Proper validation of model metadata and relationships.The tests verify the Meta class attributes and foreign key relationship configuration, ensuring the model is properly configured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (6)
backend/apps/ai/admin.py (1)
8-18: Consider enhancing the admin interface with additional functionality.The current configuration is solid. Consider adding these enhancements for better usability:
class ChunkAdmin(admin.ModelAdmin): + list_filter = ("message__conversation",) + readonly_fields = ("id", "created_at", "updated_at") list_display = ( "id", "message", "text", ) search_fields = ( "message__slack_message_id", "text", ) + raw_id_fields = ("message",)backend/apps/slack/models/message.py (1)
51-64: Robust text cleaning implementation with room for enhancement.The text cleaning logic is comprehensive and appropriate for preparing Slack messages for embedding. Consider these potential enhancements:
@property def cleaned_text(self) -> str: """Get cleaned text from the message.""" if not self.text: return "" text = emoji.demojize(self.text) # Remove emojis. text = re.sub(r"<@U[A-Z0-9]+>", "", text) # Remove user mentions. + text = re.sub(r"<#C[A-Z0-9]+\|[^>]+>", "", text) # Remove channel mentions. text = re.sub(r"<https?://[^>]+>", "", text) # Remove links. text = re.sub(r":\w+:", "", text) # Remove emoji aliases. text = re.sub(r"\s+", " ", text) # Normalize whitespace. return text.strip()backend/apps/ai/management/commands/ai_create_slack_message_chunks.py (2)
72-87: Consider breaking down the complex chunk creation logic.The list comprehension with embedded conditional logic and zip operations is hard to follow. Consider using a more explicit approach for better maintainability.
- return [ - chunk - for text, embedding in zip( - chunk_text, - [d.embedding for d in response.data], # Embedding data from OpenAI response. - strict=True, - ) - if ( - chunk := Chunk.update_data( - embedding=embedding, - message=message, - save=False, - text=text, - ) - ) - ] + chunks = [] + embeddings = [d.embedding for d in response.data] + + for text, embedding in zip(chunk_text, embeddings, strict=True): + chunk = Chunk.update_data( + embedding=embedding, + message=message, + save=False, + text=text, + ) + if chunk: + chunks.append(chunk) + + return chunks
29-30: Consider performance implications for large message counts.Using
Message.objects.count()can be expensive for large tables. If the exact count isn't critical for progress reporting, consider using approximate methods or removing this step.- total_messages = Message.objects.count() - self.stdout.write(f"Found {total_messages} messages to process") + self.stdout.write("Starting to process messages in batches...")backend/apps/ai/models/chunk.py (2)
39-44: Consider making text splitting parameters configurable.The hardcoded values for chunk size, overlap, and separators might not be optimal for all use cases. Consider making these configurable through settings or method parameters.
@staticmethod - def split_text(text: str) -> list[str]: + def split_text(text: str, chunk_size: int = 300, chunk_overlap: int = 40) -> list[str]: """Split text into chunks.""" return RecursiveCharacterTextSplitter( - chunk_size=300, - chunk_overlap=40, + chunk_size=chunk_size, + chunk_overlap=chunk_overlap, length_function=len, separators=["\n\n", "\n", " ", ""], ).split_text(text)
20-20: Document the embedding dimensions choice.The 1536 dimensions corresponds to OpenAI's text-embedding-3-small model. Consider adding a comment to clarify this dependency for future maintainers.
- embedding = VectorField(verbose_name="Embedding", dimensions=1536) + embedding = VectorField(verbose_name="Embedding", dimensions=1536) # OpenAI text-embedding-3-small
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
backend/Makefile(1 hunks)backend/apps/ai/admin.py(1 hunks)backend/apps/ai/management/commands/ai_create_slack_message_chunks.py(1 hunks)backend/apps/ai/migrations/0002_rename_chunk_text_chunk_text_and_more.py(1 hunks)backend/apps/ai/migrations/0003_alter_chunk_options_alter_chunk_embedding_and_more.py(1 hunks)backend/apps/ai/models/chunk.py(1 hunks)backend/apps/slack/models/message.py(3 hunks)backend/tests/apps/ai/models/chunk_test.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- backend/tests/apps/ai/models/chunk_test.py
🧰 Additional context used
🧬 Code Graph Analysis (4)
backend/apps/ai/management/commands/ai_create_slack_message_chunks.py (2)
backend/apps/ai/models/chunk.py (4)
Chunk(12-74)bulk_save(32-34)split_text(37-44)update_data(47-74)backend/apps/slack/models/message.py (6)
Message(15-146)bulk_save(107-109)subtype(78-80)cleaned_text(52-63)text(83-85)update_data(112-146)
backend/apps/ai/admin.py (1)
backend/apps/ai/models/chunk.py (1)
Chunk(12-74)
backend/apps/ai/migrations/0002_rename_chunk_text_chunk_text_and_more.py (1)
backend/apps/ai/migrations/0003_alter_chunk_options_alter_chunk_embedding_and_more.py (1)
Migration(7-27)
backend/apps/ai/migrations/0003_alter_chunk_options_alter_chunk_embedding_and_more.py (1)
backend/apps/ai/migrations/0002_rename_chunk_text_chunk_text_and_more.py (1)
Migration(6-22)
🪛 Flake8 (7.2.0)
backend/apps/ai/management/commands/ai_create_slack_message_chunks.py
[error] 37-37: whitespace before ':'
(E203)
🪛 checkmake (0.2.2)
backend/Makefile
[warning] 1-1: Missing required phony target "all"
(minphony)
[warning] 1-1: Missing required phony target "clean"
(minphony)
[warning] 1-1: Missing required phony target "test"
(minphony)
🪛 Pylint (3.3.7)
backend/apps/ai/admin.py
[refactor] 8-8: Too few public methods (0/2)
(R0903)
backend/apps/ai/migrations/0002_rename_chunk_text_chunk_text_and_more.py
[refactor] 6-6: Too few public methods (0/2)
(R0903)
backend/apps/ai/migrations/0003_alter_chunk_options_alter_chunk_embedding_and_more.py
[refactor] 7-7: Too few public methods (0/2)
(R0903)
backend/apps/ai/models/chunk.py
[refactor] 15-15: Too few public methods (0/2)
(R0903)
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: Run frontend e2e tests
- GitHub Check: Run frontend unit tests
- GitHub Check: Run backend tests
🔇 Additional comments (6)
backend/Makefile (1)
1-4: LGTM! Clean integration with existing Makefile structure.The new target follows the established pattern and correctly uses the existing
exec-backend-commandinfrastructure for running Django management commands.backend/apps/ai/migrations/0002_rename_chunk_text_chunk_text_and_more.py (1)
12-22: Migration looks correct and well-structured.The field rename and constraint update are properly implemented. The dependencies correctly reference both the previous AI migration and the related Slack migration.
backend/apps/ai/migrations/0003_alter_chunk_options_alter_chunk_embedding_and_more.py (1)
17-21: Verify the embedding dimensions align with your model choice.The 1536 dimensions are correct for OpenAI's
text-embedding-ada-002model. Ensure this matches the embedding model used in your implementation.#!/bin/bash # Description: Verify the embedding model dimensions used in the codebase # Expected: Find references to embedding models and their dimensions rg -A 3 -B 3 "text-embedding|openai.*embed|embedding.*model"backend/apps/slack/models/message.py (2)
3-3: LGTM! Appropriate imports for text processing functionality.The
reandemojiimports support the new text cleaning functionality.Also applies to: 6-6
77-86: Clean and consistent property accessors.The
subtypeandtextproperties provide clean access to raw_data fields with appropriate default values.backend/apps/ai/models/chunk.py (1)
15-18: LGTM on the model Meta configuration.The database table name, verbose name, and unique constraint are well-defined and follow Django conventions appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Dishant1804 please check whether code works and my comment below.
Yes, the code works properly we can proceed with it |
|



Resolves #1645