Chunk model and embeddings of message #1651

Dishant1804 · 2025-06-22T17:16:18Z

Resolves #1645

created chunk model

coderabbitai · 2025-06-22T17:16:24Z

Important

Review skipped

Review was skipped due to path filters

⛔ Files ignored due to path filters (1)

backend/poetry.lock is excluded by !**/*.lock

CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including **/dist/** will override the default block on the dist directory, by removing the pattern from both the lists.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Summary by CodeRabbit

New Features
- Introduced an AI app for managing text chunks and embeddings from Slack messages.
- Added admin interface for managing chunks.
- Added a management command to process Slack messages and generate AI-powered text chunks with embeddings.
Improvements
- Enhanced the Message model with new properties for cleaned text and subtype access.
- Updated database and Docker configurations to support vector operations for AI features.
Bug Fixes
- Adjusted database schema and model metadata for improved clarity and uniqueness constraints.
Tests
- Added comprehensive tests for the new Chunk model and its methods.
Chores
- Updated dependencies and custom dictionary for new AI features.

Walkthrough

This change introduces a new Django app for AI functionalities, specifically adding a Chunk model to store text chunks and their embeddings from Slack messages. It includes database migrations, model logic, admin configuration, management commands for chunk creation, dependency updates, Docker Compose modifications for pgvector support, and comprehensive unit tests.

Changes

Files/Paths	Change Summary
backend/apps/ai/models/chunk.py, backend/apps/ai/models/init.py	Added `Chunk` model with methods for chunking text, saving, and embedding; exposed in module init.
backend/apps/ai/admin.py	Registered `Chunk` model in Django admin with custom display/search.
backend/apps/ai/management/commands/ai_create_slack_message_chunks.py	Added management command to generate and store message chunks with embeddings using OpenAI API.
backend/apps/ai/migrations/0001_initial.py, 0002_rename_chunk_text_chunk_text_and_more.py, 0003_alter_chunk_options_alter_chunk_embedding_and_more.py	Added initial and subsequent migrations for `Chunk` model, including schema changes and metadata updates.
backend/settings/base.py	Added `"apps.ai"` to `LOCAL_APPS` in Django settings.
backend/Makefile	Added make target for running the chunk creation management command.
backend/pyproject.toml	Added dependencies: `emoji`, `langchain`, `langchain-community`, `pgvector`.
docker-compose/local.yaml, docker-compose/production.yaml, docker-compose/staging.yaml	Changed Postgres image to `pgvector/pgvector:pg16` for vector support.
backend/apps/slack/models/message.py	Added properties to `Message` model for cleaned text, subtype, and text access.
backend/tests/apps/ai/models/chunk_test.py	Added unit tests for `Chunk` model methods, meta options, and relationships.
cspell/custom-dict.txt	Added "demojize" to custom dictionary.

Assessment against linked issues

Objective	Addressed	Explanation
Create chunk model and store the chunks and embeddings of the messages (#1645)	✅

Suggested labels

backend, docker

✨ Finishing Touches

🧪 Generate Unit Tests

Create PR with Unit Tests
Post Copyable Unit Tests in Comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai auto-generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

backend/apps/ai/management/commands/slack_create_chunks.py

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (4)

backend/pyproject.toml (1)

42-42: Fix dependency declaration formatting
Add a space around the equals sign for pgvector to match the project’s existing style (e.g., pgvector = "^0.4.1").
backend/apps/ai/management/commands/slack_create_chunks.py (2)
18-18: Consider standardizing environment variable naming.

The environment variable DJANGO_OPEN_AI_SECRET_KEY could follow a more consistent pattern like OPENAI_API_KEY or DJANGO_OPENAI_API_KEY to align with common naming conventions.

46-48: Enhance progress reporting for better user experience.

Consider adding more detailed progress reporting, especially for large batches that take significant time to process.
             processed_count += len(batch_messages)
+            
+            if processed_count % (batch_size * 5) == 0:  # Report every 5 batches
+                self.stdout.write(f"Processed {processed_count}/{total_messages} messages...")

         self.stdout.write(f"Completed processing all {total_messages} messages")
backend/apps/ai/models/chunk.py (1)
28-32: Consider adding return type annotation for consistency.

The method works correctly but could benefit from explicit return type annotation for better code documentation.
-    def from_chunk(self, chunk_text: str, message: Message, embedding=None) -> None:
+    def from_chunk(self, chunk_text: str, message: Message, embedding=None) -> None:
The current annotation is already correct, but consider adding type hints for the embedding parameter:
-    def from_chunk(self, chunk_text: str, message: Message, embedding=None) -> None:
+    def from_chunk(self, chunk_text: str, message: Message, embedding: list | None = None) -> None:

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0d16437 and ef999c8.

⛔ Files ignored due to path filters (1)

backend/poetry.lock is excluded by !**/*.lock

📒 Files selected for processing (13)

backend/Makefile (1 hunks)
backend/apps/ai/admin.py (1 hunks)
backend/apps/ai/management/commands/slack_create_chunks.py (1 hunks)
backend/apps/ai/migrations/0001_initial.py (1 hunks)
backend/apps/ai/models/__init__.py (1 hunks)
backend/apps/ai/models/chunk.py (1 hunks)
backend/pyproject.toml (1 hunks)
backend/settings/base.py (1 hunks)
backend/tests/apps/ai/management/commands/slack_create_chunks_test.py (1 hunks)
backend/tests/apps/ai/models/chunk_test.py (1 hunks)
docker-compose/local.yaml (1 hunks)
docker-compose/production.yaml (1 hunks)
docker-compose/staging.yaml (1 hunks)

🧰 Additional context used

🪛 Pylint (3.3.7)

backend/tests/apps/ai/management/commands/slack_create_chunks_test.py

[refactor] 78-78: Too many arguments (6/5)

(R0913)

[refactor] 78-78: Too many positional arguments (6/5)

(R0917)

[refactor] 255-255: Too many local variables (16/15)

(R0914)

[refactor] 15-15: Too many public methods (23/20)

(R0904)

backend/apps/ai/admin.py

[refactor] 8-8: Too few public methods (0/2)

(R0903)

backend/apps/ai/migrations/0001_initial.py

[refactor] 9-9: Too few public methods (0/2)

(R0903)

backend/apps/ai/models/chunk.py

[refactor] 14-14: Too few public methods (0/2)

(R0903)

🪛 GitHub Actions: Run CI/CD

backend/tests/apps/ai/management/commands/slack_create_chunks_test.py

[error] 214-214: CSpell: Unknown word 'thumbsup' found during spelling check.

🪛 GitHub Check: CodeQL

backend/apps/ai/management/commands/slack_create_chunks.py

[warning] 102-102: Overly permissive regular expression range
Suspicious character range that overlaps with \ufffd-\ufffd in the same character class.

[warning] 102-102: Overly permissive regular expression range
Suspicious character range that overlaps with \u2600-\u2b55 in the same character class, and overlaps with \u2640-\u2642 in the same character class, and overlaps with \u2702-\u27b0 in the same character class.

[warning] 102-102: Overly permissive regular expression range
Suspicious character range that overlaps with \u2500-\u2bef in the same character class, and overlaps with \u2600-\u2b55 in the same character class, and overlaps with \u2640-\u2642 in the same character class, and overlaps with \u2702-\u27b0 in the same character class, and overlaps with \ufffd-\ufffd in the same character class.

[warning] 103-103: Overly permissive regular expression range
Suspicious character range that overlaps with \ufffd-\ufffd in the same character class.

[warning] 103-103: Overly permissive regular expression range
Suspicious character range that overlaps with \u2640-\u2642 in the same character class, and overlaps with \u2702-\u27b0 in the same character class.

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: CodeQL (javascript-typescript)

🔇 Additional comments (24)

docker-compose/staging.yaml (1)

40-40: Switch to pgvector-enabled Postgres image
This update aligns the staging environment with local and production configs for pgvector support, which is required for storing vector embeddings.

backend/apps/ai/models/__init__.py (1)

1-1: Expose Chunk model at package level
Importing Chunk in __init__.py makes the model accessible directly from the models package, improving usability.

backend/pyproject.toml (1)

37-38: Add AI and vector processing dependencies
Including langchain and langchain-community supports embedding generation workflows introduced by the new management command.

backend/Makefile (1)

178-180: Add Makefile target for message chunk creation
The slack-create-message-chunks target provides a convenient shortcut to run the new slack_create_chunks command in the backend container.

backend/settings/base.py (1)

46-46: Register the AI application
Adding "apps.ai" to LOCAL_APPS ensures Django recognizes and loads the new AI app and its migrations.

docker-compose/local.yaml (1)

55-55: LGTM! Essential infrastructure change for vector embeddings.

The switch to pgvector/pgvector:pg16 enables PostgreSQL vector extension support required by the new AI app's Chunk model with embedding fields.

backend/apps/ai/admin.py (1)

1-21: LGTM! Clean Django admin implementation.

The admin configuration follows Django best practices with appropriate list display fields and search functionality on both the chunk text and related message's Slack ID.

docker-compose/production.yaml (1)

40-40: LGTM! Maintains consistency across environments.

The production environment now uses the same pgvector-enabled PostgreSQL image as local development, ensuring consistent vector embedding support.

backend/apps/ai/migrations/0001_initial.py (1)

1-52: LGTM! Well-structured migration with proper vector support.

The migration correctly:

Enables pgvector extension for vector operations

Creates Chunk model with 1536-dimension embeddings (matching OpenAI's text-embedding-3-small)

Uses appropriate CASCADE deletion and unique constraints

Follows Django migration best practices
backend/apps/ai/management/commands/slack_create_chunks.py (2)

34-35: Optimize QuerySet pagination for better performance.

Using QuerySet slicing [offset:offset + batch_size] is inefficient for large datasets as it doesn't use database-level LIMIT/OFFSET optimization.
-        for offset in range(0, total_messages, batch_size):
-            batch_messages = Message.objects.all()[offset : offset + batch_size]
+        for offset in range(0, total_messages, batch_size):
+            batch_messages = Message.objects.all().order_by('id')[offset:offset + batch_size]
Or better yet, use Django's pagination:
+from django.core.paginator import Paginator
+
+        paginator = Paginator(Message.objects.all().order_by('id'), batch_size)
+        for page_num in paginator.page_range:
+            batch_messages = paginator.page(page_num).object_list
Likely an incorrect or invalid review comment.

79-79: I’ll verify how and where openai is declared and imported in the repo.
#!/bin/bash
# 1. Look for “openai” in dependency files
grep -R "openai" -n --include=requirements* --include=Pipfile* --include=pyproject.toml . \
  || echo "No openai entries in dependency files"

# 2. Locate the command file path
file=$(fd slack_create_chunks.py)

# 3. Show any openai imports in the codebase
grep -R "import openai" -n . || echo "No import openai found in code"

# 4. Print the top of the command file to see how openai is used
echo "=== Imports in $file ==="
sed -n '1,30p' "$file"
backend/apps/ai/models/chunk.py (5)

1-8: LGTM! Clean imports and good organization.

The imports are well-organized and include all necessary dependencies for the chunk model functionality.

11-21: Well-structured model definition with proper constraints.

The model definition follows Django best practices with appropriate field types and constraints. The unique constraint on (message, chunk_text) prevents duplicate chunks for the same message.

23-26: Good string representation with truncation.

The __str__ method provides a clear, informative representation that includes the chunk ID, message ID, and truncated text preview.

34-39: Efficient bulk save implementation with proper filtering.

The static method correctly filters out None values before calling the bulk save operation, which prevents potential errors during database operations.

41-70: Well-implemented conditional creation logic with good documentation.

The update_data method has comprehensive logic for checking existing chunks and conditionally creating new ones. The docstring is detailed and the method signature uses keyword-only arguments appropriately.
backend/tests/apps/ai/management/commands/slack_create_chunks_test.py (5)

1-13: Clean test setup with appropriate imports.

The imports are well-organized and include all necessary testing dependencies.

58-62: Good use of autouse fixture for environment setup.

The autouse fixture ensures the OpenAI API key is available for all tests without manual setup in each test method.

78-105: Comprehensive test with good mocking strategy.

The test thoroughly covers the successful execution path with appropriate mocking of external dependencies and clear assertions.

255-304: Excellent batch processing test coverage.

This test effectively validates that the command processes messages in batches of 1000, which is crucial for performance with large datasets.

208-218: Fix spelling issue in test parameter.

The pipeline failure indicates 'thumbsup' is not recognized by the spell checker.
-            ("This is :smile: awesome :thumbsup:", "This is  awesome "),
+            ("This is :smile: awesome :thumbs_up:", "This is  awesome "),
Likely an incorrect or invalid review comment.
backend/tests/apps/ai/models/chunk_test.py (3)

9-14: Useful helper function for mock creation.

The create_model_mock function provides a clean way to create model mocks with the necessary attributes for testing.

57-93: Thorough bulk_save method testing.

The test suite comprehensively covers all scenarios for the bulk_save method, including edge cases with None values and empty lists.

181-193: Excellent model metadata validation.

These tests ensure the model's Meta class attributes are correctly configured, which is important for database schema and admin interface behavior.

backend/apps/ai/management/commands/slack_create_chunks.py

backend/tests/apps/ai/models/chunk_test.py

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

backend/apps/ai/management/commands/slack_create_chunks.py (1)

110-116: Address the overlapping Unicode ranges in regex patterns.

The static analysis has flagged overlapping Unicode ranges in regex patterns. While not directly visible in this segment, the past review comments indicate regex issues that should be addressed.

Based on past review feedback, consider using the emoji library more effectively instead of custom regex patterns for emoji handling, which you're already doing with emoji.demojize() on line 110.

🧹 Nitpick comments (1)

backend/apps/ai/management/commands/slack_create_chunks.py (1)
37-50: Consider adding progress reporting for long-running operations.

Processing all messages in batches could take a long time for large datasets. Consider adding progress reporting to improve user experience.
         for offset in range(0, total_messages, batch_size):
+            progress_percent = round((processed_count / total_messages) * 100, 1)
+            self.stdout.write(f"Processing batch at offset {offset} ({progress_percent}% complete)")
+            
             batch_messages = Message.objects.all()[offset : offset + batch_size]

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1e55c56 and cbb255f.

⛔ Files ignored due to path filters (1)

backend/poetry.lock is excluded by !**/*.lock

📒 Files selected for processing (4)

backend/apps/ai/management/commands/slack_create_chunks.py (1 hunks)
backend/apps/slack/management/commands/slack_sync_messages.py (1 hunks)
backend/pyproject.toml (1 hunks)
cspell/custom-dict.txt (1 hunks)

✅ Files skipped from review due to trivial changes (1)

cspell/custom-dict.txt

🚧 Files skipped from review as they are similar to previous changes (1)

backend/pyproject.toml

⏰ Context from checks skipped due to timeout of 90000ms (4)

GitHub Check: Run frontend e2e tests
GitHub Check: Run frontend unit tests
GitHub Check: Run backend tests
GitHub Check: CodeQL (javascript-typescript)

🔇 Additional comments (3)

backend/apps/slack/management/commands/slack_sync_messages.py (1)

64-68: Verify that removing the sync_messages filter is intentional.

The removal of the sync_messages=True filter means all conversations in the workspace will now be processed for message synchronization, not just those specifically marked for syncing. This broadens the scope significantly and may impact performance and data volume.

Please confirm this change aligns with the AI chunk creation requirements and verify the performance implications:
#!/bin/bash
# Description: Check how the sync_messages field is used elsewhere in the codebase
# Expected: Understand if this field has other uses that might be affected

# Search for sync_messages field usage
rg -A 3 -B 3 "sync_messages" --type py

# Check if there are any database constraints or migrations related to sync_messages
fd -e py -x grep -l "sync_messages" {} \;
backend/apps/ai/management/commands/slack_create_chunks.py (2)

21-27: LGTM! Proper API key validation.

Good practice to check for the required environment variable and exit gracefully with a clear error message if it's missing.

89-93: Update exception handling for newer OpenAI library.

The exception class openai.error.OpenAIError is from an older version of the OpenAI library. The current library uses openai.OpenAIError.
-        except openai.error.OpenAIError as e:
+        except openai.OpenAIError as e:
Likely an incorrect or invalid review comment.

backend/apps/ai/management/commands/slack_create_chunks.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

backend/apps/ai/management/commands/slack_create_chunks.py (2)

69-80: Improve rate limiting implementation.

The current implementation using getattr works but could be cleaner. Also, consider initializing last_request_time in __init__ or at the class level.

+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.last_request_time = None
+
     def create_chunks_from_message(
         self, message: Message, cleaned_text: str
     ) -> list[Chunk | None]:
         """Create chunks from a message."""
         if message.raw_data.get("subtype") in ["channel_join", "channel_leave"]:
             return []

         chunk_texts = self.split_message_text(cleaned_text)

         if not chunk_texts:
             self.stdout.write(
                 f"No chunks created for message {message.slack_message_id} - text too short"
             )
             return []

         try:
-            time_since_last_request = datetime.now(UTC) - getattr(
-                self, "last_request_time", datetime.now(UTC) - timedelta(seconds=2)
-            )
-
-            if time_since_last_request < timedelta(seconds=1.2):
-                time.sleep(1.2 - time_since_last_request.total_seconds())
+            # Rate limiting: ensure at least 1.2 seconds between requests
+            if self.last_request_time:
+                time_since_last_request = datetime.now(UTC) - self.last_request_time
+                if time_since_last_request < timedelta(seconds=1.2):
+                    time.sleep(1.2 - time_since_last_request.total_seconds())

             response = self.openai_client.embeddings.create(
                 model="text-embedding-3-small", input=chunk_texts
             )
             self.last_request_time = datetime.now(UTC)

111-112: Fix variable reference error in text cleaning.

Line 112 references message_text instead of cleaned_text, which will undo the emoji processing from line 111.

         cleaned_text = emoji.demojize(message_text, delimiters=("", ""))
-        cleaned_text = re.sub(r"<@U[A-Z0-9]+>", "", message_text)
+        cleaned_text = re.sub(r"<@U[A-Z0-9]+>", "", cleaned_text)

🧹 Nitpick comments (1)

backend/apps/ai/management/commands/slack_create_chunks.py (1)

37-39: Consider using cursor-based pagination to avoid skipping messages.

Offset-based pagination can skip messages if records are deleted during processing. Consider using ID-based pagination for more reliable processing.

-        for offset in range(0, total_messages, batch_size):
-            batch_messages = Message.objects.all()[offset : offset + batch_size]
+        last_id = 0
+        while True:
+            batch_messages = Message.objects.filter(id__gt=last_id).order_by('id')[:batch_size]
+            if not batch_messages:
+                break
             batch_chunks = []

             for message in batch_messages:
                 cleaned_text = self.clean_message_text(message.raw_data.get("text", ""))
                 chunks = self.create_chunks_from_message(message, cleaned_text)
                 batch_chunks.extend(chunks)

             if batch_chunks:
                 Chunk.bulk_save(batch_chunks)

             processed_count += len(batch_messages)
+            last_id = batch_messages[len(batch_messages)-1].id
+            self.stdout.write(f"Processed {processed_count}/{total_messages} messages...")

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cbb255f and 9733fee.

📒 Files selected for processing (1)

backend/apps/ai/management/commands/slack_create_chunks.py (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (4)

GitHub Check: Run frontend e2e tests
GitHub Check: Run backend tests
GitHub Check: Run frontend unit tests
GitHub Check: CodeQL (javascript-typescript)

🔇 Additional comments (2)

backend/apps/ai/management/commands/slack_create_chunks.py (2)

1-15: Imports are well-organized and necessary.

The imports follow PEP 8 conventions with proper ordering (standard library, third-party, local).

96-105: Text splitting configuration is appropriate.

The RecursiveCharacterTextSplitter configuration with chunk size 300 and overlap 40 provides good balance for embedding generation.

backend/apps/ai/management/commands/slack_create_chunks.py

arkid15r

A few questions/suggestions I have regarding this:

backend/apps/ai/models/chunk.py

backend/apps/slack/management/commands/slack_sync_messages.py

backend/apps/ai/models/chunk.py

backend/apps/ai/management/commands/slack_create_chunks.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

backend/apps/ai/management/commands/slack_create_chunks.py (1)
40-54: Fix whitespace formatting issue.

There's a formatting issue with the slice syntax that violates PEP 8.
-            batch_messages = Message.objects.all()[offset : offset + batch_size]
+            batch_messages = Message.objects.all()[offset:offset + batch_size]

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a9b1be9 and 6798ad6.

⛔ Files ignored due to path filters (1)

backend/poetry.lock is excluded by !**/*.lock

📒 Files selected for processing (10)

backend/Makefile (1 hunks)
backend/apps/ai/management/commands/slack_create_chunks.py (1 hunks)
backend/apps/ai/migrations/0001_initial.py (1 hunks)
backend/apps/ai/models/chunk.py (1 hunks)
backend/apps/slack/models/message.py (1 hunks)
backend/settings/base.py (1 hunks)
backend/tests/apps/ai/models/chunk_test.py (1 hunks)
cspell/custom-dict.txt (1 hunks)
docker-compose/production.yaml (1 hunks)
docker-compose/staging.yaml (1 hunks)

✅ Files skipped from review due to trivial changes (1)

backend/apps/slack/models/message.py

🚧 Files skipped from review as they are similar to previous changes (5)

cspell/custom-dict.txt
docker-compose/staging.yaml
backend/Makefile
backend/settings/base.py
docker-compose/production.yaml

🧰 Additional context used

🧬 Code Graph Analysis (1)

backend/apps/ai/models/chunk.py (3)

backend/apps/common/models.py (2)

BulkSaveModel (8-30)

TimestampedModel (33-40)

backend/apps/common/utils.py (1)

truncate (164-176)

backend/apps/slack/models/message.py (4)

Message (13-125)

Meta (16-19)

bulk_save (86-88)

update_data (91-125)

🪛 Flake8 (7.2.0)

backend/apps/ai/management/commands/slack_create_chunks.py

[error] 41-41: whitespace before ':'

(E203)

🪛 Pylint (3.3.7)

backend/apps/ai/migrations/0001_initial.py

[refactor] 9-9: Too few public methods (0/2)

(R0903)

backend/apps/ai/models/chunk.py

[refactor] 14-14: Too few public methods (0/2)

(R0903)

⏰ Context from checks skipped due to timeout of 90000ms (3)

GitHub Check: Run backend tests
GitHub Check: Run frontend e2e tests
GitHub Check: Run frontend unit tests

🔇 Additional comments (16)

backend/apps/ai/models/chunk.py (4)

11-21: LGTM! Well-structured model with appropriate constraints.

The Chunk model is well-designed with proper inheritance from TimestampedModel, appropriate field types including the VectorField for embeddings, and a sensible unique constraint on message and chunk_text combination. The table name and verbose name are clear and consistent.

23-26: LGTM! Clear and informative string representation.

The __str__ method provides a helpful representation with chunk ID, associated message ID, and truncated text preview using the utility function.

28-33: LGTM! Efficient bulk save implementation with proper filtering.

The bulk_save method correctly filters out None values and delegates to the base class implementation. The conditional check ensures the bulk save operation only runs when there are valid chunks to process.

35-63: LGTM! Proper duplicate prevention logic.

The update_data method correctly checks for existing chunks using the unique constraint fields and returns None for duplicates, preventing unnecessary database operations. The method signature and documentation are clear.

backend/apps/ai/migrations/0001_initial.py (2)

9-14: LGTM! Proper migration setup with correct dependencies.

The migration correctly depends on the slack app migration and is marked as initial for the ai app.

16-51: LGTM! Proper pgvector extension and model creation.

The migration correctly enables the VectorExtension before creating the Chunk model, ensuring vector field support is available. The model fields and constraints match the model definition perfectly.

backend/apps/ai/management/commands/slack_create_chunks.py (5)

16-17: LGTM! Good use of constants for configuration.

The constants for minimum request interval and default offset improve maintainability and make the rate limiting behavior explicit.

23-35: LGTM! Proper environment variable validation and setup.

The command correctly validates the OpenAI API key environment variable and initializes the client. The message count display helps users understand the scope of work.

69-82: LGTM! Effective rate limiting implementation.

The rate limiting logic correctly tracks the last request time and enforces the minimum interval between API calls. The use of getattr with a default value handles the first request properly.

94-98: LGTM! Proper error handling for OpenAI API.

The exception handling correctly catches OpenAI errors and logs them without stopping the entire process, allowing other messages to be processed.

110-121: LGTM! Comprehensive text cleaning implementation.

The text cleaning method properly handles emojis, Slack user mentions, URLs, and emoji codes while normalizing whitespace. The use of the emoji library provides reliable emoji handling.

backend/tests/apps/ai/models/chunk_test.py (5)

9-13: LGTM! Helpful utility function for creating model mocks.

The create_model_mock function provides a clean way to create properly configured mock objects with the necessary attributes for testing.

17-30: LGTM! Comprehensive test of string representation.

The test properly verifies both the structure and content of the str method output, including truncation behavior.

32-68: LGTM! Thorough testing of bulk_save method.

The test suite comprehensively covers different scenarios for bulk_save including valid chunks, None filtering, empty lists, and custom fields parameter.

70-135: LGTM! Complete coverage of update_data method.

The tests properly verify both the creation of new chunks and the duplicate prevention logic, including the save parameter behavior.

137-149: LGTM! Proper validation of model metadata and relationships.

The tests verify the Meta class attributes and foreign key relationship configuration, ensuring the model is properly configured.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (6)

backend/apps/ai/admin.py (1)

8-18: Consider enhancing the admin interface with additional functionality.

The current configuration is solid. Consider adding these enhancements for better usability:

 class ChunkAdmin(admin.ModelAdmin):
+    list_filter = ("message__conversation",)
+    readonly_fields = ("id", "created_at", "updated_at")
     list_display = (
         "id",
         "message",
         "text",
     )
     search_fields = (
         "message__slack_message_id",
         "text",
     )
+    raw_id_fields = ("message",)

backend/apps/slack/models/message.py (1)

51-64: Robust text cleaning implementation with room for enhancement.

The text cleaning logic is comprehensive and appropriate for preparing Slack messages for embedding. Consider these potential enhancements:

 @property
 def cleaned_text(self) -> str:
     """Get cleaned text from the message."""
     if not self.text:
         return ""

     text = emoji.demojize(self.text)  # Remove emojis.
     text = re.sub(r"<@U[A-Z0-9]+>", "", text)  # Remove user mentions.
+    text = re.sub(r"<#C[A-Z0-9]+\|[^>]+>", "", text)  # Remove channel mentions.
     text = re.sub(r"<https?://[^>]+>", "", text)  # Remove links.
     text = re.sub(r":\w+:", "", text)  # Remove emoji aliases.
     text = re.sub(r"\s+", " ", text)  # Normalize whitespace.

     return text.strip()

backend/apps/ai/management/commands/ai_create_slack_message_chunks.py (2)

72-87: Consider breaking down the complex chunk creation logic.

The list comprehension with embedded conditional logic and zip operations is hard to follow. Consider using a more explicit approach for better maintainability.

-            return [
-                chunk
-                for text, embedding in zip(
-                    chunk_text,
-                    [d.embedding for d in response.data],  # Embedding data from OpenAI response.
-                    strict=True,
-                )
-                if (
-                    chunk := Chunk.update_data(
-                        embedding=embedding,
-                        message=message,
-                        save=False,
-                        text=text,
-                    )
-                )
-            ]
+            chunks = []
+            embeddings = [d.embedding for d in response.data]
+            
+            for text, embedding in zip(chunk_text, embeddings, strict=True):
+                chunk = Chunk.update_data(
+                    embedding=embedding,
+                    message=message,
+                    save=False,
+                    text=text,
+                )
+                if chunk:
+                    chunks.append(chunk)
+            
+            return chunks

29-30: Consider performance implications for large message counts.

Using Message.objects.count() can be expensive for large tables. If the exact count isn't critical for progress reporting, consider using approximate methods or removing this step.

-        total_messages = Message.objects.count()
-        self.stdout.write(f"Found {total_messages} messages to process")
+        self.stdout.write("Starting to process messages in batches...")

backend/apps/ai/models/chunk.py (2)

39-44: Consider making text splitting parameters configurable.

The hardcoded values for chunk size, overlap, and separators might not be optimal for all use cases. Consider making these configurable through settings or method parameters.
     @staticmethod
-    def split_text(text: str) -> list[str]:
+    def split_text(text: str, chunk_size: int = 300, chunk_overlap: int = 40) -> list[str]:
         """Split text into chunks."""
         return RecursiveCharacterTextSplitter(
-            chunk_size=300,
-            chunk_overlap=40,
+            chunk_size=chunk_size,
+            chunk_overlap=chunk_overlap,
             length_function=len,
             separators=["\n\n", "\n", " ", ""],
         ).split_text(text)
20-20: Document the embedding dimensions choice.

The 1536 dimensions corresponds to OpenAI's text-embedding-3-small model. Consider adding a comment to clarify this dependency for future maintainers.
-    embedding = VectorField(verbose_name="Embedding", dimensions=1536)
+    embedding = VectorField(verbose_name="Embedding", dimensions=1536)  # OpenAI text-embedding-3-small

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6798ad6 and d93208c.

📒 Files selected for processing (8)

backend/Makefile (1 hunks)
backend/apps/ai/admin.py (1 hunks)
backend/apps/ai/management/commands/ai_create_slack_message_chunks.py (1 hunks)
backend/apps/ai/migrations/0002_rename_chunk_text_chunk_text_and_more.py (1 hunks)
backend/apps/ai/migrations/0003_alter_chunk_options_alter_chunk_embedding_and_more.py (1 hunks)
backend/apps/ai/models/chunk.py (1 hunks)
backend/apps/slack/models/message.py (3 hunks)
backend/tests/apps/ai/models/chunk_test.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

backend/tests/apps/ai/models/chunk_test.py

🧰 Additional context used

🧬 Code Graph Analysis (4)

backend/apps/ai/management/commands/ai_create_slack_message_chunks.py (2)

backend/apps/ai/models/chunk.py (4)

Chunk (12-74)

bulk_save (32-34)

split_text (37-44)

update_data (47-74)

backend/apps/slack/models/message.py (6)

Message (15-146)

bulk_save (107-109)

subtype (78-80)

cleaned_text (52-63)

text (83-85)

update_data (112-146)

backend/apps/ai/admin.py (1)

backend/apps/ai/models/chunk.py (1)

Chunk (12-74)

backend/apps/ai/migrations/0002_rename_chunk_text_chunk_text_and_more.py (1)

backend/apps/ai/migrations/0003_alter_chunk_options_alter_chunk_embedding_and_more.py (1)

Migration (7-27)

backend/apps/ai/migrations/0003_alter_chunk_options_alter_chunk_embedding_and_more.py (1)

backend/apps/ai/migrations/0002_rename_chunk_text_chunk_text_and_more.py (1)

Migration (6-22)

🪛 Flake8 (7.2.0)

backend/apps/ai/management/commands/ai_create_slack_message_chunks.py

[error] 37-37: whitespace before ':'

(E203)

🪛 checkmake (0.2.2)

backend/Makefile

[warning] 1-1: Missing required phony target "all"

(minphony)

[warning] 1-1: Missing required phony target "clean"

(minphony)

[warning] 1-1: Missing required phony target "test"

(minphony)

🪛 Pylint (3.3.7)

backend/apps/ai/admin.py

[refactor] 8-8: Too few public methods (0/2)

(R0903)

backend/apps/ai/migrations/0002_rename_chunk_text_chunk_text_and_more.py

[refactor] 6-6: Too few public methods (0/2)

(R0903)

backend/apps/ai/migrations/0003_alter_chunk_options_alter_chunk_embedding_and_more.py

[refactor] 7-7: Too few public methods (0/2)

(R0903)

backend/apps/ai/models/chunk.py

[refactor] 15-15: Too few public methods (0/2)

(R0903)

⏰ Context from checks skipped due to timeout of 90000ms (3)

GitHub Check: Run frontend e2e tests
GitHub Check: Run frontend unit tests
GitHub Check: Run backend tests

🔇 Additional comments (6)

backend/Makefile (1)

1-4: LGTM! Clean integration with existing Makefile structure.

The new target follows the established pattern and correctly uses the existing exec-backend-command infrastructure for running Django management commands.

backend/apps/ai/migrations/0002_rename_chunk_text_chunk_text_and_more.py (1)

12-22: Migration looks correct and well-structured.

The field rename and constraint update are properly implemented. The dependencies correctly reference both the previous AI migration and the related Slack migration.
backend/apps/ai/migrations/0003_alter_chunk_options_alter_chunk_embedding_and_more.py (1)

17-21: Verify the embedding dimensions align with your model choice.

The 1536 dimensions are correct for OpenAI's text-embedding-ada-002 model. Ensure this matches the embedding model used in your implementation.
#!/bin/bash
# Description: Verify the embedding model dimensions used in the codebase
# Expected: Find references to embedding models and their dimensions

rg -A 3 -B 3 "text-embedding|openai.*embed|embedding.*model"
backend/apps/slack/models/message.py (2)

3-3: LGTM! Appropriate imports for text processing functionality.

The re and emoji imports support the new text cleaning functionality.

Also applies to: 6-6

77-86: Clean and consistent property accessors.

The subtype and text properties provide clean access to raw_data fields with appropriate default values.

backend/apps/ai/models/chunk.py (1)

15-18: LGTM on the model Meta configuration.

The database table name, verbose name, and unique constraint are well-defined and follow Django conventions appropriately.

backend/apps/ai/management/commands/ai_create_slack_message_chunks.py

backend/apps/ai/models/chunk.py

backend/apps/ai/management/commands/ai_create_slack_message_chunks.py

arkid15r

@Dishant1804 please check whether code works and my comment below.

Dishant1804 · 2025-06-27T09:20:54Z

@Dishant1804 please check whether code works and my comment below.

Yes, the code works properly we can proceed with it

sonarqubecloud · 2025-06-27T16:18:15Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Dishant1804 added 4 commits June 22, 2025 11:12

chunk model and embeddings

9627d1e

Merge remote-tracking branch 'upstream/main' into chunk_model

00ed69d

draft

79627a7

Merge remote-tracking branch 'upstream/main' into chunk_model

ef999c8

Dishant1804 requested a review from arkid15r as a code owner June 22, 2025 17:16

github-actions bot added backend backend-tests makefile labels Jun 22, 2025

github-advanced-security bot found potential problems Jun 22, 2025

View reviewed changes

coderabbitai bot reviewed Jun 22, 2025

View reviewed changes

backend/apps/ai/management/commands/slack_create_chunks.py Outdated Show resolved Hide resolved

backend/apps/ai/management/commands/slack_create_chunks.py Outdated Show resolved Hide resolved

backend/tests/apps/ai/models/chunk_test.py Outdated Show resolved Hide resolved

Dishant1804 added 2 commits June 22, 2025 22:56

spelling and fixes

1e55c56

code rabbit suggestions

cbb255f

github-actions bot added nestbot and removed backend labels Jun 22, 2025

coderabbitai bot reviewed Jun 22, 2025

View reviewed changes

backend/apps/ai/management/commands/slack_create_chunks.py Outdated Show resolved Hide resolved

backend/apps/ai/management/commands/slack_create_chunks.py Outdated Show resolved Hide resolved

fixes

9733fee

coderabbitai bot reviewed Jun 22, 2025

View reviewed changes

backend/apps/ai/management/commands/slack_create_chunks.py Outdated Show resolved Hide resolved

open ai error handling changes

a9b1be9

arkid15r requested changes Jun 25, 2025

View reviewed changes

Dishant1804 added 2 commits June 25, 2025 23:45

suggestions implemented

e2abafa

Merge remote-tracking branch 'upstream/main' into chunk_model

6798ad6

coderabbitai bot reviewed Jun 25, 2025

View reviewed changes

Dishant1804 requested a review from arkid15r June 25, 2025 19:43

Update code

d93208c

coderabbitai bot reviewed Jun 26, 2025

View reviewed changes

backend/apps/ai/management/commands/ai_create_slack_message_chunks.py Show resolved Hide resolved

backend/apps/ai/management/commands/ai_create_slack_message_chunks.py Show resolved Hide resolved

backend/apps/ai/models/chunk.py Show resolved Hide resolved

arkid15r reviewed Jun 26, 2025

View reviewed changes

backend/apps/ai/management/commands/ai_create_slack_message_chunks.py Show resolved Hide resolved

arkid15r reviewed Jun 26, 2025

View reviewed changes

Merge remote-tracking branch 'upstream/main' into chunk_model

e131566

arkid15r approved these changes Jun 27, 2025

View reviewed changes

Merge branch 'main' into chunk_model

258f79d

arkid15r enabled auto-merge June 27, 2025 16:18

arkid15r added this pull request to the merge queue Jun 27, 2025

Merged via the queue into OWASP:main with commit a7510d1 Jun 27, 2025
23 checks passed

coderabbitai bot mentioned this pull request Jul 1, 2025

Creating chunks for OWASP chapters #1693

Merged

1 task

coderabbitai bot mentioned this pull request Jul 28, 2025

NestBot AI Assistant Contexts #1891

Merged

7 tasks

coderabbitai bot mentioned this pull request Sep 1, 2025

Sync www-repopsitories #2164

Merged

5 tasks

coderabbitai bot mentioned this pull request Oct 2, 2025

Fix slack and duplication errors #2352

Merged

2 tasks

coderabbitai bot mentioned this pull request Oct 16, 2025

Agentic rag #2432

Merged

3 tasks

Uh oh!

Chunk model and embeddings of message #1651

Chunk model and embeddings of message #1651

Uh oh!

Conversation

Dishant1804 commented Jun 22, 2025

Uh oh!

coderabbitai bot commented Jun 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Summary by CodeRabbit

Walkthrough

Changes

Assessment against linked issues

Suggested labels

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

arkid15r left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arkid15r left a comment

Choose a reason for hiding this comment

Uh oh!

Dishant1804 commented Jun 27, 2025

Uh oh!

sonarqubecloud bot commented Jun 27, 2025

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

coderabbitai bot commented Jun 22, 2025 •

edited

Loading