Skip to content

Conversation

@Dishant1804
Copy link
Collaborator

Resolves #1645

  • created chunk model

@Dishant1804 Dishant1804 requested a review from arkid15r as a code owner June 22, 2025 17:16
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jun 22, 2025

Important

Review skipped

Review was skipped due to path filters

⛔ Files ignored due to path filters (1)
  • backend/poetry.lock is excluded by !**/*.lock

CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including **/dist/** will override the default block on the dist directory, by removing the pattern from both the lists.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Summary by CodeRabbit

  • New Features

    • Introduced an AI app for managing text chunks and embeddings from Slack messages.
    • Added admin interface for managing chunks.
    • Added a management command to process Slack messages and generate AI-powered text chunks with embeddings.
  • Improvements

    • Enhanced the Message model with new properties for cleaned text and subtype access.
    • Updated database and Docker configurations to support vector operations for AI features.
  • Bug Fixes

    • Adjusted database schema and model metadata for improved clarity and uniqueness constraints.
  • Tests

    • Added comprehensive tests for the new Chunk model and its methods.
  • Chores

    • Updated dependencies and custom dictionary for new AI features.

Walkthrough

This change introduces a new Django app for AI functionalities, specifically adding a Chunk model to store text chunks and their embeddings from Slack messages. It includes database migrations, model logic, admin configuration, management commands for chunk creation, dependency updates, Docker Compose modifications for pgvector support, and comprehensive unit tests.

Changes

Files/Paths Change Summary
backend/apps/ai/models/chunk.py, backend/apps/ai/models/init.py Added Chunk model with methods for chunking text, saving, and embedding; exposed in module init.
backend/apps/ai/admin.py Registered Chunk model in Django admin with custom display/search.
backend/apps/ai/management/commands/ai_create_slack_message_chunks.py Added management command to generate and store message chunks with embeddings using OpenAI API.
backend/apps/ai/migrations/0001_initial.py, 0002_rename_chunk_text_chunk_text_and_more.py, 0003_alter_chunk_options_alter_chunk_embedding_and_more.py Added initial and subsequent migrations for Chunk model, including schema changes and metadata updates.
backend/settings/base.py Added "apps.ai" to LOCAL_APPS in Django settings.
backend/Makefile Added make target for running the chunk creation management command.
backend/pyproject.toml Added dependencies: emoji, langchain, langchain-community, pgvector.
docker-compose/local.yaml, docker-compose/production.yaml, docker-compose/staging.yaml Changed Postgres image to pgvector/pgvector:pg16 for vector support.
backend/apps/slack/models/message.py Added properties to Message model for cleaned text, subtype, and text access.
backend/tests/apps/ai/models/chunk_test.py Added unit tests for Chunk model methods, meta options, and relationships.
cspell/custom-dict.txt Added "demojize" to custom dictionary.

Assessment against linked issues

Objective Addressed Explanation
Create chunk model and store the chunks and embeddings of the messages (#1645)

Suggested labels

backend, docker

✨ Finishing Touches
🧪 Generate Unit Tests
  • Create PR with Unit Tests
  • Post Copyable Unit Tests in Comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai auto-generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
backend/pyproject.toml (1)

42-42: Fix dependency declaration formatting
Add a space around the equals sign for pgvector to match the project’s existing style (e.g., pgvector = "^0.4.1").

backend/apps/ai/management/commands/slack_create_chunks.py (2)

18-18: Consider standardizing environment variable naming.

The environment variable DJANGO_OPEN_AI_SECRET_KEY could follow a more consistent pattern like OPENAI_API_KEY or DJANGO_OPENAI_API_KEY to align with common naming conventions.


46-48: Enhance progress reporting for better user experience.

Consider adding more detailed progress reporting, especially for large batches that take significant time to process.

             processed_count += len(batch_messages)
+            
+            if processed_count % (batch_size * 5) == 0:  # Report every 5 batches
+                self.stdout.write(f"Processed {processed_count}/{total_messages} messages...")

         self.stdout.write(f"Completed processing all {total_messages} messages")
backend/apps/ai/models/chunk.py (1)

28-32: Consider adding return type annotation for consistency.

The method works correctly but could benefit from explicit return type annotation for better code documentation.

-    def from_chunk(self, chunk_text: str, message: Message, embedding=None) -> None:
+    def from_chunk(self, chunk_text: str, message: Message, embedding=None) -> None:

The current annotation is already correct, but consider adding type hints for the embedding parameter:

-    def from_chunk(self, chunk_text: str, message: Message, embedding=None) -> None:
+    def from_chunk(self, chunk_text: str, message: Message, embedding: list | None = None) -> None:
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0d16437 and ef999c8.

⛔ Files ignored due to path filters (1)
  • backend/poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (13)
  • backend/Makefile (1 hunks)
  • backend/apps/ai/admin.py (1 hunks)
  • backend/apps/ai/management/commands/slack_create_chunks.py (1 hunks)
  • backend/apps/ai/migrations/0001_initial.py (1 hunks)
  • backend/apps/ai/models/__init__.py (1 hunks)
  • backend/apps/ai/models/chunk.py (1 hunks)
  • backend/pyproject.toml (1 hunks)
  • backend/settings/base.py (1 hunks)
  • backend/tests/apps/ai/management/commands/slack_create_chunks_test.py (1 hunks)
  • backend/tests/apps/ai/models/chunk_test.py (1 hunks)
  • docker-compose/local.yaml (1 hunks)
  • docker-compose/production.yaml (1 hunks)
  • docker-compose/staging.yaml (1 hunks)
🧰 Additional context used
🪛 Pylint (3.3.7)
backend/tests/apps/ai/management/commands/slack_create_chunks_test.py

[refactor] 78-78: Too many arguments (6/5)

(R0913)


[refactor] 78-78: Too many positional arguments (6/5)

(R0917)


[refactor] 255-255: Too many local variables (16/15)

(R0914)


[refactor] 15-15: Too many public methods (23/20)

(R0904)

backend/apps/ai/admin.py

[refactor] 8-8: Too few public methods (0/2)

(R0903)

backend/apps/ai/migrations/0001_initial.py

[refactor] 9-9: Too few public methods (0/2)

(R0903)

backend/apps/ai/models/chunk.py

[refactor] 14-14: Too few public methods (0/2)

(R0903)

🪛 GitHub Actions: Run CI/CD
backend/tests/apps/ai/management/commands/slack_create_chunks_test.py

[error] 214-214: CSpell: Unknown word 'thumbsup' found during spelling check.

🪛 GitHub Check: CodeQL
backend/apps/ai/management/commands/slack_create_chunks.py

[warning] 102-102: Overly permissive regular expression range
Suspicious character range that overlaps with \ufffd-\ufffd in the same character class.


[warning] 102-102: Overly permissive regular expression range
Suspicious character range that overlaps with \ufffd-\ufffd in the same character class.


[warning] 102-102: Overly permissive regular expression range
Suspicious character range that overlaps with \ufffd-\ufffd in the same character class.


[warning] 102-102: Overly permissive regular expression range
Suspicious character range that overlaps with \ufffd-\ufffd in the same character class.


[warning] 102-102: Overly permissive regular expression range
Suspicious character range that overlaps with \u2600-\u2b55 in the same character class, and overlaps with \u2640-\u2642 in the same character class, and overlaps with \u2702-\u27b0 in the same character class.


[warning] 102-102: Overly permissive regular expression range
Suspicious character range that overlaps with \u2500-\u2bef in the same character class, and overlaps with \u2600-\u2b55 in the same character class, and overlaps with \u2640-\u2642 in the same character class, and overlaps with \u2702-\u27b0 in the same character class, and overlaps with \ufffd-\ufffd in the same character class.


[warning] 103-103: Overly permissive regular expression range
Suspicious character range that overlaps with \ufffd-\ufffd in the same character class.


[warning] 103-103: Overly permissive regular expression range
Suspicious character range that overlaps with \u2640-\u2642 in the same character class, and overlaps with \u2702-\u27b0 in the same character class.

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: CodeQL (javascript-typescript)
🔇 Additional comments (24)
docker-compose/staging.yaml (1)

40-40: Switch to pgvector-enabled Postgres image
This update aligns the staging environment with local and production configs for pgvector support, which is required for storing vector embeddings.

backend/apps/ai/models/__init__.py (1)

1-1: Expose Chunk model at package level
Importing Chunk in __init__.py makes the model accessible directly from the models package, improving usability.

backend/pyproject.toml (1)

37-38: Add AI and vector processing dependencies
Including langchain and langchain-community supports embedding generation workflows introduced by the new management command.

backend/Makefile (1)

178-180: Add Makefile target for message chunk creation
The slack-create-message-chunks target provides a convenient shortcut to run the new slack_create_chunks command in the backend container.

backend/settings/base.py (1)

46-46: Register the AI application
Adding "apps.ai" to LOCAL_APPS ensures Django recognizes and loads the new AI app and its migrations.

docker-compose/local.yaml (1)

55-55: LGTM! Essential infrastructure change for vector embeddings.

The switch to pgvector/pgvector:pg16 enables PostgreSQL vector extension support required by the new AI app's Chunk model with embedding fields.

backend/apps/ai/admin.py (1)

1-21: LGTM! Clean Django admin implementation.

The admin configuration follows Django best practices with appropriate list display fields and search functionality on both the chunk text and related message's Slack ID.

docker-compose/production.yaml (1)

40-40: LGTM! Maintains consistency across environments.

The production environment now uses the same pgvector-enabled PostgreSQL image as local development, ensuring consistent vector embedding support.

backend/apps/ai/migrations/0001_initial.py (1)

1-52: LGTM! Well-structured migration with proper vector support.

The migration correctly:

  • Enables pgvector extension for vector operations
  • Creates Chunk model with 1536-dimension embeddings (matching OpenAI's text-embedding-3-small)
  • Uses appropriate CASCADE deletion and unique constraints
  • Follows Django migration best practices
backend/apps/ai/management/commands/slack_create_chunks.py (2)

34-35: Optimize QuerySet pagination for better performance.

Using QuerySet slicing [offset:offset + batch_size] is inefficient for large datasets as it doesn't use database-level LIMIT/OFFSET optimization.

-        for offset in range(0, total_messages, batch_size):
-            batch_messages = Message.objects.all()[offset : offset + batch_size]
+        for offset in range(0, total_messages, batch_size):
+            batch_messages = Message.objects.all().order_by('id')[offset:offset + batch_size]

Or better yet, use Django's pagination:

+from django.core.paginator import Paginator
+
+        paginator = Paginator(Message.objects.all().order_by('id'), batch_size)
+        for page_num in paginator.page_range:
+            batch_messages = paginator.page(page_num).object_list

Likely an incorrect or invalid review comment.


79-79: I’ll verify how and where openai is declared and imported in the repo.

#!/bin/bash
# 1. Look for “openai” in dependency files
grep -R "openai" -n --include=requirements* --include=Pipfile* --include=pyproject.toml . \
  || echo "No openai entries in dependency files"

# 2. Locate the command file path
file=$(fd slack_create_chunks.py)

# 3. Show any openai imports in the codebase
grep -R "import openai" -n . || echo "No import openai found in code"

# 4. Print the top of the command file to see how openai is used
echo "=== Imports in $file ==="
sed -n '1,30p' "$file"
backend/apps/ai/models/chunk.py (5)

1-8: LGTM! Clean imports and good organization.

The imports are well-organized and include all necessary dependencies for the chunk model functionality.


11-21: Well-structured model definition with proper constraints.

The model definition follows Django best practices with appropriate field types and constraints. The unique constraint on (message, chunk_text) prevents duplicate chunks for the same message.


23-26: Good string representation with truncation.

The __str__ method provides a clear, informative representation that includes the chunk ID, message ID, and truncated text preview.


34-39: Efficient bulk save implementation with proper filtering.

The static method correctly filters out None values before calling the bulk save operation, which prevents potential errors during database operations.


41-70: Well-implemented conditional creation logic with good documentation.

The update_data method has comprehensive logic for checking existing chunks and conditionally creating new ones. The docstring is detailed and the method signature uses keyword-only arguments appropriately.

backend/tests/apps/ai/management/commands/slack_create_chunks_test.py (5)

1-13: Clean test setup with appropriate imports.

The imports are well-organized and include all necessary testing dependencies.


58-62: Good use of autouse fixture for environment setup.

The autouse fixture ensures the OpenAI API key is available for all tests without manual setup in each test method.


78-105: Comprehensive test with good mocking strategy.

The test thoroughly covers the successful execution path with appropriate mocking of external dependencies and clear assertions.


255-304: Excellent batch processing test coverage.

This test effectively validates that the command processes messages in batches of 1000, which is crucial for performance with large datasets.


208-218: Fix spelling issue in test parameter.

The pipeline failure indicates 'thumbsup' is not recognized by the spell checker.

-            ("This is :smile: awesome :thumbsup:", "This is  awesome "),
+            ("This is :smile: awesome :thumbs_up:", "This is  awesome "),

Likely an incorrect or invalid review comment.

backend/tests/apps/ai/models/chunk_test.py (3)

9-14: Useful helper function for mock creation.

The create_model_mock function provides a clean way to create model mocks with the necessary attributes for testing.


57-93: Thorough bulk_save method testing.

The test suite comprehensively covers all scenarios for the bulk_save method, including edge cases with None values and empty lists.


181-193: Excellent model metadata validation.

These tests ensure the model's Meta class attributes are correctly configured, which is important for database schema and admin interface behavior.

@github-actions github-actions bot added nestbot and removed backend labels Jun 22, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
backend/apps/ai/management/commands/slack_create_chunks.py (1)

110-116: Address the overlapping Unicode ranges in regex patterns.

The static analysis has flagged overlapping Unicode ranges in regex patterns. While not directly visible in this segment, the past review comments indicate regex issues that should be addressed.

Based on past review feedback, consider using the emoji library more effectively instead of custom regex patterns for emoji handling, which you're already doing with emoji.demojize() on line 110.

🧹 Nitpick comments (1)
backend/apps/ai/management/commands/slack_create_chunks.py (1)

37-50: Consider adding progress reporting for long-running operations.

Processing all messages in batches could take a long time for large datasets. Consider adding progress reporting to improve user experience.

         for offset in range(0, total_messages, batch_size):
+            progress_percent = round((processed_count / total_messages) * 100, 1)
+            self.stdout.write(f"Processing batch at offset {offset} ({progress_percent}% complete)")
+            
             batch_messages = Message.objects.all()[offset : offset + batch_size]
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1e55c56 and cbb255f.

⛔ Files ignored due to path filters (1)
  • backend/poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (4)
  • backend/apps/ai/management/commands/slack_create_chunks.py (1 hunks)
  • backend/apps/slack/management/commands/slack_sync_messages.py (1 hunks)
  • backend/pyproject.toml (1 hunks)
  • cspell/custom-dict.txt (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • cspell/custom-dict.txt
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/pyproject.toml
⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: Run frontend e2e tests
  • GitHub Check: Run frontend unit tests
  • GitHub Check: Run backend tests
  • GitHub Check: CodeQL (javascript-typescript)
🔇 Additional comments (3)
backend/apps/slack/management/commands/slack_sync_messages.py (1)

64-68: Verify that removing the sync_messages filter is intentional.

The removal of the sync_messages=True filter means all conversations in the workspace will now be processed for message synchronization, not just those specifically marked for syncing. This broadens the scope significantly and may impact performance and data volume.

Please confirm this change aligns with the AI chunk creation requirements and verify the performance implications:

#!/bin/bash
# Description: Check how the sync_messages field is used elsewhere in the codebase
# Expected: Understand if this field has other uses that might be affected

# Search for sync_messages field usage
rg -A 3 -B 3 "sync_messages" --type py

# Check if there are any database constraints or migrations related to sync_messages
fd -e py -x grep -l "sync_messages" {} \;
backend/apps/ai/management/commands/slack_create_chunks.py (2)

21-27: LGTM! Proper API key validation.

Good practice to check for the required environment variable and exit gracefully with a clear error message if it's missing.


89-93: Update exception handling for newer OpenAI library.

The exception class openai.error.OpenAIError is from an older version of the OpenAI library. The current library uses openai.OpenAIError.

-        except openai.error.OpenAIError as e:
+        except openai.OpenAIError as e:

Likely an incorrect or invalid review comment.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
backend/apps/ai/management/commands/slack_create_chunks.py (2)

69-80: Improve rate limiting implementation.

The current implementation using getattr works but could be cleaner. Also, consider initializing last_request_time in __init__ or at the class level.

+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.last_request_time = None
+
     def create_chunks_from_message(
         self, message: Message, cleaned_text: str
     ) -> list[Chunk | None]:
         """Create chunks from a message."""
         if message.raw_data.get("subtype") in ["channel_join", "channel_leave"]:
             return []

         chunk_texts = self.split_message_text(cleaned_text)

         if not chunk_texts:
             self.stdout.write(
                 f"No chunks created for message {message.slack_message_id} - text too short"
             )
             return []

         try:
-            time_since_last_request = datetime.now(UTC) - getattr(
-                self, "last_request_time", datetime.now(UTC) - timedelta(seconds=2)
-            )
-
-            if time_since_last_request < timedelta(seconds=1.2):
-                time.sleep(1.2 - time_since_last_request.total_seconds())
+            # Rate limiting: ensure at least 1.2 seconds between requests
+            if self.last_request_time:
+                time_since_last_request = datetime.now(UTC) - self.last_request_time
+                if time_since_last_request < timedelta(seconds=1.2):
+                    time.sleep(1.2 - time_since_last_request.total_seconds())

             response = self.openai_client.embeddings.create(
                 model="text-embedding-3-small", input=chunk_texts
             )
             self.last_request_time = datetime.now(UTC)

111-112: Fix variable reference error in text cleaning.

Line 112 references message_text instead of cleaned_text, which will undo the emoji processing from line 111.

         cleaned_text = emoji.demojize(message_text, delimiters=("", ""))
-        cleaned_text = re.sub(r"<@U[A-Z0-9]+>", "", message_text)
+        cleaned_text = re.sub(r"<@U[A-Z0-9]+>", "", cleaned_text)
🧹 Nitpick comments (1)
backend/apps/ai/management/commands/slack_create_chunks.py (1)

37-39: Consider using cursor-based pagination to avoid skipping messages.

Offset-based pagination can skip messages if records are deleted during processing. Consider using ID-based pagination for more reliable processing.

-        for offset in range(0, total_messages, batch_size):
-            batch_messages = Message.objects.all()[offset : offset + batch_size]
+        last_id = 0
+        while True:
+            batch_messages = Message.objects.filter(id__gt=last_id).order_by('id')[:batch_size]
+            if not batch_messages:
+                break
             batch_chunks = []

             for message in batch_messages:
                 cleaned_text = self.clean_message_text(message.raw_data.get("text", ""))
                 chunks = self.create_chunks_from_message(message, cleaned_text)
                 batch_chunks.extend(chunks)

             if batch_chunks:
                 Chunk.bulk_save(batch_chunks)

             processed_count += len(batch_messages)
+            last_id = batch_messages[len(batch_messages)-1].id
+            self.stdout.write(f"Processed {processed_count}/{total_messages} messages...")
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cbb255f and 9733fee.

📒 Files selected for processing (1)
  • backend/apps/ai/management/commands/slack_create_chunks.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: Run frontend e2e tests
  • GitHub Check: Run backend tests
  • GitHub Check: Run frontend unit tests
  • GitHub Check: CodeQL (javascript-typescript)
🔇 Additional comments (2)
backend/apps/ai/management/commands/slack_create_chunks.py (2)

1-15: Imports are well-organized and necessary.

The imports follow PEP 8 conventions with proper ordering (standard library, third-party, local).


96-105: Text splitting configuration is appropriate.

The RecursiveCharacterTextSplitter configuration with chunk size 300 and overlap 40 provides good balance for embedding generation.

Copy link
Collaborator

@arkid15r arkid15r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions/suggestions I have regarding this:

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
backend/apps/ai/management/commands/slack_create_chunks.py (1)

40-54: Fix whitespace formatting issue.

There's a formatting issue with the slice syntax that violates PEP 8.

-            batch_messages = Message.objects.all()[offset : offset + batch_size]
+            batch_messages = Message.objects.all()[offset:offset + batch_size]
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a9b1be9 and 6798ad6.

⛔ Files ignored due to path filters (1)
  • backend/poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (10)
  • backend/Makefile (1 hunks)
  • backend/apps/ai/management/commands/slack_create_chunks.py (1 hunks)
  • backend/apps/ai/migrations/0001_initial.py (1 hunks)
  • backend/apps/ai/models/chunk.py (1 hunks)
  • backend/apps/slack/models/message.py (1 hunks)
  • backend/settings/base.py (1 hunks)
  • backend/tests/apps/ai/models/chunk_test.py (1 hunks)
  • cspell/custom-dict.txt (1 hunks)
  • docker-compose/production.yaml (1 hunks)
  • docker-compose/staging.yaml (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • backend/apps/slack/models/message.py
🚧 Files skipped from review as they are similar to previous changes (5)
  • cspell/custom-dict.txt
  • docker-compose/staging.yaml
  • backend/Makefile
  • backend/settings/base.py
  • docker-compose/production.yaml
🧰 Additional context used
🧬 Code Graph Analysis (1)
backend/apps/ai/models/chunk.py (3)
backend/apps/common/models.py (2)
  • BulkSaveModel (8-30)
  • TimestampedModel (33-40)
backend/apps/common/utils.py (1)
  • truncate (164-176)
backend/apps/slack/models/message.py (4)
  • Message (13-125)
  • Meta (16-19)
  • bulk_save (86-88)
  • update_data (91-125)
🪛 Flake8 (7.2.0)
backend/apps/ai/management/commands/slack_create_chunks.py

[error] 41-41: whitespace before ':'

(E203)

🪛 Pylint (3.3.7)
backend/apps/ai/migrations/0001_initial.py

[refactor] 9-9: Too few public methods (0/2)

(R0903)

backend/apps/ai/models/chunk.py

[refactor] 14-14: Too few public methods (0/2)

(R0903)

⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Run backend tests
  • GitHub Check: Run frontend e2e tests
  • GitHub Check: Run frontend unit tests
🔇 Additional comments (16)
backend/apps/ai/models/chunk.py (4)

11-21: LGTM! Well-structured model with appropriate constraints.

The Chunk model is well-designed with proper inheritance from TimestampedModel, appropriate field types including the VectorField for embeddings, and a sensible unique constraint on message and chunk_text combination. The table name and verbose name are clear and consistent.


23-26: LGTM! Clear and informative string representation.

The __str__ method provides a helpful representation with chunk ID, associated message ID, and truncated text preview using the utility function.


28-33: LGTM! Efficient bulk save implementation with proper filtering.

The bulk_save method correctly filters out None values and delegates to the base class implementation. The conditional check ensures the bulk save operation only runs when there are valid chunks to process.


35-63: LGTM! Proper duplicate prevention logic.

The update_data method correctly checks for existing chunks using the unique constraint fields and returns None for duplicates, preventing unnecessary database operations. The method signature and documentation are clear.

backend/apps/ai/migrations/0001_initial.py (2)

9-14: LGTM! Proper migration setup with correct dependencies.

The migration correctly depends on the slack app migration and is marked as initial for the ai app.


16-51: LGTM! Proper pgvector extension and model creation.

The migration correctly enables the VectorExtension before creating the Chunk model, ensuring vector field support is available. The model fields and constraints match the model definition perfectly.

backend/apps/ai/management/commands/slack_create_chunks.py (5)

16-17: LGTM! Good use of constants for configuration.

The constants for minimum request interval and default offset improve maintainability and make the rate limiting behavior explicit.


23-35: LGTM! Proper environment variable validation and setup.

The command correctly validates the OpenAI API key environment variable and initializes the client. The message count display helps users understand the scope of work.


69-82: LGTM! Effective rate limiting implementation.

The rate limiting logic correctly tracks the last request time and enforces the minimum interval between API calls. The use of getattr with a default value handles the first request properly.


94-98: LGTM! Proper error handling for OpenAI API.

The exception handling correctly catches OpenAI errors and logs them without stopping the entire process, allowing other messages to be processed.


110-121: LGTM! Comprehensive text cleaning implementation.

The text cleaning method properly handles emojis, Slack user mentions, URLs, and emoji codes while normalizing whitespace. The use of the emoji library provides reliable emoji handling.

backend/tests/apps/ai/models/chunk_test.py (5)

9-13: LGTM! Helpful utility function for creating model mocks.

The create_model_mock function provides a clean way to create properly configured mock objects with the necessary attributes for testing.


17-30: LGTM! Comprehensive test of string representation.

The test properly verifies both the structure and content of the str method output, including truncation behavior.


32-68: LGTM! Thorough testing of bulk_save method.

The test suite comprehensively covers different scenarios for bulk_save including valid chunks, None filtering, empty lists, and custom fields parameter.


70-135: LGTM! Complete coverage of update_data method.

The tests properly verify both the creation of new chunks and the duplicate prevention logic, including the save parameter behavior.


137-149: LGTM! Proper validation of model metadata and relationships.

The tests verify the Meta class attributes and foreign key relationship configuration, ensuring the model is properly configured.

@Dishant1804 Dishant1804 requested a review from arkid15r June 25, 2025 19:43
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (6)
backend/apps/ai/admin.py (1)

8-18: Consider enhancing the admin interface with additional functionality.

The current configuration is solid. Consider adding these enhancements for better usability:

 class ChunkAdmin(admin.ModelAdmin):
+    list_filter = ("message__conversation",)
+    readonly_fields = ("id", "created_at", "updated_at")
     list_display = (
         "id",
         "message",
         "text",
     )
     search_fields = (
         "message__slack_message_id",
         "text",
     )
+    raw_id_fields = ("message",)
backend/apps/slack/models/message.py (1)

51-64: Robust text cleaning implementation with room for enhancement.

The text cleaning logic is comprehensive and appropriate for preparing Slack messages for embedding. Consider these potential enhancements:

 @property
 def cleaned_text(self) -> str:
     """Get cleaned text from the message."""
     if not self.text:
         return ""

     text = emoji.demojize(self.text)  # Remove emojis.
     text = re.sub(r"<@U[A-Z0-9]+>", "", text)  # Remove user mentions.
+    text = re.sub(r"<#C[A-Z0-9]+\|[^>]+>", "", text)  # Remove channel mentions.
     text = re.sub(r"<https?://[^>]+>", "", text)  # Remove links.
     text = re.sub(r":\w+:", "", text)  # Remove emoji aliases.
     text = re.sub(r"\s+", " ", text)  # Normalize whitespace.

     return text.strip()
backend/apps/ai/management/commands/ai_create_slack_message_chunks.py (2)

72-87: Consider breaking down the complex chunk creation logic.

The list comprehension with embedded conditional logic and zip operations is hard to follow. Consider using a more explicit approach for better maintainability.

-            return [
-                chunk
-                for text, embedding in zip(
-                    chunk_text,
-                    [d.embedding for d in response.data],  # Embedding data from OpenAI response.
-                    strict=True,
-                )
-                if (
-                    chunk := Chunk.update_data(
-                        embedding=embedding,
-                        message=message,
-                        save=False,
-                        text=text,
-                    )
-                )
-            ]
+            chunks = []
+            embeddings = [d.embedding for d in response.data]
+            
+            for text, embedding in zip(chunk_text, embeddings, strict=True):
+                chunk = Chunk.update_data(
+                    embedding=embedding,
+                    message=message,
+                    save=False,
+                    text=text,
+                )
+                if chunk:
+                    chunks.append(chunk)
+            
+            return chunks

29-30: Consider performance implications for large message counts.

Using Message.objects.count() can be expensive for large tables. If the exact count isn't critical for progress reporting, consider using approximate methods or removing this step.

-        total_messages = Message.objects.count()
-        self.stdout.write(f"Found {total_messages} messages to process")
+        self.stdout.write("Starting to process messages in batches...")
backend/apps/ai/models/chunk.py (2)

39-44: Consider making text splitting parameters configurable.

The hardcoded values for chunk size, overlap, and separators might not be optimal for all use cases. Consider making these configurable through settings or method parameters.

     @staticmethod
-    def split_text(text: str) -> list[str]:
+    def split_text(text: str, chunk_size: int = 300, chunk_overlap: int = 40) -> list[str]:
         """Split text into chunks."""
         return RecursiveCharacterTextSplitter(
-            chunk_size=300,
-            chunk_overlap=40,
+            chunk_size=chunk_size,
+            chunk_overlap=chunk_overlap,
             length_function=len,
             separators=["\n\n", "\n", " ", ""],
         ).split_text(text)

20-20: Document the embedding dimensions choice.

The 1536 dimensions corresponds to OpenAI's text-embedding-3-small model. Consider adding a comment to clarify this dependency for future maintainers.

-    embedding = VectorField(verbose_name="Embedding", dimensions=1536)
+    embedding = VectorField(verbose_name="Embedding", dimensions=1536)  # OpenAI text-embedding-3-small
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6798ad6 and d93208c.

📒 Files selected for processing (8)
  • backend/Makefile (1 hunks)
  • backend/apps/ai/admin.py (1 hunks)
  • backend/apps/ai/management/commands/ai_create_slack_message_chunks.py (1 hunks)
  • backend/apps/ai/migrations/0002_rename_chunk_text_chunk_text_and_more.py (1 hunks)
  • backend/apps/ai/migrations/0003_alter_chunk_options_alter_chunk_embedding_and_more.py (1 hunks)
  • backend/apps/ai/models/chunk.py (1 hunks)
  • backend/apps/slack/models/message.py (3 hunks)
  • backend/tests/apps/ai/models/chunk_test.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/tests/apps/ai/models/chunk_test.py
🧰 Additional context used
🧬 Code Graph Analysis (4)
backend/apps/ai/management/commands/ai_create_slack_message_chunks.py (2)
backend/apps/ai/models/chunk.py (4)
  • Chunk (12-74)
  • bulk_save (32-34)
  • split_text (37-44)
  • update_data (47-74)
backend/apps/slack/models/message.py (6)
  • Message (15-146)
  • bulk_save (107-109)
  • subtype (78-80)
  • cleaned_text (52-63)
  • text (83-85)
  • update_data (112-146)
backend/apps/ai/admin.py (1)
backend/apps/ai/models/chunk.py (1)
  • Chunk (12-74)
backend/apps/ai/migrations/0002_rename_chunk_text_chunk_text_and_more.py (1)
backend/apps/ai/migrations/0003_alter_chunk_options_alter_chunk_embedding_and_more.py (1)
  • Migration (7-27)
backend/apps/ai/migrations/0003_alter_chunk_options_alter_chunk_embedding_and_more.py (1)
backend/apps/ai/migrations/0002_rename_chunk_text_chunk_text_and_more.py (1)
  • Migration (6-22)
🪛 Flake8 (7.2.0)
backend/apps/ai/management/commands/ai_create_slack_message_chunks.py

[error] 37-37: whitespace before ':'

(E203)

🪛 checkmake (0.2.2)
backend/Makefile

[warning] 1-1: Missing required phony target "all"

(minphony)


[warning] 1-1: Missing required phony target "clean"

(minphony)


[warning] 1-1: Missing required phony target "test"

(minphony)

🪛 Pylint (3.3.7)
backend/apps/ai/admin.py

[refactor] 8-8: Too few public methods (0/2)

(R0903)

backend/apps/ai/migrations/0002_rename_chunk_text_chunk_text_and_more.py

[refactor] 6-6: Too few public methods (0/2)

(R0903)

backend/apps/ai/migrations/0003_alter_chunk_options_alter_chunk_embedding_and_more.py

[refactor] 7-7: Too few public methods (0/2)

(R0903)

backend/apps/ai/models/chunk.py

[refactor] 15-15: Too few public methods (0/2)

(R0903)

⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Run frontend e2e tests
  • GitHub Check: Run frontend unit tests
  • GitHub Check: Run backend tests
🔇 Additional comments (6)
backend/Makefile (1)

1-4: LGTM! Clean integration with existing Makefile structure.

The new target follows the established pattern and correctly uses the existing exec-backend-command infrastructure for running Django management commands.

backend/apps/ai/migrations/0002_rename_chunk_text_chunk_text_and_more.py (1)

12-22: Migration looks correct and well-structured.

The field rename and constraint update are properly implemented. The dependencies correctly reference both the previous AI migration and the related Slack migration.

backend/apps/ai/migrations/0003_alter_chunk_options_alter_chunk_embedding_and_more.py (1)

17-21: Verify the embedding dimensions align with your model choice.

The 1536 dimensions are correct for OpenAI's text-embedding-ada-002 model. Ensure this matches the embedding model used in your implementation.

#!/bin/bash
# Description: Verify the embedding model dimensions used in the codebase
# Expected: Find references to embedding models and their dimensions

rg -A 3 -B 3 "text-embedding|openai.*embed|embedding.*model"
backend/apps/slack/models/message.py (2)

3-3: LGTM! Appropriate imports for text processing functionality.

The re and emoji imports support the new text cleaning functionality.

Also applies to: 6-6


77-86: Clean and consistent property accessors.

The subtype and text properties provide clean access to raw_data fields with appropriate default values.

backend/apps/ai/models/chunk.py (1)

15-18: LGTM on the model Meta configuration.

The database table name, verbose name, and unique constraint are well-defined and follow Django conventions appropriately.

Copy link
Collaborator

@arkid15r arkid15r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dishant1804 please check whether code works and my comment below.

@Dishant1804
Copy link
Collaborator Author

@Dishant1804 please check whether code works and my comment below.

Yes, the code works properly we can proceed with it

@sonarqubecloud
Copy link

@arkid15r arkid15r enabled auto-merge June 27, 2025 16:18
@arkid15r arkid15r added this pull request to the merge queue Jun 27, 2025
Merged via the queue into OWASP:main with commit a7510d1 Jun 27, 2025
23 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Jul 1, 2025
1 task
@coderabbitai coderabbitai bot mentioned this pull request Jul 28, 2025
7 tasks
@coderabbitai coderabbitai bot mentioned this pull request Sep 1, 2025
5 tasks
@coderabbitai coderabbitai bot mentioned this pull request Oct 2, 2025
2 tasks
@coderabbitai coderabbitai bot mentioned this pull request Oct 16, 2025
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create chunk model and embeddings

2 participants