A Python toolkit for managing conversational threads using TF-IDF and semantic similarity. It groups chat messages into threads and retrieves the most relevant threads for a new user prompt.
Warning: Both TF-IDF and semantic implementations are experimental and may not perfectly separate or rank threads in all cases.
I mostly moved to the semantic and gave up on TF-IDF
Sample data lives in src/data/context.json
. It was extracted from the public OASST1 (OpenAssistant) v1.0 dataset, a large archive of user–assistant chat transcripts. A custom extraction script (not included) filtered and formatted the portion you see here.
- Python 3.7–3.11 (break 3.12+)
- POSIX shell (bash/zsh) or Windows PowerShell
- Virtual environment tool (venv, conda)
# Clone repo (if not already)
git clone https://github.com/acoliver/pythreadmind.git
cd pythreadmind
# Create & activate venv
python3.11 -m venv venv
source venv/bin/activate # or `venv\Scripts\activate` on Windows
# Install deps & package
pip install --upgrade pip
pip install -r requirements.txt
pip install -e .
# Download NLTK data (stopwords)
python -c "import nltk; nltk.download('stopwords')"
Note:
requirements.txt
includesen-core-web-sm
from spaCy; it will auto-install the model.
python src/threadmind/tfidf_manager.py --test
python
>>> from threadmind.semantic_manager import SemanticThreadManager
>>> from datetime import datetime
>>> mgr = SemanticThreadManager()
>>> threads = mgr.threads_for_prompt('user', 'Your query here', datetime.now())
>>> print(threads)
Integration & unit tests live in test/test_thread_manager.py
. To run:
pytest test/test_thread_manager.py
This test will:
- Feed the sample context into both TF-IDF and semantic managers
- Verify thread grouping & message retrieval
- Export
threads_debug.csv
- Print analysis of the longest threads
Expect some warnings or failures; development is ongoing.
- Semantic manager may over-group or under-split threads
- TF-IDF manager’s topic drift penalty can be too aggressive
- Extraction script for context.json is not packaged here
If anything is unclear (dataset provenance, Python version, missing scripts), please open an issue or reach out. Happy threading!