PyThreadMind

A Python toolkit for managing conversational threads using TF-IDF and semantic similarity. It groups chat messages into threads and retrieves the most relevant threads for a new user prompt.

Warning: Both TF-IDF and semantic implementations are experimental and may not perfectly separate or rank threads in all cases.

I mostly moved to the semantic and gave up on TF-IDF

Dataset

Sample data lives in src/data/context.json. It was extracted from the public OASST1 (OpenAssistant) v1.0 dataset, a large archive of user–assistant chat transcripts. A custom extraction script (not included) filtered and formatted the portion you see here.

Requirements

Python 3.7–3.11 (break 3.12+)
POSIX shell (bash/zsh) or Windows PowerShell
Virtual environment tool (venv, conda)

Installation

# Clone repo (if not already)
git clone https://github.com/acoliver/pythreadmind.git
cd pythreadmind

# Create & activate venv
python3.11 -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows

# Install deps & package
pip install --upgrade pip
pip install -r requirements.txt
pip install -e .

# Download NLTK data (stopwords)
python -c "import nltk; nltk.download('stopwords')"

Note: requirements.txt includes en-core-web-sm from spaCy; it will auto-install the model.

Usage

TF-IDF Thread Manager Demo (not recommended)

python src/threadmind/tfidf_manager.py --test

Semantic Thread Manager in REPL

python
>>> from threadmind.semantic_manager import SemanticThreadManager
>>> from datetime import datetime
>>> mgr = SemanticThreadManager()
>>> threads = mgr.threads_for_prompt('user', 'Your query here', datetime.now())
>>> print(threads)

Running Tests (This runs the semantic one)

Integration & unit tests live in test/test_thread_manager.py. To run:

pytest test/test_thread_manager.py

This test will:

Feed the sample context into both TF-IDF and semantic managers
Verify thread grouping & message retrieval
Export threads_debug.csv
Print analysis of the longest threads

Expect some warnings or failures; development is ongoing.

Limitations & Future Work

Semantic manager may over-group or under-split threads
TF-IDF manager’s topic drift penalty can be too aggressive
Extraction script for context.json is not packaged here

Questions / Issues

If anything is unclear (dataset provenance, Python version, missing scripts), please open an issue or reach out. Happy threading!

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
src		src
test		test
.gitignore		.gitignore
README.md		README.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.py		setup.py
threads_debug.csv		threads_debug.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PyThreadMind

Dataset

Requirements

Installation

Usage

TF-IDF Thread Manager Demo (not recommended)

Semantic Thread Manager in REPL

Running Tests (This runs the semantic one)

Limitations & Future Work

Questions / Issues

About

Uh oh!

Releases

Packages

Uh oh!

Languages

acoliver/pythreadmind

Folders and files

Latest commit

History

Repository files navigation

PyThreadMind

Dataset

Requirements

Installation

Usage

TF-IDF Thread Manager Demo (not recommended)

Semantic Thread Manager in REPL

Running Tests (This runs the semantic one)

Limitations & Future Work

Questions / Issues

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages