Skip to content

acoliver/pythreadmind

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PyThreadMind

A Python toolkit for managing conversational threads using TF-IDF and semantic similarity. It groups chat messages into threads and retrieves the most relevant threads for a new user prompt.

Warning: Both TF-IDF and semantic implementations are experimental and may not perfectly separate or rank threads in all cases.

I mostly moved to the semantic and gave up on TF-IDF


Dataset

Sample data lives in src/data/context.json. It was extracted from the public OASST1 (OpenAssistant) v1.0 dataset, a large archive of user–assistant chat transcripts. A custom extraction script (not included) filtered and formatted the portion you see here.

Requirements

  • Python 3.7–3.11 (break 3.12+)
  • POSIX shell (bash/zsh) or Windows PowerShell
  • Virtual environment tool (venv, conda)

Installation

# Clone repo (if not already)
git clone https://github.com/acoliver/pythreadmind.git
cd pythreadmind

# Create & activate venv
python3.11 -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows

# Install deps & package
pip install --upgrade pip
pip install -r requirements.txt
pip install -e .

# Download NLTK data (stopwords)
python -c "import nltk; nltk.download('stopwords')"

Note: requirements.txt includes en-core-web-sm from spaCy; it will auto-install the model.

Usage

TF-IDF Thread Manager Demo (not recommended)

python src/threadmind/tfidf_manager.py --test

Semantic Thread Manager in REPL

python
>>> from threadmind.semantic_manager import SemanticThreadManager
>>> from datetime import datetime
>>> mgr = SemanticThreadManager()
>>> threads = mgr.threads_for_prompt('user', 'Your query here', datetime.now())
>>> print(threads)

Running Tests (This runs the semantic one)

Integration & unit tests live in test/test_thread_manager.py. To run:

pytest test/test_thread_manager.py

This test will:

  • Feed the sample context into both TF-IDF and semantic managers
  • Verify thread grouping & message retrieval
  • Export threads_debug.csv
  • Print analysis of the longest threads

Expect some warnings or failures; development is ongoing.

Limitations & Future Work

  • Semantic manager may over-group or under-split threads
  • TF-IDF manager’s topic drift penalty can be too aggressive
  • Extraction script for context.json is not packaged here

Questions / Issues

If anything is unclear (dataset provenance, Python version, missing scripts), please open an issue or reach out. Happy threading!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages