This repository contains tools to archive Slack workspaces, clean and chunk messages, generate synthetic search queries against those chunks using LLMs, and convert validated query-chunk pairs to a format compatible with ZeroEntropy's evals repo.
We end up with Slack-QA, a dataset of LLM-generated human-like queries with their corresponding answer in chunked documents from scraped public Slack workspaces.
Core features
- Archive Slack workspaces given an auth token for that workspace (both user-browser and workspace-installed app tokens work)
- Clean and load archived Slack data
- Multiple chunking strategies (individual messages, threads, sliding windows)
- Generate synthetic queries using LLMs (OpenAI / Anthropic via
ai.pywrapper) - Validate and export query-chunk pairs to BEIR-compatible JSONL
Repository layout (important files)
slack_archiver_browser.py— archiver that uses browser session (xoxc+ cookies)slack_archiver.py— archiver for app tokens (xoxp) and related utilitiesslack_archiver_rich.py— richer archiving utilities and helpersslack_json_cleaner.py— cleaning utilities to normalize Slack JSONgenerate_query.py— main synthetic query generator and pipeline (LLM-based)generate_query_sequential_version.py— alternate sequential implementationai.py— internal AI wrapper used by the query generator (embeddings, calls)slack_search.py— small wrapper around Slack search API (browser or app mode)validated_query_chunk_pairs_to_proper_json_format.py— converts validated pairs to BEIR formatrequirements.txt— Python dependencies used by the projecttest_boolean_slack_api.py,test_search_api_wrapper.py— small tests / examples
Quick start
- Create a Python 3.11+ venv and install dependencies:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt- Archive a workspace (browser mode example)
- You need an
xoxc-...token and the browser cookie string for the Slack workspace. This can be garnered in the network tab of any browser while connected to the slack community in question.
# edit the top of slack_archiver_browser.py or call the helper function
python slack_archiver_browser.py- Clean and inspect archived data
- After archiving, files are written under
slack_archive/<WORKSPACE_NAME>/. - Use
slack_json_cleaner.pyto normalize the JSON into thechannels-clean/subfolders.
- Generate synthetic queries (LLM credentials required)
generate_query.pyuses theai.pywrapper. Provide your preferred provider and API key via env vars or pass them to the pipeline.- Example (conceptual):
export ANTHROPIC_API_KEY="..."
python generate_query.pyNotes about AI and auth
ai.pycentralizes calls to OpenAI/Anthropic and embedding models. Review it before running generation.generate_query.pysupportsprovidervalues likeopenaiandanthropic. You must supply the corresponding API key.- Browser-mode Slack scripts require a valid
xoxctoken and full cookie string. App-mode usesxoxpand requiressearch:readand other scopes as appropriate.
evals format export
- Use
validated_query_chunk_pairs_to_proper_json_format.pyto convertvalidated_query_chunk_pairs.jsonlintodocuments.jsonl,queries.jsonl, andqrels.jsonlfor retrieval evaluation.
Next steps
- Add a small wrapper script to orchestrate the archive → clean → chunk → generate → validate pipeline.
- Expand dataset with more slack workspaces
Contact
- Open a PR here or contact the ZeroEntropy Team On Slack.