Slack-QA: An Annotated Query-Document Dataset on Public Slack Workspaces

This repository contains tools to archive Slack workspaces, clean and chunk messages, generate synthetic search queries against those chunks using LLMs, and convert validated query-chunk pairs to a format compatible with ZeroEntropy's evals repo.

We end up with Slack-QA, a dataset of LLM-generated human-like queries with their corresponding answer in chunked documents from scraped public Slack workspaces.

Core features

Archive Slack workspaces given an auth token for that workspace (both user-browser and workspace-installed app tokens work)
Clean and load archived Slack data
Multiple chunking strategies (individual messages, threads, sliding windows)
Generate synthetic queries using LLMs (OpenAI / Anthropic via ai.py wrapper)
Validate and export query-chunk pairs to BEIR-compatible JSONL

Repository layout (important files)

slack_archiver_browser.py — archiver that uses browser session (xoxc + cookies)
slack_archiver.py — archiver for app tokens (xoxp) and related utilities
slack_archiver_rich.py — richer archiving utilities and helpers
slack_json_cleaner.py — cleaning utilities to normalize Slack JSON
generate_query.py — main synthetic query generator and pipeline (LLM-based)
generate_query_sequential_version.py — alternate sequential implementation
ai.py — internal AI wrapper used by the query generator (embeddings, calls)
slack_search.py — small wrapper around Slack search API (browser or app mode)
validated_query_chunk_pairs_to_proper_json_format.py — converts validated pairs to BEIR format
requirements.txt — Python dependencies used by the project
test_boolean_slack_api.py, test_search_api_wrapper.py — small tests / examples

Quick start

Create a Python 3.11+ venv and install dependencies:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Archive a workspace (browser mode example)

You need an xoxc-... token and the browser cookie string for the Slack workspace. This can be garnered in the network tab of any browser while connected to the slack community in question.

# edit the top of slack_archiver_browser.py or call the helper function
python slack_archiver_browser.py

Clean and inspect archived data

After archiving, files are written under slack_archive/<WORKSPACE_NAME>/.
Use slack_json_cleaner.py to normalize the JSON into the channels-clean/ subfolders.

Generate synthetic queries (LLM credentials required)

generate_query.py uses the ai.py wrapper. Provide your preferred provider and API key via env vars or pass them to the pipeline.
Example (conceptual):

export ANTHROPIC_API_KEY="..."
python generate_query.py

Notes about AI and auth

ai.py centralizes calls to OpenAI/Anthropic and embedding models. Review it before running generation.
generate_query.py supports provider values like openai and anthropic. You must supply the corresponding API key.
Browser-mode Slack scripts require a valid xoxc token and full cookie string. App-mode uses xoxp and requires search:read and other scopes as appropriate.

evals format export

Use validated_query_chunk_pairs_to_proper_json_format.py to convert validated_query_chunk_pairs.jsonl into documents.jsonl, queries.jsonl, and qrels.jsonl for retrieval evaluation.

Next steps

Add a small wrapper script to orchestrate the archive → clean → chunk → generate → validate pipeline.
Expand dataset with more slack workspaces

Contact

Open a PR here or contact the ZeroEntropy Team On Slack.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
slack_archive		slack_archive
synthetic_data		synthetic_data
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
agentic_search.py		agentic_search.py
ai.py		ai.py
call_completions_api.py		call_completions_api.py
extract_best_search_calls.py		extract_best_search_calls.py
few_shot_prompt.txt		few_shot_prompt.txt
generate_keywords.py		generate_keywords.py
generate_query.py		generate_query.py
generate_query_sequential_version.py		generate_query_sequential_version.py
get_slack_results.py		get_slack_results.py
get_timestamp_to_message_id.py		get_timestamp_to_message_id.py
interactive_manual_search.py		interactive_manual_search.py
keywords.jsonl		keywords.jsonl
query.py		query.py
quicktest.py		quicktest.py
recall.py		recall.py
requirements.txt		requirements.txt
rerank.py		rerank.py
results.jsonl		results.jsonl
results_rerank_v3.5.jsonl		results_rerank_v3.5.jsonl
results_zerank_1.jsonl		results_zerank_1.jsonl
slack_archiver.py		slack_archiver.py
slack_archiver_browser.py		slack_archiver_browser.py
slack_archiver_rich.py		slack_archiver_rich.py
slack_json_cleaner.py		slack_json_cleaner.py
slack_results.jsonl		slack_results.jsonl
slack_results_cohere.jsonl		slack_results_cohere.jsonl
slack_results_rerank_v3.5.jsonl		slack_results_rerank_v3.5.jsonl
slack_results_zerank_1.jsonl		slack_results_zerank_1.jsonl
slack_results_zeroentropy.jsonl		slack_results_zeroentropy.jsonl
slack_search.py		slack_search.py
slack_search_test_results.json		slack_search_test_results.json
test_boolean_slack_api.py		test_boolean_slack_api.py
test_search_api_wrapper.py		test_search_api_wrapper.py
transform_slack_results.py		transform_slack_results.py
upload.py		upload.py
validated_query_chunk_pairs_to_proper_json_format.py		validated_query_chunk_pairs_to_proper_json_format.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Slack-QA: An Annotated Query-Document Dataset on Public Slack Workspaces

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Uh oh!

Uh oh!

zeroentropy-ai/slack-qa

Folders and files

Latest commit

History

Repository files navigation

Slack-QA: An Annotated Query-Document Dataset on Public Slack Workspaces

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages