Repository with example code and data used in the monthly Getting Started with Pinecone Webinar.
- Python >=3.9 (built on v3.13.8)
- Pinecone account (Sign up here)
- Create a project and copy your API key
Want to try it out without setting up your own environment?
You can use the shared API key provided in example.env to access the pre-loaded database indexes and assistant:
- API Key:
pcsk_2go2xm_EAXpMTvVHud6PP3od6iCB5NsCY3PC9smXUQktmh2eVaoDbAhCjqp7Fw5Yqitjqr - This API key provides read-only access to:
- Both database indexes (
getting-started-webinar-denseandgetting-started-webinar-sparse) with pre-loaded Steam game data - The Pinecone Assistant (
getting-started-webinar-assistant) with uploaded Steam game data
- Both database indexes (
Web Interface: You can also chat with the assistant directly in your browser at https://getting-started-with-pinecone-webinar.vercel.app/
Copy the API key from example.env into your .env file and you can start querying the database or chatting with the assistant immediately, without running the data load commands.
- Clone this repo:
https://github.com/pinecone-io/getting-started-with-pinecone-webinar.git - Get the data
- Download the CSV dataset from Kaggle at: https://www.kaggle.com/datasets/crainbramp/steam-dataset-2025-multi-modal-gaming-analytics?resource=download-directory&select=steam_dataset_2025_csv_package_v1
- Unzip the download (should unzip into a folder named
steam_dataset_2025_csv) - Copy the
steam_dataset_2025_csvfolder to the data folder in the root of this repo
- Create a Python virtual environment:
python -m venv .venv - Activate the virtual environment:
source .venv/bin/activate(or.\venv\Scripts\activateon Windows) - Install Python requirements:
pip install -r requirements.txt - Copy
example.envto.env:cp example.env .env - Edit
.envand add your Pinecone API key
Load Steam game applications and reviews into Pinecone indexes:
python pc-webinar.py database-loadThis command will:
- Create Pinecone indexes automatically if they don't exist:
- Dense index: Uses
llama-text-embed-v2model for integrated semantic embeddings - Sparse index: Uses
pinecone-sparse-english-v0model integrated keyword embeddings
- Dense index: Uses
- Load the Steam dataset CSV files (applications.csv and reviews.csv)
- Transform the data into text for upserting
- Chunk the text using token-based chunking with overlap
- Upsert all chunks with metadata to both dense and sparse indexes using integrated embeddings
Note: This data load will take several hours:
- It's a lot of records (hundreds of thousands of applications and over a million reviews)
- The upsert uses integrated embedding, which makes embedding model calls for every vector it upserts
- Progress bars will show real-time progress for both applications and reviews
Resilience: The load process is resilient to individual vector failures. If any vectors produce invalid embeddings (e.g., empty sparse vectors), they will be automatically skipped and the process will continue. A warning message will be displayed at the end listing any skipped vector IDs.
Output: The command prints the number of vectors successfully upserted for both applications and reviews.
Query the Pinecone database using hybrid, semantic, or lexical search:
# Hybrid search (default) - queries both dense and sparse indexes, then reranks results
python pc-webinar.py database-query "action games with good graphics"
# Semantic search - uses only the dense index for semantic similarity
python pc-webinar.py database-query "action games with good graphics" --mode semantic
# Lexical search - uses only the sparse index for keyword matching
python pc-webinar.py database-query "action games with good graphics" --mode lexical
# Specify number of results to return
python pc-webinar.py database-query "strategy games" --top-k 20
# Combine options
python pc-webinar.py database-query "indie puzzle games" -m semantic -t 5Search Modes:
- hybrid (default): Queries both dense and sparse indexes, merges and deduplicates results, then reranks using
cohere-rerank-3.5for the most relevant results - semantic: Uses only the dense index for semantic similarity search
- lexical: Uses only the sparse index for keyword-based search
Arguments:
query(required): The search query string-m, --mode(optional): Search mode -hybrid,semantic, orlexical(default:hybrid)-t, --top-k(optional): Number of results to return (default:10)
Load Steam game applications and reviews into the Pinecone Assistant:
python pc-webinar.py assistant-loadThis command will:
- Convert the Steam dataset CSV files (applications.csv and reviews.csv) to JSON format and save them in the same directory as the CSV files
- Automatically split large JSON files into 100MB chunks if needed (files are named
applications_part1.json,reviews_part1.json, etc. when split) - Get or create the Pinecone Assistant (if it doesn't exist)
- Upload all JSON files (including split parts) to the assistant for use in chat completions
Note: The CSV files are automatically converted to JSON format before upload, as the Pinecone Assistant doesn't accept CSV files but does accept JSON files. Large files (over 100MB) are automatically split into multiple smaller JSON files to ensure successful uploads.
Note: This data load will take several hours:
- They are big files (reviews.csv has over 1 million records)
- Large files are automatically split into 100MB chunks to avoid upload size limits
- The upload automatically creates dense and sparse indexes and uses hybrid search (none of this is visible to the user)
- Progress bars will show file-level progress for each JSON file being uploaded
Chat with the Pinecone Assistant using the uploaded Steam game data:
# Use default model (gpt-4o)
python pc-webinar.py assistant-prompt "What are the most popular Steam games?"
# Specify a different model using short flag
python pc-webinar.py assistant-prompt "Tell me about Counter-Strike's review sentiment" -m claude-3-5-sonnet
# Using the full --model flag
python pc-webinar.py assistant-prompt "What games have the best reviews?" --model gemini-2.5-proArguments:
prompt(required): The prompt/question to ask the assistant-m, --model(optional): Model to use for the assistant (default:gpt-4o)
Available Models:
gpt-4o(default)gpt-4.1o4-miniclaude-3-5-sonnetclaude-3-7-sonnetgemini-2.5-pro
The project uses environment variables configured in .env:
PINECONE_API_KEY(required): Your Pinecone API keyPINECONE_DENSE_INDEX(optional): Name for the dense index (default:getting-started-webinar-dense)PINECONE_SPARSE_INDEX(optional): Name for the sparse index (default:getting-started-webinar-sparse)PINECONE_ASSISTANT_NAME(optional): Name for the Pinecone Assistant (default:getting-started-webinar-assistant)CHUNK_SIZE(optional): Chunk size in tokens for text processing (default:1740)CHUNK_OVERLAP(optional): Overlap between chunks in tokens (default:205)
Index Configuration:
- Dense index uses
llama-text-embed-v2model with integrated embeddings - Sparse index uses
pinecone-sparse-english-v0model with integrated embeddings
We are using 2 CSV files from the Steam Dataset 2025: Multi-Modal Gaming Analytics dataset from Kaggle.
- Github repo: vintagedon/steam-dataset-2025
applications.csv: Contains all game application data from Steamreviews.csv: Contains all game review data from Steam
| Command | Description | Arguments |
|---|---|---|
database-load |
Load Steam data into Pinecone indexes | None |
database-query |
Query Pinecone indexes | query (required), -m/--mode (optional), -t/--top-k (optional) |
assistant-load |
Load Steam data into Pinecone Assistant | None |
assistant-prompt |
Chat with Pinecone Assistant | prompt (required), -m/--model (optional) |