getting-started-with-pinecone-webinar

Repository with example code and data used in the monthly Getting Started with Pinecone Webinar.

Prerequisites

Python >=3.9 (built on v3.13.8)
Pinecone account (Sign up here)
- Create a project and copy your API key

Using Pre-Loaded Environments (Optional)

Want to try it out without setting up your own environment?

You can use the shared API key provided in example.env to access the pre-loaded database indexes and assistant:

API Key: pcsk_2go2xm_EAXpMTvVHud6PP3od6iCB5NsCY3PC9smXUQktmh2eVaoDbAhCjqp7Fw5Yqitjqr
This API key provides read-only access to:
- Both database indexes (getting-started-webinar-dense and getting-started-webinar-sparse) with pre-loaded Steam game data
- The Pinecone Assistant (getting-started-webinar-assistant) with uploaded Steam game data

Web Interface: You can also chat with the assistant directly in your browser at https://getting-started-with-pinecone-webinar.vercel.app/

Copy the API key from example.env into your .env file and you can start querying the database or chatting with the assistant immediately, without running the data load commands.

Walkthrough

Getting your environment ready

Clone this repo: https://github.com/pinecone-io/getting-started-with-pinecone-webinar.git
Get the data
1. Download the CSV dataset from Kaggle at: https://www.kaggle.com/datasets/crainbramp/steam-dataset-2025-multi-modal-gaming-analytics?resource=download-directory&select=steam_dataset_2025_csv_package_v1
2. Unzip the download (should unzip into a folder named steam_dataset_2025_csv)
3. Copy the steam_dataset_2025_csv folder to the data folder in the root of this repo
Create a Python virtual environment: python -m venv .venv
Activate the virtual environment: source .venv/bin/activate (or .\venv\Scripts\activate on Windows)
Install Python requirements: pip install -r requirements.txt
Copy example.env to .env: cp example.env .env
Edit .env and add your Pinecone API key

Using Pinecone Database

Loading Data

Load Steam game applications and reviews into Pinecone indexes:

python pc-webinar.py database-load

This command will:

Create Pinecone indexes automatically if they don't exist:
- Dense index: Uses llama-text-embed-v2 model for integrated semantic embeddings
- Sparse index: Uses pinecone-sparse-english-v0 model integrated keyword embeddings
Load the Steam dataset CSV files (applications.csv and reviews.csv)
Transform the data into text for upserting
Chunk the text using token-based chunking with overlap
Upsert all chunks with metadata to both dense and sparse indexes using integrated embeddings

Note: This data load will take several hours:

It's a lot of records (hundreds of thousands of applications and over a million reviews)
The upsert uses integrated embedding, which makes embedding model calls for every vector it upserts
Progress bars will show real-time progress for both applications and reviews

Resilience: The load process is resilient to individual vector failures. If any vectors produce invalid embeddings (e.g., empty sparse vectors), they will be automatically skipped and the process will continue. A warning message will be displayed at the end listing any skipped vector IDs.

Output: The command prints the number of vectors successfully upserted for both applications and reviews.

Querying the Database

Query the Pinecone database using hybrid, semantic, or lexical search:

# Hybrid search (default) - queries both dense and sparse indexes, then reranks results
python pc-webinar.py database-query "action games with good graphics"

# Semantic search - uses only the dense index for semantic similarity
python pc-webinar.py database-query "action games with good graphics" --mode semantic

# Lexical search - uses only the sparse index for keyword matching
python pc-webinar.py database-query "action games with good graphics" --mode lexical

# Specify number of results to return
python pc-webinar.py database-query "strategy games" --top-k 20

# Combine options
python pc-webinar.py database-query "indie puzzle games" -m semantic -t 5

Search Modes:

hybrid (default): Queries both dense and sparse indexes, merges and deduplicates results, then reranks using cohere-rerank-3.5 for the most relevant results
semantic: Uses only the dense index for semantic similarity search
lexical: Uses only the sparse index for keyword-based search

Arguments:

query (required): The search query string
-m, --mode (optional): Search mode - hybrid, semantic, or lexical (default: hybrid)
-t, --top-k (optional): Number of results to return (default: 10)

Using Pinecone Assistant

Loading Data for Assistant

Load Steam game applications and reviews into the Pinecone Assistant:

python pc-webinar.py assistant-load

This command will:

Convert the Steam dataset CSV files (applications.csv and reviews.csv) to JSON format and save them in the same directory as the CSV files
Automatically split large JSON files into 100MB chunks if needed (files are named applications_part1.json, reviews_part1.json, etc. when split)
Get or create the Pinecone Assistant (if it doesn't exist)
Upload all JSON files (including split parts) to the assistant for use in chat completions

Note: The CSV files are automatically converted to JSON format before upload, as the Pinecone Assistant doesn't accept CSV files but does accept JSON files. Large files (over 100MB) are automatically split into multiple smaller JSON files to ensure successful uploads.

Note: This data load will take several hours:

They are big files (reviews.csv has over 1 million records)
Large files are automatically split into 100MB chunks to avoid upload size limits
The upload automatically creates dense and sparse indexes and uses hybrid search (none of this is visible to the user)
Progress bars will show file-level progress for each JSON file being uploaded

Prompting the Assistant

Chat with the Pinecone Assistant using the uploaded Steam game data:

# Use default model (gpt-4o)
python pc-webinar.py assistant-prompt "What are the most popular Steam games?"

# Specify a different model using short flag
python pc-webinar.py assistant-prompt "Tell me about Counter-Strike's review sentiment" -m claude-3-5-sonnet

# Using the full --model flag
python pc-webinar.py assistant-prompt "What games have the best reviews?" --model gemini-2.5-pro

Arguments:

prompt (required): The prompt/question to ask the assistant
-m, --model (optional): Model to use for the assistant (default: gpt-4o)

Available Models:

gpt-4o (default)
gpt-4.1
o4-mini
claude-3-5-sonnet
claude-3-7-sonnet
gemini-2.5-pro

Configuration

The project uses environment variables configured in .env:

PINECONE_API_KEY (required): Your Pinecone API key
PINECONE_DENSE_INDEX (optional): Name for the dense index (default: getting-started-webinar-dense)
PINECONE_SPARSE_INDEX (optional): Name for the sparse index (default: getting-started-webinar-sparse)
PINECONE_ASSISTANT_NAME (optional): Name for the Pinecone Assistant (default: getting-started-webinar-assistant)
CHUNK_SIZE (optional): Chunk size in tokens for text processing (default: 1740)
CHUNK_OVERLAP (optional): Overlap between chunks in tokens (default: 205)

Index Configuration:

Dense index uses llama-text-embed-v2 model with integrated embeddings
Sparse index uses pinecone-sparse-english-v0 model with integrated embeddings

Dataset

We are using 2 CSV files from the Steam Dataset 2025: Multi-Modal Gaming Analytics dataset from Kaggle.

Github repo: vintagedon/steam-dataset-2025

applications.csv: Contains all game application data from Steam
reviews.csv: Contains all game review data from Steam

Commands Summary

Command	Description	Arguments
`database-load`	Load Steam data into Pinecone indexes	None
`database-query`	Query Pinecone indexes	`query` (required), `-m/--mode` (optional), `-t/--top-k` (optional)
`assistant-load`	Load Steam data into Pinecone Assistant	None
`assistant-prompt`	Chat with Pinecone Assistant	`prompt` (required), `-m/--model` (optional)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

getting-started-with-pinecone-webinar

Prerequisites

Using Pre-Loaded Environments (Optional)

Walkthrough

Getting your environment ready

Using Pinecone Database

Loading Data

Querying the Database

Using Pinecone Assistant

Loading Data for Assistant

Prompting the Assistant

Configuration

Dataset

Commands Summary

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assistant		assistant
data		data
database		database
shared		shared
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
example.env		example.env
pc-webinar.py		pc-webinar.py
requirements.txt		requirements.txt

License

pinecone-io/getting-started-with-pinecone-webinar

Folders and files

Latest commit

History

Repository files navigation

getting-started-with-pinecone-webinar

Prerequisites

Using Pre-Loaded Environments (Optional)

Walkthrough

Getting your environment ready

Using Pinecone Database

Loading Data

Querying the Database

Using Pinecone Assistant

Loading Data for Assistant

Prompting the Assistant

Configuration

Dataset

Commands Summary

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages