- Advanced Features
- Performance Optimization
- Model Management
- System Configuration
- CLI Configuration
- Best Practices
Using CLI (New!)
# Load a custom model with the CLI
locallab start --model meta-llama/Llama-2-7b-chat-hf
Using Environment Variables
import os
from locallab import start_server
# Load any HuggingFace model
os.environ["HUGGINGFACE_MODEL"] = "meta-llama/Llama-2-7b-chat-hf"
# Configure model settings
os.environ["LOCALLAB_MODEL_TEMPERATURE"] = "0.8"
os.environ["LOCALLAB_MODEL_MAX_LENGTH"] = "4096"
os.environ["LOCALLAB_MODEL_TOP_P"] = "0.95"
start_server()
from locallab.client import LocalLabClient
client = LocalLabClient("http://localhost:8000")
# Process multiple prompts in parallel
prompts = [
"Write a poem about spring",
"Explain quantum computing",
"Tell me a joke"
]
responses = await client.batch_generate(prompts)
Using CLI (New!)
# Enable memory optimizations via CLI
locallab start --quantize --quantize-type int8 --attention-slicing
Using Environment Variables
# Enable memory optimizations
os.environ["LOCALLAB_ENABLE_QUANTIZATION"] = "true"
os.environ["LOCALLAB_QUANTIZATION_TYPE"] = "int8" # or "int4" for more savings
os.environ["LOCALLAB_ENABLE_CPU_OFFLOADING"] = "true"
Using CLI (New!)
# Enable speed optimizations via CLI
locallab start --flash-attention --better-transformer
Using Environment Variables
# Enable speed optimizations
os.environ["LOCALLAB_ENABLE_FLASH_ATTENTION"] = "true"
os.environ["LOCALLAB_ENABLE_ATTENTION_SLICING"] = "true"
os.environ["LOCALLAB_ENABLE_BETTERTRANSFORMER"] = "true"
# Set resource limits
os.environ["LOCALLAB_MIN_FREE_MEMORY"] = "2000" # MB
os.environ["LOCALLAB_MAX_BATCH_SIZE"] = "4"
os.environ["LOCALLAB_REQUEST_TIMEOUT"] = "30"
from locallab import MODEL_REGISTRY
# Check available models
print(MODEL_REGISTRY.keys())
# Load specific model
client.load_model("microsoft/phi-2")
# Get current model info
model_info = await client.get_current_model()
# Define custom model settings
os.environ["LOCALLAB_CUSTOM_MODEL"] = "your-org/your-model"
os.environ["LOCALLAB_MODEL_INSTRUCTIONS"] = """You are a helpful AI assistant.
Please provide clear and concise responses."""
# Configure server settings
os.environ["LOCALLAB_HOST"] = "0.0.0.0"
os.environ["LOCALLAB_PORT"] = "8000"
os.environ["LOCALLAB_WORKERS"] = "4"
os.environ["LOCALLAB_ENABLE_CORS"] = "true"
# Configure logging
os.environ["LOCALLAB_LOG_LEVEL"] = "INFO"
os.environ["LOCALLAB_ENABLE_FILE_LOGGING"] = "true"
os.environ["LOCALLAB_LOG_FILE"] = "locallab.log"
The LocalLab CLI provides a powerful way to configure and manage your server. Here are some advanced CLI features:
# Run the configuration wizard
locallab config
# Get detailed system information
locallab info
# Start with advanced configuration
locallab start \
--model microsoft/phi-2 \
--port 8080 \
--quantize \
--quantize-type int4 \
--attention-slicing \
--flash-attention \
--better-transformer
The CLI stores your configuration in ~/.locallab/config.json
. You can edit this file directly for advanced configuration:
{
"model_id": "microsoft/phi-2",
"port": 8080,
"enable_quantization": true,
"quantization_type": "int8",
"enable_attention_slicing": true,
"enable_flash_attention": true,
"enable_better_transformer": true
}
For more details, see the CLI Guide.
-
Resource Management
- Monitor system resources
- Use appropriate quantization
- Enable optimizations based on hardware
-
Error Handling
try: response = await client.generate("Hello") except Exception as e: if "out of memory" in str(e): # Fall back to smaller model await client.load_model("microsoft/phi-2")
-
Performance Monitoring
# Get system information system_info = await client.get_system_info() print(f"CPU Usage: {system_info.cpu_usage}%") print(f"Memory Usage: {system_info.memory_usage}%") print(f"GPU Info: {system_info.gpu_info}")