The model should be capable on testing different factors
Enable long/short
Some key factors should be included in the backtest:
- Sharpe Ratio
- Sortino Ratio
- Annualized return
- Cum. Return
- Max Drawdown
- No. of trades throughout the period
Whenever sharpe Ratio >2.5, we will do further analysis
- Equity curve of the strategy (compare with BTC price chart)
- 3D visualization for visualize potential overfit
graph TD
B --> E
H --> I
H --> K
M --> I
J --> P
subgraph Data Process
A[Download Data] --> B[Load and Process Data]
end
subgraph Factor initialization
E[Initialize Factor Engine] --> F[Define and Register Factor]
F --> G[Calculate Factor Values]
G --> H[Initialize Factor-Based Strategy]
end
subgraph Backtest
I[Run Backtest with Parameters] --> J[Evaluate Performance Metrics]
end
subgraph Optimization
K[Define Parameter Search Space] --> L[Run Parameter Optimization]
L --> M[Find Optimal Parameters]
end
subgraph Save results
P[Save Results to CSV and TXT Files] --> Q[Visualize Optimization Results]
end
graph LR
A[Data Module]
B[Factor Engine]
C[Backtest Engine]
D[Performance Evaluation]
subgraph Evaluation
D1[Performance Metrics] --> D
D2[Visulization] --> D
end
subgraph Backtest System
C1[Strategy Base Class] --> C
C2[Trade strategy] --> C
end
subgraph Factor System
B1[Factor Base Class] --> B
B2[Custom Factors] --> B
end
subgraph Data Pipeline
A1[Exchange Data] --> A
A2[External Data] --> A
end
classDiagram
class DataDownloader {
download_data()
process_data()
fetch_and_process_data()
}
class DataLoader {
load_data()
merge_data()
preprocess_data()
}
DataDownloader --> DataLoader
The DataDownloader
class is designed to download and process cryptocurrency market data from various exchanges. Here's a comprehensive guide on how to use it.
from data.data_downloader import DataDownloader
# Initialize DataDownloader
downloader = DataDownloader(
symbol="BTCUSDT", # Trading pair
interval="1d", # Time interval
start_date="2023-01-01", # Start date
end_date="2023-12-31", # End date
data_folder="./dataset", # Data storage location
data_type="spot", # Market type
exchange="binance" # Exchange name
)
# Download and process data
data = downloader.fetch_and_process_data()
Parameters Explanation
- Required Parameters
- symbol: Trading pair symbol (e.g., "BTCUSDT", "ETHUSDT")
- interval: Kline interval (e.g., "1m", "5m", "1h", "1d")
- start_date: Start date in "YYYY-MM-DD" format
- end_date: End date in "YYYY-MM-DD" format
- data_folder: Directory path for data storage
- Optional Parameters
- data_type: Market type (default: "spot")
- "spot": Spot market data
- "futures": Futures market data
- exchange: Exchange name (default: "binance")
- data_type: Market type (default: "spot")
File Storage Structure
dataset/
└── binance/
└── BTCUSDT/
└── spot/
└── 1d/
└── BTCUSDT_1d_2023-01-01_to_2023-12-31.h5
classDiagram
class FactorEngine {
register_factor()
calculate_factors()
list_factors()
}
class BaseFactor {
init()
calculate()
}
class CustomFactor {
init()
calculate()
}
BaseFactor <|-- CustomFactor
FactorEngine --> BaseFactor
Example
The provided code defines a USDT Issuance Factor as a class named USDTIssuance2Factor
. This factor is part of a trading strategy framework and is designed to generate trading signals based on the daily issuance of USDT.
The USDTIssuance2Factor
analyzes the issuance changes of USDT and determines whether to buy, sell, or hold a position based on predefined thresholds:
- Long Signal (1): When the issuance exceeds the upper threshold.
- Short Signal (-1): When the issuance falls below the lower threshold.
- Close Signal (0): When the issuance is between the two thresholds.
classDiagram
class BacktestEngine {
execute_trade()
}
class BaseStrategy {
generate_signals()
}
class PerformanceEvaluator {
calculate_metrics()
}
BaseStrategy --> BacktestEngine
BacktestEngine --> PerformanceEvaluator
Trading strategy
-
Signal-Based Trading:
- The strategy depends on signals (buy, sell, or hold) generated by a strategy class.
- Signals are numeric values:
1
: Long signal.1
: Short signal.0
: close position (no action).
-
Trade Execution: No Position:
- Signal 1: Open long position
- Signal -1: Open short position
- Signal 0: No action
Long Position:
- Signal 1: No action
- Signal -1: Switch to short (sell 2x position)
- Signal 0: Close position
Short Position:
- Signal 1: Switch to long (buy 2x position)
- Signal -1: No action
- Signal 0: Close position
-
Added a slippage parameter, initialized to 0.001.
The current logic is as follows:
- Open Long Position: Buy at price * (1 + slippage)
- Open Short Position: Sell at price * (1 - slippage)
- Close Long Position: Sell at price * (1 - slippage)
- Close Short Position: Buy at price * (1 + slippage)
Performance Metrics Explanation and Formulas
-
Total Return
-
Description: Measures the overall return of the portfolio relative to the initial investment.
-
Formula:
-
-
Annualized Return
-
Description: Adjusts the total return to an annualized rate, taking into account the duration of the investment.
-
Formula:
$$ \text{Annualized Return} = \left(1 + \text{Total Return}\right)^{\frac{1}{\text{Years}}} - 1 $$
$$ \text{Years} = \frac{\text{End Date} - \text{Start Date}}{365.25} $$
-
Sharpe Ratio
- Description: Measures the strategy's risk-adjusted return by comparing the excess return (return above the risk-free rate) to its volatility.
- Formula:
$$ \text{Sharpe Ratio} = \frac{\text{Mean(Excess Daily Returns)}}{\text{Standard Deviation of Daily Returns}} \times \sqrt{365} $$
$$ \text{Excess Daily Returns} = \text{Daily Returns} - \frac{\text{Risk-Free Rate(0.0)}}{365} $$
-
Sortino Ratio
-
Description: Similar to the Sharpe Ratio but focuses only on downside risk, which is more relevant for risk-averse investors.
-
Formula:
$$ \text{Sortino Ratio} = \frac{\text{Mean(Excess Daily Returns)}}{\text{Downside Deviation}} \times \sqrt{365} \ \text{Excess Daily Returns} = \text{Daily Returns} - \frac{\text{Risk-Free Rate(0.0)}}{365} $$
$$ \text{Downside Deviation} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (\text{Negative Returns})^2} $$
Only negative returns (below the target return(0.0)) are considered.
-
-
Maximum Drawdown
- Description: Measures the largest peak-to-trough decline in portfolio value, representing the worst potential loss.
- Formula:
$$ \text{Max Drawdown} = \min(\text{Portfolio Returns} - \text{Running Maximum}) $$
$$ \text{Running Maximum} = \max(\text{Cumulative Portfolio Returns}) $$
-
Number of Trades
-
Description: Counts the total number of trades executed during the backtest.
-
Formula:
$$ \text{Number of Trades} = \text{Count(Position Changes)} $$
- Each change in position (buy or sell) is counted as a trade.
-
-
Cumulative Returns
- Description: Tracks the portfolio's return over time relative to the initial investment.
- Formula:
$$ \text{Cumulative Return at Time } t = \frac{\text{Portfolio Value at } t}{\text{Initial Investment}} $$
The system has undergone significant performance optimization, achieving a 23x speed improvement. Here are the key optimization strategies:
- Before: Frequent Pandas DataFrame operations
- After: Using NumPy arrays for calculations
- Why it's faster:
- Lower-level implementation
- No index overhead
- More efficient memory access patterns
- Direct CPU array operations
- Before: Loop-based calculations and multiple if-else statements
- After: Vectorized operations using
np.where
and array operations - Why it's faster:
- Leverages CPU's SIMD (Single Instruction Multiple Data) capabilities
- Reduces branch prediction failures
- Allows parallel processing at CPU level
- Minimizes Python interpreter overhead
- Before: Frequent DataFrame updates and copies
- After: Batch operations on NumPy arrays
- Why it's faster:
- Reduced memory allocations
- Fewer data copies
- Better cache utilization
- Single DataFrame update at the end
# Before Optimization
for i in range(len(data)):
if portfolio.iloc[i-1]['holdings'] == 0:
if signals.iloc[i]['signal'] == 1:
portfolio.iloc[i]['holdings'] = 1
elif signals.iloc[i]['signal'] == -1:
portfolio.iloc[i]['holdings'] = -1
# After Optimization
signal_array = signals['signal'].values
holdings_array = portfolio['holdings'].values
holdings_array[i] = np.where(signal_array[i] == 1, 1,
np.where(signal_array[i] == -1, -1, 0))
portfolio['holdings'] = holdings_array
- Trading logic execution: 23x faster
- Memory usage: Significantly reduced
- Code maintainability: Improved through consistent vectorization patterns
- Scalability: Better handling of large datasets
- Use NumPy arrays for numerical computations whenever possible
- Vectorize operations instead of using loops
- Minimize DataFrame operations and perform them in batch
- Keep data in contiguous memory blocks
- Reduce object creation and copying
- Use appropriate data structures for the task
These optimizations demonstrate how proper vectorization and data structure selection can dramatically improve performance in Python data processing applications.
The FactorEngine has been optimized with a state management system:
-
State Reset Mechanism
- Added
reset()
method to clear cached factor values and signals - Allows reuse of FactorEngine instances across multiple tests
- Maintains factor definitions while clearing computed results
- Added
-
Memory Management
- Efficient reuse of engine instances reduces memory allocation
- Prevents memory leaks during large-scale optimization
- Minimizes garbage collection overhead
-
Parallel Processing
- Automatic CPU core detection and limitation
n_jobs = min(psutil.cpu_count(), 32) # Limit max processes
- Optimized batch size based on CPU cores
batch_size = max(10, n_jobs * 10) # Dynamic batch sizing
-
Resource Management
- Pre-allocated FactorEngine pool for each process
- Cyclic engine reuse pattern to minimize resource consumption
factor_engines = [FactorEngine() for _ in range(n_jobs)] # ... factor_engines[i % len(factor_engines)] # Cyclic usage
-
Batch Processing
- Efficient parameter combination testing in batches
- Reduced inter-process communication overhead
- Optimized progress tracking with batch updates
-
Performance Monitoring
- Real-time tracking of best Sharpe ratio
- Elapsed time monitoring per combination
- Batch-level progress updates
-
Memory Efficiency
- Reduced memory allocation frequency
- Better memory usage patterns
- Minimized object creation/destruction cycles
-
Processing Speed
- Optimized parallel execution
- Efficient resource utilization
- Reduced system call overhead
-
Scalability
- Handles large parameter spaces efficiently
- Automatic resource allocation
- Balanced CPU utilization
-
Reliability
- Robust error handling
- Process isolation
- State management consistency
- Vectorized Operations
- Pre-allocated numpy arrays for signals
- Batch factor value processing
- Reduced DataFrame operations
signal_array = np.zeros(len(data)) signals[ma_1 > ma_2] = 1 signals[ma_1 < ma_2] = -1
-
Efficient Moving Average Calculation
- Used numpy's convolution for MA computation
- Optimized padding for missing values
ma = np.convolve(values, np.ones(ma_period)/ma_period, mode='valid') ma = np.pad(ma, (ma_period-1, 0), mode='edge')
-
Vectorized Signal Generation
- Single-pass signal calculation
- Eliminated loops and conditionals
- Optimized threshold comparisons
signals[(values > self.upper_threshold)] = 1 signals[(values < self.lower_threshold)] = -1
-
Memory Management
- Pre-allocated result arrays
- Cached factor values in numpy arrays
- Reduced DataFrame conversions
factor_arrays = np.empty((self._data_length, len(self.factors)))
-
Computation Optimization
- Single-pass factor calculation
- Efficient state management
- Optimized reset mechanism
self.factor_values[name] = factor_arrays[:, i]
-
Timestamp Processing
- Direct integer conversion
- Avoided string operations
- Used compact data types
data['timestamp_start'] = (data['timestamp_start'].astype(np.int64) // 1000000000).astype(np.int32)
-
Data Selection
- Used views instead of copies
- Pre-defined required columns
- Optimized memory usage
data = data[required_columns].copy(deep=False)
-
Memory Efficiency
- Reduced memory allocations
- Used numpy arrays instead of DataFrames where possible
- Implemented efficient data type conversions
- Minimized data copying
-
Computational Speed
- Vectorized operations throughout
- Eliminated loops and conditionals
- Reduced DataFrame operations
- Optimized numerical computations
-
Data Processing
- Efficient timestamp handling
- Optimized data selection
- Reduced type conversions
- Minimized string operations
-
Resource Management
- Efficient memory usage
- Optimized state management
- Reduced object creation
- Better garbage collection
The optimizer showed memory leaks during large-scale parameter optimization, particularly when processing multiple trading pairs and timeframes.
# Added aggressive memory management
memory_threshold = 0.85 # Memory usage threshold
memory_usage = psutil.Process().memory_percent()
if memory_usage > memory_threshold:
gc.collect()
Result: Significant performance degradation due to frequent memory checks and garbage collection
# Changed data structure
results = [] # Instead of results = {}
results.append((combo, result))
results_dict = dict(results)
Result: Added unnecessary conversion overhead without memory benefits
# Keep original dictionary structure
results = {}
# Periodic cleanup only
if i % (self.batch_size * 5) == 0:
gc.collect()
# Factor engine pooling with minimal management
factor_engines = [FactorEngine() for _ in range(n_jobs)]
# Clean up batch results immediately
del processed_results
del valid_results
-
Less is More
- Minimal memory management performs better
- Avoid frequent garbage collection
- Keep data structures simple
-
Resource Management
- Pool and reuse resources where possible
- Clean up resources immediately after use
- Use periodic rather than continuous cleanup
-
Performance Impact
- Frequent garbage collection significantly slows processing
- Data structure conversions add unnecessary overhead
- Simple periodic cleanup provides the best balance
-
Memory Management
- Use periodic cleanup instead of continuous monitoring
- Keep original data structures when possible
- Clean up batch results immediately
-
Resource Handling
- Pool resources for reuse
- Implement cleanup in finally blocks
- Reset pooled resources between uses
-
Code Structure
- Maintain simple, direct code paths
- Avoid unnecessary data transformations
- Focus on essential cleanup points
-
Simplified Signal Processing
- Removed
np.where
conditional checks - Replaced with direct array operations
- Achieved 4x performance improvement
- Removed
-
Before Optimization:
holdings_array[i] = np.where(signal_array[i] == 1, 1,
np.where(signal_array[i] == -1, -1, 0))
- After Optimization:
# Direct array operations instead of np.where
holdings_array[signal_array == 1] = 1
holdings_array[signal_array == -1] = -1
holdings_array[signal_array == 0] = 0
- Performance Improvement Reasons:
- Eliminated conditional check overhead
- Avoided temporary array creation
- More efficient memory access patterns
- Reduced CPU instruction count