Skip to content

yifei99/quant

Folders and files

NameName
Last commit message
Last commit date
Dec 31, 2024
Feb 3, 2025
Mar 20, 2025
Mar 20, 2025
Mar 20, 2025
Mar 20, 2025
Mar 20, 2025
Jan 23, 2025
Mar 20, 2025
Dec 31, 2024
Feb 19, 2025
Feb 3, 2025
Feb 19, 2025

Repository files navigation

The model should be capable on testing different factors

Enable long/short

Some key factors should be included in the backtest:

  • Sharpe Ratio
  • Sortino Ratio
  • Annualized return
  • Cum. Return
  • Max Drawdown
  • No. of trades throughout the period

Whenever sharpe Ratio >2.5, we will do further analysis

  • Equity curve of the strategy (compare with BTC price chart)
  • 3D visualization for visualize potential overfit

Workflow

Loading
graph TD
B --> E
H --> I
H --> K
M --> I 
J --> P

subgraph Data Process
A[Download Data] --> B[Load and Process Data]
end

subgraph Factor initialization
E[Initialize Factor Engine] --> F[Define and Register Factor]
F --> G[Calculate Factor Values]
G --> H[Initialize Factor-Based Strategy]
end

subgraph Backtest
I[Run Backtest with Parameters] --> J[Evaluate Performance Metrics]
end

subgraph Optimization
K[Define Parameter Search Space] --> L[Run Parameter Optimization]
L --> M[Find Optimal Parameters]
end

subgraph Save results
P[Save Results to CSV and TXT Files] --> Q[Visualize Optimization Results]
end

Code module

Loading
graph LR
A[Data Module] 
B[Factor Engine]
C[Backtest Engine]
D[Performance Evaluation]

subgraph Evaluation
D1[Performance Metrics] --> D
D2[Visulization] --> D
end

subgraph Backtest System
C1[Strategy Base Class] --> C
C2[Trade strategy] --> C
end

subgraph Factor System
B1[Factor Base Class] --> B
B2[Custom Factors] --> B
end

subgraph Data Pipeline
A1[Exchange Data] --> A
A2[External Data] --> A
end

Data Module

Loading
classDiagram
class DataDownloader {
download_data()
process_data()
fetch_and_process_data()
}
class DataLoader {
load_data()
merge_data()
preprocess_data()
}
DataDownloader --> DataLoader

The DataDownloader class is designed to download and process cryptocurrency market data from various exchanges. Here's a comprehensive guide on how to use it.

from data.data_downloader import DataDownloader
# Initialize DataDownloader
downloader = DataDownloader(
    symbol="BTCUSDT",          # Trading pair
    interval="1d",             # Time interval
    start_date="2023-01-01",   # Start date
    end_date="2023-12-31",     # End date
    data_folder="./dataset",   # Data storage location
    data_type="spot",          # Market type
    exchange="binance"         # Exchange name
)

# Download and process data
data = downloader.fetch_and_process_data()

Parameters Explanation

  1. Required Parameters
    • symbol: Trading pair symbol (e.g., "BTCUSDT", "ETHUSDT")
    • interval: Kline interval (e.g., "1m", "5m", "1h", "1d")
    • start_date: Start date in "YYYY-MM-DD" format
    • end_date: End date in "YYYY-MM-DD" format
    • data_folder: Directory path for data storage
  2. Optional Parameters
    • data_type: Market type (default: "spot")
      • "spot": Spot market data
      • "futures": Futures market data
    • exchange: Exchange name (default: "binance")

File Storage Structure

dataset/
└── binance/
	└── BTCUSDT/
		└── spot/
			└── 1d/
				└── BTCUSDT_1d_2023-01-01_to_2023-12-31.h5

Factor System

Loading
classDiagram
class FactorEngine {
register_factor()
calculate_factors()
list_factors()
}
class BaseFactor {
init()
calculate()
}
class CustomFactor {
init()
calculate()
}
BaseFactor <|-- CustomFactor
FactorEngine --> BaseFactor

Example

The provided code defines a USDT Issuance Factor as a class named USDTIssuance2Factor. This factor is part of a trading strategy framework and is designed to generate trading signals based on the daily issuance of USDT.

The USDTIssuance2Factor analyzes the issuance changes of USDT and determines whether to buy, sell, or hold a position based on predefined thresholds:

  • Long Signal (1): When the issuance exceeds the upper threshold.
  • Short Signal (-1): When the issuance falls below the lower threshold.
  • Close Signal (0): When the issuance is between the two thresholds.

Backtest System

Loading
classDiagram
class BacktestEngine {
	execute_trade()
}
class BaseStrategy {
	generate_signals()
}
class PerformanceEvaluator {
	calculate_metrics()
}
BaseStrategy --> BacktestEngine
BacktestEngine --> PerformanceEvaluator

Trading strategy

  • Signal-Based Trading:

    • The strategy depends on signals (buy, sell, or hold) generated by a strategy class.
    • Signals are numeric values:
      • 1: Long signal.
      • 1: Short signal.
      • 0: close position (no action).
  • Trade Execution: No Position:

    • Signal 1: Open long position
    • Signal -1: Open short position
    • Signal 0: No action

    Long Position:

    • Signal 1: No action
    • Signal -1: Switch to short (sell 2x position)
    • Signal 0: Close position

    Short Position:

    • Signal 1: Switch to long (buy 2x position)
    • Signal -1: No action
    • Signal 0: Close position
  • Added a slippage parameter, initialized to 0.001.

    The current logic is as follows:

    • Open Long Position: Buy at price * (1 + slippage)
    • Open Short Position: Sell at price * (1 - slippage)
    • Close Long Position: Sell at price * (1 - slippage)
    • Close Short Position: Buy at price * (1 + slippage)

Performance Metrics Explanation and Formulas

  1. Total Return

    • Description: Measures the overall return of the portfolio relative to the initial investment.

    • Formula:

      Total Return = Final Portfolio Value Initial Investment 1

  2. Annualized Return

  • Description: Adjusts the total return to an annualized rate, taking into account the duration of the investment.

  • Formula:

    $$ \text{Annualized Return} = \left(1 + \text{Total Return}\right)^{\frac{1}{\text{Years}}} - 1 $$

    $$ \text{Years} = \frac{\text{End Date} - \text{Start Date}}{365.25} $$

  1. Sharpe Ratio

    • Description: Measures the strategy's risk-adjusted return by comparing the excess return (return above the risk-free rate) to its volatility.
    • Formula:

    $$ \text{Sharpe Ratio} = \frac{\text{Mean(Excess Daily Returns)}}{\text{Standard Deviation of Daily Returns}} \times \sqrt{365} $$

    $$ \text{Excess Daily Returns} = \text{Daily Returns} - \frac{\text{Risk-Free Rate(0.0)}}{365} $$

  2. Sortino Ratio

    • Description: Similar to the Sharpe Ratio but focuses only on downside risk, which is more relevant for risk-averse investors.

    • Formula:

      $$ \text{Sortino Ratio} = \frac{\text{Mean(Excess Daily Returns)}}{\text{Downside Deviation}} \times \sqrt{365} \ \text{Excess Daily Returns} = \text{Daily Returns} - \frac{\text{Risk-Free Rate(0.0)}}{365} $$

    $$ \text{Downside Deviation} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (\text{Negative Returns})^2} $$

    Only negative returns (below the target return(0.0)) are considered.

  3. Maximum Drawdown

    • Description: Measures the largest peak-to-trough decline in portfolio value, representing the worst potential loss.
    • Formula:

    $$ \text{Max Drawdown} = \min(\text{Portfolio Returns} - \text{Running Maximum}) $$

    $$ \text{Running Maximum} = \max(\text{Cumulative Portfolio Returns}) $$

  4. Number of Trades

    • Description: Counts the total number of trades executed during the backtest.

    • Formula:

      $$ \text{Number of Trades} = \text{Count(Position Changes)} $$

      • Each change in position (buy or sell) is counted as a trade.
  5. Cumulative Returns

    • Description: Tracks the portfolio's return over time relative to the initial investment.
    • Formula:

    $$ \text{Cumulative Return at Time } t = \frac{\text{Portfolio Value at } t}{\text{Initial Investment}} $$

Performance Optimization

The system has undergone significant performance optimization, achieving a 23x speed improvement. Here are the key optimization strategies:

1. NumPy Array Operations

  • Before: Frequent Pandas DataFrame operations
  • After: Using NumPy arrays for calculations
  • Why it's faster:
    • Lower-level implementation
    • No index overhead
    • More efficient memory access patterns
    • Direct CPU array operations

2. Vectorization

  • Before: Loop-based calculations and multiple if-else statements
  • After: Vectorized operations using np.where and array operations
  • Why it's faster:
    • Leverages CPU's SIMD (Single Instruction Multiple Data) capabilities
    • Reduces branch prediction failures
    • Allows parallel processing at CPU level
    • Minimizes Python interpreter overhead

3. Memory Management

  • Before: Frequent DataFrame updates and copies
  • After: Batch operations on NumPy arrays
  • Why it's faster:
    • Reduced memory allocations
    • Fewer data copies
    • Better cache utilization
    • Single DataFrame update at the end

4. Code Example

# Before Optimization
for i in range(len(data)):
    if portfolio.iloc[i-1]['holdings'] == 0:
        if signals.iloc[i]['signal'] == 1:
            portfolio.iloc[i]['holdings'] = 1
        elif signals.iloc[i]['signal'] == -1:
            portfolio.iloc[i]['holdings'] = -1

# After Optimization
signal_array = signals['signal'].values
holdings_array = portfolio['holdings'].values

holdings_array[i] = np.where(signal_array[i] == 1, 1,
                   np.where(signal_array[i] == -1, -1, 0))
portfolio['holdings'] = holdings_array

5. Key Improvements

  • Trading logic execution: 23x faster
  • Memory usage: Significantly reduced
  • Code maintainability: Improved through consistent vectorization patterns
  • Scalability: Better handling of large datasets

6. Best Practices Learned

  1. Use NumPy arrays for numerical computations whenever possible
  2. Vectorize operations instead of using loops
  3. Minimize DataFrame operations and perform them in batch
  4. Keep data in contiguous memory blocks
  5. Reduce object creation and copying
  6. Use appropriate data structures for the task

These optimizations demonstrate how proper vectorization and data structure selection can dramatically improve performance in Python data processing applications.

7. FactorEngine Optimization

The FactorEngine has been optimized with a state management system:

  • State Reset Mechanism

    • Added reset() method to clear cached factor values and signals
    • Allows reuse of FactorEngine instances across multiple tests
    • Maintains factor definitions while clearing computed results
  • Memory Management

    • Efficient reuse of engine instances reduces memory allocation
    • Prevents memory leaks during large-scale optimization
    • Minimizes garbage collection overhead

8. Strategy Optimizer Enhancements

  • Parallel Processing

    • Automatic CPU core detection and limitation
    n_jobs = min(psutil.cpu_count(), 32)  # Limit max processes
    • Optimized batch size based on CPU cores
    batch_size = max(10, n_jobs * 10)  # Dynamic batch sizing
  • Resource Management

    • Pre-allocated FactorEngine pool for each process
    • Cyclic engine reuse pattern to minimize resource consumption
    factor_engines = [FactorEngine() for _ in range(n_jobs)]
    # ... 
    factor_engines[i % len(factor_engines)]  # Cyclic usage
  • Batch Processing

    • Efficient parameter combination testing in batches
    • Reduced inter-process communication overhead
    • Optimized progress tracking with batch updates
  • Performance Monitoring

    • Real-time tracking of best Sharpe ratio
    • Elapsed time monitoring per combination
    • Batch-level progress updates

9. Key Improvements

  1. Memory Efficiency

    • Reduced memory allocation frequency
    • Better memory usage patterns
    • Minimized object creation/destruction cycles
  2. Processing Speed

    • Optimized parallel execution
    • Efficient resource utilization
    • Reduced system call overhead
  3. Scalability

    • Handles large parameter spaces efficiently
    • Automatic resource allocation
    • Balanced CPU utilization
  4. Reliability

    • Robust error handling
    • Process isolation
    • State management consistency

10. Strategy Optimization (strategy.py)

  • Vectorized Operations
    • Pre-allocated numpy arrays for signals
    • Batch factor value processing
    • Reduced DataFrame operations
    signal_array = np.zeros(len(data))
    signals[ma_1 > ma_2] = 1
    signals[ma_1 < ma_2] = -1

11. Factor Definitions (factor_definitions.py)

  • Efficient Moving Average Calculation

    • Used numpy's convolution for MA computation
    • Optimized padding for missing values
    ma = np.convolve(values, np.ones(ma_period)/ma_period, mode='valid')
    ma = np.pad(ma, (ma_period-1, 0), mode='edge')
  • Vectorized Signal Generation

    • Single-pass signal calculation
    • Eliminated loops and conditionals
    • Optimized threshold comparisons
    signals[(values > self.upper_threshold)] = 1
    signals[(values < self.lower_threshold)] = -1

12. Factor Engine (factor_engine.py)

  • Memory Management

    • Pre-allocated result arrays
    • Cached factor values in numpy arrays
    • Reduced DataFrame conversions
    factor_arrays = np.empty((self._data_length, len(self.factors)))
  • Computation Optimization

    • Single-pass factor calculation
    • Efficient state management
    • Optimized reset mechanism
    self.factor_values[name] = factor_arrays[:, i]

13. Example Optimization (2ma_factor_mining.py)

  • Timestamp Processing

    • Direct integer conversion
    • Avoided string operations
    • Used compact data types
    data['timestamp_start'] = (data['timestamp_start'].astype(np.int64) // 1000000000).astype(np.int32)
  • Data Selection

    • Used views instead of copies
    • Pre-defined required columns
    • Optimized memory usage
    data = data[required_columns].copy(deep=False)

Key Performance Improvements

  1. Memory Efficiency

    • Reduced memory allocations
    • Used numpy arrays instead of DataFrames where possible
    • Implemented efficient data type conversions
    • Minimized data copying
  2. Computational Speed

    • Vectorized operations throughout
    • Eliminated loops and conditionals
    • Reduced DataFrame operations
    • Optimized numerical computations
  3. Data Processing

    • Efficient timestamp handling
    • Optimized data selection
    • Reduced type conversions
    • Minimized string operations
  4. Resource Management

    • Efficient memory usage
    • Optimized state management
    • Reduced object creation
    • Better garbage collection

Strategy Optimizer Memory Management

Problem Evolution

1. Initial Memory Issue

The optimizer showed memory leaks during large-scale parameter optimization, particularly when processing multiple trading pairs and timeframes.

2. First Optimization Attempt

# Added aggressive memory management
memory_threshold = 0.85  # Memory usage threshold
memory_usage = psutil.Process().memory_percent()
if memory_usage > memory_threshold:
    gc.collect()

Result: Significant performance degradation due to frequent memory checks and garbage collection

3. Second Attempt

# Changed data structure
results = []  # Instead of results = {}
results.append((combo, result))
results_dict = dict(results)

Result: Added unnecessary conversion overhead without memory benefits

Final Solution

1. Simplified Memory Management

# Keep original dictionary structure
results = {}

# Periodic cleanup only
if i % (self.batch_size * 5) == 0:
    gc.collect()

2. Resource Pooling

# Factor engine pooling with minimal management
factor_engines = [FactorEngine() for _ in range(n_jobs)]

3. Immediate Cleanup

# Clean up batch results immediately
del processed_results
del valid_results

Key Learnings

  1. Less is More

    • Minimal memory management performs better
    • Avoid frequent garbage collection
    • Keep data structures simple
  2. Resource Management

    • Pool and reuse resources where possible
    • Clean up resources immediately after use
    • Use periodic rather than continuous cleanup
  3. Performance Impact

    • Frequent garbage collection significantly slows processing
    • Data structure conversions add unnecessary overhead
    • Simple periodic cleanup provides the best balance

Best Practices

  1. Memory Management

    • Use periodic cleanup instead of continuous monitoring
    • Keep original data structures when possible
    • Clean up batch results immediately
  2. Resource Handling

    • Pool resources for reuse
    • Implement cleanup in finally blocks
    • Reset pooled resources between uses
  3. Code Structure

    • Maintain simple, direct code paths
    • Avoid unnecessary data transformations
    • Focus on essential cleanup points

14. Trading Logic Optimization (trading_logic.py)

  • Simplified Signal Processing

    • Removed np.where conditional checks
    • Replaced with direct array operations
    • Achieved 4x performance improvement
  • Before Optimization:

holdings_array[i] = np.where(signal_array[i] == 1, 1,
                   np.where(signal_array[i] == -1, -1, 0))
  • After Optimization:
# Direct array operations instead of np.where
holdings_array[signal_array == 1] = 1
holdings_array[signal_array == -1] = -1
holdings_array[signal_array == 0] = 0
  • Performance Improvement Reasons:
    • Eliminated conditional check overhead
    • Avoided temporary array creation
    • More efficient memory access patterns
    • Reduced CPU instruction count

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published