This directory contains a workflow for comparing Alpha158 version 0_7 (original) vs 0_7_beta (enhanced with VAE embeddings) predictions.
## Overview
The goal is to evaluate whether the beta version of Alpha158 factors produces better predictions than the original 0_7 version when used with the d033 prediction model.
## Directory Structure
```
stock_1d/d033/alpha158_beta/
├── README.md # This file
├── config.yaml # VAE model configuration
├── pipeline.py # Main orchestration script
├── scripts/ # Core pipeline scripts
│ ├── generate_beta_embedding.py # Generate VAE embeddings from beta factors
│ ├── generate_returns.py # Generate actual returns from kline data
│ ├── fetch_predictions.py # Fetch original predictions from DolphinDB
│ ├── predict_with_embedding.py # Generate predictions using beta embeddings
The original `handler.yaml` uses `<TODAY>` placeholder which always loads data until today's date. The modified version in `config/handler.yaml` uses `<LOAD_END>` placeholder that can be controlled via arguments:
```yaml
# Original (always loads until today)
load_start: &load_start <SINCE_DATE>
load_end: &load_end <TODAY>
# Modified (configurable end date)
load_start: &load_start <LOAD_START>
load_end: &load_end <LOAD_END>
```
### Using qlib_loader.py
```python
from stock_1d.d033.alpha158_beta.src.qlib_loader import (
load_data_from_handler,
load_data_with_proc_list,
load_and_dump_data
)
# Load data with configurable date range
df = load_data_from_handler(
since_date="2019-01-01",
end_date="2019-01-31",
buffer_days=20, # Extra days for diff calculations
verbose=True
)
# Load and apply preprocessing pipeline
df_processed = load_data_with_proc_list(
since_date="2019-01-01",
end_date="2019-01-31",
proc_list_path="/path/to/proc_list.proc",
buffer_days=20
)
# Load and dump both raw and processed data to pickle files
raw_df, processed_df = load_and_dump_data(
since_date="2019-01-01",
end_date="2019-01-31",
output_dir="data/",
fill_con_rating_nan=True, # Fill NaN in con_rating_strength column
verbose=True
)
```
### Key Features
1.**Configurable end date**: Unlike the original handler.yaml, the end date is now respected
2.**Buffer period handling**: Automatically loads extra days before `since_date` for diff calculations
3.**NaN handling**: Optional filling of NaN values in `con_rating_strength` column
4.**Dual output**: Saves both raw (before proc_list) and processed (after proc_list) data
### Processor Fixes
The `qlib_loader.py` includes fixed implementations of qlib processors that correctly handle the `::` separator column format:
**Note**: The `FlagSTInjector` step is skipped because it fails silently even in the gold-standard qlib code (see `BUG_ANALYSIS_FINAL.md` for details).
The pipeline uses pre-fitted normalization parameters extracted from Qlib's `proc_list.proc` file. These parameters are stored in `data/robust_zscore_params/` and can be loaded using the `RobustZScoreNorm.from_version()` method.