# Alpha158 0_7 vs 0_7_beta Prediction Comparison This directory contains a workflow for comparing Alpha158 version 0_7 (original) vs 0_7_beta (enhanced with VAE embeddings) predictions. ## Overview The goal is to evaluate whether the beta version of Alpha158 factors produces better predictions than the original 0_7 version when used with the d033 prediction model. ## Directory Structure ``` stock_1d/d033/alpha158_beta/ ├── README.md # This file ├── config.yaml # VAE model configuration ├── pipeline.py # Main orchestration script ├── scripts/ # Core pipeline scripts │ ├── generate_beta_embedding.py # Generate VAE embeddings from beta factors │ ├── generate_returns.py # Generate actual returns from kline data │ ├── fetch_predictions.py # Fetch original predictions from DolphinDB │ ├── predict_with_embedding.py # Generate predictions using beta embeddings │ └── compare_predictions.py # Compare 0_7 vs 0_7_beta predictions ├── src/ # Source modules │ └── qlib_loader.py # Qlib data loader with configurable date range ├── config/ # Configuration files │ └── handler.yaml # Modified handler with configurable end date └── data/ # Data files (gitignored) ├── embedding_0_7_beta.parquet ├── predictions_beta_embedding.parquet ├── original_predictions_0_7.parquet ├── actual_returns.parquet ├── raw_data_*.pkl # Raw data before preprocessing └── processed_data_*.pkl # Processed data after preprocessing ``` ## Data Loading with Configurable Date Range ### handler.yaml Modification The original `handler.yaml` uses `` placeholder which always loads data until today's date. The modified version in `config/handler.yaml` uses `` placeholder that can be controlled via arguments: ```yaml # Original (always loads until today) load_start: &load_start load_end: &load_end # Modified (configurable end date) load_start: &load_start load_end: &load_end ``` ### Using qlib_loader.py ```python from stock_1d.d033.alpha158_beta.src.qlib_loader import ( load_data_from_handler, load_data_with_proc_list, load_and_dump_data ) # Load data with configurable date range df = load_data_from_handler( since_date="2019-01-01", end_date="2019-01-31", buffer_days=20, # Extra days for diff calculations verbose=True ) # Load and apply preprocessing pipeline df_processed = load_data_with_proc_list( since_date="2019-01-01", end_date="2019-01-31", proc_list_path="/path/to/proc_list.proc", buffer_days=20 ) # Load and dump both raw and processed data to pickle files raw_df, processed_df = load_and_dump_data( since_date="2019-01-01", end_date="2019-01-31", output_dir="data/", fill_con_rating_nan=True, # Fill NaN in con_rating_strength column verbose=True ) ``` ### Key Features 1. **Configurable end date**: Unlike the original handler.yaml, the end date is now respected 2. **Buffer period handling**: Automatically loads extra days before `since_date` for diff calculations 3. **NaN handling**: Optional filling of NaN values in `con_rating_strength` column 4. **Dual output**: Saves both raw (before proc_list) and processed (after proc_list) data ### Processor Fixes The `qlib_loader.py` includes fixed implementations of qlib processors that correctly handle the `::` separator column format: - `FixedDiff` - Fixes column naming bug (creates proper `feature::col_diff` names) - `FixedColumnRemover` - Handles `::` separator format - `FixedRobustZScoreNorm` - Uses trained `mean_train`/`std_train` parameters from pickle - `FixedIndusNtrlInjector` - Industry neutralization with `::` format - Other fixed processors for the full preprocessing pipeline All fixed processors preserve the trained parameters from the original proc_list pickle. ## Workflow ### 1. Generate Beta Embeddings Generate VAE embeddings from the alpha158_0_7_beta factors: ```bash python scripts/generate_beta_embedding.py --start-date 2019-01-01 --end-date 2020-11-30 ``` This loads data from Parquet, applies the full feature transformation pipeline, and encodes with the VAE model. Output: `data/embedding_0_7_beta.parquet` ### 2. Fetch Original Predictions Fetch the original 0_7 predictions from DolphinDB: ```bash python scripts/fetch_predictions.py --start-date 2019-01-01 --end-date 2020-11-30 ``` Output: `data/original_predictions_0_7.parquet` ### 3. Generate Predictions with Beta Embeddings Use the d033 model to generate predictions from the beta embeddings: ```bash python scripts/predict_with_embedding.py --start-date 2019-01-01 --end-date 2020-11-30 ``` Output: `data/predictions_beta_embedding.parquet` ### 4. Generate Actual Returns Generate actual returns from kline data for IC calculation: ```bash python scripts/generate_returns.py ``` Output: `data/actual_returns.parquet` ### 5. Compare Predictions Compare the 0_7 vs 0_7_beta predictions: ```bash python scripts/compare_predictions.py ``` This calculates: - Prediction correlation (Pearson and Spearman) - Daily correlation statistics - IC metrics (mean, std, IR) - RankIC metrics - Top-tier returns (top 10%) ## Quick Start Run the full pipeline: ```bash python pipeline.py --start-date 2019-01-01 --end-date 2020-11-30 ``` Or run individual steps: ```bash # Step 1: Generate embeddings python scripts/generate_beta_embedding.py --start-date 2019-01-01 --end-date 2020-11-30 # Step 2: Fetch original predictions python scripts/fetch_predictions.py --start-date 2019-01-01 --end-date 2020-11-30 # Step 3: Generate beta predictions python scripts/predict_with_embedding.py # Step 4: Generate returns python scripts/generate_returns.py # Step 5: Compare python scripts/compare_predictions.py ``` ## Data Dependencies ### Input Data (from Parquet) - `/data/parquet/dataset/stg_1day_wind_alpha158_0_7_beta_1D/` - Alpha158 beta factors - `/data/parquet/dataset/stg_1day_wind_kline_adjusted_1D/` - Market data (kline) - `/data/parquet/dataset/stg_1day_gds_indus_flag_cc1_1D/` - Industry flags ### Models - `/home/guofu/Workspaces/alpha/data_ops/tasks/dwm_feature_vae/model/csiallx_feature2_ntrla_flag_pnlnorm_vae4_dim32a_beta0001/module.pt` - VAE encoder - `/home/guofu/Workspaces/alpha/data_ops/tasks/app_longsignal/model/host140_exp20_d033/module.pt` - d033 prediction model ### DolphinDB - Table: `dfs://daily_stock_run_multicast/app_1day_multicast_longsignal_port` - Version: `host140_exp20_d033` ## Key Metrics The comparison script outputs: | Metric | Description | |--------|-------------| | Pearson Correlation | Overall correlation between 0_7 and beta predictions | | Spearman Correlation | Rank correlation between predictions | | Daily Correlation | Mean and std of daily correlations | | IC Mean | Average information coefficient | | IC Std | Standard deviation of IC | | IC IR | Information ratio (IC Mean / IC Std) | | RankIC | Spearman correlation with returns | | Top-tier Return | Average return of top 10% predictions | ## Notes - All scripts can be run from the `alpha158_beta/` directory - Scripts use relative paths (`../data/`) to locate data files - The VAE model expects 341 input features after the transformation pipeline - The d033 model expects 32-dimensional embeddings with a 40-day lookback window