# Alpha158 0_7 vs 0_7_beta Prediction Comparison This directory contains a workflow for comparing Alpha158 version 0_7 (original) vs 0_7_beta (enhanced with VAE embeddings) predictions. ## Overview The goal is to evaluate whether the beta version of Alpha158 factors produces better predictions than the original 0_7 version when used with the d033 prediction model. ## Directory Structure ``` stock_1d/d033/alpha158_beta/ ├── README.md # This file ├── config.yaml # VAE model configuration ├── pipeline.py # Main orchestration script ├── scripts/ # Core pipeline scripts │ ├── generate_beta_embedding.py # Generate VAE embeddings from beta factors │ ├── generate_returns.py # Generate actual returns from kline data │ ├── fetch_predictions.py # Fetch original predictions from DolphinDB │ ├── predict_with_embedding.py # Generate predictions using beta embeddings │ ├── compare_predictions.py # Compare 0_7 vs 0_7_beta predictions │ ├── dump_polars_dataset.py # Dump raw and processed datasets using polars pipeline │ └── extract_qlib_params.py # Extract RobustZScoreNorm parameters from Qlib proc_list ├── src/ # Source modules │ └── qlib_loader.py # Qlib data loader with configurable date range ├── config/ # Configuration files │ └── handler.yaml # Modified handler with configurable end date ├── data/ # Data files (gitignored) │ ├── robust_zscore_params/ # Pre-fitted normalization parameters │ │ └── csiallx_feature2_ntrla_flag_pnlnorm/ │ │ ├── mean_train.npy │ │ ├── std_train.npy │ │ └── metadata.json │ ├── embedding_0_7_beta.parquet │ ├── predictions_beta_embedding.parquet │ ├── original_predictions_0_7.parquet │ ├── actual_returns.parquet │ ├── raw_data_*.pkl # Raw data before preprocessing │ └── processed_data_*.pkl # Processed data after preprocessing └── data_polars/ # Polars-generated datasets (gitignored) ├── raw_data_*.pkl └── processed_data_*.pkl ``` ## Data Loading with Configurable Date Range ### handler.yaml Modification The original `handler.yaml` uses `` placeholder which always loads data until today's date. The modified version in `config/handler.yaml` uses `` placeholder that can be controlled via arguments: ```yaml # Original (always loads until today) load_start: &load_start load_end: &load_end # Modified (configurable end date) load_start: &load_start load_end: &load_end ``` ### Using qlib_loader.py ```python from stock_1d.d033.alpha158_beta.src.qlib_loader import ( load_data_from_handler, load_data_with_proc_list, load_and_dump_data ) # Load data with configurable date range df = load_data_from_handler( since_date="2019-01-01", end_date="2019-01-31", buffer_days=20, # Extra days for diff calculations verbose=True ) # Load and apply preprocessing pipeline df_processed = load_data_with_proc_list( since_date="2019-01-01", end_date="2019-01-31", proc_list_path="/path/to/proc_list.proc", buffer_days=20 ) # Load and dump both raw and processed data to pickle files raw_df, processed_df = load_and_dump_data( since_date="2019-01-01", end_date="2019-01-31", output_dir="data/", fill_con_rating_nan=True, # Fill NaN in con_rating_strength column verbose=True ) ``` ### Key Features 1. **Configurable end date**: Unlike the original handler.yaml, the end date is now respected 2. **Buffer period handling**: Automatically loads extra days before `since_date` for diff calculations 3. **NaN handling**: Optional filling of NaN values in `con_rating_strength` column 4. **Dual output**: Saves both raw (before proc_list) and processed (after proc_list) data ### Processor Fixes The `qlib_loader.py` includes fixed implementations of qlib processors that correctly handle the `::` separator column format: - `FixedDiff` - Fixes column naming bug (creates proper `feature::col_diff` names) - `FixedColumnRemover` - Handles `::` separator format - `FixedRobustZScoreNorm` - Uses trained `mean_train`/`std_train` parameters from pickle - `FixedIndusNtrlInjector` - Industry neutralization with `::` format - `FixedFlagMarketInjector` - Adds `market_0`, `market_1` columns based on instrument codes - `FixedFlagSTInjector` - Creates `IsST` column from `ST_S`, `ST_Y` flags All fixed processors preserve the trained parameters from the original proc_list pickle. ### Polars Dataset Generation The `scripts/dump_features.py` script generates datasets using a polars-based pipeline that replicates the qlib preprocessing: ```bash # Generate merged features (flat columns) python scripts/dump_features.py --start-date 2024-01-01 --end-date 2024-01-31 --groups merged # Generate with struct columns (packed feature groups) python scripts/dump_features.py --start-date 2024-01-01 --end-date 2024-01-31 --groups merged --pack-struct # Generate specific feature groups python scripts/dump_features.py --start-date 2024-01-01 --end-date 2024-01-31 --groups alpha158 market_ext ``` This script: 1. Loads data from Parquet files (alpha158, kline, market flags, industry flags) 2. Applies the full processor pipeline: - Diff processor (adds diff features) - FlagMarketInjector (adds market_0, market_1) - ColumnRemover (removes log_size_diff, IsN, IsZt, IsDt) - FlagToOnehot (converts 29 industry flags to indus_idx) - IndusNtrlInjector (industry neutralization) - RobustZScoreNorm (using pre-fitted qlib parameters via `from_version()`) - Fillna (fill NaN with 0) 3. Saves to parquet/pickle format **Output modes:** - **Flat mode (default)**: All columns as separate fields (348 columns for merged) - **Struct mode (`--pack-struct`)**: Feature groups packed into struct columns: - `features_alpha158` (316 fields) - `features_market_ext` (14 fields) - `features_market_flag` (11 fields) **Note**: The `FlagSTInjector` step is skipped because it fails silently even in the gold-standard qlib code (see `BUG_ANALYSIS_FINAL.md` for details). Output structure: - Raw data: ~204 columns (158 feature + 4 feature_ext + 12 feature_flag + 30 indus_flag) - Processed data: 348 columns (318 alpha158 + 14 market_ext + 14 market_flag + 2 index) - VAE input dimension: 341 (excluding indus_idx) ### RobustZScoreNorm Parameter Extraction The pipeline uses pre-fitted normalization parameters extracted from Qlib's `proc_list.proc` file. These parameters are stored in `data/robust_zscore_params/` and can be loaded using the `RobustZScoreNorm.from_version()` method. **Extract parameters from Qlib proc_list:** ```bash python scripts/extract_qlib_params.py --version csiallx_feature2_ntrla_flag_pnlnorm ``` This creates: - `data/robust_zscore_params/{version}/mean_train.npy` - Pre-fitted mean parameters (330,) - `data/robust_zscore_params/{version}/std_train.npy` - Pre-fitted std parameters (330,) - `data/robust_zscore_params/{version}/metadata.json` - Feature column names and metadata **Use in Polars processors:** ```python from cta_1d.src.processors import RobustZScoreNorm # Load pre-fitted parameters by version name processor = RobustZScoreNorm.from_version("csiallx_feature2_ntrla_flag_pnlnorm") # Apply normalization to DataFrame df = processor.process(df) ``` **Parameter details:** - Fit period: 2013-01-01 to 2018-12-31 - Feature count: 330 (158 alpha158_ntrl + 158 alpha158_raw + 7 market_ext_ntrl + 7 market_ext_raw) - Fields: ['feature', 'feature_ext'] ## Workflow ### 1. Generate Beta Embeddings Generate VAE embeddings from the alpha158_0_7_beta factors: ```bash python scripts/generate_beta_embedding.py --start-date 2019-01-01 --end-date 2020-11-30 ``` This loads data from Parquet, applies the full feature transformation pipeline, and encodes with the VAE model. Output: `data/embedding_0_7_beta.parquet` ### 2. Fetch Original Predictions Fetch the original 0_7 predictions from DolphinDB: ```bash python scripts/fetch_predictions.py --start-date 2019-01-01 --end-date 2020-11-30 ``` Output: `data/original_predictions_0_7.parquet` ### 3. Generate Predictions with Beta Embeddings Use the d033 model to generate predictions from the beta embeddings: ```bash python scripts/predict_with_embedding.py --start-date 2019-01-01 --end-date 2020-11-30 ``` Output: `data/predictions_beta_embedding.parquet` ### 4. Generate Actual Returns Generate actual returns from kline data for IC calculation: ```bash python scripts/generate_returns.py ``` Output: `data/actual_returns.parquet` ### 5. Compare Predictions Compare the 0_7 vs 0_7_beta predictions: ```bash python scripts/compare_predictions.py ``` This calculates: - Prediction correlation (Pearson and Spearman) - Daily correlation statistics - IC metrics (mean, std, IR) - RankIC metrics - Top-tier returns (top 10%) ## Quick Start Run the full pipeline: ```bash python pipeline.py --start-date 2019-01-01 --end-date 2020-11-30 ``` Or run individual steps: ```bash # Step 1: Generate embeddings python scripts/generate_beta_embedding.py --start-date 2019-01-01 --end-date 2020-11-30 # Step 2: Fetch original predictions python scripts/fetch_predictions.py --start-date 2019-01-01 --end-date 2020-11-30 # Step 3: Generate beta predictions python scripts/predict_with_embedding.py # Step 4: Generate returns python scripts/generate_returns.py # Step 5: Compare python scripts/compare_predictions.py ``` ## Data Dependencies ### Input Data (from Parquet) - `/data/parquet/dataset/stg_1day_wind_alpha158_0_7_beta_1D/` - Alpha158 beta factors - `/data/parquet/dataset/stg_1day_wind_kline_adjusted_1D/` - Market data (kline) - `/data/parquet/dataset/stg_1day_gds_indus_flag_cc1_1D/` - Industry flags ### Models - `/home/guofu/Workspaces/alpha/data_ops/tasks/dwm_feature_vae/model/csiallx_feature2_ntrla_flag_pnlnorm_vae4_dim32a_beta0001/module.pt` - VAE encoder - `/home/guofu/Workspaces/alpha/data_ops/tasks/app_longsignal/model/host140_exp20_d033/module.pt` - d033 prediction model ### DolphinDB - Table: `dfs://daily_stock_run_multicast/app_1day_multicast_longsignal_port` - Version: `host140_exp20_d033` ## Key Metrics The comparison script outputs: | Metric | Description | |--------|-------------| | Pearson Correlation | Overall correlation between 0_7 and beta predictions | | Spearman Correlation | Rank correlation between predictions | | Daily Correlation | Mean and std of daily correlations | | IC Mean | Average information coefficient | | IC Std | Standard deviation of IC | | IC IR | Information ratio (IC Mean / IC Std) | | RankIC | Spearman correlation with returns | | Top-tier Return | Average return of top 10% predictions | ## Notes - All scripts can be run from the `alpha158_beta/` directory - Scripts use relative paths (`../data/`) to locate data files - The VAE model expects 341 input features after the transformation pipeline - The d033 model expects 32-dimensional embeddings with a 40-day lookback window