You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
guofu ea011090f8
Update documentation for processors module and commit test file
3 days ago
..
config Add qlib_loader.py with configurable date range for data loading 4 days ago
data Update documentation for processors module and commit test file 3 days ago
docs Add configuration files and alpha158_beta pipeline 4 days ago
scripts Extract RobustZScoreNorm parameters and add from_version() method 3 days ago
src Fix FlagMarketInjector and FlagSTInjector, add polars dataset dump script 4 days ago
BUG_ANALYSIS.md Add configuration files and alpha158_beta pipeline 4 days ago
BUG_ANALYSIS_FINAL.md Fix FlagMarketInjector and FlagSTInjector, add polars dataset dump script 4 days ago
README.md Extract RobustZScoreNorm parameters and add from_version() method 3 days ago
config.yaml Add configuration files and alpha158_beta pipeline 4 days ago
pipeline.py Add configuration files and alpha158_beta pipeline 4 days ago

README.md

Alpha158 0_7 vs 0_7_beta Prediction Comparison

This directory contains a workflow for comparing Alpha158 version 0_7 (original) vs 0_7_beta (enhanced with VAE embeddings) predictions.

Overview

The goal is to evaluate whether the beta version of Alpha158 factors produces better predictions than the original 0_7 version when used with the d033 prediction model.

Directory Structure

stock_1d/d033/alpha158_beta/
├── README.md                    # This file
├── config.yaml                  # VAE model configuration
├── pipeline.py                  # Main orchestration script
├── scripts/                     # Core pipeline scripts
│   ├── generate_beta_embedding.py   # Generate VAE embeddings from beta factors
│   ├── generate_returns.py          # Generate actual returns from kline data
│   ├── fetch_predictions.py         # Fetch original predictions from DolphinDB
│   ├── predict_with_embedding.py    # Generate predictions using beta embeddings
│   ├── compare_predictions.py       # Compare 0_7 vs 0_7_beta predictions
│   ├── dump_polars_dataset.py       # Dump raw and processed datasets using polars pipeline
│   └── extract_qlib_params.py       # Extract RobustZScoreNorm parameters from Qlib proc_list
├── src/                         # Source modules
│   └── qlib_loader.py           # Qlib data loader with configurable date range
├── config/                      # Configuration files
│   └── handler.yaml             # Modified handler with configurable end date
├── data/                        # Data files (gitignored)
│   ├── robust_zscore_params/    # Pre-fitted normalization parameters
│   │   └── csiallx_feature2_ntrla_flag_pnlnorm/
│   │       ├── mean_train.npy
│   │       ├── std_train.npy
│   │       └── metadata.json
│   ├── embedding_0_7_beta.parquet
│   ├── predictions_beta_embedding.parquet
│   ├── original_predictions_0_7.parquet
│   ├── actual_returns.parquet
│   ├── raw_data_*.pkl           # Raw data before preprocessing
│   └── processed_data_*.pkl     # Processed data after preprocessing
└── data_polars/                 # Polars-generated datasets (gitignored)
    ├── raw_data_*.pkl
    └── processed_data_*.pkl

Data Loading with Configurable Date Range

handler.yaml Modification

The original handler.yaml uses <TODAY> placeholder which always loads data until today's date. The modified version in config/handler.yaml uses <LOAD_END> placeholder that can be controlled via arguments:

# Original (always loads until today)
load_start: &load_start <SINCE_DATE>
load_end: &load_end <TODAY>

# Modified (configurable end date)
load_start: &load_start <LOAD_START>
load_end: &load_end <LOAD_END>

Using qlib_loader.py

from stock_1d.d033.alpha158_beta.src.qlib_loader import (
    load_data_from_handler,
    load_data_with_proc_list,
    load_and_dump_data
)

# Load data with configurable date range
df = load_data_from_handler(
    since_date="2019-01-01",
    end_date="2019-01-31",
    buffer_days=20,  # Extra days for diff calculations
    verbose=True
)

# Load and apply preprocessing pipeline
df_processed = load_data_with_proc_list(
    since_date="2019-01-01",
    end_date="2019-01-31",
    proc_list_path="/path/to/proc_list.proc",
    buffer_days=20
)

# Load and dump both raw and processed data to pickle files
raw_df, processed_df = load_and_dump_data(
    since_date="2019-01-01",
    end_date="2019-01-31",
    output_dir="data/",
    fill_con_rating_nan=True,  # Fill NaN in con_rating_strength column
    verbose=True
)

Key Features

  1. Configurable end date: Unlike the original handler.yaml, the end date is now respected
  2. Buffer period handling: Automatically loads extra days before since_date for diff calculations
  3. NaN handling: Optional filling of NaN values in con_rating_strength column
  4. Dual output: Saves both raw (before proc_list) and processed (after proc_list) data

Processor Fixes

The qlib_loader.py includes fixed implementations of qlib processors that correctly handle the :: separator column format:

  • FixedDiff - Fixes column naming bug (creates proper feature::col_diff names)
  • FixedColumnRemover - Handles :: separator format
  • FixedRobustZScoreNorm - Uses trained mean_train/std_train parameters from pickle
  • FixedIndusNtrlInjector - Industry neutralization with :: format
  • FixedFlagMarketInjector - Adds market_0, market_1 columns based on instrument codes
  • FixedFlagSTInjector - Creates IsST column from ST_S, ST_Y flags

All fixed processors preserve the trained parameters from the original proc_list pickle.

Polars Dataset Generation

The scripts/dump_polars_dataset.py script generates datasets using a polars-based pipeline that replicates the qlib preprocessing:

# Generate raw and processed datasets
python scripts/dump_polars_dataset.py

This script:

  1. Loads data from Parquet files (alpha158, kline, market flags, industry flags)
  2. Saves raw data (before processors) to data_polars/raw_data_*.pkl
  3. Applies the full processor pipeline:
    • Diff processor (adds diff features)
    • FlagMarketInjector (adds market_0, market_1)
    • ColumnRemover (removes log_size_diff, IsN, IsZt, IsDt)
    • FlagToOnehot (converts 29 industry flags to indus_idx)
    • IndusNtrlInjector (industry neutralization)
    • RobustZScoreNorm (using pre-fitted qlib parameters via from_version())
    • Fillna (fill NaN with 0)
  4. Saves processed data to data_polars/processed_data_*.pkl

Note: The FlagSTInjector step is skipped because it fails silently even in the gold-standard qlib code (see BUG_ANALYSIS_FINAL.md for details).

Output structure:

  • Raw data: ~204 columns (158 feature + 4 feature_ext + 12 feature_flag + 30 indus_flag)
  • Processed data: 342 columns (316 feature + 14 feature_ext + 11 feature_flag + 1 indus_idx)
  • VAE input dimension: 341 (excluding indus_idx)

RobustZScoreNorm Parameter Extraction

The pipeline uses pre-fitted normalization parameters extracted from Qlib's proc_list.proc file. These parameters are stored in data/robust_zscore_params/ and can be loaded using the RobustZScoreNorm.from_version() method.

Extract parameters from Qlib proc_list:

python scripts/extract_qlib_params.py --version csiallx_feature2_ntrla_flag_pnlnorm

This creates:

  • data/robust_zscore_params/{version}/mean_train.npy - Pre-fitted mean parameters (330,)
  • data/robust_zscore_params/{version}/std_train.npy - Pre-fitted std parameters (330,)
  • data/robust_zscore_params/{version}/metadata.json - Feature column names and metadata

Use in Polars processors:

from cta_1d.src.processors import RobustZScoreNorm

# Load pre-fitted parameters by version name
processor = RobustZScoreNorm.from_version("csiallx_feature2_ntrla_flag_pnlnorm")

# Apply normalization to DataFrame
df = processor.process(df)

Parameter details:

  • Fit period: 2013-01-01 to 2018-12-31
  • Feature count: 330 (158 alpha158_ntrl + 158 alpha158_raw + 7 market_ext_ntrl + 7 market_ext_raw)
  • Fields: ['feature', 'feature_ext']

Workflow

1. Generate Beta Embeddings

Generate VAE embeddings from the alpha158_0_7_beta factors:

python scripts/generate_beta_embedding.py --start-date 2019-01-01 --end-date 2020-11-30

This loads data from Parquet, applies the full feature transformation pipeline, and encodes with the VAE model.

Output: data/embedding_0_7_beta.parquet

2. Fetch Original Predictions

Fetch the original 0_7 predictions from DolphinDB:

python scripts/fetch_predictions.py --start-date 2019-01-01 --end-date 2020-11-30

Output: data/original_predictions_0_7.parquet

3. Generate Predictions with Beta Embeddings

Use the d033 model to generate predictions from the beta embeddings:

python scripts/predict_with_embedding.py --start-date 2019-01-01 --end-date 2020-11-30

Output: data/predictions_beta_embedding.parquet

4. Generate Actual Returns

Generate actual returns from kline data for IC calculation:

python scripts/generate_returns.py

Output: data/actual_returns.parquet

5. Compare Predictions

Compare the 0_7 vs 0_7_beta predictions:

python scripts/compare_predictions.py

This calculates:

  • Prediction correlation (Pearson and Spearman)
  • Daily correlation statistics
  • IC metrics (mean, std, IR)
  • RankIC metrics
  • Top-tier returns (top 10%)

Quick Start

Run the full pipeline:

python pipeline.py --start-date 2019-01-01 --end-date 2020-11-30

Or run individual steps:

# Step 1: Generate embeddings
python scripts/generate_beta_embedding.py --start-date 2019-01-01 --end-date 2020-11-30

# Step 2: Fetch original predictions
python scripts/fetch_predictions.py --start-date 2019-01-01 --end-date 2020-11-30

# Step 3: Generate beta predictions
python scripts/predict_with_embedding.py

# Step 4: Generate returns
python scripts/generate_returns.py

# Step 5: Compare
python scripts/compare_predictions.py

Data Dependencies

Input Data (from Parquet)

  • /data/parquet/dataset/stg_1day_wind_alpha158_0_7_beta_1D/ - Alpha158 beta factors
  • /data/parquet/dataset/stg_1day_wind_kline_adjusted_1D/ - Market data (kline)
  • /data/parquet/dataset/stg_1day_gds_indus_flag_cc1_1D/ - Industry flags

Models

  • /home/guofu/Workspaces/alpha/data_ops/tasks/dwm_feature_vae/model/csiallx_feature2_ntrla_flag_pnlnorm_vae4_dim32a_beta0001/module.pt - VAE encoder
  • /home/guofu/Workspaces/alpha/data_ops/tasks/app_longsignal/model/host140_exp20_d033/module.pt - d033 prediction model

DolphinDB

  • Table: dfs://daily_stock_run_multicast/app_1day_multicast_longsignal_port
  • Version: host140_exp20_d033

Key Metrics

The comparison script outputs:

Metric Description
Pearson Correlation Overall correlation between 0_7 and beta predictions
Spearman Correlation Rank correlation between predictions
Daily Correlation Mean and std of daily correlations
IC Mean Average information coefficient
IC Std Standard deviation of IC
IC IR Information ratio (IC Mean / IC Std)
RankIC Spearman correlation with returns
Top-tier Return Average return of top 10% predictions

Notes

  • All scripts can be run from the alpha158_beta/ directory
  • Scripts use relative paths (../data/) to locate data files
  • The VAE model expects 341 input features after the transformation pipeline
  • The d033 model expects 32-dimensional embeddings with a 40-day lookback window