You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

6.6 KiB

Data Pipeline Bug Analysis - Final Status

Summary

After fixing all identified bugs, the feature count now matches (341), but the embeddings remain uncorrelated with the database 0_7 version.

Latest Version: v6

  • Feature count: 341 ✓ (matches VAE input dim)
  • Mean correlation with DB: 0.0050 (essentially zero)
  • Status: All identified bugs fixed, IsST issue documented
  • New: Polars-based dataset generation script added (scripts/dump_polars_dataset.py)

Bugs Fixed

1. Market Classification (FlagMarketInjector) ✓ FIXED

  • Bug: Used instrument >= 600000 which misclassified 新三板 instruments
  • Fix: Use string prefix matching with vocab_size=2 (not 3)
  • Impact: 167 instruments corrected

2. ColumnRemover Missing IsN ✓ FIXED

  • Bug: Only removed IsZt, IsDt but not IsN
  • Fix: Added IsN to removal list
  • Impact: Feature count alignment

3. RobustZScoreNorm Scope ✓ FIXED

  • Bug: Applied normalization to all 341 features
  • Fix: Only normalize 330 features (alpha158 + market_ext, both original + neutralized)
  • Impact: Correct normalization scope

4. Wrong Data Sources for Market Flags ✓ FIXED

  • Bug: Used Limit, Stopping (Float64) from kline_adjusted
  • Fix: Load from correct sources:
    • kline_adjusted: IsZt, IsDt, IsN, IsXD, IsXR, IsDR (Boolean)
    • market_flag: open_limit, close_limit, low_limit, high_stop (Boolean, 4 cols)
  • Impact: Correct boolean flag data

5. Feature Count Mismatch ✓ FIXED

  • Bug: 344 features (3 extra)
  • Fix: vocab_size=2 + 4 market_flag cols = 341 features
  • Impact: VAE input dimension matches

6. Fixed* Processors Not Adding Required Columns ✓ FIXED

  • Bug: FixedFlagMarketInjector only converted dtype but didn't add market_0, market_1 columns
  • Bug: FixedFlagSTInjector only converted dtype but didn't create IsST column from ST_S, ST_Y
  • Fix:
    • FixedFlagMarketInjector: Now adds market_0 (SH60xxx, SZ00xxx) and market_1 (SH688xxx, SH689xxx, SZ300xxx, SZ301xxx)
    • FixedFlagSTInjector: Now creates IsST = ST_S | ST_Y
  • Impact: Processed data now has 408 columns (was 405), matching original qlib output

Important Discovery: IsST Column Issue in Gold-Standard Code

Problem Description

The FlagSTInjector processor in the original qlib proc_list is supposed to create an IsST column in the feature_flag group from the ST_S and ST_Y columns in the st_flag group. However, this processor fails silently even in the gold-standard qlib code.

Root Cause

The FlagSTInjector processor attempts to access columns using a format that doesn't match the actual column structure in the data:

  1. Expected format: The processor expects columns like st_flag::ST_S and st_flag::ST_Y (string format with :: separator)
  2. Actual format: The qlib handler produces MultiIndex tuple columns like ('st_flag', 'ST_S') and ('st_flag', 'ST_Y')

This format mismatch causes the processor to fail to find the ST flag columns, and thus no IsST column is created.

Evidence

# Check proc_list
import pickle as pkl
with open('proc_list.proc', 'rb') as f:
    proc_list = pkl.load(f)

# FlagSTInjector config
flag_st = proc_list[2]
print(f"fields_group: {flag_st.fields_group}")  # 'feature_flag'
print(f"col_name: {flag_st.col_name}")  # 'IsST'
print(f"st_group: {flag_st.st_group}")  # 'st_flag'

# Check if IsST exists in processed data
with open('processed_data.pkl', 'rb') as f:
    df = pkl.load(f)

feature_flag_cols = [c[1] for c in df.columns if c[0] == 'feature_flag']
print('IsST' in feature_flag_cols)  # False!

Impact

  • VAE training: The VAE model was trained on data without the IsST column
  • VAE input dimension: 341 features (excluding IsST), not 342
  • Polars pipeline: Should also skip IsST to maintain compatibility

Resolution

The polars-based pipeline (dump_polars_dataset.py) now correctly skips the FlagSTInjector step to match the gold-standard behavior:

# Step 3: FlagSTInjector - SKIPPED (fails even in gold-standard)
print("[3] Skipping FlagSTInjector (as per gold-standard behavior)...")
market_flag_with_st = market_flag_with_market  # No IsST added

Lessons Learned

  1. Verify processor execution: Don't assume all processors in the proc_list executed successfully. Check the output data to verify expected columns exist.

  2. Column format matters: The qlib processors were designed for specific column formats (MultiIndex tuples vs :: separator strings). Format mismatches can cause silent failures.

  3. Match the gold-standard bugs: When replicating a pipeline, sometimes you need to replicate the bugs too. The VAE was trained on data without IsST, so our pipeline must also exclude it.

  4. Debug by comparing intermediate outputs: Use scripts like debug_data_divergence.py to compare raw and processed data between the gold-standard and polars pipelines.


Correlation Results (v5)

Metric Value
Mean correlation (32 dims) 0.0050
Median correlation 0.0079
Min -0.0420
Max 0.0372
Overall (flattened) 0.2225

Conclusion: Embeddings remain essentially uncorrelated with database.


Possible Remaining Issues

  1. Different input data values: The alpha158_0_7_beta Parquet files may contain different values than the original DolphinDB data used to train the VAE.

  2. Feature ordering mismatch: The 330 RobustZScoreNorm parameters must be applied in the exact order:

    • [0:158] = alpha158 original
    • [158:316] = alpha158_ntrl
    • [316:323] = market_ext original (7 cols)
    • [323:330] = market_ext_ntrl (7 cols)
  3. Industry neutralization differences: Our IndusNtrlInjector implementation may differ from qlib's.

  4. Missing transformations: There may be additional preprocessing steps not captured in handler.yaml.

  5. VAE model mismatch: The VAE model may have been trained with different data than what handler.yaml specifies.


  1. Compare intermediate features: Run both the qlib pipeline and our pipeline on the same input data and compare outputs at each step.

  2. Verify RobustZScoreNorm parameter order: Check if our feature ordering matches the order used during VAE training.

  3. Compare predictions, not embeddings: Instead of comparing VAE embeddings, compare the final d033 model predictions with the original 0_7 predictions.

  4. Check alpha158 data source: Verify that stg_1day_wind_alpha158_0_7_beta_1D contains the same data as the original DolphinDB stg_1day_wind_alpha158_0_7_beta table.