# Data Pipeline Bug Analysis - Final Status ## Summary After fixing all identified bugs, the feature count now matches (341), but the embeddings remain uncorrelated with the database 0_7 version. **Latest Version**: v6 - Feature count: 341 ✓ (matches VAE input dim) - Mean correlation with DB: 0.0050 (essentially zero) - Status: All identified bugs fixed, IsST issue documented - **New**: Polars-based dataset generation script added (`scripts/dump_polars_dataset.py`) --- ## Bugs Fixed ### 1. Market Classification (`FlagMarketInjector`) ✓ FIXED - **Bug**: Used `instrument >= 600000` which misclassified 新三板 instruments - **Fix**: Use string prefix matching with vocab_size=2 (not 3) - **Impact**: 167 instruments corrected ### 2. ColumnRemover Missing `IsN` ✓ FIXED - **Bug**: Only removed `IsZt, IsDt` but not `IsN` - **Fix**: Added `IsN` to removal list - **Impact**: Feature count alignment ### 3. RobustZScoreNorm Scope ✓ FIXED - **Bug**: Applied normalization to all 341 features - **Fix**: Only normalize 330 features (alpha158 + market_ext, both original + neutralized) - **Impact**: Correct normalization scope ### 4. Wrong Data Sources for Market Flags ✓ FIXED - **Bug**: Used `Limit, Stopping` (Float64) from kline_adjusted - **Fix**: Load from correct sources: - kline_adjusted: `IsZt, IsDt, IsN, IsXD, IsXR, IsDR` (Boolean) - market_flag: `open_limit, close_limit, low_limit, high_stop` (Boolean, 4 cols) - **Impact**: Correct boolean flag data ### 5. Feature Count Mismatch ✓ FIXED - **Bug**: 344 features (3 extra) - **Fix**: vocab_size=2 + 4 market_flag cols = 341 features - **Impact**: VAE input dimension matches ### 6. Fixed* Processors Not Adding Required Columns ✓ FIXED - **Bug**: `FixedFlagMarketInjector` only converted dtype but didn't add `market_0`, `market_1` columns - **Bug**: `FixedFlagSTInjector` only converted dtype but didn't create `IsST` column from `ST_S`, `ST_Y` - **Fix**: - `FixedFlagMarketInjector`: Now adds `market_0` (SH60xxx, SZ00xxx) and `market_1` (SH688xxx, SH689xxx, SZ300xxx, SZ301xxx) - `FixedFlagSTInjector`: Now creates `IsST = ST_S | ST_Y` - **Impact**: Processed data now has 408 columns (was 405), matching original qlib output --- ## Important Discovery: IsST Column Issue in Gold-Standard Code ### Problem Description The `FlagSTInjector` processor in the original qlib proc_list is supposed to create an `IsST` column in the `feature_flag` group from the `ST_S` and `ST_Y` columns in the `st_flag` group. However, this processor **fails silently** even in the gold-standard qlib code. ### Root Cause The `FlagSTInjector` processor attempts to access columns using a format that doesn't match the actual column structure in the data: 1. **Expected format**: The processor expects columns like `st_flag::ST_S` and `st_flag::ST_Y` (string format with `::` separator) 2. **Actual format**: The qlib handler produces MultiIndex tuple columns like `('st_flag', 'ST_S')` and `('st_flag', 'ST_Y')` This format mismatch causes the processor to fail to find the ST flag columns, and thus no `IsST` column is created. ### Evidence ```python # Check proc_list import pickle as pkl with open('proc_list.proc', 'rb') as f: proc_list = pkl.load(f) # FlagSTInjector config flag_st = proc_list[2] print(f"fields_group: {flag_st.fields_group}") # 'feature_flag' print(f"col_name: {flag_st.col_name}") # 'IsST' print(f"st_group: {flag_st.st_group}") # 'st_flag' # Check if IsST exists in processed data with open('processed_data.pkl', 'rb') as f: df = pkl.load(f) feature_flag_cols = [c[1] for c in df.columns if c[0] == 'feature_flag'] print('IsST' in feature_flag_cols) # False! ``` ### Impact - **VAE training**: The VAE model was trained on data **without** the `IsST` column - **VAE input dimension**: 341 features (excluding IsST), not 342 - **Polars pipeline**: Should also skip `IsST` to maintain compatibility ### Resolution The polars-based pipeline (`dump_polars_dataset.py`) now correctly **skips** the `FlagSTInjector` step to match the gold-standard behavior: ```python # Step 3: FlagSTInjector - SKIPPED (fails even in gold-standard) print("[3] Skipping FlagSTInjector (as per gold-standard behavior)...") market_flag_with_st = market_flag_with_market # No IsST added ``` ### Lessons Learned 1. **Verify processor execution**: Don't assume all processors in the proc_list executed successfully. Check the output data to verify expected columns exist. 2. **Column format matters**: The qlib processors were designed for specific column formats (MultiIndex tuples vs `::` separator strings). Format mismatches can cause silent failures. 3. **Match the gold-standard bugs**: When replicating a pipeline, sometimes you need to replicate the bugs too. The VAE was trained on data without `IsST`, so our pipeline must also exclude it. 4. **Debug by comparing intermediate outputs**: Use scripts like `debug_data_divergence.py` to compare raw and processed data between the gold-standard and polars pipelines. --- ## Correlation Results (v5) | Metric | Value | |--------|-------| | Mean correlation (32 dims) | 0.0050 | | Median correlation | 0.0079 | | Min | -0.0420 | | Max | 0.0372 | | Overall (flattened) | 0.2225 | **Conclusion**: Embeddings remain essentially uncorrelated with database. --- ## Possible Remaining Issues 1. **Different input data values**: The alpha158_0_7_beta Parquet files may contain different values than the original DolphinDB data used to train the VAE. 2. **Feature ordering mismatch**: The 330 RobustZScoreNorm parameters must be applied in the exact order: - [0:158] = alpha158 original - [158:316] = alpha158_ntrl - [316:323] = market_ext original (7 cols) - [323:330] = market_ext_ntrl (7 cols) 3. **Industry neutralization differences**: Our `IndusNtrlInjector` implementation may differ from qlib's. 4. **Missing transformations**: There may be additional preprocessing steps not captured in handler.yaml. 5. **VAE model mismatch**: The VAE model may have been trained with different data than what handler.yaml specifies. --- ## Recommended Next Steps 1. **Compare intermediate features**: Run both the qlib pipeline and our pipeline on the same input data and compare outputs at each step. 2. **Verify RobustZScoreNorm parameter order**: Check if our feature ordering matches the order used during VAE training. 3. **Compare predictions, not embeddings**: Instead of comparing VAE embeddings, compare the final d033 model predictions with the original 0_7 predictions. 4. **Check alpha158 data source**: Verify that `stg_1day_wind_alpha158_0_7_beta_1D` contains the same data as the original DolphinDB `stg_1day_wind_alpha158_0_7_beta` table.