### 6. Fixed* Processors Not Adding Required Columns ✓ FIXED
- **Bug**: `FixedFlagMarketInjector` only converted dtype but didn't add `market_0`, `market_1` columns
- **Bug**: `FixedFlagSTInjector` only converted dtype but didn't create `IsST` column from `ST_S`, `ST_Y`
- **Fix**:
-`FixedFlagMarketInjector`: Now adds `market_0` (SH60xxx, SZ00xxx) and `market_1` (SH688xxx, SH689xxx, SZ300xxx, SZ301xxx)
-`FixedFlagSTInjector`: Now creates `IsST = ST_S | ST_Y`
- **Impact**: Processed data now has 408 columns (was 405), matching original qlib output
---
## Important Discovery: IsST Column Issue in Gold-Standard Code
### Problem Description
The `FlagSTInjector` processor in the original qlib proc_list is supposed to create an `IsST` column in the `feature_flag` group from the `ST_S` and `ST_Y` columns in the `st_flag` group. However, this processor **fails silently** even in the gold-standard qlib code.
### Root Cause
The `FlagSTInjector` processor attempts to access columns using a format that doesn't match the actual column structure in the data:
1.**Expected format**: The processor expects columns like `st_flag::ST_S` and `st_flag::ST_Y` (string format with `::` separator)
2.**Actual format**: The qlib handler produces MultiIndex tuple columns like `('st_flag', 'ST_S')` and `('st_flag', 'ST_Y')`
This format mismatch causes the processor to fail to find the ST flag columns, and thus no `IsST` column is created.
feature_flag_cols = [c[1] for c in df.columns if c[0] == 'feature_flag']
print('IsST' in feature_flag_cols) # False!
```
### Impact
- **VAE training**: The VAE model was trained on data **without** the `IsST` column
- **VAE input dimension**: 341 features (excluding IsST), not 342
- **Polars pipeline**: Should also skip `IsST` to maintain compatibility
### Resolution
The polars-based pipeline (`dump_polars_dataset.py`) now correctly **skips** the `FlagSTInjector` step to match the gold-standard behavior:
```python
# Step 3: FlagSTInjector - SKIPPED (fails even in gold-standard)
print("[3] Skipping FlagSTInjector (as per gold-standard behavior)...")
market_flag_with_st = market_flag_with_market # No IsST added
```
### Lessons Learned
1.**Verify processor execution**: Don't assume all processors in the proc_list executed successfully. Check the output data to verify expected columns exist.
2.**Column format matters**: The qlib processors were designed for specific column formats (MultiIndex tuples vs `::` separator strings). Format mismatches can cause silent failures.
3.**Match the gold-standard bugs**: When replicating a pipeline, sometimes you need to replicate the bugs too. The VAE was trained on data without `IsST`, so our pipeline must also exclude it.
4.**Debug by comparing intermediate outputs**: Use scripts like `debug_data_divergence.py` to compare raw and processed data between the gold-standard and polars pipelines.
**Conclusion**: Embeddings remain essentially uncorrelated with database.
---
## Possible Remaining Issues
1.**Different input data values**: The alpha158_0_7_beta Parquet files may contain different values than the original DolphinDB data used to train the VAE.
2.**Feature ordering mismatch**: The 330 RobustZScoreNorm parameters must be applied in the exact order:
- [0:158] = alpha158 original
- [158:316] = alpha158_ntrl
- [316:323] = market_ext original (7 cols)
- [323:330] = market_ext_ntrl (7 cols)
3.**Industry neutralization differences**: Our `IndusNtrlInjector` implementation may differ from qlib's.
4.**Missing transformations**: There may be additional preprocessing steps not captured in handler.yaml.
5.**VAE model mismatch**: The VAE model may have been trained with different data than what handler.yaml specifies.
---
## Recommended Next Steps
1.**Compare intermediate features**: Run both the qlib pipeline and our pipeline on the same input data and compare outputs at each step.
2.**Verify RobustZScoreNorm parameter order**: Check if our feature ordering matches the order used during VAE training.
3.**Compare predictions, not embeddings**: Instead of comparing VAE embeddings, compare the final d033 model predictions with the original 0_7 predictions.
4.**Check alpha158 data source**: Verify that `stg_1day_wind_alpha158_0_7_beta_1D` contains the same data as the original DolphinDB `stg_1day_wind_alpha158_0_7_beta` table.