3.5 KiB
Data Pipeline Bug Analysis
Summary
The generated embeddings do not match the database 0_7 embeddings due to multiple bugs in the data pipeline migration from qlib to standalone Polars implementation.
Bugs Fixed
1. Market Classification (FlagMarketInjector) ✓ FIXED
Original (incorrect):
market_0 = (instrument >= 600000) # SH
market_1 = (instrument < 600000) # SZ
Fixed:
inst_str = str(instrument).zfill(6)
market_0 = inst_str.startswith('6') # SH: 6xxxxx
market_1 = inst_str.startswith('0') | inst_str.startswith('3') # SZ: 0xxx, 3xxx
market_2 = inst_str.startswith('4') | inst_str.startswith('8') # NE: 4xxx, 8xxx
Impact: 167 instruments (4xxxxx, 8xxxxx - 新三板) were misclassified.
2. ColumnRemover Missing IsN ✓ FIXED
Original (incorrect):
columns_to_remove = ['TotalValue_diff', 'IsZt', 'IsDt']
Fixed:
columns_to_remove = ['TotalValue_diff', 'IsN', 'IsZt', 'IsDt']
Impact: Extra column caused feature dimension mismatch.
3. RobustZScoreNorm Applied to Wrong Columns ✓ FIXED
Original (incorrect): Applied normalization to ALL 341 features including market flags and indus_idx.
Fixed:
Only normalize alpha158 + alpha158_ntrl + market_ext + market_ext_ntrl (330 features), excluding:
- Market flags (Limit, Stopping, IsTp, IsXD, IsXR, IsDR, market_0, market_1, market_2, IsST)
- indus_idx
Critical Remaining Issue: Data Schema Mismatch
Limit and Stopping Column Types Changed
Original qlib pipeline expected:
Limit: Boolean flag (True = limit up)Stopping: Boolean flag (True = suspended trading)
Current Parquet data has:
Limit: Float64 price change percentage (0.0 to 1301.3)Stopping: Float64 price change percentage
Evidence:
Limit values sample: [8.86, 9.36, 31.0, 7.32, 2.28, 6.39, 5.38, 4.03, 3.86, 9.89]
Limit == 0: only 2 rows
Limit > 0: 3738 rows
This is a fundamental data schema change. The current Parquet files contain different data than what the original VAE model was trained on.
Possible fixes:
- Convert
LimitandStoppingto boolean flags using a threshold - Find the original data source that had boolean flags
- Re-train the VAE model with the new data schema
Correlation Results
After fixing bugs 1-3, the embedding correlation with database 0_7:
| Metric | Value |
|---|---|
| Mean correlation (32 dims) | 0.0068 |
| Median correlation | 0.0094 |
| Overall correlation | 0.2330 |
Conclusion: Embeddings remain essentially uncorrelated (≈0).
Root Cause
The Limit/Stopping data schema change is the most likely root cause. The VAE model learned to encode features that included binary limit/stopping flags, but the standalone pipeline feeds it continuous price change percentages instead.
Next Steps
-
Verify original data schema:
- Check if the original DolphinDB table had boolean
LimitandStoppingcolumns - Compare with the current Parquet schema
- Check if the original DolphinDB table had boolean
-
Fix the data loading:
- Either convert continuous values to binary flags
- Or use the correct boolean columns (
IsZt,IsDt) for limit flags
-
Verify feature order:
- Ensure the qlib RobustZScoreNorm parameters are applied in the correct order
- Check that
[alpha158, alpha158_ntrl, market_ext, market_ext_ntrl]matches the 330-parameter shape
-
Re-run comparison:
- Generate new embeddings with the corrected pipeline
- Compare correlation with database