You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

6.0 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Overview

Alpha Lab is a quantitative research experiment framework for the qshare library. It uses a notebook-centric approach for exploring trading strategies and ML models. The codebase is organized around two prediction tasks:

  • cta_1d: CTA (Commodity Trading Advisor) futures 1-day return prediction
  • stock_15m: Stock 15-minute forward return prediction using high-frequency features

Directory Structure

alpha_lab/
├── common/              # Shared utilities
│   ├── __init__.py
│   ├── paths.py         # Path management
│   └── plotting.py      # Common plotting functions
│
├── cta_1d/             # CTA 1-day return prediction
│   ├── __init__.py     # Re-exports from src/
│   ├── config.yaml     # Task configuration
│   ├── src/            # Implementation modules
│   │   ├── __init__.py
│   │   ├── loader.py   # CTA1DLoader
│   │   ├── train.py    # Training functions
│   │   ├── backtest.py # Backtest functions
│   │   └── labels.py   # Label blending utilities
│   └── *.ipynb         # Experiment notebooks
│
├── stock_15m/          # Stock 15-minute return prediction
│   ├── __init__.py     # Re-exports from src/
│   ├── config.yaml     # Task configuration
│   ├── src/            # Implementation modules
│   │   ├── __init__.py
│   │   ├── loader.py   # Stock15mLoader
│   │   └── train.py    # Training functions
│   └── *.ipynb         # Experiment notebooks
│
└── results/            # Output directory (gitignored)

Common Commands

Development Setup

# Install dependencies
pip install -r requirements.txt

# Create environment configuration
cp .env.template .env
# Edit .env with your DolphinDB host and data paths

Running Experiments

# Start Jupyter for interactive experiments
jupyter notebook

# Train CTA model from config
python -m cta_1d.train --config cta_1d/config.yaml --output results/cta_1d/exp01

# Train Stock 15m model
python -m stock_15m.train --config stock_15m/config.yaml --output results/stock_15m/exp01

# Run CTA backtest
python -m cta_1d.backtest \
    --model results/cta_1d/exp01/model.json \
    --dt-range 2023-01-01 2023-12-31 \
    --output results/cta_1d/backtest_01

Python API Usage

# CTA 1D workflow
from cta_1d import CTA1DLoader, train_model, TrainConfig

loader = CTA1DLoader(return_type='o2c_twap1min', normalization='dual')
dataset = loader.load(dt_range=['2020-01-01', '2023-12-31'])

config = TrainConfig(dt_range=['2020-01-01', '2023-12-31'], feature_sets=['alpha158'])
model, metrics = train_model(config, output_dir='results/exp01')

# Stock 15m workflow
from stock_15m import Stock15mLoader, train_model, TrainConfig

loader = Stock15mLoader(normalization_mode='dual')
dataset = loader.load(
    dt_range=['2020-01-01', '2023-12-31'],
    feature_path='/data/parquet/stock_1min_alpha158',
    kline_path='/data/parquet/stock_1min_kline'
)

Architecture

Module Organization

All implementation code lives in src/ subdirectories:

  • cta_1d/src/: CTA-specific implementations

    • loader.py: CTA1DLoader class
    • train.py: train_model, TrainConfig
    • backtest.py: run_backtest, BacktestConfig
    • labels.py: Label blending utilities
  • stock_15m/src/: Stock-specific implementations

    • loader.py: Stock15mLoader class
    • train.py: train_model, TrainConfig

Root __init__.py files re-export public APIs for backward compatibility:

from cta_1d import CTA1DLoader  # Imports from cta_1d.src

Data Flow

Both tasks follow a consistent pattern:

  1. Loaders (src/loader.py): Fetch data from DolphinDB (CTA) or Parquet files (Stock), apply normalization, compute sample weights, return pl_Dataset
  2. Training (src/train.py): XGBoost with early stopping, outputs model JSON + metrics
  3. Backtest (src/backtest.py): CTA-only; uses qshare.eval.cta.backtest.CTABacktester for strategy simulation

Key Classes

  • CTA1DLoader: Loads alpha158/hffactor features from DolphinDB; supports 5 normalization modes (zscore, cs_zscore, rolling_20, rolling_60, dual)
  • Stock15mLoader: Loads Alpha158 on 1-min data; computes 15-min forward returns; normalization modes: industry, cs_zscore, dual
  • pl_Dataset: From qshare.data; provides .with_segments(), .split(), .to_numpy() methods

Normalization Modes

CTA 1D (dual blending):

  • zscore: Fit-time mean/std normalization
  • cs_zscore: Cross-sectional z-score per datetime
  • rolling_20/60: Rolling window normalization
  • dual: Weighted blend (default: [0.2, 0.1, 0.3, 0.4])

Stock 15m:

  • industry: Industry-neutralized returns
  • cs_zscore: Cross-sectional z-score
  • dual: 80% industry-neutral + 20% cs_zscore

Experiment Tracking

Manual tracking in results/{task}/README.md:

## 2025-01-15: Baseline XGB
- Notebook: `cta_1d/03_baseline_xgb.ipynb` (cells 1-50)
- Config: eta=0.5, lambda=0.1
- Train IC: 0.042
- Test IC: 0.038
- Notes: Dual normalization, 4 trades/day

Dependencies on qshare

The codebase relies heavily on the qshare library (already installed in the venv):

  • qshare.data.pl_Dataset: Dataset container with Polars backend
  • qshare.io.ddb: DolphinDB session management
  • qshare.io.polars: Parquet loading utilities
  • qshare.algo.polars: Industry neutralization, cross-sectional z-score
  • qshare.eval.cta.backtest: CTA backtesting framework
  • qshare.config.research.cta: Predefined column lists (HFFACTOR_COLS)

Configuration Files

YAML configs define data ranges, model hyperparameters, and output settings:

data:
  dt_range: ['2020-01-01', '2023-12-31']
  feature_sets: [alpha158, hffactor]
  normalization: dual
model:
  type: xgb
  params: {eta: 0.05, max_depth: 6}

Load with: python -m cta_1d.train --config config.yaml or yaml.safe_load() directly.