📦 Working with Multiple Datasets

A comprehensive guide to uploading single, multiple, and large datasets across all VEOX problem types.

Overview

VEOX evaluates every pipeline candidate across all provided datasets and averages the fitness scores. This produces more robust, generalizable solutions because:

High-variance candidates are penalized — the engine uses a Lower Confidence Bound (mean − λ·std) to discourage lucky one-off scores.
Overfitting to a single dataset is harder — a pipeline must perform well across distributions.
Real-world scenarios often involve heterogeneous data (e.g., multiple hospitals, different markets, seasonal vs. non-seasonal time series).

Input Patterns

All categories that accept client-uploaded data support these input patterns:

Single DataFrame

import pandas as pd
from veox import VeoxEvolver

evolver = VeoxEvolver("binary")
evolver.fit(data=df, target_column="target", max_generations=5)

List of DataFrames

evolver.fit(
    data=[df1, df2, df3],
    target_column="target",
    max_generations=5,
)

CSV File Path (String)

evolver.fit(
    data="path/to/train.csv",
    target_column="target",
    max_generations=5,
)

CSV File Path (pathlib.Path)

from pathlib import Path

evolver.fit(
    data=Path("path/to/train.csv"),
    target_column="target",
    max_generations=5,
)

Mixed List

from pathlib import Path

evolver.fit(
    data=[df1, "train_b.csv", Path("train_c.csv")],
    target_column="target",
    max_generations=5,
)

Category Support Matrix

Category	Client Upload	Server-side Data	Multi-Dataset
`binary`	✅	—	✅
`regression`	✅	—	✅
`outlier`	✅	✅ (bundled)	✅
`time_series`	✅	✅ (seed cache)	✅
`trading`	✅	✅ (golden set)	✅
`optimization`	—	✅ (objective functions)	✅ (server-side)
`signal_separation`	✅ (ZIP/dir)	✅ (seed cache)	✅

Column Requirements by Category

Category	Required Column	Notes
`binary`	Target column (0/1)	Specified via `target_column=`
`regression`	Target column (continuous)	Specified via `target_column=`
`outlier`	Auto-detected	Checks: `is_outlier`, `outlier`, `class`, `target`, then last column
`time_series`	`unique_id`, `ds`, `y`	Auto-formatted if columns are missing
`trading`	`close` (price)	Auto-detected: `close`, `*_close`, `mid_price`
`signal_separation`	N/A	Uses mixture/ground_truth CSV pairs in ZIP

Large Dataset Handling

File Paths (Recommended for >50MB files)

When passing file paths, the SDK reads CSV data from disk as a string and streams it to the server — avoiding a full DataFrame round-trip through memory:

# Good — streams from disk
evolver.fit(data="huge_dataset.csv", target_column="target")

# Also good — pathlib.Path works too
from pathlib import Path
evolver.fit(data=Path("huge_dataset.csv"), target_column="target")

If a DataFrame exceeds 50MB when serialized to CSV, the SDK will print a warning suggesting file paths instead.

Progress Bars

When uploading multiple datasets, the SDK shows a Rich progress bar:

⠋ [2/5] Uploading fraud_dataset  ━━━━━━━━━━━━━━━━━━━━ 40%

Server-Side Storage

All uploaded data lives in /tmp/veox_uploads/. This directory:

Has a 2 GB ceiling — the server auto-cleans old uploads when exceeded.
Is ephemeral — in Docker containers, it's cleared on restart.
Is cleaned after each job completes.

How Fitness Aggregation Works

Each pipeline candidate is evaluated on every dataset independently.
Scores are averaged (mean_score = mean(all_dataset_scores)).
Standard deviation is calculated (std = std(all_dataset_scores)).
The GrandmasterEvaluator computes a final score using Lower Confidence Bound:
```
final_score = mean_score - λ·std - complexity_penalty - time_penalty
```
This means high-variance candidates (great on one dataset, terrible on another) get penalized.

Complete Examples

Binary Classification (2 datasets)

from sklearn.datasets import make_classification
import pandas as pd
from veox import VeoxEvolver

# Two datasets with different class balances
X1, y1 = make_classification(n_samples=800, n_features=20, weights=[0.7, 0.3], random_state=42)
df1 = pd.DataFrame(X1, columns=[f"f{i}" for i in range(20)]); df1["target"] = y1

X2, y2 = make_classification(n_samples=800, n_features=20, weights=[0.9, 0.1], random_state=99)
df2 = pd.DataFrame(X2, columns=[f"f{i}" for i in range(20)]); df2["target"] = y2

evolver = VeoxEvolver("binary")
evolver.fit(data=[df1, df2], target_column="target", max_generations=5)

Time Series (3 series)

import pandas as pd, numpy as np
from veox import VeoxEvolver

def make_ts(name, length=200, seed=42):
    rng = np.random.default_rng(seed)
    dates = pd.date_range("2020-01-01", periods=length, freq="D")
    return pd.DataFrame({"unique_id": name, "ds": dates,
                          "y": np.linspace(10, 50, length) + rng.normal(0, 3, length)})

evolver = VeoxEvolver("time_series")
evolver.fit(data=[make_ts("a", seed=1), make_ts("b", seed=2), make_ts("c", seed=3)],
            target_column="y", max_generations=5)

Trading (from CSV files)

from pathlib import Path
from veox import VeoxEvolver

evolver = VeoxEvolver("trading")
evolver.fit(
    data=[Path("eth_ohlcv.csv"), Path("btc_ohlcv.csv")],
    target_column="close",
    max_generations=3,
)

# Potential data sources:
#    https://www.kaggle.com/datasets/farhanbhamgara/eth-all?resource=download
#    https://www.kaggle.com/datasets/stijnvanleeuwen/bitcoin-intraday-ohlcv-data