π¦ Working with Multiple Datasets
A comprehensive guide to uploading single, multiple, and large datasets across all VEOX problem types.
Overview
VEOX evaluates every pipeline candidate across all provided datasets and averages the fitness scores. This produces more robust, generalizable solutions because:
- High-variance candidates are penalized β the engine uses a Lower Confidence Bound (mean β λ·std) to discourage lucky one-off scores.
- Overfitting to a single dataset is harder β a pipeline must perform well across distributions.
- Real-world scenarios often involve heterogeneous data (e.g., multiple hospitals, different markets, seasonal vs. non-seasonal time series).
Input Patterns
All categories that accept client-uploaded data support these input patterns:
Single DataFrame
import pandas as pd
from veox import VeoxEvolver
evolver = VeoxEvolver("binary")
evolver.fit(data=df, target_column="target", max_generations=5)
List of DataFrames
CSV File Path (String)
CSV File Path (pathlib.Path)
from pathlib import Path
evolver.fit(
data=Path("path/to/train.csv"),
target_column="target",
max_generations=5,
)
Mixed List
from pathlib import Path
evolver.fit(
data=[df1, "train_b.csv", Path("train_c.csv")],
target_column="target",
max_generations=5,
)
Category Support Matrix
| Category | Client Upload | Server-side Data | Multi-Dataset |
|---|---|---|---|
binary |
β | β | β |
regression |
β | β | β |
outlier |
β | β (bundled) | β |
time_series |
β | β (seed cache) | β |
trading |
β | β (golden set) | β |
optimization |
β | β (objective functions) | β (server-side) |
signal_separation |
β (ZIP/dir) | β (seed cache) | β |
Column Requirements by Category
| Category | Required Column | Notes |
|---|---|---|
binary |
Target column (0/1) | Specified via target_column= |
regression |
Target column (continuous) | Specified via target_column= |
outlier |
Auto-detected | Checks: is_outlier, outlier, class, target, then last column |
time_series |
unique_id, ds, y |
Auto-formatted if columns are missing |
trading |
close (price) |
Auto-detected: close, *_close, mid_price |
signal_separation |
N/A | Uses mixture/ground_truth CSV pairs in ZIP |
Large Dataset Handling
File Paths (Recommended for >50MB files)
When passing file paths, the SDK reads CSV data from disk as a string and streams it to the server β avoiding a full DataFrame round-trip through memory:
# Good β streams from disk
evolver.fit(data="huge_dataset.csv", target_column="target")
# Also good β pathlib.Path works too
from pathlib import Path
evolver.fit(data=Path("huge_dataset.csv"), target_column="target")
If a DataFrame exceeds 50MB when serialized to CSV, the SDK will print a warning suggesting file paths instead.
Progress Bars
When uploading multiple datasets, the SDK shows a Rich progress bar:
Server-Side Storage
All uploaded data lives in /tmp/veox_uploads/. This directory:
- Has a 2 GB ceiling β the server auto-cleans old uploads when exceeded.
- Is ephemeral β in Docker containers, it's cleared on restart.
- Is cleaned after each job completes.
How Fitness Aggregation Works
- Each pipeline candidate is evaluated on every dataset independently.
- Scores are averaged (
mean_score = mean(all_dataset_scores)). - Standard deviation is calculated (
std = std(all_dataset_scores)). - The GrandmasterEvaluator computes a final score using Lower Confidence Bound:
- This means high-variance candidates (great on one dataset, terrible on another) get penalized.
Complete Examples
Binary Classification (2 datasets)
from sklearn.datasets import make_classification
import pandas as pd
from veox import VeoxEvolver
# Two datasets with different class balances
X1, y1 = make_classification(n_samples=800, n_features=20, weights=[0.7, 0.3], random_state=42)
df1 = pd.DataFrame(X1, columns=[f"f{i}" for i in range(20)]); df1["target"] = y1
X2, y2 = make_classification(n_samples=800, n_features=20, weights=[0.9, 0.1], random_state=99)
df2 = pd.DataFrame(X2, columns=[f"f{i}" for i in range(20)]); df2["target"] = y2
evolver = VeoxEvolver("binary")
evolver.fit(data=[df1, df2], target_column="target", max_generations=5)
Time Series (3 series)
import pandas as pd, numpy as np
from veox import VeoxEvolver
def make_ts(name, length=200, seed=42):
rng = np.random.default_rng(seed)
dates = pd.date_range("2020-01-01", periods=length, freq="D")
return pd.DataFrame({"unique_id": name, "ds": dates,
"y": np.linspace(10, 50, length) + rng.normal(0, 3, length)})
evolver = VeoxEvolver("time_series")
evolver.fit(data=[make_ts("a", seed=1), make_ts("b", seed=2), make_ts("c", seed=3)],
target_column="y", max_generations=5)
Trading (from CSV files)
from pathlib import Path
from veox import VeoxEvolver
evolver = VeoxEvolver("trading")
evolver.fit(
data=[Path("eth_ohlcv.csv"), Path("btc_ohlcv.csv")],
target_column="close",
max_generations=3,
)
# Potential data sources:
# https://www.kaggle.com/datasets/farhanbhamgara/eth-all?resource=download
# https://www.kaggle.com/datasets/stijnvanleeuwen/bitcoin-intraday-ohlcv-data