Skip to content

πŸ“¦ Working with Multiple Datasets

A comprehensive guide to uploading single, multiple, and large datasets across all VEOX problem types.


Overview

VEOX evaluates every pipeline candidate across all provided datasets and averages the fitness scores. This produces more robust, generalizable solutions because:

  • High-variance candidates are penalized β€” the engine uses a Lower Confidence Bound (mean βˆ’ λ·std) to discourage lucky one-off scores.
  • Overfitting to a single dataset is harder β€” a pipeline must perform well across distributions.
  • Real-world scenarios often involve heterogeneous data (e.g., multiple hospitals, different markets, seasonal vs. non-seasonal time series).

Input Patterns

All categories that accept client-uploaded data support these input patterns:

Single DataFrame

import pandas as pd
from veox import VeoxEvolver

evolver = VeoxEvolver("binary")
evolver.fit(data=df, target_column="target", max_generations=5)

List of DataFrames

evolver.fit(
    data=[df1, df2, df3],
    target_column="target",
    max_generations=5,
)

CSV File Path (String)

evolver.fit(
    data="path/to/train.csv",
    target_column="target",
    max_generations=5,
)

CSV File Path (pathlib.Path)

from pathlib import Path

evolver.fit(
    data=Path("path/to/train.csv"),
    target_column="target",
    max_generations=5,
)

Mixed List

from pathlib import Path

evolver.fit(
    data=[df1, "train_b.csv", Path("train_c.csv")],
    target_column="target",
    max_generations=5,
)

Category Support Matrix

Category Client Upload Server-side Data Multi-Dataset
binary βœ… β€” βœ…
regression βœ… β€” βœ…
outlier βœ… βœ… (bundled) βœ…
time_series βœ… βœ… (seed cache) βœ…
trading βœ… βœ… (golden set) βœ…
optimization β€” βœ… (objective functions) βœ… (server-side)
signal_separation βœ… (ZIP/dir) βœ… (seed cache) βœ…

Column Requirements by Category

Category Required Column Notes
binary Target column (0/1) Specified via target_column=
regression Target column (continuous) Specified via target_column=
outlier Auto-detected Checks: is_outlier, outlier, class, target, then last column
time_series unique_id, ds, y Auto-formatted if columns are missing
trading close (price) Auto-detected: close, *_close, mid_price
signal_separation N/A Uses mixture/ground_truth CSV pairs in ZIP

Large Dataset Handling

When passing file paths, the SDK reads CSV data from disk as a string and streams it to the server β€” avoiding a full DataFrame round-trip through memory:

# Good β€” streams from disk
evolver.fit(data="huge_dataset.csv", target_column="target")

# Also good β€” pathlib.Path works too
from pathlib import Path
evolver.fit(data=Path("huge_dataset.csv"), target_column="target")

If a DataFrame exceeds 50MB when serialized to CSV, the SDK will print a warning suggesting file paths instead.

Progress Bars

When uploading multiple datasets, the SDK shows a Rich progress bar:

β ‹ [2/5] Uploading fraud_dataset  ━━━━━━━━━━━━━━━━━━━━ 40%

Server-Side Storage

All uploaded data lives in /tmp/veox_uploads/. This directory:

  • Has a 2 GB ceiling β€” the server auto-cleans old uploads when exceeded.
  • Is ephemeral β€” in Docker containers, it's cleared on restart.
  • Is cleaned after each job completes.

How Fitness Aggregation Works

  1. Each pipeline candidate is evaluated on every dataset independently.
  2. Scores are averaged (mean_score = mean(all_dataset_scores)).
  3. Standard deviation is calculated (std = std(all_dataset_scores)).
  4. The GrandmasterEvaluator computes a final score using Lower Confidence Bound:
    final_score = mean_score - λ·std - complexity_penalty - time_penalty
    
  5. This means high-variance candidates (great on one dataset, terrible on another) get penalized.

Complete Examples

Binary Classification (2 datasets)

from sklearn.datasets import make_classification
import pandas as pd
from veox import VeoxEvolver

# Two datasets with different class balances
X1, y1 = make_classification(n_samples=800, n_features=20, weights=[0.7, 0.3], random_state=42)
df1 = pd.DataFrame(X1, columns=[f"f{i}" for i in range(20)]); df1["target"] = y1

X2, y2 = make_classification(n_samples=800, n_features=20, weights=[0.9, 0.1], random_state=99)
df2 = pd.DataFrame(X2, columns=[f"f{i}" for i in range(20)]); df2["target"] = y2

evolver = VeoxEvolver("binary")
evolver.fit(data=[df1, df2], target_column="target", max_generations=5)

Time Series (3 series)

import pandas as pd, numpy as np
from veox import VeoxEvolver

def make_ts(name, length=200, seed=42):
    rng = np.random.default_rng(seed)
    dates = pd.date_range("2020-01-01", periods=length, freq="D")
    return pd.DataFrame({"unique_id": name, "ds": dates,
                          "y": np.linspace(10, 50, length) + rng.normal(0, 3, length)})

evolver = VeoxEvolver("time_series")
evolver.fit(data=[make_ts("a", seed=1), make_ts("b", seed=2), make_ts("c", seed=3)],
            target_column="y", max_generations=5)

Trading (from CSV files)

from pathlib import Path
from veox import VeoxEvolver

evolver = VeoxEvolver("trading")
evolver.fit(
    data=[Path("eth_ohlcv.csv"), Path("btc_ohlcv.csv")],
    target_column="close",
    max_generations=3,
)

# Potential data sources:
#    https://www.kaggle.com/datasets/farhanbhamgara/eth-all?resource=download
#    https://www.kaggle.com/datasets/stijnvanleeuwen/bitcoin-intraday-ohlcv-data