Data input formats¶

gamfit.fit() and Model.predict() accept several rectangular Python inputs. The Python layer normalises the input to (headers, rows) before crossing the Rust FFI boundary.

Supported input types¶

Input	Notes
`pandas.DataFrame`	Columns taken from `df.columns`.
`polars.DataFrame`	Columns taken from `df.columns`.
`pyarrow.Table`	Columns taken from `table.column_names`.
`numpy.ndarray` (1-D or 2-D)	Columns auto-named `x0`, `x1`, …. 1-D becomes a single column `x0`.
`Mapping[str, sequence]`	Keys are column names, values are 1-D sequences.
`list[Mapping[str, Any]]`	List of records. The full set of keys across rows defines the column order; each row must contain every key.
`Sequence[Sequence]` (2-D)	Columns auto-named `x0`, `x1`, …. All rows must have the same width.

pandas/polars/pyarrow are detected at runtime via _try_import. They are not required at install time.

Equivalent inputs for a two-column dataset:

import pandas as pd
import numpy as np
import pyarrow as pa

pd.DataFrame({"y": [1.0, 2.0, 3.0], "x": [0.0, 1.0, 2.0]})
pa.table({"y": [1.0, 2.0, 3.0], "x": [0.0, 1.0, 2.0]})
{"y": [1.0, 2.0, 3.0], "x": [0.0, 1.0, 2.0]}
[{"y": 1.0, "x": 0.0}, {"y": 2.0, "x": 1.0}, {"y": 3.0, "x": 2.0}]
np.array([[1.0, 0.0], [2.0, 1.0], [3.0, 2.0]])  # columns become x0, x1

Validation rules¶

Cells are stringified before crossing the FFI boundary. Two hard rules enforced in stringify_cell:

None is rejected.
NaN floats and empty strings are rejected.

Booleans become "1" / "0". Numbers use repr(). Other values are passed through str(...). The engine handles numeric coercion from strings; explicit casting to float is unnecessary.

Column lengths must agree. Mismatches raise ValueError before the engine sees the data.

String columns are accepted for terms like group(site) and are encoded by the engine.

Missing data¶

gamfit does not impute. Drop or impute rows upstream:

df.dropna(subset=[...]) in pandas.
sklearn.impute or equivalent.
For survival, ensure entry/exit/event columns are complete.

What `predict()` returns¶

By default, the return container matches the input kind, falling back to the training kind, then to dict. Override with return_type=:

`return_type`	Returns
`None` (default)	Input kind, else training kind, else `dict`.
`"dict"`	`dict[str, list]`.
`"numpy"`	2-D `numpy.ndarray` with columns in fixed order.
`"pandas"`	`pandas.DataFrame`.
`"polars"`	`polars.DataFrame`.
`"pyarrow"`	`pyarrow.Table`.

model.predict(test_df, return_type="dict")
model.predict(test_df, return_type="numpy")
model.predict(test_df, return_type="pandas")

Array-returning model classes¶

Transformation-normal models and Bernoulli marginal-slope models return a 1-D numpy.ndarray of shape (n_samples,) by default. Passing id_column= or return_type= switches them to a two-column table.

# 1-D numpy by default
z = model.predict(test_df)                       # shape (n,)

# Two-column table when id_column or return_type is set
df = model.predict(test_df, id_column="patient", return_type="pandas")
z = df["z"].to_numpy()                           # transformation-normal

The value column is named z for transformation-normal output and mean for Bernoulli marginal-slope output. Flattening the two-column table with np.asarray(...) produces a shape (n, 2) array; extract the column explicitly when an array is wanted.

Identifier columns¶

A column that is not part of the model can be carried through to the output by naming it with id_column=:

preds = model.predict(
    [
        {"patient_id": "P001", "x": 1.5},
        {"patient_id": "P002", "x": 2.5},
    ],
    id_column="patient_id",
    return_type="dict",
)
# preds = {"patient_id": ["P001", "P002"], "eta": [...], "mean": [...]}

The id column is excluded from the model and may be any type that stringify_cell accepts.