Guide · Bring your own cohort

Use the Standard Model on your data.

How to leverage the Standard Model on your own EHR data — reshape and tokenize medical events, build a labels table, and train four types of classifiers on Standard Model embeddings.

Make sure you've completed the quickstart setup. Commands below assume quickstart/ as the working directory and use uv run.

§ 01 · Format & tokenize

Format & tokenize data

MEDS events + labels

To apply the Standard Model, you must first have an events table in MEDS (Medical Event Data Standard) format (one row per clinical event). A collection of ETLs from common data formats — OMOP, MIMIC-IV, MEDS Unsorted — can be found here. The end-to-end example further describes the input data format.

The Standard Model was built to operate on multiple modalities. For simplicity, this tutorial focuses on EHR text data. A more advanced multi-modal tutorial is forthcoming.

Events data — example rows

subject_id	time	code	table	value
10000032	2022-01-15 08:00:00	`ICD10:I10`	condition	—
10000032	2022-01-15 09:30:00	`LOINC:2093-3`	lab	145.2
10001217	2022-02-01 14:00:00	`RxNorm:861004`	medication	—

You should also have a labels table with one row per subject, in the same order as your events table. Columns set the ground truth for prediction tasks, and are used for performance evaluation.

Labels data — example rows

subject_id	prediction_time	readmission	phenotype	survival_mo	observed
10000032	2022-04-12 12:00:00	0	0	68.1	1
10001217	2022-06-15 08:00:00	0	2	45.8	1
10002428	2022-02-14 13:30:00	1	1	35.5	1

Your labels should be derived from linked patient data and should reflect the data types used by your clinical outcome classifier. The model tokenizer uses prediction_time as the cutoff when considering events to build embeddings — think of this date as the "as-of" time for the prediction.

We provide smb_utils.process_ehr_info to convert the MEDS events table into a tokenizable XML-like stream. Time is measured in days; all events at the same timestamp are grouped by event category with XML-style tags, in chronological order.

smb-v1-1.7b supports a max token length of 4096, yet many patient histories exceed this constraint.

Our smb-utils package offers multiple strategies as a temporary solution:

Filter events data by modality (i.e., code or table columns)
Organize events into time bins with the most recent events from an anchor date
Flexibly define custom event categories

Longer context length is on our roadmap — better support is rolling out soon.

§ 01.5 · ETL example

ETL example

Serialize one patient

Assume your events are saved as internal_cohort_meds.parquet with the schema above. Load and verify, then serialize one patient:

python

import pandas as pd
from smb_utils import process_ehr_info

# Load events (MEDS format: subject_id, time, code, table, value)
df = pd.read_parquet("internal_cohort_meds.parquet")
assert {"subject_id", "time", "code", "table", "value"}.issubset(df.columns)

# Example: Serialize a single patient history
# 'end_time' enforces causal masking (the model cannot see future data)
input_text = process_ehr_info(
    df,
    subject_id="patient_5521",
    end_time=pd.Timestamp("2024-01-01")
)

print(input_text)
# Output:
# [2023-11-15]
# <conditions>
# ICD10:C34.90
# </conditions>
# <medications>
# RxNorm:583214
# </medications>

Labels must be aligned to the same patient order as your embedding matrix (i.e. same subject_id order as the events you passed to the model). This is important to resolve the correct end_time for each patient when creating embeddings.

If you're using a coding agent (Claude Code, Codex, Gemini, etc.), you can derive a new labels table from your events data with the prompt below. The prompt below runs in ~1 min.

Prompt — Create labels table

Use the Standard Model Biomedicine documentation here (https://docs.standardmodel.bio/your-own-data) and my MEDS-formatted events data to write a new labels table. The outcomes I'd like to include are a binary for mortality, a multi-class category for disposition at final discharge (i.e., went home, institutionalized, death/hospice, other), and a continuous variable for survival days after last admission. Since all of these correspond to the end of each patient's history, record the prediction_time as the time stamp strictly before their last hospital discharge.

§ 02 · Shortcut

Embeddings & classifiers
in one prompt.

From here, a coding agent is generally able to run inference, train classifiers, and give you evaluation results with a single prompt. Resolves in 3–4 min.

Claude Code Codex Gemini CLI

Prompt — Embeddings + classifiers

Use my MEDS event data and labels table to extract embeddings using the appropriate script from this documentation page (https://docs.standardmodel.bio/your-own-data). Use the page as a guide to then train binary, multi-class, regression, and COX classifiers with the resulting embeddings and provide results for each model's performance.

§ 03 · Represent

Represent data as embeddings

Last-token pooling

get_embeddings loads the model, serializes MEDS data into text, tokenizes the text stream, and performs inference through the model for a full dataframe of patients.

We use Last Token Pooling to extract the final hidden state, which represents the patient's entire causal trajectory up to the end_time (inclusive).

python

import pandas as pd
import torch
from smb_utils import process_ehr_info
from transformers import AutoModelForCausalLM, AutoTokenizer
from tqdm import tqdm

MODEL_ID = "standardmodelbio/smb-v1-1.7b"

# 1. Load Model
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, trust_remote_code=True, device_map="auto"
)
model.eval()

# 2. Batch Extraction (df from Format data step above)
def get_embeddings(df, pids, end_time):
    embeddings = []
    for pid in tqdm(pids):

        # Resolve per-patient end_time if a mapping is provided
        if isinstance(end_time, (pd.Series, dict)):
            patient_end_time = pd.Timestamp(end_time[pid])
        else:
            patient_end_time = end_time

        # A. Serialize (MEDS -> Text)
        text = process_ehr_info(df, subject_id=pid, end_time=patient_end_time)

        # B. Tokenize
        inputs = tokenizer(
            text, return_tensors="pt", truncation=True, max_length=4096
        ).to(model.device)

        # C. Inference (Hidden States)
        with torch.no_grad():
            outputs = model(inputs.input_ids, output_hidden_states=True)
            # Extract last token vector
            vec = outputs.hidden_states[-1][:, -1, :].cpu()
            embeddings.append(vec)

    return torch.cat(embeddings, dim=0).numpy()

# Execute
pids = df["subject_id"].unique()
end_times = labels_df.set_index("subject_id")["prediction_time"]
X = get_embeddings(df, pids, end_times)

The output X is a list of embeddings (fixed-length, 1-dimensional PyTorch tensor) that represent the patients' medical history from df. This list is in the same patient order as your MEDS data and labels table.

§ 04 · Train predictors

Train clinical predictors

4 task heads on your cohort

With X and your labels table, train multiple types of task heads to predict clinical outcomes.

A

Readmission risk — Binary

Logistic regression → ROC-AUC

After importing packages and splitting our data, Task A is a binary classifier predicting subject readmission as recorded in the readmission_risk column.

python — setup + task a

# --- Setup ---
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.metrics import roc_auc_score, accuracy_score, mean_absolute_error
from lifelines import CoxPHFitter

# X, pids from embedding step above. Align labels to X row order:
labels_df = pd.read_parquet("your_labels.parquet")
labels = labels_df.set_index("subject_id").loc[pids].reset_index()

X_train, X_test, labels_train, labels_test = train_test_split(
    X, labels, test_size=0.2, random_state=42
)
print(f"Training on {len(X_train)} samples, testing on {len(X_test)}")


# --- Task A: Binary (Readmission Risk) ---
print("\n--- Task A: Binary Classification ---")
clf_bin = LogisticRegression(max_iter=1000)
clf_bin.fit(X_train, labels_train["readmission_risk"])
y_prob = clf_bin.predict_proba(X_test)[:, 1]
auc = roc_auc_score(labels_test["readmission_risk"], y_prob)
print(f"-> ROC-AUC: {auc:.3f}")

B

Phenotype stage — Multiclass

Logistic regression → Accuracy

Task B is a multi-class classifier to predict cancer stage 1–4, recorded in phenotype_class. Accuracy is reported.

python — task b

# --- Task B: Multiclass (Phenotype) ---
print("\n--- Task B: Multiclass Phenotyping ---")
clf_multi = LogisticRegression(max_iter=1000)
clf_multi.fit(X_train, labels_train["phenotype_class"])
y_pred_class = clf_multi.predict(X_test)
acc = accuracy_score(labels_test["phenotype_class"], y_pred_class)
print(f"-> Accuracy: {acc:.3f}")

C

Survival months — Regression

Ridge regression → MAE

Task C uses the continuous overall_survival_months to predict months of survival for subjects who died. Mean absolute error is reported.

python — task c

# --- Task C: Regression (Survival months) ---
print("\n--- Task C: Regression ---")
reg = Ridge(alpha=1.0)
reg.fit(X_train, labels_train["overall_survival_months"])
y_pred_reg = reg.predict(X_test)
mae = mean_absolute_error(labels_test["overall_survival_months"], y_pred_reg)
print(f"-> MAE: {mae:.2f}")

D

Cox proportional hazards — Survival

PCA + Cox PH → C-Index

Task D predicts Cox proportional hazards between arbitrary groups for risk of death, using overall_survival_months and event_observed. The concordance index is reported.

python — task d

# --- Task D: Survival (Cox PH) ---
print("\n--- Task D: Survival Analysis ---")
pca = PCA(n_components=10)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)
cox_df = pd.DataFrame(X_train_pca, columns=[f"PC{i}" for i in range(10)])
cox_df["T"] = labels_train["overall_survival_months"].values
cox_df["E"] = labels_train["event_observed"].values
cph = CoxPHFitter()
cph.fit(cox_df, duration_col="T", event_col="E")
test_cox_df = pd.DataFrame(X_test_pca, columns=[f"PC{i}" for i in range(10)])
test_cox_df["T"] = labels_test["overall_survival_months"].values
test_cox_df["E"] = labels_test["event_observed"].values
c_index = cph.score(test_cox_df, scoring_method="concordance_index")
print(f"-> C-Index: {c_index:.3f}")

You may abstract these concepts to use Standard Model embeddings as input to other classifier types, including more complex downstream models.

Contact us

Having trouble, or just want to talk about your project?

Get in touch