NAV Navbar
Logo white

Welcome to the API docs for RIDDLE!

Check version

import riddle
  > Hello, World
  > My name is RIDDLE 2.0.0

RIDDLE (Race and ethnicity Imputation from Disease history with Deep LEarning) is an open-source Python2 library for using deep learning to impute race and ethnicity information in anonymized electronic medical records (EMRs). RIDDLE provides the ability to (1) build models for estimating race and ethnicity from clinical features, and (2) interpret trained models to describe how specific features contribute to predictions. The RIDDLE library implements the methods introduced in “RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning” (PLOS Computational Biology, 2018).

Compared to alternative methods (e.g., scikit-learn/Python, glm/R), RIDDLE is designed to handle large and high-dimensional datasets in a performant fashion. RIDDLE trains models efficiently by using a parallelized TensorFlow-under-Keras backend, and avoids memory overflow by preprocessing data in conjunction with batch-wise training.

RIDDLE uses Keras to specify and train the underlying deep neural networks, and DeepLIFT to compute feature-to-class contribution scores. The default architecture is a deep multi-layer perceptron (deep MLP) that takes binary-encoded features and targets. However, you can specify any neural network architecture (e.g., LSTM, CNN) and data format by writing your own model_module files (see Configuration)!


Shell commands:

# Install HDF5 (only non-pip dependency)
apt-get install libhdf5-serial-dev

# Option 1) Clone from GitHub
git clone --recursive git://
cd riddle
pip install -r requirements.txt
apt-get install libhdf5-serial-dev

# Option 2) Install using pip
pip install git+
pip install git+

Install the following libraries/software:


High-level API

Quickstart commands

# run in repository directory

Template script for a toy pipeline

import numpy as np
from sklearn.metrics import accuracy_score

from riddle import emr, models

# get data
x, y = emr.get_data(...)
x_train, y_train, x_val, y_val, x_test, y_test = emr.get_k_fold_partition(x, y, ...)

# specify model
model = MLP(...)

# train and evaluate model
model.train(x_train, y_train, x_val, y_val)

y_probas = model.predict_proba(x_test)
y_pred = np.argmax(y_probas, axis=1)
print('accuracy: {:.4f}'.format(accuracy_score(y_test, y_pred)))


Module Description
riddle/ Reads in data files & preprocesses the data.
riddle/ Computes & summarizes DeepLIFT feature contribution scores
riddle/ Plots ROC curves and computes ROC AUC scores
riddle/ Implements parameter tuning functions
riddle/models/ Base Model class for defining model architectures
riddle/models/ MLP architecture (used in the PLOS CB paper)


Script Description Runs parameter tuning Runs experiments (model training and evaluation) using k-fold cross-validation Runs a pipeline to compute DeepLIFT scores


Ji-Sung Kim
Princeton University
hello (at) (technical inquiries)

Xin Gao, Associate Professor
King Abdullah University of Science and Technology

Andrey Rzhetsky, Edna K. Papazian Professor
University of Chicago
andrey.rzhetsky (at) (research inquiries)

License & Attribution

All media (including but not limited to designs, images and logos) are copyrighted by Ji-Sung Kim (2018).

Project Python code (explicitly excluding media) is licensed under the Apache License 2.0. If you would like use or modify this project or any code presented here, please include the notice and license files, and cite the paper.