QIAI Research Overview

This page will be updated accordingly as this is work in progress. Briefly, I’m working on two projects, the first being more independent (potentially publishable) and the second project of a team in a large, interdisciplinary effort.

Chest X-ray

This project I’m running on the side (though I’m now prioritizing the second project) was inspired from two ongoing developments.

Jared Dunnmon, one of the post-docs working at QIAI, introduced me to prior work being done on representation co-learning on different modalities, specifically this paper. He’s also known for his contributions to the Snorkel, which provides programming tools for extracting weak labels from data sources in a way that the accuracies can be recovered for use in weakly supervised learning.


The task of radiologists is inherently multi-modal (cross-referencing both the X-rays and radiology reports). Prior work like CheXpert have used automated label extractor tools to extract labels from the reports to supervise the X-ray model, followed by different ways to account for those uncertain labels. Intuitively, this fails to: a) leverage the information available and b) provide a measure of robustness to the degree of the uncertainty (related to the topic of robust optimization). In biomedical applications, this uncertainty is especially relevant should these models be used in clinical diagnosis.


After discussing with Jared, the motivation behind this problem inspired an approach where we co-learn representations for the (X-ray, report) pairs in a shared latent space, as to minimize their cosine similarity.

We are reasonably confident in this approach due to prior work done by Stefano Ermon (my instructor for CS236!) and others applied this on Wikipedia articles and geolocated images. Chest X-rays and radiology reports both come from significantly smaller distributions than geolocated images and Wikipedia articles!

One concern (that has shown in initial experiments) is that radiology reports, unlike Wikipedia articles, are highly variant to small changes like “inflated lungs” to “de-inflated lungs”, which likely means the latent space will have highly “oscillating” regions. While I’ve gotten started replicating the training for the doc2vec used in WikiSatNet, alternative ways to learn these embeddings should be considered.

EEG Project (Video group)

This project is part of the video group of the EEG project, which works with the largest EEG dataset to date in partnership with Stanford hospital.


The lab has previously established baselines on the EEG dataset with annotations by technicians and clinicians. Alone, they weren’t performing well, because of the inherent difficulties of the dataset, which include:

  • Scale of the dataset (it is the LARGEST)
  • Variability in annotations
  • Patient demographics
  • Seizure types
  • Environment (ICU, nursing room, etc.)
  • Idiosyncratic response/symptoms of seizure
  • False pos/neg in inherent inaccuracies of EEG measurements
  • Clinicians’ annotations

With only EEG supervision, the neurologist in the group says it can only account ~90% of labels. However, there is also a vast volume of unlabelled video 1:1 timestamped to the EEG measurements collected from room cameras.


The lab formed a video group, which is currently headed my neurologist Dr. Chris-Lee Messer, a graduate student, and myself, in order to leverage unlabelled data by first training a SOTA Siamese tracker to pseudolabel through the volume of videos to filter through only those with detected patient presence. TLDR; the tracker operates by only requiring the first frame to be labelled, and would train a correlation filter as one of the layers of a deep convolutional network (so no online adaptation of the filter as traditional trackers).

This tracker will be used for two things:

  1. Be fed in as a feature to an existing attention model on EEG annotations
    Since the output of the tracker is a feature heat map corresponding to how closely parts of the frame cross-correlate with the labelled bounding box, one direct way to run in parallel with an EEG attention model by feeding the temporal feature map at every time step.
  2. Be used to run additional pose/gesture recognition/detection models on the tracked patient
    With the labelled bounding boxes from the tracker, we can also detect a variety of idiosyncratic responses. I’m working closely to learn the from the neurologist on specific seizure types/responses. Some of these responses include patterns that we hope to pick up with machine learning. One of the machine learning researchers in the lab developed this which comes with a data/programming model to compose the outputs of models into event queries (i.e. if a nurse is detected shortly after after the model detects seizure, then it likely is one) which can help further diagnosis.

The uncertainty whether/how well these approaches will work is both a source of excitement and anxiousness, but this should what research is all about.

%d bloggers like this: