Note: I finished my BS and MS degrees in June 2022 but have continued to work on these since. I wrote up these reflections, following the suggested format of my recommenders, when preparing for CS Ph.D. apps. Updates since November 2022 may not be recorded.
Professor: Jure Leskovec, worked with: Tailin Wu
Learning Efficient Hybrid Particle-continuum Representations of Non-equilibrium N-body Systems [submission]
Publication status: Under Review at ICLR, also in NeurIPS 2022 AI4Science workshop
Author order: Tailin Wu, Michael Sun, H.G. Jason Chou, Pranay Reddy Samala, Sithipont Cholsaipant, Sophia Kivelson, Jacqueline Yau, Zhitao Ying, E. Paulo Alves, Jure Leskovec, Frederico Fiuza
Date: January 2022 – present
I was fully devoted to this project during 2021-2022’s winter and spring quarters, taking 6 units of 399 both quarters. I continued full-time in the summer (40h a week) with SLAC with financial support. I began working full-time in late August, but continued helping the project until submission.
This paper has already been accepted to Neurips AI4Science workshop! I will be traveling to NeurIPS with Tailin late November to present our work.
Collaborators and their role
As the second author, I integrated the majority of our system (LHPC)’s pipeline. I also ran and iterated on all the experiments for LHPC that led to the final results. Tailin implemented the solver components that was used in LHPC, and Pranay ran initial experiments for the neural models before joining the project. Jason (SLAC) prepared the data. The senior authors are Paulo, Jure and Fiuza.
Explain the problem
Partial differential equations (PDEs) are important scientific problems that form the foundation of our understanding of various phenomena. Traditionally, first-principle solvers for dynamical systems like the particle-in-cell method are burdened by the cost of an all-particle representation. Meanwhile, most numerical PDE solvers and neural-based PDE solvers assume a uniform representation on top of a time-space discretization of the domain.
However, certain multi-scale phenomena, like the acceleration of ions within high-intensity laser-plasma interaction, require a novel hybrid approach because neither a uniform nor all-particle representation is sufficient for capturing its chaotic multi-scale dynamics.
Why is it important
Such multi-scale behavior is omnipresent across many systems and is important in science and engineering. Examples include tumor therapy, nuclear fusion, plasma modeling in astronomical events, and much more. Laser-driven ion acceleration, a prime example of multi-scale dynamics, specifically has direct application to cancer radiotherapy.
Why is it hard
In laser-plasma interaction, only ~0.01% of the particles are accelerated but can carry 10-50% of the total energy. Thus, the system simultaneously consists of a highly dynamic subset of fast particles, and a dominant background subset of thermal particles. Using a first-principles solver sacrifices speed due to it expending an unnecessary amount of computation on the dominant thermal background, while a pure neural approach sacrifices accuracy on the small but highly dynamic subset. Simulating the evolution of this multi-scale dynamical system requires a new approach that achieves a good tradeoff between accuracy and speed.
Describe the solution
This project is the first-known neural-hybrid solver that simultaneously models a small but highly dynamic subset of the system (accelerated ions) using a first-principles solver and the background subset (fluid-like plasma) using efficient neural solvers. One of our key innovations addresses a new challenge that this hybrid approach introduces – the coupling between the two subsets. Namely, how do we handle the particles that have accelerated past a threshold to now be classified highly dynamic? We designed what we believe is the most natural way to model particle-continuum coupling using machine learning, which is based on the fluid representation, predict the moments of the distribution of the particles to be injected, and sample from this distribution, which is what we employ for long-term rollout. To our knowledge, we are the first to use deep learning for particle-continuum coupling. Thus, we are able to train the entire system in an end-to-end manner using multi-step backpropagation. We apply the pushforward trick and other regularization strategies to enable the model to learn from its own outputs, allowing for efficient and stable long-term rollout.
Key results and outcomes
We show via ablations that each component of the system is essential: (a) using an all-particle representation is 8x slower, (b) using an all-fluid representation increases error up to 6.8x on key quantities of interest, and (c) ignoring the coupling between the two subsets increases error up to 2.6x. We conclude that our system achieves the goal of modeling the laser-plasma multi-scale dynamics while maintaining a good tradeoff between speed and accuracy.
Impact of Work
The project is a joint collaboration between Stanford and SLAC. It is currently under review. During my summer at SLAC, Fiuza has told me this project helped him write an important grant application, and that the area of ML-accelerated plasma modeling is the topic of many new grant proposals. The techniques developed are also helping third author Jason Chou build an ion accelerator for tumor therapy. If it is published, I believe the precedent we set will help pave the way for many future collaborations.
Professor: Jure Leskovec, personal project, supervised by Antoine
Graph Inductive Inference of Personalized Content for Out-of-matrix Users Via Content Baskets
Publication status: Awaiting Revision Decision at Springer [preprint]
Author order: Michael Sun, Andrew Wang
Date: December 2020 – June 2021
This project began as a 224N project in winter 2020-21. I was pursuing a side project where I was constructing my own dataset from web-scraped data from a large anime website. I devoted all my spare time to it (3h+ a day) and built my own map-reduce data pipeline to scrape, clean, and process the data. In the course, we obtained some initial results, but due to lack of compute, our system was far from usable. I spent my entire spring break building the website that now hosts the system (https://otakuroll.net). I single-handedly manage the full tech stack, backend and AWS to this day.
How I met Jure
During 224N, one of the TAs was Andrew Wang (and second author). He also used to be an RA in SNAP. He first suggested we use a graph neural network to model item-item interactions. I was intrigued and during our discussions, Jure’s name came up a few times. At the start of spring quarter, I reached out to Jure due to my desire to learn more about GNNs. To my surprise, Jure not only replied but introduced me to Antoine, whose dual expertise in NLP and GNNs made him the perfect mentor for my continuing work. I enrolled in independent research spring 2020-21 to complete the project.
I submitted to Springer’s data science journal, IJDSA. After multiple rebuttal-revision rounds and more revisions to the manuscript during this year, we are now awaiting the final decision.
Collaborators and their role
As written in the paper, “The first (corresponding) author made the most significant contribution to all steps in the research process, including the topic conception, the code implementation and the results presentation. He built the data curation pipeline, made the new dataset, implemented the recommendation framework, carried out all experiments, designed the ablation studies, and analyzed all findings. He wrote the first draft of the manuscript. He also built and currently maintains the live website (https://otakuroll.net/) showcasing the model.”
Second author Andrew also helped edit parts of the manuscript. My other two project partners in 224N helped analyze initial findings that were presented in the 224N final project.
As written in the paper, “We would like to thank Stanford professors Dr. Jure Leskovec, Dr. Chris Manning and EPFL professor Dr. Antoine Bosselut for their mentoring and support for the project. We also want to acknowledge the Stanford undergraduate students who helped during the initial phases of the research when it was a course project. Finally, the experimental results would not have been possible without the computational resources of the Stanford Network Analysis Project while the authors were students in the group.”
Explain the problem
Guest users are common in real world applications, requiring industrial recommendation systems to handle the “cold start” problem, where no existing interactions between new users and recommendable items can be drawn from to make predictions.
Why is it important
Recommendation systems are central to our everyday lives. Every system has to grapple with the “cold start” problem at the beginning of deployment.
Why is it hard
Prior work addresses this problem by learning profiling user representations to bootstrap recommendations for new users. However, this process can often be invasive, requiring new users to submit personal data, or shallow, yielding unexpressive representations for accurate recommendations. No existing solution has found a way to address this problem without sacrificing either user privacy or lack of expressiveness. Thus, they cannot serve quality recommendations to new users, lowering user conversion rate, or have to risk legal issues of user privacy (e.g. TikTok, where I see first-hand the measures the company is taking to deal with user privacy).
Describe the solution
In this work, we propose new representations for guest users based on their “content basket’’. A set of seed items is submitted by the user to use the service, allowing each user to be represented as a function of a collection of items. Simultaneously, we design a graph representation space, where items (nodes) are connected by edges that signify joint, written recommendations between items. We propose a graph neural network architecture that inductively learns item and inter-item (edge) representations as a combination of deep language encodings of textual content descriptions and graph embeddings learned via message passing on the edges. This enables effective generalization to items unseen during training. To evaluate our model and demonstrate a novel application, we present a new dataset for anime recommendations, AnimeULike, containing anonymized interactions between ~13k users and 10k animes, and the accompanying recommendation engine which can exclusively serve guest users. Our empirical results on AnimeULike and a standard recommender systems benchmark demonstrate significant performance improvements over previous cold start solutions that do not learn to dynamically represent new users.
Key results and outcomes
We introduce DeepNaniNet, a graph neural recommender system framework inspired by denoising autoencoders for reconstructing user-item preferences using rich item content representations, coupled with a novel solution for cold start content basket design – that yields both strong warm start and cold start performance. Our framework capably handles the user cold start setting when user content is not available. The framework replicates a prior state-of-the-art method’s cold start results on the benchmark CiteULike, maintaining equally strong performance across both in (existing) vs. out-of-matrix (guest) users hence outperforming DropoutNet in the realistic real-world setting where a significant fraction of users are guest users.
We also release AnimeULike, a new graph-based interaction dataset that has rich user-item, item and item-item textual content and is reflective of real-world conditions regarding data sparsity.We demonstrated DeepNaniNet’s strong performance on it across both warm and cold start –notably a 7-fold improvement over the WMF baseline. Furthermore, we observed gains to cold start generalization via using jointly learnt graph representations that leverage the rich inter-item relations. Finally, we made the case for deep, differentiable language encoders by investigating the feasibility for end-to-end training. We hope our methods, results, and detailed analyses can be a positive step towards creating services that deliver meaningful, personalized, and engaging content for all users (whether old, new, or guest) in the next generation of privacy-aware recommender systems.
Impact of Work
The website I built and currently host our system on (https://otakuroll.net) now has over 1000 registered users (and likely served many more guest users that I do not track, which is the point of the project). Some of the positive reception the site has gotten can be found in section 6 of the preprint.
Professor: Percy Liang, worked with: Ananya Kumar
Improving Representational Continuity via Continued Pretraining
Publication status: Preparing ICML submission [doc][overleaf]
Author order: Michael Sun, Ananya Kumar, Divyam Madaan, Percy Liang
Date: March 2022 – present
I was devoted as a student during spring quarter of 2022 (20h a week) and attended the group meeting weekly. I graduated in June and continued with the same time commitment in the summer, though I was employed by SLAC. I started working full-time for TikTok at the end of August, and have been spending my spare time on the project since then.
Continual learning (CL) simulates an important real-world scenario when data comes in a sequence of tasks, e.g. chronologically or high-to-low-resource subdomains. Many algorithms have been successful, but incur some memory cost, either requiring a replay buffer of past data, or storing past model weights for computing regularization. In recent years, foundation models have made possible transferring rich representations to a wide variety of downstream tasks. These two developments have motivated continual representational learning, where the desirable evaluation scheme borrows from how foundation models have traditionally been used – transferring downstream via a probing classifier. Our project introduces a simple yet effective technique to improve performance in both CL and representation settings. It can be used as a standalone method to improve fine-tuning, avoiding memory costs that foundation model practitioners cannot afford, but can also be combined with CL algorithms like SI, DER, and LwF. We find (1) finetuning on few-shot data using LPFT as a standalone method compares favorably to replay and regularization methods on the standard KNN evaluation while being nearly equal on our few-shot probing evaluation protocol, (2) coupling LPFT with replay and regularization either maintains or improves performance, (3) scales to real-world data, and (4) a variant of LPFT achieves SOTA accuracy on an NLP CL benchmark.
What I did
I extended Divyam’s UCL codebase to implement LPFT for the supervised and unsupervised settings. I ran initial experiment sweeps for both settings and manually iterated on them from March to June, using tools I’m familiar with like tensorboard and ray tune. The finetuning + LPFT results were positive from the start, but I actually spent most of my efforts trying to get LPFT to work with the unsupervised Siamese network. In doing so I implemented the training, evaluation and logging utilities for the project. Every week, I would share and discuss the results with Ananya and we would talk through some intuition why things worked, which hyperparameters were more important, etc.
Sometime in June, Ananya suggested I automate my training + analysis flow more as he did in past projects, so I invested the month streamlining my prior flow into an automated one. I built on a script from Ananya’s previous repository. My script can auto-submit batches of jobs that can grid search over any number of parameters, including model hyperparameters, dataset size, etc. Afterwards a separate script quickly summarizes all the results to quickly find the best runs. Results are saved and logged to wandb for weekly discussion. The script also sweeps over multiple learning rates and seeds, then later auto-groups and summarizes them for analysis. Each run has a dedicated folder with a parameter string as the name, and later the same script can be used to relaunch those runs’ best checkpoints with a different set of parameters, etc.
In June and August, I carried out experiments coupling LPFT with each CL method while refining the flow, reporting the numbers to Percy during regular meetings. I also implemented probing evaluation for representations and integrated with the training flow. We conjectured that with few-shot probing on the best checkpoints, regular finetuning can close the gap to other SOTA CL algorithms. I refined the flow to work with this alternate evaluation protocol. I solidified our conclusions by running comprehensive experiments over CIFAR10 and CIFAR100, as well as real-world datasets TinyImages and FMOW.
In late August, I started working full-time.
In September, Percy brought to my attention a paper adapting BERT for continual learning. Since then until now, I have focused on applying LPFT to the paper’s NLP CL dataset (read more below).
In September to October, I compiled an outline and wrote sections 2, 3, 4 of the current Overleaf draft that we almost submitted to ICLR with. In the end, Ananya and I decided not to so we can get Percy on the same page, and have time to polish the writeup more. The progress we have made would not have been possible without Ananya’s guidance at important junctures, like deciding to focus on the direction with supervised representation learning, or investing the time to adapt an automated workflow in June. I have and am continuing to learn a lot from Ananya’s feedback on how to write and present the results.
We are still making good progress and anticipate being fully prepared for a solid submission to ICML.
Why what I did was important
What I did brought about the successful execution of the project. The strong results from the experiments I carried out bring us closer to achieving the goal of the project.
Challenges I overcame
In September, Percy brought to my attention a paper adapting BERT for continual learning. I read the paper’s proposed method B-CL in-depth, but I have to admit it was complicated for me (and Ananya) to grasp at first. Moreover, the codebase implements 5 papers and over 40 baselines and was difficult to parse. I wanted to understand the method in detail, so I set breakpoints and traced through line-by-line the training of B-CL. I wrote comments on which equation or quantity from the paper each variable referred to for my own reference. With some tedious effort, I got through the code in detail but still felt a bit lost amongst the knowledge-sharing capsules, agreement-based routing between capsules, the pseudo-gate functions, etc. When faced with this method that is difficult conceptually, I decided to take a step back and think: what are the main modules, and why do each of these modules matter? I began asking some high-level questions that concern our end goal, and let them be my guide. (1) Why is this model task-incremental? In the task-incremental setting, only the task-specific modules really matter, so I was able to isolate the parameters that are reset between tasks. (2) What aspects of the method were designed to protect forgetting? It turns out, there is a task mask that is output from a gated function, which dedicates separate neurons for separate tasks within a fully connected layer. This led me to the final question: (3) how should LPFT be applied? I realized within each encoder layer, the task-specific mask and neurons can also be included in the definition of “head”, in addition to the classification head. This led me to a more general insight: we can linear-probe on only the task-specific subset of weights, then fine-tune on all weights. I eagerly implemented LPFT this way, and since have gained exciting results. By overcoming this obstacle, I learned two things: (1) it is easier to understand a conceptually difficult method by asking guiding questions on its high-level goal, and (2) simplicity is very desirable in a flexible technique like LPFT that couples with even a complicated method that eludes understanding at first glance, reinforcing our goal to prove the wide applicability of LPFT to other CL methods.
I familiarized myself with slurm and using a compute cluster. This makes iterating extremely efficient, so progress is only limited by the job limit (15). At any point in time, I have dozens of jobs in queue. This experience drilled in me a valuable sense to prioritize automating iteration on experiments after the initial phase of trial-and-error. During my future projects, I will prioritize implementing a similar flow to make iterating cost nearly zero.
WorldWideWoZ: Creating High Quality Multilingual Dialogue Datasets
Publication status: Preparing ARR submission [overleaf]
Author order: TBD
Date: September 2021 – present
Professor: Monica Lam, worked with: Mehrad Moradshahi
Description: Task-oriented dialogue systems have been trained only on high-resource languages like English and Chinese. Meanwhile, hand-collecting data in the target language is expensive due to there being only a small number of human experts. Existing attempts to train dialogue systems on low-resource languages rely on machine translation of dialogue datasets in the source language, but the quality of translated data is not sufficient to make such “zero-shot” systems practical. We simulate this challenge by using Chinese as the source language and English as the target language. We investigate three settings: (0) full-shot, where we use all annotated English data (1) few-shot, where we start with a zero-shot model and finetune on a small set of labeled few-shot English and (2) zero-shot, where no labeled few-shot data is allowed. In the zero-shot setting, we leverage a novel auto-annotation approach to leverage vast quantities of unlabeled English data. First, a zero-shot model trained on noisy translated English data is used to annotate the unlabeled corpus. Then, an error classifier is used to filter out incorrect auto-annotations. We compare the result with the lower bound of using no auto-annotated data, and the upper bound of having an oracle that filters out all incorrect annotations.
What I did
I helped design, train, and deploy the auto-annotation error classifier. The dataset to train the error classifier comprises of positive examples taken from the translated English data and synthesized negative examples, which I found a creative way to do. First, I analyzed what types of errors the zero-shot model is making on gold few-shot data. Then, I made an automated pipeline that takes as input the model errors, and outputs a statistical analysis of which types of errors at the slot, value, pair and dialogue act level. The exact count of how often the model was omitting a (slot, value) during DST is recorded.