NLP x RecSys

UPDATE 2/21: Iacked together a prototype with a well known baseline algorithm. Try it out now!

UPDATE 2/16: We handed in our project proposal, so it’s time to start building models! We’re building on the codebase of a former NeurIPS paper.

UPDATE 2/07: We finished scraping our dataset, and am converging on a paper we want to emulate.

(Currently, this is a developing CS224N team project that may spin off into something more.)

Inspiration

Most recommendation engines are trained on the basis of a large user-item (preference) matrix. Along with matrix completion techniques like RPCA, these give rise to traditional algorithms like collaborative filtering.

However, I believe for higher forms of art like anime, independence assumptions are not fulfilled as anime recommendations inherently travel by word of mouth. Internet forums create huge chain reaction effects that inflates viewership of bad anime while good ones often slip under the radar.

To take the best of both worlds, we want to pioneer a NLP-driven recommendation system that leverages representations of Transformer Encoders like BERT for the basis of content embeddings.

Approach

Our goal is to first train a good language model from Internet forum data, Crunchyroll reviews, and show descriptions. That becomes the basis for an actual recommendation engine which takes user submitted forms of favorite set of shows that look like:

user 1: {naruto, bleach, fullmetal alchemist, etc.}

user 2: {hunter x hunter, one piece, naruto, etc.}

We can embed users by the shows they enjoyed, showing off our representations, then feed it to the downstream task of reconstructing relevances and retrieving nearest neighbors. There’re many ways to go about this, but we want to combine ideas from:

  • Matrix factorization
    • Dense latent vectors of users/shows
    • Support for cold start (new) users/shows
  • Representation Learning
    • Encodings of shows’ content
    • Hybrid loss functions
      • Same ones as word2vec for embeddings
      • Regression losses like RMSE for relevance
      • Reconstruction losses for latent vectors
  • Online Learning
    • Retraining latent representations
    • Fine-tuning embedding layer over time

The challenge will be how we can support many different ideas while still optimizing one fully end-to-end network. We hope to come up with something innovative!

%d bloggers like this: