Recommendation systems

> Learning notes from [Google's ML course](https://developers.google.com/machine-learning/recommendation) ### Components of a recommendation systems - Candidates generation: a system might have multiple candidates generators depends on the use cases - Scoring: the scoring function will give candidates a ranking based on a scoring function where the scores represent the relevance to a specific user - Reranking: the last step of recommendation is reranking, where system can enforce rules or criteria before presenting results to users, e.g., to increase diversity and freshness, ensure fairness, remove duplicates or bias, or even enforce corporate policies. There're two main scenarios for recommendation systems, 1. Home page recommendation 2. Related item recommendation ### Content-based vs. collaborative filtering #### Content-based filtering As name suggested, content-based filtering relies on similar items associated with a specific user to make recommendation. Thus, for a given user vector $q$, we measure the similarity of an item $x$ and the user using some common methods, e.g., cosine, dot-product or euclidean distance. Given the user $q$, system will return $N$ closest items $\{x\}_{i=1}^N$ . - **Advantages** - No other users feature is required, which makes it scalable to a large amount of users - Capture user's niche interests which might be underrepresented amongst other users - **Disadvantages** - Only recommend items based on user's historical interests - Hand-crafted features are required with domain knowledge #### Collaborative filtering CF can address disadvantages of content-based filtering by adding both users and items similarity together. System can recommend items that are derived from other similar users' preferences. The core idea is that we use a 2-d matrix to represent user-item preference. On the rows we have users, on the columns we have items. The value of each cell in the matrix is calculated by a similarity measurement between user embedding $u$ and item embedding $v$. So if we know the preference table already before hand, it equals like we have the labels to mark which user prefers which items so that we can learn both user and item embeddings by training a ML model, therefore we don't need to craft the features manually any more. This is a clear advantage over content-based filtering. ##### Matrix factorisation user matrix $U\in R^{m\times d}$ and item matrix $V\in R^{n\times d}$. We want to minimize the squared loss between preference table entry $a_{ij}$ and $u_iv_j^T$. Make sure unobserved items also need contribute to the loss subject to a weight. The optimization should use weighted alternating least square (WALS) which converges faster than SGD. - **Advantages** - No manual features or domain knowledge are required - System can recommend items to user which might not present in his/her historical interaction - Great start point, it only needs a feedback matrix, no contextual features are required. In practice, this is usually used as one of the candidates generators. - **Disadvantages** - The training stage needs all known users and items, which makes it hard to present new user/items. But there exists solution to that, e.g., projection in WALS, or averaging embeddings of the same category and uploader to represent a new item - Hard to add manual features (side features, eg., country, age, ...) to WALS. But it may be possible to add them to generalized WALS ##### Deep learning based method - **Softmax model** - The main idea is to use user feature as input, goes through a DNN that outputs the user embedding in $R^d$ , then add a multi-class classification head to assign probabilities to $n$ items, where we have ground truth labels from feedback matrix - The problem with it is that $n$ is usually very large, therefore we need to use negative sampling to combine all positive and some negative samples, instead of computing gradients on all $n$ items. - Eventually the weights of last layer $V \in R^{n\times d}$ represents embeddings of $n$ items - You can easily see that it's difficult to represent a new item if it is not trained - Also negative sampling could be challenging as well, we can choose the negative samples that have higher probability value as they contribute more to gradients. These are called hard negative sampling - Two-tower model - Different from Softmax model, two tower model uses two DNN for user and item, each tower learns the embedding for user or item, and then use dot-product of the embeddings to calculate similarity. - The advantage is that we can use each tower to calculate embeddings for user and item, even when the user or item are new. - A common loss function can be chosen as N-pair loss, where we select one positive sample and $n-1$ negative samples, then use cross-entropy to calculate the loss - Each tower should have their own parameters, unless they represent similar type of items ##### Compare DNN and matrix factorization | | MF | DNN | | --------------------- | --------------------------------------------- | ------------------------------------------ | | Side features | Hard | easy to incorporate | | Cold start | hard | easy to handle new items/news | | Folding | easy to handle unobserved | may suffer folding, need negative sampling | | Training scalability | Easy to train hundreds of millions | Expensive to train | | Inference scalability | Static embeddings, precomputed, easy to serve | embeddings can be precomputed | - Matrix factorization is usually the better choice for large corpora. It is easier to scale, cheaper to query, and less prone to folding. - DNN models can better capture personalized preferences, but are harder to train and more expensive to query. DNN models are preferable to matrix factorization for **scoring** because DNN models can use more features to better capture relevance. Also, it is usually acceptable for DNN models to fold, since you mostly care about ranking a pre-filtered set of candidates assumed to be relevant. ### Model serving After model training, e.g., a matrix factorization or two-tower DNN, we need to serve the model in production to give recommendations. There're normally three stages down the pipeline, - **Retrieval** - Given a query, retrieve a pool of candidates from index - Scoring - Score the candidates and rank them according to different criteria - Reranking - Apply reranking after scoring to address issues like freshness, diversity and fairness #### Retrieval Given a query, we want to retrieve its embedding and then get similar embeddings from a "pre-computed" index. We normally apply approximate nearest neighbors. Check [[Vector indexing and approximate search]] for some details. #### Scoring In practice, you will have multiple candidate generators. Then you need to apply another model to rank the candidates. For example, we can use another DNN to train a ranking model (learn-to-rank). > But why we can't just use candidate generators to score? *There're two reasons. Firstly, candidate generators give candidates in different ways, they are not quite comparable sometimes. Second, candidate generators do not necessarily consist of rich features that represent item ranking accurately, since they need to cope with a large corpus of items, therefore trade efficiency over richness.* **Positional bias** Positional bias associates with the position of recommendations and how it affects user's real choices rather than relevance. In practice, this needs to be coped to avoid negative feedback. ### Reranking We'll mainly discuss three dimensions - Freshness - Enforce retraining more frequently to involve new items - Use warm start to accelerate retraining - Add freshness related features, such as video age, last time being viewed - Use two towers to cope with new user/item or use heuristic approaches such as averaging embeddings for matrix factorization - Diversity - Use multiple candidate generators that are trained with different rules - Multiple rankers with different objective functions - Rerank items using metadata such as genre, categories - Fairness - Use different models for underserved populations - Monitor and evaluate model performances on different geographic groups - Include diverse perspective in design and development - Train on comprehensive dataset ### Evaluation #### Online metrics - Click-through-rate (CTR) - Conversation rate (e.g., purchases, watched time, completed sessions) - Hit rate #### Offline metrics - Precision, recall, F1-score - Mean Average Precision - MRR (mean reciprocal rank) - Normalized discounted cumulative gain (NDCG) See more from [[Similarity search metrics]]