Recommenders - MLE Path Notes

**Summary:** In this deep dive, we’ll explore how modern recommendation systems are built and adapted. We start with foundational techniques – from classic collaborative filtering and matrix factorization to advanced two-tower neural networks and learning-to-rank methods. We then examine how different domains (ads, social networks, local businesses, and events) impose unique constraints and trade-offs on model architecture, loss functions, and objectives. Finally, we compare offline evaluation metrics with online experimentation, focusing on how A/B tests validate real-world performance. Let’s get started on designing recommendation systems that fit the problem at hand! ## **Overview of Foundational Recommendation Techniques** _Summary:_ This section reviews the fundamental building blocks of recommender systems. - We’ll discuss **collaborative filtering** (how recommendations can be based on user-user or item-item similarities) and **content-based filtering** (recommending by matching item features with user profiles). - We then introduce **matrix factorization** as a way to learn latent factors from interaction data, and **two-tower** (and multi-tower) **neural architectures** that scale recommendations with learned embeddings. - We also cover **learning-to-rank** approaches, explaining pointwise vs. pairwise vs. listwise training and how they improve ranking quality. By the end of this section, you’ll see how these foundational methods relate and how they set the stage for more domain-specific designs. ### **Collaborative Filtering (CF) – Learning from Similar Users or Items** At its core, collaborative filtering uses the **wisdom of the crowd** to make recommendations. The idea is that users with similar tastes in the past will have similar preferences in the future. In practice, collaborative filtering can be **user-based** (find users similar to the target user and recommend what those users liked) or **item-based** (find items similar to those the user liked before, and recommend those) . A classic example: if Alice and Bob have both highly rated _The Matrix_ and _Inception_, and Alice also loved _Interstellar_, a user-based CF system might recommend _Interstellar_ to Bob because Alice (a similar user) liked it. That said, user-based CF can be more flexible in capturing niche interests and even surfacing serendipitous finds, because it’s looking for fellow “tastes like yours”, but it slows to a crawl as your user base balloons and really stumbles when a fresh face with no history shows up (cold start) since there are no neighbors to lean on. Conversely, item-based CF flips the focus: by precomputing item–item similarities it’s lightning fast at scale and more stable over time (new ratings won’t wildly shift the model), yet it can feel less adventurous, mostly nudging you toward things that look very much like what you’ve already consumed, with less room for those off-beat, cross-genre surprises **Memory-based CF:** Traditional CF algorithms operate directly on the user-item interaction data (often a sparse matrix of users vs. items). They compute similarity between users or between items using measures like **cosine similarity** or **Pearson correlation** on their rating vectors . For instance, an _item-based_ CF system might measure that people who like **Item A** often also like **Item B**, thus these two items have a high similarity. Then if a user liked Item A, the system recommends Item B. These methods are simple and **interpretable** (you can often explain a recommendation by saying “users who liked X also liked Y”), but they can struggle when data is sparse or when scaling to millions of users or items (computing pairwise similarities for everything is expensive) . > **Example – Item-Item Collaborative Filtering in Python:** Let’s illustrate a basic item-based CF. Suppose we have 3 users and 4 items, and we know which items each user liked (1) or didn’t interact with (0). We can compute item similarity and then recommend new items to a user based on items they already like: ``` import numpy as np # Toy user-item interaction matrix (1 = user liked the item, 0 = not liked) # Users: 0,1,2; Items: 0,1,2,3 # e.g., user0 likes items0 and2; user1 likes 2 and3; user2 likes 0,1,3. interactions = np.array([ [1, 0, 1, 0], # user0 [0, 0, 1, 1], # user1 [1, 1, 0, 1] # user2 ], dtype=float) # Compute cosine similarity between item vectors item_vectors = interactions.T # shape: 4 items x 3 users norms = np.linalg.norm(item_vectors, axis=1, keepdims=True) item_vectors = item_vectors / np.where(norms == 0, 1e-9, norms) # normalize, avoid /0 similarity = item_vectors.dot(item_vectors.T) np.fill_diagonal(similarity, 0) # not interested in self-similarity # Choose a target user (user0) and compute scores for items they haven't liked target_user = 0 liked = np.where(interactions[target_user] > 0)[0] # items user0 liked scores = {} for item in range(similarity.shape[0]): if interactions[target_user, item] == 0: # not liked yet # Sum similarity of this candidate item to each item the user likes scores[item] = similarity[item, liked].sum() # Recommend the item with highest score: best_item = max(scores, key=scores.get) print(f"Recommend item {best_item} to user {target_user} (score = {scores[best_item]:.2f})") ``` Running this code might output something like: **“Recommend item 3 to user 0 (score = 1.00)”**, meaning our simple recommender suggests Item 3 for User 0. This makes sense because user0 liked items 0 and 2, and item 3 is highly similar to those (item3 was liked by users who also liked item0 and item2 in our toy data). This demonstrates item-based CF: we used other users’ behavior implicitly by focusing on item similarity. **Pros and Cons:** Collaborative filtering can uncover complex, unexpected patterns (“serendipity”) because it’s not limited by item attributes – it purely uses the **collective behavior** of users . It’s domain-agnostic: whether recommending movies, products, or friends, as long as you have user interaction data, CF can work. However, CF has a notorious **cold start** problem: when a new user joins or a new item is added, there’s not enough data to make recommendations. In those cases, CF doesn’t know what to do (since no one has interacted with that new item, or the new user hasn’t interacted with anything) . Another challenge is sparsity – in large systems, each user might have interacted with only a tiny fraction of items, making meaningful similarity computations difficult for long-tail users or items . Later, we’ll discuss how content-based methods and hybrid approaches address these cold start issues. ### **Content-Based Filtering – Learning from Item Attributes** Content-based filtering recommends items **similar to those the user already likes, based on item features**. Unlike CF, it does **not** rely on other users at all – only on the user’s own history and the attributes of items. For example, if you liked the movie _Inception_, a content-based system might recommend other sci-fi movies directed by Christopher Nolan, because it knows _Inception_’s genre, director, cast, etc., and finds similar movies with those features. In content-based methods, we represent each item by a set of features or descriptive attributes (keywords, categories, embeddings of text, etc.), and we build a **user profile** by aggregating features of items the user has interacted with. If a user reads a lot of sports articles, their profile might be biased towards the “sports” category, so the system will recommend other sports articles. One can use classic information retrieval techniques (like TF-IDF vectors for text content) or modern embeddings (e.g. using a language model to embed item descriptions) to represent items in a feature space. Recommendations then boil down to finding items whose feature vectors are most similar to the user’s profile vector. **Advantages:** Content-based filtering doesn’t suffer as much from cold-start for new items. If you can describe a new item’s attributes, you can immediately start recommending it to users with matching tastes . For example, a new restaurant on Yelp can be recommended if its cuisine, location, price range, etc. match what the user likes, even if no user has reviewed it yet. Also, content-based models are inherently **explainable**: you can often say _“We recommended this book because it’s in the same genre as other books you liked”_. **Drawbacks:** Pure content-based systems can **narrow the experience** – they tend to show more of what the user is already interested in, possibly missing serendipitous finds. There’s a risk of the user being stuck in a “filter bubble” (only seeing very similar content). Content-based methods also require good item metadata; if items lack rich features or the features don’t truly capture what makes items appealing, recommendations will suffer. They can’t easily introduce something completely new or unexpected because everything is based on similarity to past liked content . Furthermore, content-based systems struggle with new users (cold-start on the user side), since with no history, the system doesn’t know the user’s preferences (this problem is somewhat symmetric to CF’s new item cold-start issue). **Hybrid approaches:** Many real-world systems combine collaborative filtering and content-based ideas to get the best of both. For instance, a “**hybrid**” recommender might start by using content-based rules for new users (until they’ve made enough interactions) and then gradually switch to collaborative filtering which can pick up more nuanced patterns once there’s sufficient data. In fact, Netflix’s famous prize-winning algorithm in 2009 was a blend of collaborative filtering and content-based models , and modern deep learning recommenders often incorporate both interaction data and content features. ### **Matrix Factorization – Latent Factors from Ratings** Matrix factorization (MF) is a **model-based** collaborative filtering technique that became hugely popular after the Netflix Prize (2006–2009) demonstrated its effectiveness. The basic idea is to take the large user-item interaction matrix (which is mostly empty), and factorize it into two smaller matrices: one matrix of **user factors** and one matrix of **item factors**, such that their product reconstructs the original interactions as well as possible . In formula terms, if $R$ is the user-item matrix (with $R_{u,i}$ as user $u$’s rating for item $i$, or 0 if none), we seek $U$ (user latent matrix) and $V$ (item latent matrix) such that: R \approx U \times V^T, where row $U_u$ is the feature vector for user $u$, and row $V_i$ is the feature vector for item $i$. Each user and item are thus represented in a **shared latent feature space** of dimension $k$ (where $k$ is much smaller than the number of users or items). The dot product $U_u \cdot V_i^T$ produces the system’s predicted score for user $u$ on item $i$. Intuitively, MF discovers **hidden factors** that explain ratings. For movies, the factors might correspond to interpretable concepts like “action vs. drama”, “serious vs. humorous”, or other abstract dimensions of taste – but the model learns them automatically without explicit labels . A user’s factor vector encodes how much they align with each latent dimension, and an item’s factor vector encodes how much it exhibits that dimension. A high dot product means the user and item match on these hidden preferences. **Why is MF useful?** First, it’s **compact** – instead of storing huge similarity matrices, you learn a relatively small set of vectors. It also handles sparsity well by learning from the global patterns in the data (filling in the missing entries by the model’s estimates). MF can generalize to predict user-item combinations that were never seen in training, by leveraging the latent factor structure . In fact, from a retrieval perspective, matrix factorization is essentially learning an **embedding for users and items** such that similar users and items have higher dot-product similarity in that embedding space . It’s closely related to the idea of a two-tower neural network (which we’ll discuss next), but MF is usually a simpler linear model (often optimized with techniques like **alternating least squares (ALS)** or **stochastic gradient descent (SGD)** on a loss function like squared error or logistic loss). To make this concrete: imagine a simplified scenario with 2 latent factors. After factorizing, we might interpret factor 1 as a “action-film affinity” and factor 2 as “artistic vs. mainstream preference” (we only infer these semantics post-hoc). A user $u$ might have a vector $U_u = [0.9, -0.2]$ meaning they love action and prefer artsy films a bit more than mainstream. A movie $i$ might have $V_i = [0.8, -0.1]$ (an action-packed blockbuster), and another movie $j$ has $V_j = [0.1, -0.9]$ (a slow artistic indie film). The dot product $U_u \cdot V_i \approx 0.9_0.8 + (-0.2)_(-0.1) = 0.74$ might be higher than $U_u \cdot V_j$, predicting user $u$ will strongly like movie $i$ (action blockbuster) more than movie $j$. In this way, MF transforms the discrete, sparse interaction matrix into a **dense continuous feature space** where recommendation is a matter of nearest neighbors or high scores in that learned space . Modern recommenders often extend MF with additional bells and whistles (bias terms, implicit feedback handling, etc. – e.g., SVD++, factorization machines). But more notably, MF’s concept of learning **embeddings** for users and items paved the way for deep learning methods. In fact, a neural network can be viewed as learning an even richer factorization where latent factors are discovered through non-linear transformations rather than just linear dot products. We’ll see that next. ### **Two-Tower and Multi-Tower Neural Architectures** As recommendation systems scaled and started using richer input features, the **two-tower architecture** emerged as a **powerful deep learning paradigm for recommendations** (also called dual-encoder or Siamese networks in some contexts). A two-tower model has two neural network “towers”: one ingests **user features** to produce a user embedding, the other ingests **item features** to produce an item embedding . The outputs of the towers are vectors in the same latent space, and the model’s predicted score for a user-item pair is usually the **dot product** (or cosine similarity) between these two embeddings . This is conceptually similar to matrix factorization (user and item vectors, dot product) but with a crucial difference: the towers can be deep networks that take into account not only an ID but any features of user or item (age, gender, context for user; title, category, image, etc. for item). The two-tower model effectively generalizes MF by allowing complex transformations to learn embeddings that capture **non-linear feature interactions** and rich content. **Why two towers?** The major advantage is **scalability in retrieval**. In large systems (imagine matching a user to one video among millions on YouTube), you can’t feasibly run an expensive neural network that examines every user-item pair at query time. Two-tower models solve this by **decoupling user and item computation** . You can pre-compute and store embeddings for every item in the catalog (say, all videos) offline. Then at runtime, you only need to compute the user’s embedding (from the user tower), and perform a fast **nearest-neighbor search** in the embedding space to find items with the closest vectors . This nearest-neighbor search can be optimized with approximate methods (like Annoy, Faiss, ScaNN, etc.) to handle billions of vectors efficiently . Essentially, the two-tower model turns the recommendation problem into an **embedding similarity search** problem, which is much faster than brute-force scoring every candidate through a full model. Google and others widely use this approach for the first-stage retrieval in systems like search, ads, and YouTube recommendations . Another benefit is that two-tower models naturally handle **multi-modal input features**. For example, Uber Eats built a two-tower model where the user tower encodes a user’s profile and context, and the item tower encodes a restaurant’s info (cuisine, location, etc.) . They train the model on historical “engagement” pairs (user ordered from restaurant or not) – effectively a classification task – and learn embeddings such that a user’s vector is close to vectors of restaurants they’re likely to order from . This drastically improved scalability and **retrieval latency** in their system: instead of evaluating a deep model for every (user, restaurant) pair in a city (which would be millions of pairs per query), they precompute all restaurant embeddings and only compute one user embedding at query time . The dot products between that user vector and all restaurant vectors can be quickly approximated by ANN search, cutting down computation while maintaining good recall of relevant items. Uber’s team noted this was a huge win over their previous city-by-city deep matrix factorization model, which wasn’t scalable as the number of cities grew . > **Example – Two-Tower vs. Single-Tower:** YouTube’s recommendation system (circa 2016) popularized a two-stage approach : first, a two-tower (user/content) model generates a few hundred candidates from a huge corpus (millions of videos) using embeddings; second, a more complex ranking model refines those candidates. The candidate generator (two-tower) essentially did collaborative filtering in a learned embedding space – it treated recommendation as an extreme classification problem where the user’s watch history and other features are used to predict which video out of millions they might watch next . The network was inspired by language models (Word2Vec’s continuous bag-of-words), learning embeddings for video IDs and search tokens, and it used techniques like **negative sampling** to train efficiently . This deep learning approach was a generalized matrix factorization: _“Using a neural network as a generalization of matrix factorization allows arbitrary continuous and categorical features to be added to the model easily”_ . In other words, the two-tower model retained MF’s efficiency and simplicity (embeddings + dot product) while allowing much richer input data. **Multi-tower architectures:** Sometimes we have more than two distinct types of inputs or tasks. A **multi-tower** model might have, say, three or more parallel towers whose outputs are combined (via concatenation or another layer) to make a prediction. This is common in multi-modal recommenders – for example, one tower for text features of an item, one for image features, one for user features. It’s also used in **multi-task learning** and **multi-domain** scenarios: e.g., Pinterest’s ads system had to serve two different products (Standard Ads vs. Shopping Ads) which have very different characteristics . They found that a single model struggled to learn both distributions, so they used a **shared-bottom, multi-tower** network: the lower layers are shared to learn general patterns, then split into separate MLP towers, one for each ad type . Each tower learns the specifics of that ad domain without interference from the other, and then they recombine at the top or simply have separate outputs. This yielded better performance on both ad types (offline AUC and log-loss improved) and those gains translated to online metrics in A/B tests . Multi-tower designs are effective for isolating influences when you have heterogeneous inputs or objectives, while still allowing some shared representation. They do increase model complexity, so there’s a trade-off in training cost and risk of overfitting if not enough data for each tower, but in industry settings (where data is plenty) multi-tower setups for multi-task recommender models are increasingly common. ### **Learning to Rank – Optimizing Loss for Ranking Quality** Thus far, we’ve talked about model architectures and approaches to represent users and items. Equally important is **how we train these models** – specifically, the **loss functions and training objectives** designed for recommendations. This is where **learning-to-rank (LTR)** comes in. In recommendation (and information retrieval in general), getting the order of results right is critical. You often care less about the absolute predicted score and more that the top of the list contains the truly relevant items. Learning-to-rank techniques aim to train models to produce good _relative_ ordering of items. LTR is typically divided into three types of approaches: **pointwise**, **pairwise**, and **listwise** : - **Pointwise**: You treat the problem like regression or classification on single items. For example, given a (user, item) pair, predict the probability that the user will click or the rating the user would give. Each item’s score is learned independently, and then for ranking you’d sort by those scores. If using a pointwise approach, you might use a standard loss like mean squared error (for ratings) or cross-entropy (for click/no-click). Pointwise methods are simple and allow using all the machinery of classification/regression. A lot of deep recommendation models (like the ones for ads or YouTube candidates) indeed use pointwise training – e.g., treat a click as 1 and non-click as 0 and do binary cross-entropy. **Pros:** Pointwise is easy to implement, and training is often fast. **Cons:** It doesn’t directly account for the relative ordering; it might produce good probability estimates but not necessarily the best top-$k$ ranking. It can also be susceptible to being overly influenced by popular items (since there are many more negatives than positives, one needs proper sampling and weighting). - **Pairwise**: The model is trained on pairs of items, with the goal to have the model score a known positive item higher than some negative (unpreferred) item. For example, from a user’s history, take a pair: item $i$ the user clicked vs. item $j$ the user skipped; you want $f(user, i) > f(user, j)$ by a margin. Loss functions like **hinge loss** or **pairwise logistic loss** (used in **RankNet**, etc.) are employed. A well-known pairwise method in recommender systems is **BPR (Bayesian Personalized Ranking)** which optimizes a pairwise loss for implicit feedback: it samples observed (user, item) interactions and compares them to random unobserved items, pushing the observed items to have higher scores than the unobserved . **Pros:** Pairwise training directly optimizes the ordering – the model learns to distinguish “what should be higher vs. lower”. It often yields better **rank-based metrics** (like NDCG) than pointwise. **Cons:** It’s more computationally expensive (there are many more possible pairs than individual points), and sampling the pairs effectively is tricky. Also, optimizing pairwise differences doesn’t guarantee the **global** order will be perfect; it handles relative orderings locally. - **Listwise**: This approach considers a whole list of items at once and tries to optimize a metric that measures the quality of an ordered list (like NDCG, MAP, etc.). Listwise methods include algorithms like **LambdaMART** (boosted trees optimizing NDCG) or neural approaches that directly approximate list losses. These are the most direct in theory – you can aim to maximize NDCG directly – but they can be complex. Some listwise losses are not smooth or easy to optimize, and surrogate losses are used. **Pros:** If done well, listwise training most closely aligns with how we evaluate recommendations (by ranked lists). **Cons:** It’s even more complex and data-hungry; listwise methods can overfit and sometimes underperform pairwise in practice if not carefully regularized. **Trade-offs:** Pointwise methods have the advantage of **linear inference complexity** (scoring each item is O(1) per item, then you sort) and a well-understood training process, but they _“fail to capture crucial comparative information between items”_ – e.g., a pointwise model might assign scores that are not calibrated for direct comparison. Pairwise and listwise methods incorporate those comparisons and often yield more accurate rankings, but they can be expensive: pairwise has to consider many item pairs and doesn’t have a simple probabilistic interpretation; listwise often requires approximations or suffers in practice despite theoretical appeal . In fact, a recent Amazon Science publication (2024) noted that pairwise/listwise approaches, while powerful, can have practical limitations (computational cost, or instability), and proposed a hybrid that integrates pointwise and pairwise training to get the best of both . In practice, many industry recommender systems use _pointwise_ training for their primary models (especially in deep learning recommenders for ads or feeds, where they optimize cross-entropy/AUC), but then may apply a _pairwise reranker_ or use _heuristics_ to adjust the results. For example, a system might first score candidates with a deep model (pointwise probability of interaction), then do a second pass to ensure that certain known liked items outrank known less-liked ones (a kind of pairwise adjustment). There’s also a trend of using **reinforcement learning** or **bandit** algorithms to directly optimize long-term rewards, which can be seen as an advanced form of listwise optimization (optimizing sequence of recommendations over sessions), but that’s beyond our scope here. **Learning-to-rank in context:** If your recommendation problem is, say, a ranked list of top-10 items, and your offline metric is NDCG@10, it might make sense to explore pairwise or listwise loss. On the other hand, if you’re doing a prediction task like “will the user click on this ad (yes/no)”, a pointwise loss (logistic) is natural, and you rely on sorting predicted probabilities for ranking. There is no one-size-fits-all – it often depends on your data volume and what you’re optimizing for. Many academic competitions (e.g., KDD Cup, RecSys challenges) have shown pairwise or listwise methods winning when fine-tuned, but the complexity of implementing them at scale means pointwise deep models remain popular in industry, supplemented by careful evaluation and tuning to ensure the ranking makes sense. To close this section, here’s an important point: whichever approach you choose, it’s crucial to align your training loss with your **evaluation metric** and business goal as much as possible. If you care about top-k precision, ensure your model gets feedback about ranking (not just classification accuracy). This alignment often dictates which method (pointwise vs pairwise) will serve you best. ## **Domain-Specific Recommendation Systems and Trade-offs** _Summary:_ In this section, we delve into how recommendation strategies differ across domains. We’ll look at **Ad Recommendations**, where maximizing clicks and conversions (while respecting budgets and user experience) is key, and the models (like wide-and-deep, DLRM) that handle massive sparse features. We’ll explore **Social Recommendations** (e.g. “People You May Know”), which leverage social graph connections and need to predict two-sided acceptance. Next, **Place Recommendations** (restaurants or local businesses) where context like location and personal taste are crucial – we’ll see how systems like Uber Eats or Google Maps approach “where to eat” suggestions. Finally, **Event Recommendations** (concerts, live events) where the content is ephemeral and matching user interests (often from other domains like music) poses a cold-start challenge. For each domain, we’ll discuss typical model architectures, appropriate loss functions/objectives, and any special evaluation considerations or constraints. The goal is to understand that **one size doesn’t fit all** in recommender systems – the best design is highly dependent on what you’re recommending and what “success” means in that context. ### **Ad Recommendations – Optimizing Clicks & Conversions under Constraints** Recommender systems for advertisements (sponsored content) are ubiquitous – think of the ads you see in your Facebook feed, Google search results, or e-commerce sites. The recommendation problem here is to select the most relevant ads for a user **that also perform well for the advertisers**. This domain has unique challenges: **multi-objective optimization** (balancing user engagement with advertiser ROI), extremely **high stakes for relevance** (showing too many poor ads can drive users away, but showing the right ad yields revenue), and often the need for **real-time bidding or budget pacing** (an ad might not be shown if its campaign budget is exhausted, etc.). **Models and Features:** Ad recommender systems usually ingest a **large number of sparse categorical features** – e.g. user ID, ad ID, keywords, demographics, context (time, device), etc. – and some **dense features** (like user age, maybe continuous stats). A famous industry model is Facebook’s **DLRM (Deep Learning Recommendation Model)** . In DLRM, input features are split into two types: **dense features** go through a small neural network, while **categorical features** (like IDs) are each converted into embeddings (via an embedding lookup) . They then combine these by taking **dot products of all pairs of feature vectors** (this captures interactions between every pair of features, similar to factorization machines), concatenate those interactions with the original dense features, and feed it into another MLP to get a final prediction . This architecture is quite sophisticated: it learns both linear contributions of features and pairwise interactions in a differentiable way. Google’s **Wide & Deep** model (for Play Store apps and ads) took a slightly different approach: it has a “wide” part (basically a linear model with cross-feature inputs manually engineered or learned) and a “deep” part (an MLP on embeddings), and merges them at the end . The wide part helps with memorization of frequent patterns (e.g. some popular app gets a boost), while the deep part generalizes to new combinations of features . Another Google model, **Deep & Cross Network (DCN)**, explicitly learns higher-order feature crosses via hidden layers . All these are tackling a core issue: ads data is **high-dimensional and sparse**, so embedding layers to reduce dimensionality and special structures to handle feature interactions are key. On top of these base models, ad recommenders often employ **multi-task learning**. For instance, predicting _click-through rate (CTR)_ is the primary task (will the user click the ad?), but an adjacent task could be _conversion rate_ (will the click lead to a purchase or sign-up?). Pinterest, for example, trained a multi-task model for ads that predicted not just clicks but also downstream engagement like whether the user bounced or scrolled on the landing page . Multi-task learning allowed them to improve overall user satisfaction by not solely optimizing the immediate click, but also the quality of the click (a “good” click vs. one where the user regretted it) . The loss function in that case is a weighted sum of losses for each task, or even a custom loss that combines them. An example multi-task loss could be $L = \sum_{t} \alpha_t \cdot \frac{1}{N}\sum_{n} \ell(\hat{y}_{n,t}, y_{n,t})$ for tasks $t$, distributing weight $\alpha_t$ to each task’s error . **Objectives and trade-offs:** The primary metric for ad recommendations is often **CTR (click-through rate)** or **CPA (cost per action)** if optimizing conversions. However, the _true_ objective might be revenue or long-term advertiser value. There’s an inherent trade-off: showing only the highest predicted CTR ads could lead to showing cheap gimmicky ads too often, potentially hurting user experience, while showing only “pleasant” low-paying ads might miss revenue. Many systems solve this via a **second price auction** mechanism combined with relevance prediction – i.e., rank ads by something like $p(\text{click}|user,ad) \times \text{bid}$ (expected value). From a modeling perspective, one often needs to ensure the probability estimates are **calibrated**, so that this product is meaningful . Calibration means if an ad is predicted 0.1 CTR, it should actually get ~10% CTR on average; this is critical when money is involved, as it ensures the platform charges advertisers fairly. Techniques like **Platt scaling** or adding a small calibration model on top are common to adjust raw model scores into well-calibrated probabilities . Another challenge is **exploration vs. exploitation**. If we always recommend the top-predicted ad, we might never discover that some new ad actually performs well (because it gets no exposure to gather data). Real ad systems often include an exploration strategy or at least don’t overly punish initially “cold” ads – some percentage of traffic may be set aside to randomly (or systematically) try less-proven ads, to gather data. **Industry example – multi-tower for ads:** We mentioned Pinterest’s multi-tower model . Here’s more detail: they had two separate ad sources (Shopping vs. regular ads). Instead of one tower mixing both, they used a **shared bottom** (common layers that process features both ad types share) and then **two separate towers** on top – one tower specialized on Shopping Ads features, one on Standard Ads . Each tower then had its own outputs for the tasks (CTR, etc.). This way, the model could learn some commonalities (like general user engagement patterns) but not confuse the peculiar signals of shopping ads (which have product images, prices, etc.) with standard ads (which might be more about creative and targeting). The result was a significant lift in performance on the Shopping slice without degrading Standard ads . The takeaway: in ad recommendations, you often juggle multiple data distributions or objectives (maybe optimize revenue, but also user retention, etc.), and multi-tower or multi-objective setups let you address those explicitly, rather than trying to bake everything into one black-box prediction. **Real-time aspects:** Ad recommenders usually operate under strict latency constraints because they often run in real-time when a user opens an app or webpage. They may need a multi-stage approach (retrieve candidates quickly, then rank). Two-tower models are heavily used for the retrieval stage in ads at Facebook and Google for this reason – they can retrieve a few thousand candidate ads from billions in milliseconds. The ranking stage can then use a more expressive model (like DLRM or an ensemble) on those candidates, since a few thousand is manageable to score with a deeper network in tens of milliseconds. Also, because ad systems deal with **time-sensitive data** (an ad campaign starts, stops, changes bids), features like “time since last ad shown” or “current budget left” might come into play and require dynamic feature updates even for the same ad. This is a layer of complexity beyond standard recommenders. **Evaluation:** Offline, ad recommenders are often evaluated with **AUC (Area Under ROC)** and **log loss** for click/no-click prediction , because those metrics relate to ranking and calibration. Online, the key metrics are CTR, conversion rate, revenue per impression, etc., measured via A/B tests (we’ll discuss evaluation later). A tricky thing is that an improvement in offline AUC doesn’t always mean higher revenue – for example, the model might better predict who won’t click anything (improving AUC by getting negatives right) but not actually rank the top few positives any better. Thus, a lot of fine-tuning and careful experimental validation is needed for ad recommenders. To summarize, ad recommendation pushes the limits of model complexity and data scale: massive feature spaces (hence heavy use of embeddings), multiple objectives (click, convert, revenue), need for calibration, and exploration. The models have evolved from simple logistic regression on manual features to deep networks like DLRM that automatically learn feature interactions . The guiding principle is **predict the user’s likelihood to engage with the ad**, but the real magic is handling all the side constraints and making sure these predictions translate into actual business value. ### **Social Recommendations – Connecting People and Content** Social recommendation systems suggest **people or content within a social network** that you might be interested in. Examples include “People You May Know” (friend suggestions on LinkedIn or Facebook), “Users to follow” (on Twitter, Instagram), or group/page recommendations. These systems leverage the structure of the social graph (who is connected with whom) in addition to behavioral signals. Consider **People You May Know (PYMK)** on LinkedIn . The goal is to recommend connections that a user is likely to know and want to add. This is a two-sided recommendation: it’s not enough that _you_ might click “Connect” – the other person might need to accept your invite. So they actually optimize for the probability of a successful connection (invite sent _and_ accepted) . That already complicates the loss function: you could break it into two models (P(invite) and P(accept|invite)), or use one model to predict some combined utility of a connection. LinkedIn’s approach is to use **multiple stages** and multiple metrics : an initial stage focuses on recall (getting a pool of potential candidates likely to be known), and later stages refine with precision and multi-objective scoring (taking into account invite acceptance, etc.). **Candidate generation in social rec:** Unlike e-commerce, where you might have a catalog of products, in social networks the “items” can be other users (for friend rec) or posts/groups. For friend recommendations, the candidate set is basically all other users minus current connections – billions of possibilities! Social recommender systems heavily rely on **graph algorithms** to generate candidates. For example, the simplest heuristic: recommend **friends-of-friends**. If Alice and Bob share 10 mutual friends, maybe they should know each other. This can be formalized by graph traversal: perform a 2-hop random walk from the user to reach candidates . Many systems use this: Facebook’s PYMK early on was rumored to be largely based on mutual friend count. LinkedIn’s PYMK blog describes using **n-hop neighborhood random walks** to get graph-based candidates , which captures those friend-of-friend relationships. Another source of candidates is **embedding-based retrieval (EBR)** – which is essentially a two-tower model where one tower represents the source user and the other represents candidate profiles, and they retrieve by vector similarity (similar to what we discussed in two-tower approach). This can capture more nuanced similarities (maybe you and someone else went to the same university or have similar interests, even if you don’t share a direct friend). LinkedIn hints at using a mix of graph-based and embedding-based candidate generators in their first stage . They even use simple heuristics like “new users in your geographic area” as candidates – a reminder that sometimes non-ML methods (like a business rule) can be part of the candidate mix. After generating a candidate pool (say a few thousand people you might know), the system goes through **ranking stages** . A light ranker might score how likely you are to connect (could be a logistic regression or gradient boosted trees combining features like mutual friend count, same company, etc.) . Then a heavy ranker (a deep model) might predict probabilities of various outcomes: invite sent, invite accepted, message sent after connecting (as a downstream value), etc. . In LinkedIn’s case, they use deep neural nets at the “rich ranker” stage to predict multiple engagement probabilities , and they care about **calibration** of these probabilities because the scores might be reused by other systems . They also finally apply a **re-ranking stage** that enforces business rules or fairness constraints – e.g. ensure diversity (don’t show 10 suggestions from the same company or all of one gender) . **Features and model:** Social rec involves a lot of graph features. For PYMK, important features include number of mutual connections, common workplaces or schools, whether the other person viewed your profile, etc. These can be plugged into machine learning models. In recent years, **graph embedding** techniques (Node2vec, GraphSAGE, etc.) are also used to embed each user as a vector that encodes graph structure, which can then be used in a two-tower retrieval or as features in ranking. There’s also a trend of using **GNNs (Graph Neural Networks)** to directly learn on the social graph to predict links (friend recommendations are essentially link prediction on a graph). But GNNs can be heavy, so often a combination is used: e.g. use simpler graph analytics for candidate gen, then a GNN-inspired feature for ranking if feasible. **Trade-offs:** One big issue is **explaining recommendations** and avoiding awkward suggestions. If you purely go by an algorithm, sometimes PYMK could suggest someone’s ex-partner or a person you deliberately chose not to connect with – which can be awkward. Social recommenders often incorporate **privacy and sensitivity heuristics**: e.g. don’t suggest someone who searched for you if that’s deemed creepy, or avoid recommending people who have blocked each other. There’s a lot of product thinking here beyond the algorithm. Another challenge is **fairness and filter bubbles in social context**. If the algorithm always connects like-minded people, you risk reinforcing homogeneity. LinkedIn explicitly mentions adding a fairness re-ranker to ensure a fair chance across gender and age, etc. . This shows how business rules might override the raw model output in later stages, to align with ethical or strategic goals. From an objective perspective, for friend recommendations the ideal outcome is a successful connection. That’s a relatively sparse signal (you might send a few invites out of many suggestions, and acceptance is another drop-off). So offline evaluation might use proxy metrics like precision@k of whether suggestions eventually got connected, but ultimately, online A/B testing to see if the feature leads to more connections made (without harming session time, etc.) is key. Also, social recs can have **network effects** – if I connect to more people, I become more engaged and that could drive others to connect too. This complicates evaluation, as a single user’s action can influence others. **Example – LinkedIn PYMK multi-stage:** Summarizing LinkedIn’s pipeline : - **L0 (Recall stage):** multiple candidate sources (graph walks, embedding sim, heuristics), optimize for **Recall@k** (ensure the true positives are in the candidate set) . For example, recall of actual known contacts that aren’t connected. - **L1 (Light ranker):** calibrate and merge candidates from different sources, down-select to a few hundred, still focusing on recall but now at a smaller k (like recall@500) . Light models (maybe logistic regression or XGBoost) ensure apples-to-apples comparison between a candidate suggested by graph vs by embedding by producing a unified score. - **L2 (Heavy ranker):** one or more deep models that predict probabilities of events (invite sent, accepted, etc.), used to rank the top N (maybe 10 shown). Optimize **AUC, Precision@k** here for accuracy of ranking . The loss might be a weighted combination of “likelihood of send _and_ accept” to directly optimize successful connection. - **Re-ranker:** enforce constraints like diversity and fairness before finalizing the list . This is more complex than a typical recommender because of the multi-stage and multi-objective nature, but it’s a template many large-scale recommenders follow. Beyond friend suggestions, social platforms also recommend **content** (posts, news feed items, groups, events). Those recommendations blend into the feed ranking problem (which is itself a recommendation task: ranking stories from friends or groups). The algorithms for feed ranking (like Facebook’s or Twitter’s algorithms) consider social signals (if many of your friends liked a post, it’s recommended to you) along with personal relevance. They often use **edge-ranking** (evaluate each content “edge” between user and item with a score). The details are vast, but one can see it as another domain: recommending posts is different from recommending movies because posts have very short lifespan (freshness is key) and social proof (friend likes) plays a big role. The model might incorporate features like “how many friends liked this post” directly, which is something unique to social content recommendation. To conclude, social recommenders uniquely leverage **social graph structure** and often have to handle **two-sided compatibility** (both parties benefit). They emphasize recall in early stages (due to huge candidate space) and fine-tune with multiple objectives (like making sure suggestions are both relevant and reciprocal). The human element (privacy, fairness) is quite pronounced here – a reminder that metrics aren’t everything when your recommendations involve real people. ### **Place Recommendations (Where to Eat or Visit) – Local Context and Personal Taste** Recommending “places to eat” or generally local businesses (restaurants, cafes, attractions) is another interesting domain. Unlike movies or friends, places have a strong **geographical** component and often a **momentary context** (time of day, current location). If you open Google Maps at noon, the app might recommend nearby lunch spots. The problem is a mix of content-based (match user preferences: e.g., you love sushi, so show sushi restaurants) and collaborative filtering (many people like this new cafe, so you might too). **Data and signals:** For restaurants, common signals include reviews/ratings, user check-ins or past visits, and possibly transaction data (if integrated with payment or reservation systems). Users might explicitly mark favorites or give star ratings, which gives item-specific feedback. There’s also an **implicit feedback** signal: if a user has navigated to a restaurant’s info or looked at it on the map, that indicates interest. **Content features:** Places have rich metadata – cuisine type, price range, hours, location (coordinates), maybe a description or menu, and user attributes like dietary preferences. These can feed into a content-based model. For instance, a user’s profile might infer “likes Italian, moderate price, within 5 miles”. The system will filter or boost restaurants that match those criteria. **Collaborative signal:** If many users who are similar to you (maybe similar tastes or demographics) liked a certain restaurant, CF would recommend it to you. Also, one can imagine an item-item CF: “users who liked these 3 restaurants you liked also tend to rate this other restaurant highly”. Those patterns can be captured if you have enough rating data or check-in data. **Context-awareness:** A key difference is that the _context_ (where and when the recommendation is happening) plays a huge role. Recommending a breakfast cafe at 8am vs. a bar at 8pm vs. a fine-dining restaurant for Saturday night are very different scenarios. So context features like current time, day of week, and location of user are important inputs. Many location-based recommenders use **contextual bandit** approaches to quickly learn a user’s preferences in a certain context. **Model approach:** A typical architecture might involve a two-stage system: first filter by location (e.g., within X radius or in the city the user is looking at), then rank by a learned model. The ranking model could be a two-tower: user tower could incorporate the user’s cuisine preferences vector (learned from past behavior or explicit preferences) and context (like a vector encoding “lunchtime” or “evening”), item tower would embed the place with its attributes (cuisine, popularity, distance from user). The dot product would give a base relevance score. Then an MLP might further refine by combining additional features (like “this restaurant is currently very highly rated by similar users”). Uber, in their Uber Eats system, faced an interesting variant: recommending restaurants (stores) to a user in the app’s feed. They initially had many separate models for different cities (deep matrix factorization per city), but this didn’t scale . They moved to a global two-tower model which could handle any city by including location as a feature and still keep training feasible . The two-tower approach allowed them to precompute embeddings for all restaurants and index them in their search system (they call it SIA) . At query time (when a user opens the app), they compute the user embedding and do an ANN search to get say top 500 restaurant candidates that roughly match the user’s tastes and context. Then a second stage might rank those with a more precise model (taking into account real-time info like whether a restaurant is open, wait time, etc.). **Special considerations:** - **Diversity:** In food, people often prefer a variety. If your last 5 orders were pizza, maybe you want something different tonight. A good place recommender might deliberately diversify suggestions (not 10 sushi restaurants even if you love sushi, but maybe 2 sushi, 3 pizza, 2 Thai, etc.). This can be handled via a diversity-promoting re-ranker or by adding a penalty for too-similar items in the top list. - **Novelty vs. Familiarity:** There’s an exploration aspect: some users want new experiences, others stick to what they know. Personalizing this is hard – a solution might be to mix known favorites with a few new recommendations. For known favorites, content-based logic (past high ratings => recommend it again) works; for novel suggestions, collaborative filtering might surface something trending that the user hasn’t tried. - **Temporal dynamics:** Restaurants can rise or fall in popularity. A recommender should adapt (through retraining or through features like “average rating last month”). Also seasonal effects (ice cream places in summer, pumpkin spice in fall) may occur. - **Geo-spatial relevance:** The distance or travel time to a place is a big factor. Often, an exponential decay function on distance is used as a feature or multiplier – people will rarely drive 50 miles for a lunch suggestion. So the system might hard-filter by a radius or include distance in the score (learning that closer is better, all else equal). **Event Recommendations:** Let’s include concerts and events here, as they are related to places but with a time component. Recommending a concert involves matching a user’s musical taste with upcoming events in their area. The **cold start** problem is huge: every event is new (it hasn’t happened before), so you can’t rely on prior interactions with that event. You have to rely on the artist’s popularity (if user listens to that artist or similar artists) – a content-based approach linking the event to the user’s music preferences. Collaborative filtering in events might only manifest as “people who bought tickets to X also bought to Y”, but since events are one-time, the data is sparse and often we fallback to content or knowledge-based filtering (artist genre, etc.). A known approach is to use the user’s listening history (from a service like Spotify or Last.fm) and the event’s artist lineup to compute a match. E.g., generate an affinity score = fraction of the event’s artists the user likes. Then weight by how popular or soon the event is. The model might be a linear model combining such features. The research from Facebook noted the “**transiency of events**” – events have short lifespans and user-event interaction history is extremely sparse, causing severe cold-start issues . They worked on a joint embedding of users and entities (artists/genres) to improve event recommendations, essentially pulling in side information to make up for little direct interaction data. This highlights the need for **cross-domain features**: to recommend events, use music preferences (a different domain) and social connections (if friends are interested in the event) to inform the recommendations. So for an event recommender: - Likely features: user’s known interests (artists, categories of events), event attributes (performer, category, location, date). - Possibly social: show events that the user’s friends are interested in (Facebook events does this: “3 friends are going to this event” – strong social proof). - Time urgency: if an event is this weekend, it might be pushed higher today than an event 6 months away, since the user must decide now for the former. - Diversity: mix big popular events (likely to be broadly appealing) with some niche ones the user specifically would like. **Evaluation in place/event rec:** Offline, you can use rating prediction metrics if you have them (predict user rating of a place), or top-k precision (did the user eventually go there?), but it’s tricky. Online, A/B testing can measure engagement (clicks on a place, bookings, etc.) or retention (did suggesting better places make users use the app more?). For events, a metric might be tickets sold or event RSVPs attributable to recommendations. In summary, recommending places and events brings in **context (time/location)** and often requires blending **content-based filtering (attributes)** with **collaborative signals**. It’s a domain where you may even incorporate external data (like a trend: “this new café is trending on social media, show it to more users”). The models might be simpler than those for ads (some systems lean on heuristics and simpler scoring functions), but they are increasingly adopting the same deep learning approaches – Google Maps has started using more AI to tailor restaurant suggestions, likely using embeddings for users and places plus context. Domain knowledge (like geography) still plays a big role in feature engineering and model design, so it’s a nice example of combining data-driven learning with rule-based constraints. ### **Concerted Example (Python) – Combining Content and Collaborative Signals** ### **(Illustrative)** To tie together some ideas, imagine we have a small dataset of users, the concerts they attended, and their music preferences. A simple approach could be: 1. Use collaborative filtering on the user-concert matrix to find some likely concerts (e.g. similar users attended). 2. Use content: ensure the recommended concerts feature artists the user likes. We won’t code a full solution here, but conceptually: ``` # Pseudo-code for a hybrid event recommender user_profile = get_user_top_artists(user) # content: top artists user listens to candidate_events = collaborative_candidates(user) # CF: events attended by similar users # Rank by a combination of content and CF scores scores = {} for event in candidate_events: cf_score = collaborative_score(user, event) content_score = similarity(user_profile, event.artists) # e.g. Jaccard or cosine scores[event] = 0.5*cf_score + 0.5*content_score # weighted hybrid top_rec = argmax(scores) ``` This pseudocode mixes collaborative and content signals. A real system would learn the weighting and possibly use a model to combine them rather than a fixed 0.5/0.5. But this highlights that in domains like events, pure CF might fail for brand-new events (cf_score will be 0 if no similar users have attended because the event is new), so content_score (matching the user’s profile to the event) rescues the recommendation. Many production systems implement such hybrids to solve cold start issues . ## **Evaluating Recommendation Systems: Offline Metrics vs. Online A/B Testing** _Summary:_ This final section focuses on **evaluation strategies** for recommender systems. We start with **offline evaluation** – how to use historical data and metrics like precision, recall, NDCG, RMSE, etc., to gauge a model’s performance in a lab setting (and common pitfalls of offline tests). We then discuss **online evaluation**, particularly A/B testing, which is the gold standard to measure real user impact. We’ll explain how to design an A/B test, interpret results (statistical significance, multiple metrics), and touch on advanced methods (multi-armed bandits, interleaving) that can accelerate online comparisons. We’ll also cover the importance of considering long-term effects and not just optimizing one metric at the expense of others. By the end, you should appreciate why **both** offline and online evaluations are needed and how to balance them when iterating on a recommender system. ### **Offline Evaluation – Metrics and Methodologies** Offline evaluation means testing your model on historical data. You typically have a dataset of users, items, and past interactions (e.g., who rated/clicked what). You hold out some interactions as a test set. Then you ask: “If my model had been deployed, how well would it predict those held-out interactions?” This is done without any live users – it’s all on the computer, so it’s fast and repeatable. **Common metrics:** For rating prediction tasks (like predicting a 5-star rating), you’d use **RMSE (root mean squared error)** or MAE (mean absolute error) between predicted and actual ratings. But many modern recommender tasks are about **top-$k$ recommendations**. Here, we use ranking metrics: - **Precision@k**: Of the top $k$ items recommended, what fraction are relevant (the user interacted with them in the test set)? - **Recall@k**: Of all relevant items for the user, how many did we manage to include in the top $k$? Recall is important in candidate generation stages – e.g. does the user’s true favorite show up in the candidates? High recall means we’re not missing good items. - **MAP (Mean Average Precision)**: The average precision of the ranked list, taking into account the position of relevant items (higher ranked relevant items yield higher precision). Then average across users. - **NDCG (Normalized Discounted Cumulative Gain)**: This metric accounts for the rank of each relevant item with a logarithmic discount, and also allows graded relevance (not all hits are equal). NDCG@k is often used if we have a notion of “relevance grade” or just to emphasize that getting a hit in position 1 is better than position 5 . We’ll demonstrate NDCG shortly. - **AUC (Area Under the ROC Curve)**: Often used when the task is viewed as binary classification (relevant or not). AUC is the probability a random relevant item is ranked above a random non-relevant item. It’s popular for implicit feedback scenarios and is a nice single-number metric that correlates with ranking quality . - **Coverage, Diversity, Novelty**: These are secondary metrics looking at the recommendations across all users. Coverage asks: what fraction of items can our model recommend (does it only ever show 10 popular items, or a broad range)? Diversity measures if the list for a user contains different genres or categories. Novelty tries to quantify how “surprising” or new the recommendations are (there are definitions using popularity inverse frequency). These metrics help ensure the system isn’t too narrow. Let’s illustrate **NDCG** with a short code snippet for clarity, as it’s a bit tricky: ``` # Example: compute DCG and NDCG for a single user's recommendation list import math # Suppose the user has 2 relevant items: "itemA" and "itemD" relevant_items = {"itemA", "itemD"} # Our system recommended the following ranked list: recommended_list = ["itemB", "itemD", "itemC", "itemA"] # Compute DCG (Discounted Cumulative Gain) dcg = 0.0 for rank, item in enumerate(recommended_list, start=1): rel = 1.0 if item in relevant_items else 0.0 # relevance is 1 if relevant, else 0 dcg += rel / math.log2(rank + 1) # note: log2(rank+1) print(f"DCG = {dcg:.3f}") # Compute IDCG (Ideal DCG) for the ideal ordering of those 2 relevant items at the top ideal_list = ["itemA", "itemD"] # one ideal ordering (both relevant items in positions 1 and 2) idcg = 0.0 for rank, item in enumerate(ideal_list, start=1): rel = 1.0 idcg += rel / math.log2(rank + 1) print(f"IDCG = {idcg:.3f}") ndcg = dcg / idcg print(f"NDCG = {ndcg:.3f}") ``` If we run this, we might get output like: DCG = 1.062, IDCG = 1.631, NDCG = 0.651 This means our recommended list achieved about 65.1% of the ideal gain. We got one relevant item (“itemD”) in rank 2 (which contributed $1/\log_2(3) \approx 0.63$) and another relevant “itemA” at rank 4 ($1/\log_2(5) \approx 0.43$), summing to 1.06. The ideal would have been both relevant items at ranks 1 and 2 giving 1 + 0.63 = 1.63. So NDCG=0.65 tells us the ranking was okay but not perfect (we didn’t surface both relevant items at the very top). > **Offline train-test splits:** It’s important how you split data for offline tests. A common approach for recommendation is **leave-one-out**: for each user, hide one interaction (like their last interaction) as test. Another is a full **temporal split**: use all data up to time T for training, and interactions after T for testing . Temporal splits mimic the real scenario of predicting future behavior from past data and avoid “time travel” leakage. Random splits can artificially make it easier (because some training data might be from after some test data if time isn’t respected). **Pitfalls of offline evaluation:** Offline metrics are convenient but they have pitfalls : - **Behavioral biases:** The data you collected was a result of the previous system or user interface. For example, if your app always recommended popular items, the log data will be skewed towards popular items. An offline evaluation might then unfairly favor a model that also recommends popular items, because those will match the historical data (this is called **popularity bias or data bias**) . It doesn’t mean that’s the truly best experience; it just mirrors the past. This is tricky: a new model that tries to introduce niche items might look worse offline because those niche items weren’t clicked historically (since they were never shown!), but it could actually do better if deployed. - **Missing data (unobserved preferences):** If a user didn’t listen to a song, we interpret it as “not relevant” in evaluation, but maybe they would have loved it and just never discovered it. The absence of interaction is not a true negative in recommender systems – a big difference from, say, classifying an image where every image has a known label. This makes metrics like precision/recall hard to interpret because the set of “relevant” items is incomplete. One technique is to assume all unobserved interactions are negative for evaluation, which is very strict. Another is to only evaluate on a subset of “seen” items (e.g., known positives vs. sampled negatives). - **Collection bias and feedback loops:** Similar to popularity bias, there is also the issue that once you deploy a recommender, user behavior changes. They might start interacting differently (feedback loop). Offline eval can’t capture that – it evaluates in a static world where your model doesn’t influence the data generation. When you put the model live, it _will_ influence what users see and do (observational vs. interventional gap ). - **Cold-start in test:** If your test set includes new users or items that weren’t in training, offline metrics will be harsh (because your model had no data for them). In reality, you might handle cold start with other means. So one must be deliberate – maybe evaluate cold start separately from warm start. **Addressing biases offline:** There’s research on **counterfactual evaluation** and **propensity scoring** to correct offline metrics for how data was collected . For example, **IPS (Inverse Propensity Score)** weighting can adjust for popularity bias: if popular items were shown more, you weight their errors less, so a model doesn’t get over-credited for recommending popular stuff . There are also techniques to do offline A/B tests by replaying logs (assuming you have logging that includes randomized exploration, you can simulate how a new model would have done). These are advanced topics, but it’s good to know that naive offline evaluation can be misleading, and there are ways researchers try to make it more reliable . In practice, offline metrics are very useful for **model development**. They let you try many ideas quickly. If your new matrix factorization model has a higher NDCG on a holdout set than the previous one, it’s a positive signal. It’s much faster than deploying each idea online. However, teams often find that **offline gains don’t always translate to online gains**. Thus, offline eval is a necessary step but not sufficient. A good practice is to use multiple metrics offline. For example, track NDCG for all users, but also maybe track the score for different segments (new vs. old users, or for popular vs. tail items). Also track diversity/novelty if that matters to you, so you don’t inadvertently optimize just one aspect. ### **Online Evaluation – A/B Testing in Production** Online evaluation means testing with real users in a live system. This is the ultimate judge of a recommender system’s effectiveness: do users click more, buy more, are they happier, do they stick around longer? The standard method is an **A/B test (randomized controlled experiment)**. You deploy the new recommendation algorithm (Variant B) for a subset of users (say 10% of traffic) while the current system (Variant A or control) runs for the rest (90%, or another 10% if you do a 10/90 or 50/50 split depending on needs). By randomly assigning users (or sessions) to each variant, you ensure the groups are statistically equivalent on average. Any difference in their behavior can be attributed to the new algorithm, with some margin of error. **What to measure:** This depends on your product goals. Common primary metrics for recommenders: - Click-through rate (CTR) on recommendations. - Conversion rate or purchase rate (if e-commerce). - Time spent or sessions (for content feeds, maybe the goal is to increase user engagement time). - Number of connections made (for PYMK, the success metric might be connection accepts). - Revenue (for ads, an obvious metric). - Long-term retention (does this feature make users come back more often in the following weeks?). Often you have a **primary metric** (the one you want to move) and a set of **guardrail metrics** (things you don’t want to hurt, like user retention or diversity or system load). A/B testing will reveal if variant B is significantly better or worse on these metrics. **Running an A/B test:** Typically, you let it run for enough time to gather sufficient data for significance – maybe two weeks or more, depending on traffic. You monitor the metrics daily to ensure nothing catastrophic is happening. At the end, you perform statistical tests (e.g., t-test or chi-square) to see if differences are likely real or just noise. **Interpreting results:** If variant B’s CTR is +5% with p-value 0.01, you conclude it’s an improvement. But you also check other metrics: did average session length drop? If so, maybe the recs got more clicks but annoyed users (they left earlier). Or did revenue drop? It could happen if you optimized clicks but they were low-value clicks. **Pitfalls of A/B tests:** They are considered the gold standard, but: - They can be **time-consuming and costly**. You might hold back a chunk of users in a less-optimized variant, which has an opportunity cost . If every experiment needs 2 weeks and you have to run 10 to find a good one, that’s slow. This motivates techniques like multi-armed bandits (below) to speed learning. - **Single metric focus:** If you focus on one metric, you might harm others . For instance, optimizing click rate can reduce diversity or long-term satisfaction . It’s crucial to take a holistic view – many teams at companies have dashboards with 5-10 metrics for experiments. If an experiment wins on 1 metric but tanks 3 others, it’s usually a no-go. - **Segment analysis:** A change might be good on average but bad for a subset of users . E.g., new algorithm is great for active users (they get super personalized lists), but new users get worse recommendations than before. If you only look at aggregate, you might miss that. It’s wise to slice results by user type, country, etc. . - **Short-term vs. long-term:** Some changes yield immediate metric boosts but have long-term downsides. Example: a recommendation algorithm might learn to exploit clickbait – users click more this week, but over a month they start trusting the recommendations less and engagement drops. Ideally, one could run long-duration experiments or have proxy metrics for long-term (maybe measure repeat engagement or user survey happiness). One interesting practice is **holdout groups** that are kept on an old recommendation policy indefinitely to measure long-term differences . For example, keep 1% of users on the old algorithm for months as a baseline, while 99% are on the new; if after 3 months the baseline group is somehow retaining more users, that’s a red flag. **Beyond A/B:** - **Multi-armed bandits:** Instead of a fixed A/B split, bandit algorithms dynamically adjust traffic to variants based on performance . If variant B looks promising early, it gets more traffic, speeding up results. This can maximize reward during experiments and find the best option faster. It’s useful when you have many variants or want to continuously optimize a parameter (like explore/exploit trade-off). - **Interleaving tests:** This is mainly used in search ranking, but can apply to recommenders for comparing two ranked lists. Interleaving means you merge results from A and B into one list shown to the user (in some alternating fashion) . By seeing which system’s items the user clicked more, you infer preference between A and B for that user. Interleaving allows very fast comparison with fewer users, because each user effectively tries both algorithms simultaneously . However, it’s tricky to implement for recommendation scenarios outside search (where results from A and B are comparable). **Trusting offline vs. online:** There’s often a healthy tension. In one famous anecdote, Netflix found that the algorithm that won the Netflix Prize (best offline RMSE on ratings) did not actually improve their business metric (hours watched) much when implemented. Why? Because optimizing RMSE on predicted stars wasn’t aligned enough with driving user engagement. They shifted to other approaches (including more implicit feedback models and personal taste profiles). The moral: **offline metrics are surrogates** – you optimize them hoping it translates to real impact. But the ultimate check is the A/B test with real users. **Example scenario:** Suppose offline tests showed Model X had higher NDCG than Model Y, so you deploy X in an A/B test expecting a CTR lift. The online result shows no significant difference in CTR, but a slight decrease in time spent. What could have happened? Maybe Model X, in maximizing NDCG, ends up recommending a few very relevant items that the user clicks quickly and leaves, whereas Model Y gave a broader list that kept the user browsing longer. Or maybe the offline data didn’t include how users react to seeing the same recommendations repeatedly (maybe Model X over-personalizes and always shows the same top items, causing fatigue). This highlights things like **novelty** – sometimes you want to deliberately inject less relevant but novel items to keep engagement up over multiple sessions (a purely greedy offline-optimal model might not do that). These considerations often only surface in online tests. In conclusion, a practical workflow is: use offline evaluation to **narrow down** models (it’s necessary to sift through ideas), but **validate with A/B tests** to choose what actually goes live . Over time, you also refine your offline metrics to better correlate with online outcomes (perhaps by incorporating multiple metrics or more realistic simulation of online behavior). Online evaluation gives you the ground truth on user satisfaction and business KPIs, while offline gives you development speed. Both are essential in recommender system design. A/B testing, despite requiring careful design and interpretation, remains the **cornerstone of evaluating changes** in real-world systems because it captures all the complex effects in a way offline tests can’t. As an engineer or data scientist, mastering both offline and online eval will make you much more effective in iterating on recommendation algorithms. ## **Conclusion** In this deep exploration, we’ve journeyed through the landscape of recommendation systems – from fundamental algorithms to domain-tailored designs and evaluation techniques. We started by understanding core methods like collaborative filtering and content-based filtering, which showed us the power (and limits) of using past user behavior and item attributes to predict preferences. We saw how matrix factorization distills interactions into latent factors, and how modern deep learning approaches like two-tower models extend that idea to handle web-scale candidate retrieval and rich features . We also recognized that training objectives matter: learning-to-rank strategies align our models with how we evaluate rankings, whether via pointwise simplicity or pairwise/listwise attention to ordering . Then we dove into **four specific domains** and discovered just how adaptive recommender system design needs to be. In **advertising**, the emphasis is on predicting clicks and conversions with high-dimensional features and doing so fairly and optimally for revenue – hence complex models like DLRM that mix dense and sparse features and careful calibration . We learned from industry examples how multi-task and multi-tower networks help juggle multiple objectives or data domains (e.g. Pinterest’s ads model) . For **social recommendations**, the social graph and two-sided nature of connections drive a multi-stage approach that balances recall and precision across candidates, and addresses things like fairness and diversity via re-ranking . In **local recommendations** (places to eat), context is king – models integrate location and time, and two-tower setups provide scalable retrieval of nearby favorites . We see the importance of mixing content knowledge (cuisine, etc.) with collaborative trends (popular spots) and handling new-item cold start with side information. For **events**, short item lifetimes demand heavy use of content (like artist similarity) and perhaps social proof, demonstrating a case where pure CF is often insufficient and hybrid approaches shine (a research point underscored by Facebook’s event rec paper on dealing with sparse user-event feedback) . Throughout these domains, a theme emerged: **the best recommender is the one tailored to the task and the user’s context**. There is no single algorithm that wins everywhere. You often combine building blocks – for instance, using a two-tower model for candidate generation, matrix factorization or deep networks for scoring, and then heuristic re-rankers for business rules – to meet all requirements. We also saw how **real-world systems** (Google, LinkedIn, Meta, Uber, etc.) have converged on surprisingly similar multi-stage architectures : a recall stage (often using embeddings) followed by one or more ranking stages, and domain-specific tweaks at each step. It’s reassuring that the theory we learn maps to practice, albeit with a dose of engineering pragmatism. Finally, we emphasized **evaluation** – because if you can’t measure it, you can’t improve it. Offline metrics like precision, recall, NDCG, and AUC are invaluable for rapid experimentation, but they come with biases and blind spots. We discussed how to use them wisely and the importance of aligning them with your product goals (e.g., if novelty matters, include a novelty metric so you don’t accidentally sacrifice it). Then we covered online A/B testing, the ultimate arbiter, with all its considerations from statistical significance to long-term impacts . The key takeaway is that a combination of offline and online eval leads to the best outcomes: offline to **develop** and online to **validate**. We also gave practical illustrations with Python snippets – showing how to compute similarities for CF, how to calculate NDCG, etc. – to demystify some of the math behind the metrics and algorithms. These tools and techniques should empower you to prototype your own simple recommender and understand how to measure its success. To wrap up, designing a recommendation system is as much an **art** as a science: it requires understanding your users, the items, and the business context deeply. It requires picking the right model architecture (or mixture of models) that can handle the scale and data of your problem, choosing loss functions that steer the model towards what you care about, and constantly evaluating and iterating. The field continues to evolve (we didn’t even touch on the latest like transformers for sequential recommendations, or knowledge graph enhanced recommenders, or the interplay of recommenders and explainable AI – there’s always more!), but the foundational principles covered here will serve as a solid base. When you go forth to design or improve a recommender, remember to **ask the right questions**: What’s the objective? What data do we have? Where are the cold-start gaps? How will we know it’s working (which metric, and whose experience does that metric capture)? Armed with the knowledge from this deep dive and a healthy curiosity, you’ll be well-equipped to tackle those questions and build systems that meaningfully connect users with the content they’ll love. Happy recommending! ## **Table of Contents** - [Overview of Foundational Recommendation Techniques](#overview-of-foundational-recommendation-techniques) - [Collaborative Filtering (CF) – Learning from Similar Users or Items](#collaborative-filtering-cf--learning-from-similar-users-or-items) - [Content-Based Filtering – Learning from Item Attributes](#content-based-filtering--learning-from-item-attributes) - [Matrix Factorization – Latent Factors from Ratings](#matrix-factorization--latent-factors-from-ratings) - [Two-Tower and Multi-Tower Neural Architectures](#two-tower-and-multi-tower-neural-architectures) - [Learning to Rank – Optimizing Loss for Ranking Quality](#learning-to-rank--optimizing-loss-for-ranking-quality) - [Domain-Specific Recommendation Systems and Trade-offs](#domain-specific-recommendation-systems-and-trade-offs) - [Ad Recommendations – Optimizing Clicks & Conversions under Constraints](#ad-recommendations--optimizing-clicks--conversions-under-constraints) - [Social Recommendations – Connecting People and Content](#social-recommendations--connecting-people-and-content) - [Place Recommendations (Where to Eat or Visit) – Local Context and Personal Taste](#place-recommendations-where-to-eat-or-visit--local-context-and-personal-taste) - [Concerted Example (Python) – Combining Content and Collaborative Signals _(Illustrative)_](#concerted-example-python--combining-content-and-collaborative-signals-illustrative) - [Evaluating Recommendation Systems: Offline Metrics vs. Online A/B Testing](#evaluating-recommendation-systems-offline-metrics-vs-online-ab-testing) - [Offline Evaluation – Metrics and Methodologies](#offline-evaluation--metrics-and-methodologies) - [Online Evaluation – A/B Testing in Production](#online-evaluation--ab-testing-in-production) - [Conclusion](#conclusion)