An outfit recommendation lane for the women jeans category page.
For five top jeans, multiple auto-curated looks with different companion combinations. Generated purely based on product-performance signals.
Help shoppers get outfit insperations for each hero jeans.
On the women's Jeans category page, AY wants a new lane that shows multiple outfit suggestions per jeans. The challenge is to design the system that picks those outfits automatically using only product-performance data, then to spec the data model, the data contract with the SWEs on the shop backend, and the A/B test that decides whether to ship.
A rich demand signal with a missing visual layer.
The sample dataset for this challenge has a variety of demand signal: basket and wishlist counts, category rank, brand-rank, yesterday's gross-sales percentile. However, it lacks the substrate that fashion-ecom recommenders usually run on. There is no visual metadata, session-level co-purchase, customer identity, price and stock. The question than becomes, which use cases this dataset can serve well or which use cases require a different data foundation.
- Products
- 36,387
- Virtual categories
- 51
- Jeans (hero candidates)
- 2,348
- No null values (100% coverage)
- category-rank & gross-sales & brand-rank
- Partial coverage
- baskets (76%) & wishlists (40%)
- Color / fit / style
- No attributes
- Co-purchase signal
- No session data
- Product images
- No visuals & embeddings
- Brand name
- Only brand_rank (category-relative)
- Price / stock
- No price, discount or in-stock information
Where the data is enough and where it isn't
The dataset is enough to launch a useful lane and measure whether it works. It is not enough if we want to personalize, style outfits properly, or train anything that learns from shoppers.
- Enough to ship a useful lane. Basket and wishlist counts tell us what people actually want. Combined with category rank, that's enough to pick credible products for each outfit and refresh the lane daily.
- Enough to measure whether it works. The Jeans page would get enough traffic that a real lift would show up in days. Every feature in the score has a clear weight, so when the numbers move we can explain why.
- Not enough for personalization. Nothing in the data ties a product to a specific shopper. There is no past orders, no browsing history, no profile. Every visitor would see the same five jeans.
- Not enough to make outfits look styled. With no colors, materials, photos, or even brand names, the system can't tell that a navy jeans goes with a cream top. And with no price or stock data, it can surface a product that's sold out or heavily discounted.
- Not enough to train a smarter recommender. The models that actually learn from shopper behavior need click logs, session histories, or product imagery. The dataset has none of these, so a hand-tuned formula is honestly the best we can do until that changes.
Find a product score, then sample looks from different brands.
The recommender is a heuristic by design. This makes it defensible and transparent. It can also be shipped fast and start A/B testing quickly. A composite relevance score mixes five features from the dataset. For each hero jeans, I sample three companion pieces per variant (tops & outerwear & shoes & accessories) from the top-N of the relevant category pool, weighted by score, without re-using a product across variants of the same hero. Slot composition rotates across variants so the looks feel different from each other.
- 35%Basket intentClosest to revenue — a basket-add is one click from a purchase.
- 12%Wishlist intentTiebreaker. 40 % coverage and ρ=0.43 with baskets reduce its independent work.
- 23%Yesterday's salesCaptures momentum the basket signal lags on. Full coverage, only partly correlated with basket (ρ=0.53).
- 20%Category rankThe only signal for the 35 % of jeans with null intent — carries the no-data tail.
- 10%Female preferenceIn-brand rank divergence (overall rank − female segment).
The reasoning for the weights and how to evolve them
Every weight should have a reason today and a way to be refined tomorrow. We can plan a post-launch tuning to pair the informed priors.
- 01The demand signals has a positive correlation. baskets and wishlists has a ρ of 0.43, and baskets and gross_sales has a ρ of 0.53. Stacking all three at high weights partly double-counts the same underlying preference.
- 02The coverage is asymmetric. The coverages: baskets 76 %, wishlists 40 %, gross_sales & brand_rank & brand_rank_female 100 %. Weighting a sparse signal heavily means I am only ranking the 40 % that have it.
- 03Hand-tuned weights generalise better than NNLS-fit weights. I hold baskets out as the proxy outcome and fit weights to predict it. The fitted weights (ρ = 0.42 on held-out 20 %) underperformed the hand-tuned ones (ρ = 0.50). The sample is too small and the proxy too noisy to beat informed priors offline.
Bottom line: with only the sample CSV, I chose to have a defensible prior plus a plan to refine post-launch. The offline optimisation against a noisy proxy would overfit.
Three mechanisms, ordered by what to reach for first. The first two work from day one of the A/B test, the third is a permanent low-cost application.
- 01Slot-level A/B ladderWe deploy today's weights as the control. Each subsequent test moves ONE weight by ±15 % in a direction a stakeholder hypothesised. Builds a weight-impact table over a few weeks.
- 02Thompson sampling on weight vectorsWe can first define N candidate weight configurations. Then, allocate impressions via Thompson sampling on outfit-lane CTR. The winning vector should be converged in a couple of days.
- 03User evaluation at scaleAnother approach would be showing pairs of outfits (variant A vs variant B) and asking the user 'which would you actually wear?' Then we can infer preference weights via Bradley-Terry model. This is useful when traffic is thin or for new categories with no live data yet.
Alternatives I considered
- ConsideredCollaborative filteringUnfortunately, we need co-purchase data.
- ConsideredCLIP image-embedding pairingThere is no product images in the source dataset (Although, I tackled this in Go Beyond by scraping top candidates).
- ConsideredBrand-aesthetic clusteringbrand_rank is per-product in-brand (not a brand identifier). There is no brand identifier.
- ChosenHero + companions, weighted sampling, no product repeatsThis is the main approach explained in this section. It is defensible, varied, robust to nulls, easy to A/B and evolve. The results can be seen as variants 1–3.
- Go BeyondScrape top candidates' product data, compose with a multimodal LLMHeavier infrastructure and slower iteration, but unlocks visual coherence and natural-language rationales. Ships in parallel as variants 4–6. See next section for more information
Layer a multimodal LLM on top of the heuristic for higher-quality looks.
The heuristic model is blind to color, fabric, and silhouette. All these features can take a fashion ecommerce recommender to another level. The Go Beyond approach works as a reranker on top of heuristic ranker model. It adds a multimodal LLM layer on the rich product specs of top products scraped from aboutyou.de, using the shop_link provided in the sample set. It asks Claude to compose three additional outfits per hero. These appear as variants 4–6 in the same carousel.
Same data contract
- AI variants land in the same
heroes[].variants[]array; the shop backend reads exactly one JSON, and asourcefield can be used to discriminate heuristic vs AI.
The recommendation lane
Each hero below gets its own carousel of looks (inspired by About You's own product page), i.e., same jeans with different companions. Variants 1–3 come from the heuristic, variants 4–6 from Claude. Use the arrows or swipe to flip through; click any tile to open it on aboutyou.de.
Wear it with
Wear it with
Wear it with
Wear it with
Wear it with
From source to the final data contract
Task 3 from the challenge has three parts: the data model that turns raw product performance into a precomputed lane, the data-quality tests that catch regressions before the shop sees them, and the contract that the shop backend receives.
I imagine this pipeline as a dbt project and a JSON snapshot per day. Each stage adds one thing and is tested before the next runs. In every stage different DQ dimensions are tested. The assumption is that the Source stage embeds the upstream contract. Click any stage to see its full dbt YAML.
Three gates over the lifecycle of a code change and the daily refresh. It's all to avoid a data-quality failure that might happen at 03:30 CET.
One JSON snapshot per day, mirrored to gs://ay-reco-lanes/jeans/de/<version>/<date>.json. It is schema-validated before publish and the shop backend pins it by lane_version and caches it in Redis for one hour. Breaking changes ship under a new version with a one-week double-publish window.
{
"version": "v3",
"lane_id": "jeans_wear_it_with_de_w",
"generated_with_seed": 42,
"weights": {
"baskets": 0.35,
"wishlists": 0.12,
"gross_sales": 0.23,
"cat_rank": 0.2,
"female_pref": 0.1
},
"heroes": [
{
"hero_id": "hero_01",
"hero_jeans": {
"product_id": 29754268,
"shop_link": "https://aboutyou.de/p/x/x-29754268",
"virtual_category_name": "Jeans",
"score": 0.8999,
"baskets": 298,
"wishlists": 7755,
"brand_rank": 1,
"image_url": "https://cdn.aboutstatic.com/file/images/dcb9fc9ee527b53ecb35e956e4f3cff0.png?bg=F4F4F5&quality=75&trim=1&height=630&width=1200&expand=1"
},
"variants": [
{
"variant_id": "var_01",
"composite_score": 0.8712,
"pieces": {
"top": {
"product_id": 4861398,
"shop_link": "https://aboutyou.de/p/x/x-4861398",
"virtual_category_name": "Blusen & Tuniken",
"score": 0.836,
"baskets": 5,
"wishlists": 440,
"brand_rank": 7,
"image_url": "https://cdn.aboutstatic.com/file/images/6b650889ff4ca324be7c691d150230be?bg=F4F4F5&quality=75&trim=1&height=630&width=1200&expand=1"
},
"outerwear": {
"product_id": 30778531,
"shop_link": "https://aboutyou.de/p/x/x-30778531",
"virtual_category_name": "Pullover & Strick",
"score": 0.8515,
"baskets": 6,
"wishlists": 157,
"brand_rank": 17,
"image_url": "https://cdn.aboutstatic.com/file/images/49fe04f004b144b6f27dd57594070954.jpg?brightness=0.96&quality=75&trim=1&height=630&width=1200&expand=1"
},
"shoes": {
"product_id": 5728267,
"shop_link": "https://aboutyou.de/p/x/x-5728267",
"virtual_category_name": "Pumps & High Heels",
"score": 0.8975,
"baskets": 50,
"wishlists": 483,
"brand_rank": 12,
"image_url": "https://cdn.aboutstatic.com/file/images/f689e2cfafa682eab539a6c0645d5f29.png?bg=F4F4F5&quality=75&trim=1&height=630&width=1200&expand=1"
}
}
},
"… 5 more variants …"
]
},
"… 4 more heroes …"
]
}Three events can be used the A/B test and the post-launch dashboard:
What the recommendations team promises the shop backend. Each SLO maps to an alert and a fallback behaviour. There is never a blank lane, at the very least yesterday's snapshot stays live if anything breaks.
Hypothesis, metrics, and the decision rule.
Hypothesis is the outfit lane increases the AOV of jeans-purchase sessions by ≥5% without regressing CLP bounce or page latency. At a 4% baseline CTR with a 5% MDE, α=0.05, power=0.80 and about 75,000 visitors per arm. I would hold the test open for at least a full week to absorb day-of-week effects.
- A North star says whether it is relevant to the user.
- A Primary says whether it converts.
- A Guardrail says whether it broke something.
- A Diagnostics inform the next iteration.
Eight parameters that define the experiment. Click any to see the rationale and what would break if it were chosen differently.
I would write down the decision tree before the test runs so we can't move the goalposts when results come in.
Path to a better model.
- → From the AOV perspective, the first thing to add is the co-purchase signal from session data (item-item collaborative filtering).
- → If CLIP image embeddings are available, we can pair items by color and silhouette (content-based filtering with visual embeddings).
- → Using style and fit attributes as structured catalog fields (content-based recommendations / feature engineering).
- → Pre-filtering on some catalogue attributes, such as stock_available (business-rule filtering), so the lane never shows a dead/out of category product.
- → Personalization by cohort or past purchases (user-based collaborative filtering / session-based recommendations) once customer-level data is in scope.









































