About You Tech Challenge - Data Analyst

Help shoppers get outfit insperations for each hero jeans.

On the women's Jeans category page, AY wants a new lane that shows multiple outfit suggestions per jeans. The challenge is to design the system that picks those outfits automatically using only product-performance data, then to spec the data model, the data contract with the SWEs on the shop backend, and the A/B test that decides whether to ship.

A rich demand signal with a missing visual layer.

The sample dataset for this challenge has a variety of demand signal: basket and wishlist counts, category rank, brand-rank, yesterday's gross-sales percentile. However, it lacks the substrate that fashion-ecom recommenders usually run on. There is no visual metadata, session-level co-purchase, customer identity, price and stock. The question than becomes, which use cases this dataset can serve well or which use cases require a different data foundation.

What we have

Products: 36,387
Virtual categories: 51
Jeans (hero candidates): 2,348
No null values (100% coverage): category-rank & gross-sales & brand-rank
Partial coverage: baskets (76%) & wishlists (40%)

What we don't have

Color / fit / style: No attributes
Co-purchase signal: No session data
Product images: No visuals & embeddings
Brand name: Only brand_rank (category-relative)
Price / stock: No price, discount or in-stock information

Where the data is enough and where it isn't

The dataset is enough to launch a useful lane and measure whether it works. It is not enough if we want to personalize, style outfits properly, or train anything that learns from shoppers.

Enough to ship a useful lane. Basket and wishlist counts tell us what people actually want. Combined with category rank, that's enough to pick credible products for each outfit and refresh the lane daily.
Enough to measure whether it works. The Jeans page would get enough traffic that a real lift would show up in days. Every feature in the score has a clear weight, so when the numbers move we can explain why.
Not enough for personalization. Nothing in the data ties a product to a specific shopper. There is no past orders, no browsing history, no profile. Every visitor would see the same five jeans.
Not enough to make outfits look styled. With no colors, materials, photos, or even brand names, the system can't tell that a navy jeans goes with a cream top. And with no price or stock data, it can surface a product that's sold out or heavily discounted.
Not enough to train a smarter recommender. The models that actually learn from shopper behavior need click logs, session histories, or product imagery. The dataset has none of these, so a hand-tuned formula is honestly the best we can do until that changes.

Find a product score, then sample looks from different brands.

The recommender is a heuristic by design. This makes it defensible and transparent. It can also be shipped fast and start A/B testing quickly. A composite relevance score mixes five features from the dataset. For each hero jeans, I sample three companion pieces per variant (tops & outerwear & shoes & accessories) from the top-N of the relevant category pool, weighted by score, without re-using a product across variants of the same hero. Slot composition rotates across variants so the looks feel different from each other.

score(p) = 0.35 · (1 − basket_pct) + 0.12 · (1 − wishlist_pct) + 0.23 · (1 − sales_pct) + 0.20 · inv_rank(category) + 0.10 · female_pref_bonus

35%
Basket intent
Closest to revenue — a basket-add is one click from a purchase.
12%
Wishlist intent
Tiebreaker. 40 % coverage and ρ=0.43 with baskets reduce its independent work.
23%
Yesterday's sales
Captures momentum the basket signal lags on. Full coverage, only partly correlated with basket (ρ=0.53).
20%
Category rank
The only signal for the 35 % of jeans with null intent — carries the no-data tail.
10%
Female preference
In-brand rank divergence (overall rank − female segment).

Percentile semantics in this dataset are inverted. 0.0 means that the product is the top of the catalog, therefore, every term flips the raw value with (1 − pct). It's also worth mentioning that Null intent values fall back to 0.5 (neutral), not 0 (worst).

Hero selection is deterministic (top-K by score), but companions are drawn by weighted sampling from the top-25 of each slot. The seed (default 42) is persisted to generated_with_seed in the output so any shipped lane can be reproduced exactly.

The reasoning for the weights and how to evolve them

Every weight should have a reason today and a way to be refined tomorrow. We can plan a post-launch tuning to pair the informed priors.

What the sample data alone tells us

01
The demand signals has a positive correlation. baskets and wishlists has a ρ of 0.43, and baskets and gross_sales has a ρ of 0.53. Stacking all three at high weights partly double-counts the same underlying preference.
02
The coverage is asymmetric. The coverages: baskets 76 %, wishlists 40 %, gross_sales & brand_rank & brand_rank_female 100 %. Weighting a sparse signal heavily means I am only ranking the 40 % that have it.
03
Hand-tuned weights generalise better than NNLS-fit weights. I hold baskets out as the proxy outcome and fit weights to predict it. The fitted weights (ρ = 0.42 on held-out 20 %) underperformed the hand-tuned ones (ρ = 0.50). The sample is too small and the proxy too noisy to beat informed priors offline.

Bottom line: with only the sample CSV, I chose to have a defensible prior plus a plan to refine post-launch. The offline optimisation against a noisy proxy would overfit.

Then, the question becomes: How we evolve the weights once we're live

Three mechanisms, ordered by what to reach for first. The first two work from day one of the A/B test, the third is a permanent low-cost application.

01
Slot-level A/B ladder
We deploy today's weights as the control. Each subsequent test moves ONE weight by ±15 % in a direction a stakeholder hypothesised. Builds a weight-impact table over a few weeks.
02
Thompson sampling on weight vectors
We can first define N candidate weight configurations. Then, allocate impressions via Thompson sampling on outfit-lane CTR. The winning vector should be converged in a couple of days.
03
User evaluation at scale
Another approach would be showing pairs of outfits (variant A vs variant B) and asking the user 'which would you actually wear?' Then we can infer preference weights via Bradley-Terry model. This is useful when traffic is thin or for new categories with no live data yet.

Alternatives I considered

Considered
Collaborative filtering
Unfortunately, we need co-purchase data.
Considered
CLIP image-embedding pairing
There is no product images in the source dataset (Although, I tackled this in Go Beyond by scraping top candidates).
Considered
Brand-aesthetic clustering
brand_rank is per-product in-brand (not a brand identifier). There is no brand identifier.
Chosen
Hero + companions, weighted sampling, no product repeats
This is the main approach explained in this section. It is defensible, varied, robust to nulls, easy to A/B and evolve. The results can be seen as variants 1–3.
Go Beyond
Scrape top candidates' product data, compose with a multimodal LLM
Heavier infrastructure and slower iteration, but unlocks visual coherence and natural-language rationales. Ships in parallel as variants 4–6. See next section for more information

Layer a multimodal LLM on top of the heuristic for higher-quality looks.

The heuristic model is blind to color, fabric, and silhouette. All these features can take a fashion ecommerce recommender to another level. The Go Beyond approach works as a reranker on top of heuristic ranker model. It adds a multimodal LLM layer on the rich product specs of top products scraped from aboutyou.de, using the shop_link provided in the sample set. It asks Claude to compose three additional outfits per hero. These appear as variants 4–6 in the same carousel.

example product_specs.json entry

{
  "product_id": 443634,
  "name": "PIECES Umhängetasche (cognac, One Size)",
  "brand": "PIECES Umhängetasche",
  "description": "PIECES Umhängetasche (cognac, One Size)",
  "color": "cognac",
  "price_amount": 34.9,
  "price_currency": "EUR",
  "availability": "InStock",
  "sku": "6348125",
  "images": [
    "https://cdn.aboutstatic.com/file/e99c14e5…?bg=F4F4F5&quality=75&trim=1",
    "https://cdn.aboutstatic.com/file/679ccb83…?bg=F4F4F5&quality=75&trim=1",
    "https://cdn.aboutstatic.com/file/b58c1580…?brightness=0.96&quality=75&trim=1"
    // …5 images total per product
  ],
  "fetched_at": "2026-05-24T22:59:17+00:00"
}

what claude is asked

You are a senior womenswear merchandiser for AboutYou, a fashion
e-commerce platform serving the German market. You compose outfit
suggestions ("Wear it with" looks) for a category-page recommendation
lane on the women's Jeans page. Every outfit anchors on a hero pair of
jeans and combines exactly three companion pieces drawn from the
candidates provided. Your goal is high visual coherence and shoppability
— a real shopper should look at the look and think "yes, I would wear
this together".

Hard coherence rules — break them and the look is rejected:
1. SEASON. The look must read as one season. No winter coat or wool knit
   with sandals or slides. No swimwear with boots. If the hero jeans is
   heavy / dark / lined denim, lean cooler; if it's light wash, frayed,
   or summery, lean warmer.
2. GENDER. This lane is women's only. Reject any candidate whose image
   obviously reads men's or kids'.
3. MATERNITY. If the HERO jeans is NOT a maternity product, DO NOT
   include any maternity companion. Conversely, if the hero IS
   maternity, prefer maternity companions where available.
4. CATEGORY DUPLICATION. Never include two pieces from the same slot.
5. COLOR / SILHOUETTE / FORMALITY. Build a deliberate color story;
   balance proportions; match formality.

If a slot's candidate pool contains nothing that satisfies the rules,
drop that slot. Always return through the `return_outfits` tool.

The system prompt is versioned (SYSTEM_PROMPT_VERSION) and included in the cache key, so any change re-generates all compositions on the next run.

Same data contract

AI variants land in the same heroes[].variants[] array; the shop backend reads exactly one JSON, and asourcefield can be used to discriminate heuristic vs AI.

The recommendation lane

Each hero below gets its own carousel of looks (inspired by About You's own product page), i.e., same jeans with different companions. Variants 1–3 come from the heuristic, variants 4–6 from Claude. Use the arrows or swipe to flip through; click any tile to open it on aboutyou.de.

Hero 01 · Outfit Inspiration

Wear it with

Heuristic

01 / 06

An outfit recommendation lane for the women jeans category page.