Kimi K2.5 vs Claude Sonnet 4.6 - FFL A/B Experiment Findings

Experiment start: Feb 26, 2026 Data snapshot: Feb 27, 2026 (~2 days of data) Traffic split: 50/50 random assignment per request Scope: All unified processor agents (extractor, completion, chef, reference, supplement)

How the Experiment Works

KIMI_EXPERIMENT_ENABLED flag toggles the experiment on/off
Each incoming FFL request is randomly assigned to kimi or default group
The group is propagated via ContextVar to all agent model factories
Kimi group: Kimi K2.5 via Bedrock Converse API (primary), with Claude Sonnet 4.6 as fallback
Default group: Claude Sonnet 4.6 via Bedrock (primary), standard fallback chain
Experiment group is stored in ffl_feedback.experiment_group

Summary Metrics (from production Supabase)

Metric	Sonnet 4.6 (`default`)	Kimi K2.5 (`kimi`)
Total feedback events	64	35
Unique users who gave feedback	~41	~23
Likes	62	35
Dislikes	2	0
Approval rate	96.9%	100%
Text feedback submitted	2	0

Daily Breakdown

Date	Sonnet likes / dislikes	Kimi likes / dislikes
Feb 26	32 / 1	12 / 0
Feb 27	30 / 1	23 / 0

The 2 Sonnet Dislikes (Neither Is Accuracy-Related)

UX bug report: "Glitchy with changing the meal time. Changes back after changing it to a different meal. Have to delete entry and start again to fix" - Query: "Snack of quarter cup of blueberries"
Feature request: Long text about "fridge longevity" and wanting food recommendations to account for rolling one-week history. Also asked about whether food corrections persist across future entries.

Neither dislike is related to food logging accuracy or model quality.

Macro Correction Data (Stronger Signal Than Thumbs Up/Down)

Only Kimi-group meals had macro corrections in this window. All corrections were downward (Kimi overestimated calories):

Food	Kimi Estimated	User Corrected To	% Change	Notes
Funyuns onion rings (2 serving)	200 cal	15 cal	-92.5%	Likely quantity misinterpretation - "2 funyunis" probably meant 2 individual rings, not 2 servings
Cereal (1 bowl)	383 cal	200 cal	-28.6% (after two corrections)	Progressive correction: 383 -> 280 -> 200. Generic food, ambiguous portion
Pepperoni (17 slices)	428 cal	182 cal	-57.5%	Branded food, overestimated per-slice calories
Activia strawberry yogurt (1 cup)	218 cal	82 cal	-62.5%	Branded food - nearly 3x overestimate

Average calorie correction: -186 cal per item Median absolute % change: 57.5% All corrections direction: Down (overestimation)

No corrections were found for Sonnet-group meals that also had feedback in this window.

Key Findings

1. User satisfaction is comparable

Both models have very high approval. The 100% vs 96.9% gap is not statistically significant at these sample sizes. The Sonnet dislikes aren't accuracy complaints anyway.

2. Kimi has a calorie overestimation pattern

4 out of 4 corrected Kimi items were corrected downward. This suggests Kimi may systematically overestimate:

Branded food portions (Activia yogurt, pepperoni) - possibly not calibrated to real-world serving sizes
Ambiguous quantities (Funyuns "2" interpreted as servings instead of pieces)

3. Traffic split may be uneven in feedback

Kimi got 35 feedback events vs Sonnet's 64 (~35% vs 65%), despite 50/50 random assignment. Possible explanations:

Fewer Kimi users happened to tap the feedback button (selection bias)
Kimi requests may have different timing characteristics affecting when the feedback prompt appears
Should verify against total request counts in Logfire to confirm the actual traffic split matches 50/50

4. Sample size is too small for conclusions

~100 total feedback events over 2 days is not enough to declare a winner. Need 5-7 more days minimum.

Open Questions / Next Steps

Pull Logfire latency data to compare response times between groups
Verify actual traffic split (total requests per group, not just feedback)
Monitor whether the Kimi overestimation pattern persists with more data
Check if Kimi's fallback to Sonnet is triggering frequently (would dilute the comparison)
Look at error rates per group (food items stuck in pending or error state)
Get at least 7 days of data before making any routing decisions

Technical References

Experiment code: app/core/alma_agentic/unified_processor/model_utils.py
Sampling: sample_experiment_group() - 50/50 coin flip when KIMI_EXPERIMENT_ENABLED=True
Context propagation: ContextVar per async task tree
Feedback table: ffl_feedback (column: experiment_group)
Config flag: KIMI_EXPERIMENT_ENABLED in app/config.py

Raw Queries Used

-- Summary by group
SELECT experiment_group, like_status, COUNT(*) as count, COUNT(DISTINCT user_id) as unique_users
FROM ffl_feedback
WHERE experiment_group IN ('kimi', 'default')
GROUP BY experiment_group, like_status;

-- Approval rates
SELECT experiment_group,
  COUNT(CASE WHEN like_status = 'like' THEN 1 END) as likes,
  COUNT(CASE WHEN like_status = 'dislike' THEN 1 END) as dislikes,
  ROUND(COUNT(CASE WHEN like_status = 'like' THEN 1 END)::numeric / COUNT(*) * 100, 1) as approval_rate
FROM ffl_feedback
WHERE experiment_group IN ('kimi', 'default')
GROUP BY experiment_group;

-- Macro corrections for Kimi meals
SELECT cl.original_name, cl.original_calories, cl.new_calories, cl.calories_pct_change, cl.correction_direction
FROM food_items_change_log cl
JOIN food_items fi ON fi.id = cl.food_item_id
JOIN ffl_feedback fb ON fb.meal_id = fi.meal_id
WHERE fb.experiment_group = 'kimi'
  AND cl.change_type = 'MACRO_UPDATE'
  AND cl.change_timestamp >= NOW() - INTERVAL '7 days';