Rami Alhamad

Writing · Wed Apr 22 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Kimi K25 vs Sonnet 46 Experiment Findings

Experiment start: Feb 26, 2026 Data snapshot: Feb 27, 2026 (~2 days of data) Traffic split: 50/50 random assignment per request Scope: All unified processor agents (extractor, completion, chef, reference, supplement)

Kimi K2.5 vs Claude Sonnet 4.6 - FFL A/B Experiment Findings

Experiment start: Feb 26, 2026 Data snapshot: Feb 27, 2026 (~2 days of data) Traffic split: 50/50 random assignment per request Scope: All unified processor agents (extractor, completion, chef, reference, supplement)

How the Experiment Works

  • KIMI_EXPERIMENT_ENABLED flag toggles the experiment on/off
  • Each incoming FFL request is randomly assigned to kimi or default group
  • The group is propagated via ContextVar to all agent model factories
  • Kimi group: Kimi K2.5 via Bedrock Converse API (primary), with Claude Sonnet 4.6 as fallback
  • Default group: Claude Sonnet 4.6 via Bedrock (primary), standard fallback chain
  • Experiment group is stored in ffl_feedback.experiment_group

Summary Metrics (from production Supabase)

Metric Sonnet 4.6 (default) Kimi K2.5 (kimi)
Total feedback events 64 35
Unique users who gave feedback ~41 ~23
Likes 62 35
Dislikes 2 0
Approval rate 96.9% 100%
Text feedback submitted 2 0

Daily Breakdown

Date Sonnet likes / dislikes Kimi likes / dislikes
Feb 26 32 / 1 12 / 0
Feb 27 30 / 1 23 / 0

The 2 Sonnet Dislikes (Neither Is Accuracy-Related)

  1. UX bug report: "Glitchy with changing the meal time. Changes back after changing it to a different meal. Have to delete entry and start again to fix" - Query: "Snack of quarter cup of blueberries"
  2. Feature request: Long text about "fridge longevity" and wanting food recommendations to account for rolling one-week history. Also asked about whether food corrections persist across future entries.

Neither dislike is related to food logging accuracy or model quality.

Macro Correction Data (Stronger Signal Than Thumbs Up/Down)

Only Kimi-group meals had macro corrections in this window. All corrections were downward (Kimi overestimated calories):

Food Kimi Estimated User Corrected To % Change Notes
Funyuns onion rings (2 serving) 200 cal 15 cal -92.5% Likely quantity misinterpretation - "2 funyunis" probably meant 2 individual rings, not 2 servings
Cereal (1 bowl) 383 cal 200 cal -28.6% (after two corrections) Progressive correction: 383 -> 280 -> 200. Generic food, ambiguous portion
Pepperoni (17 slices) 428 cal 182 cal -57.5% Branded food, overestimated per-slice calories
Activia strawberry yogurt (1 cup) 218 cal 82 cal -62.5% Branded food - nearly 3x overestimate

Average calorie correction: -186 cal per item Median absolute % change: 57.5% All corrections direction: Down (overestimation)

No corrections were found for Sonnet-group meals that also had feedback in this window.

Key Findings

1. User satisfaction is comparable

Both models have very high approval. The 100% vs 96.9% gap is not statistically significant at these sample sizes. The Sonnet dislikes aren't accuracy complaints anyway.

2. Kimi has a calorie overestimation pattern

4 out of 4 corrected Kimi items were corrected downward. This suggests Kimi may systematically overestimate:

  • Branded food portions (Activia yogurt, pepperoni) - possibly not calibrated to real-world serving sizes
  • Ambiguous quantities (Funyuns "2" interpreted as servings instead of pieces)

3. Traffic split may be uneven in feedback

Kimi got 35 feedback events vs Sonnet's 64 (~35% vs 65%), despite 50/50 random assignment. Possible explanations:

  • Fewer Kimi users happened to tap the feedback button (selection bias)
  • Kimi requests may have different timing characteristics affecting when the feedback prompt appears
  • Should verify against total request counts in Logfire to confirm the actual traffic split matches 50/50

4. Sample size is too small for conclusions

~100 total feedback events over 2 days is not enough to declare a winner. Need 5-7 more days minimum.

Open Questions / Next Steps

  • Pull Logfire latency data to compare response times between groups
  • Verify actual traffic split (total requests per group, not just feedback)
  • Monitor whether the Kimi overestimation pattern persists with more data
  • Check if Kimi's fallback to Sonnet is triggering frequently (would dilute the comparison)
  • Look at error rates per group (food items stuck in pending or error state)
  • Get at least 7 days of data before making any routing decisions

Technical References

  • Experiment code: app/core/alma_agentic/unified_processor/model_utils.py
  • Sampling: sample_experiment_group() - 50/50 coin flip when KIMI_EXPERIMENT_ENABLED=True
  • Context propagation: ContextVar per async task tree
  • Feedback table: ffl_feedback (column: experiment_group)
  • Config flag: KIMI_EXPERIMENT_ENABLED in app/config.py

Raw Queries Used

-- Summary by group
SELECT experiment_group, like_status, COUNT(*) as count, COUNT(DISTINCT user_id) as unique_users
FROM ffl_feedback
WHERE experiment_group IN ('kimi', 'default')
GROUP BY experiment_group, like_status;

-- Approval rates
SELECT experiment_group,
  COUNT(CASE WHEN like_status = 'like' THEN 1 END) as likes,
  COUNT(CASE WHEN like_status = 'dislike' THEN 1 END) as dislikes,
  ROUND(COUNT(CASE WHEN like_status = 'like' THEN 1 END)::numeric / COUNT(*) * 100, 1) as approval_rate
FROM ffl_feedback
WHERE experiment_group IN ('kimi', 'default')
GROUP BY experiment_group;

-- Macro corrections for Kimi meals
SELECT cl.original_name, cl.original_calories, cl.new_calories, cl.calories_pct_change, cl.correction_direction
FROM food_items_change_log cl
JOIN food_items fi ON fi.id = cl.food_item_id
JOIN ffl_feedback fb ON fb.meal_id = fi.meal_id
WHERE fb.experiment_group = 'kimi'
  AND cl.change_type = 'MACRO_UPDATE'
  AND cl.change_timestamp >= NOW() - INTERVAL '7 days';

Related