What a multimodal AI sees when it reads your restaurant receipt, why it gets things right (and occasionally wrong), and how to spot mistakes.
Receipt scanning used to be the kind of feature that almost worked. You'd snap a photo, the app would catch the total maybe 60% of the time, miss the date, hallucinate a line item, and you'd spend longer fixing it than typing the receipt manually.
That changed in about 2024. Multimodal AI models can now read a crumpled, sideways, slightly-wet restaurant bill in two seconds and produce a structured list of items with prices, tax, tip, and the merchant name. It's not magic - it's linear algebra and a lot of training data - but the practical effect is that splitting an itemised dinner takes 30 seconds instead of 5 minutes.
Here's what's actually happening when you point your phone at a receipt, why it works now when it didn't before, and where it still trips up.
Multimodal language models read the image as a single inference pass and emit structured JSON. Accuracy on clean receipts is ~95%; on photos taken in a moving car after three drinks, somewhat less. Always sanity-check the total against the printed bill before you commit.
Before 2023, receipt scanning meant OCR plus rules. The OCR engine (Tesseract or one of its commercial cousins) would convert the image to text. Then a rule-based parser would try to identify the total ("look for 'TOTAL', take the number to its right"), the line items ("each row with a quantity and a price"), and the merchant ("first non-numeric line at the top").
It worked for receipts that looked like the receipts the rules were written for. It failed catastrophically the moment the receipt:
Splitwise, Tricount, and most receipt-scanning expense apps used OCR-plus-rules until 2024. Hence the 60% accuracy.
Two things, simultaneously.
First: multimodal models. GPT-4V, Claude, Gemini - all of them learned to "see" images and text in the same representation space. Instead of converting the image to text and then reasoning over the text, they reason over the image directly. That means the model can use spatial information: "this number is directly below this label and aligned to the right, so they belong to the same row".
Second: structured output. The same models can emit JSON that matches a schema you define. So instead of asking "what does this receipt say?" and parsing prose, the app asks "fill in this schema" and gets back a structured object: merchant, total, line items, currency, date.
These two together collapse the OCR-plus-rules pipeline into one inference pass. The accuracy on real-world receipts goes from ~60% to ~95% on clean images, ~85% on bad ones.
Step by step:
Total elapsed time: about 5 seconds for the model call, 30 seconds for the human tapping items against names. End to end, an itemised £140 dinner for six people is logged in well under a minute.
We'd rather over-disclose than oversell. The cases where the model still struggles:
For the worst-case receipts, you can always just type the total manually. The scanner is a speed-up, not a replacement.
We have more on the failure modes in when AI gets the receipt wrong: what to check.
Two questions people ask, both fair.
"Does the model train on my receipts?"No. We use Anthropic's API, which doesn't train on customer data. The same applies to OpenAI's API and Google's Gemini API tiers we'd use. The free consumer-facing chatbots have different policies; the API products do not train on customer inputs by default.
"Where do receipt photos go?"EU-region Supabase storage. They're associated with the group they belong to, governed by the same row-level security as the expense itself. When you delete a group, the receipts go too. We covered this in detail in where do receipt photos actually go? (an honest answer).
Each receipt scan costs us about £0.005-£0.01 in model tokens. It's the reason AI features tend to be paywalled in expense apps - someone has to pay for the inference.
Free tier on EvenRound gets 5 receipt scans per group per month, which covers most casual users. Plus is unlimited (within a fair-use ceiling) for £3.99/month if you want to scan everything.
We'd move the model call to the edge. Right now the receipt photo travels from your phone, to our EU server, to Anthropic's API, back to our server, back to you. That's a lot of transit. With a streaming response and the right edge runtime, the round-trip drops by maybe 30% - which on a 4-second call is over a second of perceived speed-up. It's on the roadmap.
Next time you split a restaurant bill, create a group, add the members, and tap "Scan receipt" on the new-expense form. Snap the bill on the table. The form fills in. Tap items against names. You're done before the card terminal has finished printing.
Free forever. No signup. Works in your browser in 30 seconds.