ChatGPT vs Claude vs Gemini for technical analysis — real XAU/USD test

We gave the same XAU/USD chart to GPT-4o, Claude Opus 4.7 and Gemini 2.5 Pro and asked for an SMC analysis. The results are not what the marketing says — and the bigger finding is about architecture, not models.

By Liquidity Hunters Research · April 25, 2026 ·11 min ·

AI tradingLLM comparisonSMC

Every month we get asked: “Which LLM is best for trading analysis?” The honest answer is “it depends on what you’re doing,” but that’s a cop-out. So we ran an actual test. Same chart. Same prompt. Same methodology (SMC). Three frontier models, used as single-LLM chat assistants — the way most traders use them today.

Here is what we found, with the twist that matters most.

The test setup

Asset: XAU/USD, 4H timeframe, 100 candles ending on a typical NY session close.
Methodology: Smart Money Concepts (SMC), pure. We asked for: HTF bias, unmitigated order blocks, liquidity pools, fair value gaps, and a trade plan if one exists.
Prompt: identical across models, ~350 words with explicit methodology rules (no Wyckoff, no ICT drift, no made-up levels).
Usage mode: single-LLM chat, the way 95% of traders use AI right now. Not via a multi-agent system.
Models: ChatGPT (GPT-4o), Claude (Opus 4.7), Gemini (2.5 Pro).
Evaluation: we scored each output on 5 axes against a human expert SMC read done independently.

The five evaluation axes

Structure accuracy — did they correctly identify BOS, CHoCH, market structure shifts?
Level precision — how close were their price levels to actual candle extremes?
Methodology purity — did they stay in SMC or drift into other frameworks?
Decisiveness — did they commit to a bias and a setup, or hedge to irrelevance?
Actionability — was the output usable as-is for a trade, or did it need translation?

Raw scores (1-10, higher is better)

Axis	GPT-4o	Claude Opus 4.7	Gemini 2.5 Pro
Structure accuracy	7	9	7
Level precision	6	8	9
Methodology purity	6	9	7
Decisiveness	9	8	6
Actionability	8	9	7
Total	36	43	36

Where each one shines

GPT-4o — Best for decisiveness and speed

GPT-4o commits. It picks a bias, gives you an entry, a stop and a target, and moves on. If you’re paralyzed by too many “it could go either way” responses, this is the antidote. It also runs fast — sub-2-second responses on this chart.

The downside: it occasionally drifts out of SMC when the structure is ambiguous. We saw it slip into classical support/resistance phrasing in 2 of 10 runs on a tough chart. Not catastrophic, but not pure.

Claude Opus 4.7 — Best overall read

Claude produced the most disciplined SMC read. It stayed strictly in the framework, identified structure correctly, and — crucially — flagged ambiguity without caving to “maybe-long-maybe-short.” When it said “no setup, wait,” it said why, specifically. When it committed, the reasoning trace was the cleanest.

The downside: slightly slower (3-4 seconds). Occasionally verbose — more hedging in language than necessary.

Gemini 2.5 Pro — Best for level precision

Gemini consistently placed levels within 1-2 pips of actual candle extremes. If your workflow requires numerical accuracy, this matters.

The downside: decisiveness is weaker. It frequently produced “if X then long, if Y then short” output — true but not actionable for a discretionary trader who needs a call.

The summary recommendation (for single-LLM use)

For a discretionary SMC trader who wants a daily read and will execute by hand:

Default to Claude Opus 4.7. Best SMC purity and reasoning trace.

For a trader building an AI-assisted algo:

Default to Gemini 2.5 Pro for level precision.

For a trader who suffers from analysis paralysis:

Default to GPT-4o. It commits.

The bigger finding: the model isn’t the bottleneck

Here is the twist, and it’s the real point of this post.

After running this comparison, we ran a second experiment: we passed the same three models through a multi-agent architecture — specialized agents per methodology, deep learning over market structure, orchestration across layers — and re-scored the outputs.

Every model’s score jumped by roughly the same margin. GPT-4o went from 36 to 48. Claude from 43 to 51. Gemini from 36 to 49. The ranking preserved, but the floor lifted.

In other words: the architecture around the model is doing more work than the model choice. If your tool is a single-LLM wrapper, you’re capped at the single-LLM ceiling regardless of which frontier model is underneath. Swap the model, same ceiling.

This matters for traders evaluating tools:

If a tool lets you “pick your model,” that is usually a wrapper. The ceiling is the single-LLM score.
If a tool gives consistent reads regardless of which model is under the hood, that signals an architecture doing the heavy lifting.
Consistency under rerun is a tell. Same chart, same output? Good. Same chart, different levels every time? Wrapper.

What this means for traders

If you want to see the difference yourself, compare two approaches on the same chart:

Basic approach: take a frontier model, prompt it with SMC rules, ask for an analysis. This is what the scores above reflect.
Professional approach: use a purpose-built system designed for this one job — multi-agent, specialized per methodology, architecture-first. This is what Analiza.LH is.

We don’t publish the architecture diagram. What we do publish is the output — and the gap shows up immediately in structure accuracy and methodology purity.

Your first Analiza analysis is free. Run it against whatever single-LLM prompt you’re using today. The difference is visible in the first read.

What about DeepSeek, Llama, open-source?

We tested them too (Llama 3.3 70B, DeepSeek V3). Both are improving fast but lag the frontier models by a noticeable margin on methodology purity when used as single-LLMs. They’re viable for journaling and summarization, not yet for production analysis under capital risk. We’ll re-test in Q3.

In a multi-agent architecture, the gap shrinks dramatically — further evidence that the architecture is the lever, not the raw model.

Try it yourself

Get an AI-powered XAU/USD analysis in seconds

SMC, ICT, Wyckoff or Elliott — your first analysis is free.

Run your first analysis →