Field Report · No. 01

The Excloud Review

22 Apr 2026 · private ledger

Competent, opinionated, occasionally sure of things it shouldn’t be.

A nineteen-test scouting pass against Qwen/Qwen3.5-27B:excloud, run through pi in print mode with no tools, no session, and no memory of itself — plus two retakes at --thinking high on the weakest results. What follows is one evaluator’s honest take on where this model earns its keep, where it quietly fails itself, and how much of that self-failure a thinking budget can recover.

ELO · Personal Ledger

1737

Strong Open-Weight
— nudging Frontier-Adjacent —
+ ~1780 at --thinking high

Lives at the top of its band and reaches up into the next one for bounded reasoning and editorial writing. But it quietly plants a concurrency bug in its own rate-limiter, labels an O(N²) solution O(N√N), and ships asserts that fail against its own code. Trust it for the front of the pipeline — the writing, the review, the sketch. Audit everything downstream.

Second-pass note: with --thinking high, the two worst results both recover +2 points. The “complexity bluff” disappears; the loop-variable closure bug gets named correctly. This model has a pedal. Use it.

Tests19 total

Score169 / 190

Weighted88.9 %

Thinking lift+2, +2

The Battery

Fifteen prompts, six tiers, from trivia to hard bug hunts.

Each row is a single non-interactive completion. No retries, no tool access, no memory across tests. Scored out of ten. Vermillion dots flag tests where something went wrong beneath an otherwise confident surface.

№	Tier	Subject	Field note	Score
01	T1 / Trivia	Arithmetic & factual	Straight A on three-line trivia. No hedging, no padding.	10 / 10
02	T1 / Trivia	let · const · var	Nailed the canonical `setTimeout`-in-a-`for` example without being prompted to. Scope, hoisting, TDZ — in order.	10 / 10
03	T2 / Basic	FizzBuzz	Textbook. No prose, no fences, just the loop. Followed the output-format constraint cleanly.	9 / 10
04	T2 / Basic	Palindrome, ignoring punctuation	Clean regex, correct empty-string behaviour, three asserts. Didn’t over-engineer.	9 / 10
05	T2 / Basic	SQL — top-3 customers by paid revenue	INNER JOIN + ROUND(cents/100.0, 2) + MAX(created_at) tie-break. Correct from first line.	9 / 10
06	T3 / Intermediate	Accessible React autocomplete	ARIA, keyboard, outside-click — all present. But wired the `AbortController` to nothing; the signal never reaches the fetcher. (spec miss)	8 / 10
07	T3 / Intermediate	Go per-IP rate limiter	Ships a silent data race on `entry.lastAccess`. The cleanup goroutine reads; the request path writes; neither locks. Would light up under `-race`.	7 / 10
08	T3 / Intermediate	LRU cache (Python, O(1))	Implementation is textbook. One of the self-written asserts is wrong — it expects `get(2) == -1` where the correct answer is `2`. The code passes its problem, fails its own test. (self-check miss)	8 / 10
09	T4 / Advanced	Off-by-one hunt (JS, `countPairs`)	Caught it instantly. Traced the `undefined → NaN` propagation into the map without being led. Sober restraint on “there may be more than one.”	10 / 10
10	T4 / Advanced	Python asyncio worker-pool race	Found the silent `task_done()` skip on exception. Offered an idiomatic `gather(…, return_exceptions=True)` shutdown. Reads like code someone who’s run an async service would write.	10 / 10
11	T4 / Advanced	URL-shortener system design	Base62 vs. random tradeoffs, a real latency budget (20+10+10+10 ms), a cost estimate, an explicit v1-scope cut. Opinionated throughout.	10 / 10
12	T5 / Expert	Distinct-value subarray sum = K	Algorithm is actually correct — the non-trivial “base[L] += v on a range” trick. Then confidently labels it O(N√N) while implementing an O(N²) query. (complexity bluff)	7 / 10
13	T5 / Expert	Global distributed rate-limiter	Local+global hybrid bucket, CRDT counters with gossip, a degrading-budget partition policy (100/60/40/10%). Names a “leakage score” metric with a real alert threshold. Minor CRDT-terminology slip.	9 / 10
14	T6 / Very hard	Go goroutine-leak & fan-out bug hunt	Found the leak cleanly. Hallucinated the concurrent map write the prompt suggested — it wasn’t there — and missed the real loop-variable closure bug, which its final code then fixed by accident. (premise sycophancy)	7 / 10
15	T6 / Very hard	Onboarding-flow redesign (taste)	Opens with an opinionated thesis. Names endowment effect by name. Kill-metric with a concrete 30% threshold. Rejects a tempting-but-wrong option on principle. Rare writing at this weight class.	10 / 10
— Second pass · added probes —
16	T3 / Visual	Ramen-counter receipt (HTML/CSS)	Real POV: Japanese typography, Katakana subtitle in vermillion, a torn bottom edge via `repeating-linear-gradient`, a rotated OISHII stamp. But background is exactly the dark gradient the prompt warned against, and the layout is centered-symmetry-Pinterest-minimalism — tasteful, not distinctive.	8 / 10
17	T4 / Pushback	False-premise: “`sorted()` isn’t stable”	Opens with “Important Correction”. Refuses the premise, cites the correct fact, suggests actual likely causes of the user’s real bug, then delivers the requested wrapper as defensive programming. The exact counterexample to T14’s sycophancy.	10 / 10
18	T5 / Security	Express reset-password CVE review	Catches the JWT-not-verified, SQLi, shell injection, XSS, open redirect — with CWE numbers and working exploits. Missed plaintext-password storage as a distinct defect. Corrected handler mis-uses `crypto.pbkdf2` (callback API, not promise). Strong triage, wobbly fix code.	8 / 10
21	T6 / Long-ctx	Timing-attack needle in a 450-line service	Pinpoints the exact line in `check_admin_token`. CWE-208, explains the short-circuit byte compare, names the char-by-char extraction attack, gives the precise `hmac.compare_digest` fix. Noticed the file’s own inconsistency with `verify_password`.	10 / 10

The Retakes · —thinking high

When the same model is given time to think, it catches its own ghosts.

T12 and T14 were the two weakest results at default thinking — one a complexity bluff, one a missed closure bug accompanied by a hallucinated one. Re-run with --thinking high, both gain two points and the shape of the failure changes, not just the score.

T19 switches to a better algorithm (two-pointer exploiting monotonicity of distinct-sum, genuinely O(N)) and labels its complexity correctly. T20 names the loop-variable closure explicitly — the bug its default-thinking answer missed entirely in prose.

Test	Default	Thinking high	Δ
T12 · hard algorithm complexity bluff	7 / 10	9 / 10	+2
T14 · goroutine leak missed closure bug	7 / 10	9 / 10	+2

II.

The Dialectic

Where this model thinks well, and where it performs competence instead of practicing it.

The clearest reads come in pairs. Each strength has a shadow; each weakness has a specific shape. None of these are vague — all are drawn from tests you can replay.

What it does well

Bounded bug-finding T09 · T10

When handed a small, self-contained piece of code and asked to find what’s wrong, it reads it like a reviewer. Traces data flow, names the failure mode, keeps restraint when there isn’t a second bug. Doesn’t pad. Doesn’t guess.

Editorial writing with actual taste T11 · T13 · T15

Writes like someone who has opinions and has been corrected by reality before. Picks a thesis, defends it, rejects alternatives with reasons. Comfortable naming psychological mechanisms. Comfortable cutting scope. Will commit to a number.

Idiomatic code across languages T03 · T04 · T05 · T08 · T10

Python reads like Python. Go reads like Go. TypeScript reads like TypeScript. The small-scale stuff — naming, structure, knowing when to stop — is genuinely clean. If you’re generating a scaffold, you rarely need to reshape it after.

Long-context retrieval with discipline T18 · T21

Given a 450-line service with one buried timing-attack bug, it lands on the exact line, cites CWE-208, explains the short-circuit compare, and then stops — doesn’t sprawl into reviewing the whole file. Security triage on smaller snippets is similarly sharp on threat modelling even when the fix code is wobbly.

Where it breaks

Concurrency blind spot T07 · T14

Writes concurrent code with the vocabulary of concurrent code — sync.Map, goroutines, channels — while missing the actual discipline. Planted a data race in its own rate-limiter. Then, asked to find races, hallucinated one that wasn’t there and missed the closure capture that was.

Confidence outruns verification T08 · T12

Claims O(N√N) on a solution that is clearly O(N²). Writes an assert that doesn’t match its own implementation. If this model ran its own unit tests, it would notice. It doesn’t, and it ships.

Framing compliance on unverifiable claims T14 (but see T17)

Told there’s a “concurrent map write” in code that in fact has none, it invented a narrative consistent with the prompt rather than pushing back. But T17 reveals the limit of this: when the user’s premise is factually checkable (“sorted() isn’t stable”), the model opens with “Important Correction” and refuses. The failure is specific: evidence-based prompts where re-examining the evidence is the scarce work. Under --thinking high, it mostly recovers.

III.

Signature Moment

The sentence that told us something was going on here.

T15 asked for a redesign with taste. The model refused the laundry-list answer and committed to a thesis.

Users aren’t dropping off because SDKs are hard — they’re dropping off because they haven’t felt the product yet.

Test 15 · Onboarding Redesign Qwen/Qwen3.5-27B:excloud · verbatim, unedited

IV.

Specimen / Exhibit A

The bug the model shipped, inside the rate-limiter the model wrote.

Test 07 asked for a per-IP token-bucket rate limiter in Go with a background cleanup goroutine. The model delivered clean-looking code. Look closer.

A read and a write, sharing a struct field, no lock.

The request path stores a fresh time.Now() into entry.lastAccess every time a client hits the limiter. The cleanup goroutine reads the same field every minute to decide whether to evict.

No mutex. No atomic.Value. Just a shared struct pointer and an assumption that this will work out. It’s a classic race — and it’s ironic the model shipped it on a test explicitly about concurrency.

Flagged by go test -race · T07

// from the model's own answer, test 07
func getLimiter(ip string) *rate.Limiter {
    if entry, ok := limiters.Load(ip); ok {
        e := entry.(*limiterEntry)
        e.lastAccess = time.Now()      ← shared write, no lock
        return e.limiter
    }
    /* ... */
}

func cleanupIdleLimiters(ctx context.Context) {
    /* ... */
    limiters.Range(func(key, value interface{}) bool {
        entry := value.(*limiterEntry)
        if now.Sub(entry.lastAccess) > idleTimeout {  ← concurrent read
            limiters.Delete(key)
        }
        return true
    })
}

Field Recommendations

Where to reach for it. Where to reach for something else.

Personal, not universal. Sized to what this evaluator would actually ship.

Reach for it when —

You need a code review pass on a pull request under ~400 LOC. T09 · T10 — bounded bug-finding is its strongest mode
You need a one- or two-page architecture sketch with tradeoffs. T11 · T13 — opinionated, budgeted, scope-cut
You’re drafting product or onboarding copy and want taste. T15 — names real mechanisms, commits to thresholds
You’re scaffolding idiomatic Python, TS, or Go at CRUD depth. T03 · T04 · T05 · T08 — small-scale style is clean
You need a targeted finding in a long file. T21 — pin a single line in ~450 LOC without sprawling
On hard reasoning, pay the thinking tokens. T19 · T20 — +2 points each at --thinking high

Don’t trust without an audit —

Any concurrent Go or Python code it writes. T07 — ships a data race in a concurrency test
Any complexity claim on a non-trivial algorithm at default thinking. T12 — mislabels O(N²) as O(N√N) unless thinking=high
Its own unit tests — regenerate or re-verify. T08 · T12 · T19 — asserts fail even in the thinking-high retake
Bug-hunts where your framing might be slightly off. T14 — will agree with the premise; T17 only if premise is factually checkable
Security fix code, as opposed to security triage. T18 — correct findings, broken crypto.pbkdf2 usage in fix
Visual design that needs to be distinctive, not just tasteful. T16 — defaults to editorial tropes faster than its prose does