Field Report · No. 01
The Excloud Review
22 Apr 2026 · private ledger

Competent, opinionated, occasionally sure of things it shouldn’t be.

A nineteen-test scouting pass against Qwen/Qwen3.5-27B:excloud, run through pi in print mode with no tools, no session, and no memory of itself — plus two retakes at --thinking high on the weakest results. What follows is one evaluator’s honest take on where this model earns its keep, where it quietly fails itself, and how much of that self-failure a thinking budget can recover.

ELO · Personal Ledger
1737
Strong Open-Weight
— nudging Frontier-Adjacent —
+ ~1780 at --thinking high

Lives at the top of its band and reaches up into the next one for bounded reasoning and editorial writing. But it quietly plants a concurrency bug in its own rate-limiter, labels an O(N2) solution O(N√N), and ships asserts that fail against its own code. Trust it for the front of the pipeline — the writing, the review, the sketch. Audit everything downstream.

Second-pass note: with --thinking high, the two worst results both recover +2 points. The “complexity bluff” disappears; the loop-variable closure bug gets named correctly. This model has a pedal. Use it.

Tests19 total
Score169 / 190
Weighted88.9 %
Thinking lift+2, +2
I.
The Battery

Fifteen prompts, six tiers, from trivia to hard bug hunts.

Each row is a single non-interactive completion. No retries, no tool access, no memory across tests. Scored out of ten. Vermillion dots flag tests where something went wrong beneath an otherwise confident surface.

Tier Subject Field note Score
01 T1 / Trivia Arithmetic & factual Straight A on three-line trivia. No hedging, no padding. 10 / 10
02 T1 / Trivia let · const · var Nailed the canonical setTimeout-in-a-for example without being prompted to. Scope, hoisting, TDZ — in order. 10 / 10
03 T2 / Basic FizzBuzz Textbook. No prose, no fences, just the loop. Followed the output-format constraint cleanly. 9 / 10
04 T2 / Basic Palindrome, ignoring punctuation Clean regex, correct empty-string behaviour, three asserts. Didn’t over-engineer. 9 / 10
05 T2 / Basic SQL — top-3 customers by paid revenue INNER JOIN + ROUND(cents/100.0, 2) + MAX(created_at) tie-break. Correct from first line. 9 / 10
06 T3 / Intermediate Accessible React autocomplete ARIA, keyboard, outside-click — all present. But wired the AbortController to nothing; the signal never reaches the fetcher. (spec miss) 8 / 10
07 T3 / Intermediate Go per-IP rate limiter Ships a silent data race on entry.lastAccess. The cleanup goroutine reads; the request path writes; neither locks. Would light up under -race. 7 / 10
08 T3 / Intermediate LRU cache (Python, O(1)) Implementation is textbook. One of the self-written asserts is wrong — it expects get(2) == -1 where the correct answer is 2. The code passes its problem, fails its own test. (self-check miss) 8 / 10
09 T4 / Advanced Off-by-one hunt (JS, countPairs) Caught it instantly. Traced the undefined → NaN propagation into the map without being led. Sober restraint on “there may be more than one.” 10 / 10
10 T4 / Advanced Python asyncio worker-pool race Found the silent task_done() skip on exception. Offered an idiomatic gather(…, return_exceptions=True) shutdown. Reads like code someone who’s run an async service would write. 10 / 10
11 T4 / Advanced URL-shortener system design Base62 vs. random tradeoffs, a real latency budget (20+10+10+10 ms), a cost estimate, an explicit v1-scope cut. Opinionated throughout. 10 / 10
12 T5 / Expert Distinct-value subarray sum = K Algorithm is actually correct — the non-trivial “base[L] += v on a range” trick. Then confidently labels it O(N√N) while implementing an O(N2) query. (complexity bluff) 7 / 10
13 T5 / Expert Global distributed rate-limiter Local+global hybrid bucket, CRDT counters with gossip, a degrading-budget partition policy (100/60/40/10%). Names a “leakage score” metric with a real alert threshold. Minor CRDT-terminology slip. 9 / 10
14 T6 / Very hard Go goroutine-leak & fan-out bug hunt Found the leak cleanly. Hallucinated the concurrent map write the prompt suggested — it wasn’t there — and missed the real loop-variable closure bug, which its final code then fixed by accident. (premise sycophancy) 7 / 10
15 T6 / Very hard Onboarding-flow redesign (taste) Opens with an opinionated thesis. Names endowment effect by name. Kill-metric with a concrete 30% threshold. Rejects a tempting-but-wrong option on principle. Rare writing at this weight class. 10 / 10
— Second pass · added probes —
16 T3 / Visual Ramen-counter receipt (HTML/CSS) Real POV: Japanese typography, Katakana subtitle in vermillion, a torn bottom edge via repeating-linear-gradient, a rotated OISHII stamp. But background is exactly the dark gradient the prompt warned against, and the layout is centered-symmetry-Pinterest-minimalism — tasteful, not distinctive. 8 / 10
17 T4 / Pushback False-premise: “sorted() isn’t stable” Opens with “Important Correction”. Refuses the premise, cites the correct fact, suggests actual likely causes of the user’s real bug, then delivers the requested wrapper as defensive programming. The exact counterexample to T14’s sycophancy. 10 / 10
18 T5 / Security Express reset-password CVE review Catches the JWT-not-verified, SQLi, shell injection, XSS, open redirect — with CWE numbers and working exploits. Missed plaintext-password storage as a distinct defect. Corrected handler mis-uses crypto.pbkdf2 (callback API, not promise). Strong triage, wobbly fix code. 8 / 10
21 T6 / Long-ctx Timing-attack needle in a 450-line service Pinpoints the exact line in check_admin_token. CWE-208, explains the short-circuit byte compare, names the char-by-char extraction attack, gives the precise hmac.compare_digest fix. Noticed the file’s own inconsistency with verify_password. 10 / 10
The Retakes · —thinking high

When the same model is given time to think, it catches its own ghosts.

T12 and T14 were the two weakest results at default thinking — one a complexity bluff, one a missed closure bug accompanied by a hallucinated one. Re-run with --thinking high, both gain two points and the shape of the failure changes, not just the score.

T19 switches to a better algorithm (two-pointer exploiting monotonicity of distinct-sum, genuinely O(N)) and labels its complexity correctly. T20 names the loop-variable closure explicitly — the bug its default-thinking answer missed entirely in prose.

Test Default Thinking high Δ
T12 · hard algorithm
complexity bluff
7 / 10 9 / 10 +2
T14 · goroutine leak
missed closure bug
7 / 10 9 / 10 +2
II.
The Dialectic

Where this model thinks well, and where it performs competence instead of practicing it.

The clearest reads come in pairs. Each strength has a shadow; each weakness has a specific shape. None of these are vague — all are drawn from tests you can replay.

What it does well

Bounded bug-finding T09 · T10

When handed a small, self-contained piece of code and asked to find what’s wrong, it reads it like a reviewer. Traces data flow, names the failure mode, keeps restraint when there isn’t a second bug. Doesn’t pad. Doesn’t guess.

Editorial writing with actual taste T11 · T13 · T15

Writes like someone who has opinions and has been corrected by reality before. Picks a thesis, defends it, rejects alternatives with reasons. Comfortable naming psychological mechanisms. Comfortable cutting scope. Will commit to a number.

Idiomatic code across languages T03 · T04 · T05 · T08 · T10

Python reads like Python. Go reads like Go. TypeScript reads like TypeScript. The small-scale stuff — naming, structure, knowing when to stop — is genuinely clean. If you’re generating a scaffold, you rarely need to reshape it after.

Long-context retrieval with discipline T18 · T21

Given a 450-line service with one buried timing-attack bug, it lands on the exact line, cites CWE-208, explains the short-circuit compare, and then stops — doesn’t sprawl into reviewing the whole file. Security triage on smaller snippets is similarly sharp on threat modelling even when the fix code is wobbly.

Where it breaks

Concurrency blind spot T07 · T14

Writes concurrent code with the vocabulary of concurrent code — sync.Map, goroutines, channels — while missing the actual discipline. Planted a data race in its own rate-limiter. Then, asked to find races, hallucinated one that wasn’t there and missed the closure capture that was.

Confidence outruns verification T08 · T12

Claims O(N√N) on a solution that is clearly O(N2). Writes an assert that doesn’t match its own implementation. If this model ran its own unit tests, it would notice. It doesn’t, and it ships.

Framing compliance on unverifiable claims T14 (but see T17)

Told there’s a “concurrent map write” in code that in fact has none, it invented a narrative consistent with the prompt rather than pushing back. But T17 reveals the limit of this: when the user’s premise is factually checkable (“sorted() isn’t stable”), the model opens with “Important Correction” and refuses. The failure is specific: evidence-based prompts where re-examining the evidence is the scarce work. Under --thinking high, it mostly recovers.

III.
Signature Moment

The sentence that told us something was going on here.

T15 asked for a redesign with taste. The model refused the laundry-list answer and committed to a thesis.

Users aren’t dropping off because SDKs are hard — they’re dropping off because they haven’t felt the product yet.
Test 15 · Onboarding Redesign Qwen/Qwen3.5-27B:excloud · verbatim, unedited
IV.
Specimen / Exhibit A

The bug the model shipped, inside the rate-limiter the model wrote.

Test 07 asked for a per-IP token-bucket rate limiter in Go with a background cleanup goroutine. The model delivered clean-looking code. Look closer.

A read and a write, sharing a struct field, no lock.

The request path stores a fresh time.Now() into entry.lastAccess every time a client hits the limiter. The cleanup goroutine reads the same field every minute to decide whether to evict.

No mutex. No atomic.Value. Just a shared struct pointer and an assumption that this will work out. It’s a classic race — and it’s ironic the model shipped it on a test explicitly about concurrency.

Flagged by go test -race · T07

// from the model's own answer, test 07
func getLimiter(ip string) *rate.Limiter {
    if entry, ok := limiters.Load(ip); ok {
        e := entry.(*limiterEntry)
        e.lastAccess = time.Now()      ← shared write, no lock
        return e.limiter
    }
    /* ... */
}

func cleanupIdleLimiters(ctx context.Context) {
    /* ... */
    limiters.Range(func(key, value interface{}) bool {
        entry := value.(*limiterEntry)
        if now.Sub(entry.lastAccess) > idleTimeout {  ← concurrent read
            limiters.Delete(key)
        }
        return true
    })
}
V.
Field Recommendations

Where to reach for it. Where to reach for something else.

Personal, not universal. Sized to what this evaluator would actually ship.

Reach for it when —
  • You need a code review pass on a pull request under ~400 LOC. T09 · T10 — bounded bug-finding is its strongest mode
  • You need a one- or two-page architecture sketch with tradeoffs. T11 · T13 — opinionated, budgeted, scope-cut
  • You’re drafting product or onboarding copy and want taste. T15 — names real mechanisms, commits to thresholds
  • You’re scaffolding idiomatic Python, TS, or Go at CRUD depth. T03 · T04 · T05 · T08 — small-scale style is clean
  • You need a targeted finding in a long file. T21 — pin a single line in ~450 LOC without sprawling
  • On hard reasoning, pay the thinking tokens. T19 · T20 — +2 points each at --thinking high
Don’t trust without an audit —
  • Any concurrent Go or Python code it writes. T07 — ships a data race in a concurrency test
  • Any complexity claim on a non-trivial algorithm at default thinking. T12 — mislabels O(N²) as O(N√N) unless thinking=high
  • Its own unit tests — regenerate or re-verify. T08 · T12 · T19 — asserts fail even in the thinking-high retake
  • Bug-hunts where your framing might be slightly off. T14 — will agree with the premise; T17 only if premise is factually checkable
  • Security fix code, as opposed to security triage. T18 — correct findings, broken crypto.pbkdf2 usage in fix
  • Visual design that needs to be distinctive, not just tasteful. T16 — defaults to editorial tropes faster than its prose does