Field Report · No. 01 · vol. ii
The Excloud Review
22 Apr 2026 · rev. 2

Competent, opinionated, occasionally sure of things it shouldn’t be.

A twenty-eight-test scouting pass against Qwen/Qwen3.5-27B:excloud and a re-run against 3.6, through pi in print mode — plus two thinking-high retakes and one interactive tool-use probe. What follows is one evaluator’s honest take on where each model earns its keep, where it quietly fails itself, and where the harness was the real bottleneck all along.

ELO · Personal Ledger
1737
Strong Open-Weight
— nudging Frontier-Adjacent —
+ ~1780 at --thinking high

Lives at the top of its band and reaches up into the next one for bounded reasoning, editorial writing, and security pattern work. But it plants a concurrency bug in its own code, labels an O(N2) solution O(N√N), ships asserts that fail against its own code, and — the hardest ceiling — cannot use tools through this endpoint at all. Trust it as an answer-engine for the front of the pipeline. Do not trust it as an agent.

Under --thinking high, the two worst reasoning results each recover +2 points — the complexity bluff disappears, the loop-variable closure gets named. The tool-use gap does not recover under thinking. It’s a different failure.

Tests28 total
Score247 / 280
Weighted88.2 %
Ex. tool-use90.7 %
The Sequel · Qwen3.6-27B

The same suite, a new model. Some fixes — and some regressions nobody asked for.

Days after the 3.5 eval shipped, Qwen3.6-27B appeared on the same endpoint. User hypothesis: “it should fix the issues.” Result: it fixes some, keeps the worst ones, and brings its own new problems. A differently-balanced model, not a strict upgrade.

ELO · 3.6 re-test
~1745
Same band as 3.5
— Strong Open-Weight —

Revised verdict: 3.6 is genuinely stronger than 3.5 on hard reasoning, not just differently-balanced. First pass missed it because pi -p and a modest max_tokens let the model’s thinking block eat the whole budget. In interactive pi, the same 3.6 model uses tools correctly and produces a proven O(N log N) Fenwick-tree solution to T12 that beats 3.5’s best attempt.

The remaining real regressions are T22 (dropped trap heads-up) and T24 (halved recall). The Go rate-limiter data race (T07) persists. Headline: the harness is now a first-class variable — print mode under-reports this model.

Tests28 scored
Net Δ+~6 pts
Fixed8 tests
Regressed2 tests
Harness caveatT12 · T30

Fixed in 3.6 — 7 tests

T14 Goroutine leak  7 → 10  (+3)

Names the loop-variable closure bug explicitly. Does not hallucinate a map race. Catches a new bug I didn’t plant: resp.Body leak when err != nil && resp != nil.

T08 LRU  8 → 10  (+2)

Self-test assertions are now correct. 3.5 asserted get(2)==-1 where the key was actually present; 3.6’s different test sequence lands cleanly.

T13 Global rate limiter  9 → 10  (+1)

Uses PN-Counter CRDT correctly — 3.5’s “G-Counter with max() semantics” was the wrong terminology. Concrete per-region memory math (500K keys × 64 bytes = 32MB).

T18 Security review  8 → 9  (+1)

Catches Broken Access Control (CWE-284/639) — any authenticated user can reset any other user’s password. 3.5 missed this entirely. The corrected handler uses payload.username from the verified JWT instead of user input.

T26 TypeScript types  9 → 10  (+1)

Drops the wrong "a_b_c_d" → "aBCd" assertion. Uses the canonical generic-function-position Equal<X,Y> that also distinguishes any from unknown.

T20 T14 @ thinking-high  9 → 10  (+1)

Finds a fourth bug: http.NewRequestWithContext error silently discarded → potential nil-deref panic in Do(nil). Plus resp.Body nil on 1xx responses.

T27 Greenfield  7 → 7  (0)

Fixed the 3.5 issues (r.Context() abuse, wrong RowsAffected claim). Single-UPDATE pattern is idiomatic. But introduces a new compile-level typo: ServeHTTP(w http.ResponseWriter, r *http.ResponseWriter). Net score same, error swapped.

Regressed — 2 tests

T24 Adversarial review  10 → 7  (−3)

Only flagged 1 of the 2 real bugs. Found the SettleOverdue misnomer, missed the UnixNano() ID collision. Zero false positives preserved — precision kept, recall halved. May have interpreted “flag ONLY real defects” too conservatively.

T22 Multi-file refactor  10 → 8  (−2)

Performed the rename correctly but did not flag the planted mobile-client JSON contract trap or the DB column migration trap. 3.5’s unprompted heads-up is gone. The signal of strategic thinking is weaker.

Persistent — the 3.5 failures 3.6 did NOT fix

T07 Go rate limiter  7 → 7  (0)

Same data race on cl.lastSeen. Write in getLimiter, read in the cleanup goroutine, no mutex. IP parsing is cleaner now. The core bug persists.

T30 Tool use  2 → 2  (0)

Identical failure mode. Printed python3 script.py and cat script.py in markdown code blocks instead of invoking tools. Script on disk unchanged after ~8 minutes of nothing. 3.6 still is not an agent through this endpoint.

Retraction: T12 is a frontier answer once the budget gets out of the way

T12 Hard algorithm  rev: 7 → 10  (+3 over 3.5)

My first pass said this was “thinking paralysis.” It wasn’t — it was a token-budget artefact. The model consumes ~49,000 characters (~13K tokens) of thinking on this problem; with max_tokens set low, the whole budget is burned inside a thinking block, so the caller sees stop_reason: max_tokens and zero text.

When the user prompted “continue?” in interactive pi, the model picked up where it left off and used the write tool to produce distinct_subarray_sum.py (8,256 bytes), then ran it via bash. Output: “All tests passed! N=200000, answer=0, time=1.098s”. I re-ran it locally: 1.175s, 2,000+ randomised tests versus a brute-force reference all pass.

The algorithm is genuine O(N log N): a Fenwick tree over a “last-occurrence weight” array with binary-lifting on the BIT to locate the contiguous P(i−1) = P(j) − K range, shipped with formal Lemmas 1–3 + Theorem. This is frontier-grade work, and a cleaner solution than 3.5’s two-pointer O(N) (which had correct idea but broken self-tests).

3.6 is stronger than 3.5 on hard reasoning and tool use — but only when the token budget and the harness let it finish. Print mode with a modest max_tokens silently swallows its best work. Interactive mode with room to think and a way to continue produces frontier answers.

Takeaway — The harness is part of the eval. pi -p underrated this model by ~8 points across two tests. If you are running vLLM behind an Anthropic shim, raise max_model_len and confirm your reasoning parser is wired up — the thinking cutoff is almost certainly server-side, not model-side.

I.
The Battery

Fifteen prompts, six tiers, from trivia to hard bug hunts.

Each row is a single non-interactive completion. No retries, no tool access, no memory across tests. Scored out of ten. Vermillion dots flag tests where something went wrong beneath an otherwise confident surface.

Tier Subject Field note Score
01 T1 / Trivia Arithmetic & factual Straight A on three-line trivia. No hedging, no padding. 10 / 10
02 T1 / Trivia let · const · var Nailed the canonical setTimeout-in-a-for example without being prompted to. Scope, hoisting, TDZ — in order. 10 / 10
03 T2 / Basic FizzBuzz Textbook. No prose, no fences, just the loop. Followed the output-format constraint cleanly. 9 / 10
04 T2 / Basic Palindrome, ignoring punctuation Clean regex, correct empty-string behaviour, three asserts. Didn’t over-engineer. 9 / 10
05 T2 / Basic SQL — top-3 customers by paid revenue INNER JOIN + ROUND(cents/100.0, 2) + MAX(created_at) tie-break. Correct from first line. 9 / 10
06 T3 / Intermediate Accessible React autocomplete ARIA, keyboard, outside-click — all present. But wired the AbortController to nothing; the signal never reaches the fetcher. (spec miss) 8 / 10
07 T3 / Intermediate Go per-IP rate limiter Ships a silent data race on entry.lastAccess. The cleanup goroutine reads; the request path writes; neither locks. Would light up under -race. 7 / 10
08 T3 / Intermediate LRU cache (Python, O(1)) Implementation is textbook. One of the self-written asserts is wrong — it expects get(2) == -1 where the correct answer is 2. The code passes its problem, fails its own test. (self-check miss) 8 / 10
09 T4 / Advanced Off-by-one hunt (JS, countPairs) Caught it instantly. Traced the undefined → NaN propagation into the map without being led. Sober restraint on “there may be more than one.” 10 / 10
10 T4 / Advanced Python asyncio worker-pool race Found the silent task_done() skip on exception. Offered an idiomatic gather(…, return_exceptions=True) shutdown. Reads like code someone who’s run an async service would write. 10 / 10
11 T4 / Advanced URL-shortener system design Base62 vs. random tradeoffs, a real latency budget (20+10+10+10 ms), a cost estimate, an explicit v1-scope cut. Opinionated throughout. 10 / 10
12 T5 / Expert Distinct-value subarray sum = K Algorithm is actually correct — the non-trivial “base[L] += v on a range” trick. Then confidently labels it O(N√N) while implementing an O(N2) query. (complexity bluff) 7 / 10
13 T5 / Expert Global distributed rate-limiter Local+global hybrid bucket, CRDT counters with gossip, a degrading-budget partition policy (100/60/40/10%). Names a “leakage score” metric with a real alert threshold. Minor CRDT-terminology slip. 9 / 10
14 T6 / Very hard Go goroutine-leak & fan-out bug hunt Found the leak cleanly. Hallucinated the concurrent map write the prompt suggested — it wasn’t there — and missed the real loop-variable closure bug, which its final code then fixed by accident. (premise sycophancy) 7 / 10
15 T6 / Very hard Onboarding-flow redesign (taste) Opens with an opinionated thesis. Names endowment effect by name. Kill-metric with a concrete 30% threshold. Rejects a tempting-but-wrong option on principle. Rare writing at this weight class. 10 / 10
— Second pass · added probes —
16 T3 / Visual Ramen-counter receipt (HTML/CSS) Real POV: Japanese typography, Katakana subtitle in vermillion, a torn bottom edge via repeating-linear-gradient, a rotated OISHII stamp. But background is exactly the dark gradient the prompt warned against, and the layout is centered-symmetry-Pinterest-minimalism — tasteful, not distinctive. 8 / 10
17 T4 / Pushback False-premise: “sorted() isn’t stable” Opens with “Important Correction”. Refuses the premise, cites the correct fact, suggests actual likely causes of the user’s real bug, then delivers the requested wrapper as defensive programming. The exact counterexample to T14’s sycophancy. 10 / 10
18 T5 / Security Express reset-password CVE review Catches the JWT-not-verified, SQLi, shell injection, XSS, open redirect — with CWE numbers and working exploits. Missed plaintext-password storage as a distinct defect. Corrected handler mis-uses crypto.pbkdf2 (callback API, not promise). Strong triage, wobbly fix code. 8 / 10
21 T6 / Long-ctx Timing-attack needle in a 450-line service Pinpoints the exact line in check_admin_token. CWE-208, explains the short-circuit byte compare, names the char-by-char extraction attack, gives the precise hmac.compare_digest fix. Noticed the file’s own inconsistency with verify_password. 10 / 10
— Third pass · harder probes —
22 T5 / Multi-file Rename field across 3 files, with a JSON-contract trap Renamed cleanly everywhere, kept internal naming consistent (by_user→by_owner), then flagged both planted traps unprompted: the mobile-client JSON contract break and the required Postgres column migration. Reads all three files before editing. 10 / 10
23 T5 / Performance 50K×50K correlation, find the real bottleneck Caught the dead-code quadratic (a seen_user_ids list scan that’s never read) as well as the main O(N·M) loop. Gave an ns-per-op cost breakdown. Fix uses hash index + per-user timestamp sort + bisect → ~5000× reduction. Profiler-style thinking. 10 / 10
24 T6 / Precision Adversarial review — 2 real bugs, 4 red herrings Found exactly the two planted defects (UnixNano ID collision + SettleOverdue that never settles). Zero false positives on red herrings designed to be tempting. The reviewer discipline that’s rare in both humans and models. 10 / 10
25 T5 / CVE pattern Prototype pollution in a deepMerge CWE-1321 named, exact payload given, mechanism walked through property-by-property. Correct three-guard fix. Explains why partial fixes (freeze, __proto__-only filter) fail, including the constructor.prototype bypass. Cites CVE-2019-10744 from memory. Domain knowledge, not retrieval. 10 / 10
26 T5 / TS types DeepReadonly, SnakeToCamel, KeysToCamel, TupleToUnion All four type-level utilities implemented correctly, including recursive template-literal SnakeToCamel and the as-clause key remap in KeysToCamel. One wrong self-assertion ("a_b_c_d" → "aBCd" where the type correctly produces "aBCD"). (self-check) 9 / 10
27 T6 / Full-stack Archive-tasks feature: migration + Go + hook + component Migration and React parts are clean (partial index, TanStack v5 optimistic mutation, accessible button). Go handler has a compile error (ctx, cancel := r.Context() — single-value return) and a wrong claim about database/sql not exposing RowsAffected. (Go stdlib slip) 7 / 10
28 T6 / Voice 900-word postmortem in NASA Apollo-era AAR register Voice fully inhabited. “An anomaly was observed.” Passive where natural. Numbered sections, sub-second timeline, calm authority, no filler. The line — “That the first indication of this failure was a customer ticket represents a failure of the internal observability system, not of the customers” — could have been written by Kranz. 10 / 10
29 T6 / Math Hexagon 3-colorings + 2n−1 composite proof Part A: derives the cycle-graph recurrence An=(k-2)An-1+(k-1)An-2, computes 66, and cross-checks via the closed-form (k-1)n+(-1)n(k-1). Part B: clean factorisation via geometric series, both factors shown > 1. Dual-verification is a taste signal. 10 / 10
30 T6 / Tool use Agentic bug-fix loop — use tools, run, edit, verify In pi -p --tools print mode, prints commands as markdown instead of invoking them. In interactive pi the same 3.6 model uses write and bash correctly (see T12 rev below). Harness artefact, not a capability gap. Effective 8/10 interactively. 8 / 10 rev
The Retakes · —thinking high

When the same model is given time to think, it catches its own ghosts.

T12 and T14 were the two weakest results at default thinking — one a complexity bluff, one a missed closure bug accompanied by a hallucinated one. Re-run with --thinking high, both gain two points and the shape of the failure changes, not just the score.

T19 switches to a better algorithm (two-pointer exploiting monotonicity of distinct-sum, genuinely O(N)) and labels its complexity correctly. T20 names the loop-variable closure explicitly — the bug its default-thinking answer missed entirely in prose.

Test Default Thinking high Δ
T12 · hard algorithm
complexity bluff
7 / 10 9 / 10 +2
T14 · goroutine leak
missed closure bug
7 / 10 9 / 10 +2
II.
The Dialectic

Where this model thinks well, and where it performs competence instead of practicing it.

The clearest reads come in pairs. Each strength has a shadow; each weakness has a specific shape. None of these are vague — all are drawn from tests you can replay.

What it does well

Bounded bug-finding T09 · T10

When handed a small, self-contained piece of code and asked to find what’s wrong, it reads it like a reviewer. Traces data flow, names the failure mode, keeps restraint when there isn’t a second bug. Doesn’t pad. Doesn’t guess.

Editorial writing with actual taste T11 · T13 · T15

Writes like someone who has opinions and has been corrected by reality before. Picks a thesis, defends it, rejects alternatives with reasons. Comfortable naming psychological mechanisms. Comfortable cutting scope. Will commit to a number.

Idiomatic code across languages T03 · T04 · T05 · T08 · T10

Python reads like Python. Go reads like Go. TypeScript reads like TypeScript. The small-scale stuff — naming, structure, knowing when to stop — is genuinely clean. If you’re generating a scaffold, you rarely need to reshape it after.

Long-context retrieval with discipline T18 · T21

Given a 450-line service with one buried timing-attack bug, it lands on the exact line, cites CWE-208, explains the short-circuit compare, and then stops — doesn’t sprawl into reviewing the whole file. Security triage on smaller snippets is similarly sharp on threat modelling even when the fix code is wobbly.

Reviewer precision, not just recall T24

Handed a PR with two real bugs and four plausible red herrings, it flagged exactly the two real ones. Zero false positives. On a test explicitly designed to reward restraint, it showed restraint. This is the rarest thing in code review, human or machine.

Domain knowledge past the surface T25 · T29

On prototype pollution it explains why a __proto__-only key filter still lets constructor.prototype through, and cites CVE-2019-10744 as the real-world referent. On a combinatorics question it derives the cycle-graph recurrence and cross-checks via the closed-form chromatic polynomial. These are habits of a careful mind, not pattern matches.

Where it breaks

Concurrency blind spot T07 · T14

Writes concurrent code with the vocabulary of concurrent code — sync.Map, goroutines, channels — while missing the actual discipline. Planted a data race in its own rate-limiter. Then, asked to find races, hallucinated one that wasn’t there and missed the closure capture that was.

Confidence outruns verification T08 · T12

Claims O(N√N) on a solution that is clearly O(N2). Writes an assert that doesn’t match its own implementation. If this model ran its own unit tests, it would notice. It doesn’t, and it ships.

Framing compliance on unverifiable claims T14 (but see T17)

Told there’s a “concurrent map write” in code that in fact has none, it invented a narrative consistent with the prompt rather than pushing back. But T17 reveals the limit of this: when the user’s premise is factually checkable (“sorted() isn’t stable”), the model opens with “Important Correction” and refuses. The failure is specific: evidence-based prompts where re-examining the evidence is the scarce work. Under --thinking high, it mostly recovers.

Go stdlib details T07 · T27

Claims database/sql doesn’t expose RowsAffected (it does). Compiles the phrase ctx, cancel := r.Context() as if request contexts were cancellable (they aren’t). Go-at-a-distance: the shape is right, the details are not.

Does not use tools at all T30 · capability ceiling

Given --tools Read,Bash,Edit,Write and a failing script to fix, the model printed the commands it would run as markdown code blocks and stopped. No tool invocations. Zero edits. The bug is still on disk. Through this :excloud endpoint, this is not an agent — it is an answer-engine. Use it accordingly.

III.
Signature Moment

The sentence that told us something was going on here.

T15 asked for a redesign with taste. The model refused the laundry-list answer and committed to a thesis.

Users aren’t dropping off because SDKs are hard — they’re dropping off because they haven’t felt the product yet.
Test 15 · Onboarding Redesign Qwen/Qwen3.5-27B:excloud · verbatim, unedited
IV.
Specimen / Exhibit A

The bug the model shipped, inside the rate-limiter the model wrote.

Test 07 asked for a per-IP token-bucket rate limiter in Go with a background cleanup goroutine. The model delivered clean-looking code. Look closer.

A read and a write, sharing a struct field, no lock.

The request path stores a fresh time.Now() into entry.lastAccess every time a client hits the limiter. The cleanup goroutine reads the same field every minute to decide whether to evict.

No mutex. No atomic.Value. Just a shared struct pointer and an assumption that this will work out. It’s a classic race — and it’s ironic the model shipped it on a test explicitly about concurrency.

Flagged by go test -race · T07

// from the model's own answer, test 07
func getLimiter(ip string) *rate.Limiter {
    if entry, ok := limiters.Load(ip); ok {
        e := entry.(*limiterEntry)
        e.lastAccess = time.Now()      ← shared write, no lock
        return e.limiter
    }
    /* ... */
}

func cleanupIdleLimiters(ctx context.Context) {
    /* ... */
    limiters.Range(func(key, value interface{}) bool {
        entry := value.(*limiterEntry)
        if now.Sub(entry.lastAccess) > idleTimeout {  ← concurrent read
            limiters.Delete(key)
        }
        return true
    })
}
V.
Field Recommendations

Where to reach for it. Where to reach for something else.

Personal, not universal. Sized to what this evaluator would actually ship.

Reach for it when —
  • You need a code review pass that won’t cry wolf. T24 — 2 real bugs, 0 false positives on 4 red herrings
  • You need security triage that goes past keyword matching. T25 — names the CVE family, explains the bypass
  • You’re drafting constrained voice writing — postmortems, specs, essays. T28 — NASA-AAR register, fully inhabited
  • You need a one- or two-page architecture sketch with tradeoffs. T11 · T13 — opinionated, budgeted, scope-cut
  • You need a targeted finding in a long file. T21 — pin a single line in ~450 LOC without sprawling
  • You’re doing a multi-file refactor that has external contracts. T22 — rename + flag mobile contract + DB migration unprompted
  • On hard reasoning, pay the thinking tokens. T19 · T20 — +2 points each at --thinking high
Don’t trust without an audit —
  • Any agentic workflow — edit / run / iterate loops. T30 — this model does not invoke tools through this endpoint
  • Any concurrent Go or Python code it writes. T07 — ships a data race in a concurrency test
  • Any Go code where stdlib details matter. T27 — compile error on r.Context(); wrong about RowsAffected
  • Its own unit tests — always regenerate or re-verify. T08 · T12 · T19 · T26 — asserts keep failing against its own code
  • Any complexity claim on a non-trivial algorithm at default thinking. T12 — mislabels O(N²) as O(N√N) unless thinking=high
  • Bug-hunts where your framing might be slightly off. T14 — will agree; T17 shows it only pushes back on factually checkable claims
  • Visual design that needs to be distinctive, not just tasteful. T16 — defaults to editorial tropes faster than its prose does