| 01 |
T1 / Trivia |
Arithmetic & factual |
Straight A on three-line trivia. No hedging, no padding. |
10 / 10
|
| 02 |
T1 / Trivia |
let · const · var |
Nailed the canonical setTimeout-in-a-for example without being prompted to. Scope, hoisting, TDZ — in order. |
10 / 10
|
| 03 |
T2 / Basic |
FizzBuzz |
Textbook. No prose, no fences, just the loop. Followed the output-format constraint cleanly. |
9 / 10
|
| 04 |
T2 / Basic |
Palindrome, ignoring punctuation |
Clean regex, correct empty-string behaviour, three asserts. Didn’t over-engineer. |
9 / 10
|
| 05 |
T2 / Basic |
SQL — top-3 customers by paid revenue |
INNER JOIN + ROUND(cents/100.0, 2) + MAX(created_at) tie-break. Correct from first line. |
9 / 10
|
| 06 |
T3 / Intermediate |
Accessible React autocomplete |
ARIA, keyboard, outside-click — all present. But wired the AbortController to nothing; the signal never reaches the fetcher. (spec miss) |
8 / 10
|
| 07 |
T3 / Intermediate |
Go per-IP rate limiter |
Ships a silent data race on entry.lastAccess. The cleanup goroutine reads; the request path writes; neither locks. Would light up under -race. |
7 / 10
|
| 08 |
T3 / Intermediate |
LRU cache (Python, O(1)) |
Implementation is textbook. One of the self-written asserts is wrong — it expects get(2) == -1 where the correct answer is 2. The code passes its problem, fails its own test. (self-check miss) |
8 / 10
|
| 09 |
T4 / Advanced |
Off-by-one hunt (JS, countPairs) |
Caught it instantly. Traced the undefined → NaN propagation into the map without being led. Sober restraint on “there may be more than one.” |
10 / 10
|
| 10 |
T4 / Advanced |
Python asyncio worker-pool race |
Found the silent task_done() skip on exception. Offered an idiomatic gather(…, return_exceptions=True) shutdown. Reads like code someone who’s run an async service would write. |
10 / 10
|
| 11 |
T4 / Advanced |
URL-shortener system design |
Base62 vs. random tradeoffs, a real latency budget (20+10+10+10 ms), a cost estimate, an explicit v1-scope cut. Opinionated throughout. |
10 / 10
|
| 12 |
T5 / Expert |
Distinct-value subarray sum = K |
Algorithm is actually correct — the non-trivial “base[L] += v on a range” trick. Then confidently labels it O(N√N) while implementing an O(N2) query. (complexity bluff) |
7 / 10
|
| 13 |
T5 / Expert |
Global distributed rate-limiter |
Local+global hybrid bucket, CRDT counters with gossip, a degrading-budget partition policy (100/60/40/10%). Names a “leakage score” metric with a real alert threshold. Minor CRDT-terminology slip. |
9 / 10
|
| 14 |
T6 / Very hard |
Go goroutine-leak & fan-out bug hunt |
Found the leak cleanly. Hallucinated the concurrent map write the prompt suggested — it wasn’t there — and missed the real loop-variable closure bug, which its final code then fixed by accident. (premise sycophancy) |
7 / 10
|
| 15 |
T6 / Very hard |
Onboarding-flow redesign (taste) |
Opens with an opinionated thesis. Names endowment effect by name. Kill-metric with a concrete 30% threshold. Rejects a tempting-but-wrong option on principle. Rare writing at this weight class. |
10 / 10
|
|
— Second pass · added probes —
|
| 16 |
T3 / Visual |
Ramen-counter receipt (HTML/CSS) |
Real POV: Japanese typography, Katakana subtitle in vermillion, a torn bottom edge via repeating-linear-gradient, a rotated OISHII stamp. But background is exactly the dark gradient the prompt warned against, and the layout is centered-symmetry-Pinterest-minimalism — tasteful, not distinctive. |
8 / 10
|
| 17 |
T4 / Pushback |
False-premise: “sorted() isn’t stable” |
Opens with “Important Correction”. Refuses the premise, cites the correct fact, suggests actual likely causes of the user’s real bug, then delivers the requested wrapper as defensive programming. The exact counterexample to T14’s sycophancy. |
10 / 10
|
| 18 |
T5 / Security |
Express reset-password CVE review |
Catches the JWT-not-verified, SQLi, shell injection, XSS, open redirect — with CWE numbers and working exploits. Missed plaintext-password storage as a distinct defect. Corrected handler mis-uses crypto.pbkdf2 (callback API, not promise). Strong triage, wobbly fix code. |
8 / 10
|
| 21 |
T6 / Long-ctx |
Timing-attack needle in a 450-line service |
Pinpoints the exact line in check_admin_token. CWE-208, explains the short-circuit byte compare, names the char-by-char extraction attack, gives the precise hmac.compare_digest fix. Noticed the file’s own inconsistency with verify_password. |
10 / 10
|
|
— Third pass · harder probes —
|
| 22 |
T5 / Multi-file |
Rename field across 3 files, with a JSON-contract trap |
Renamed cleanly everywhere, kept internal naming consistent (by_user→by_owner), then flagged both planted traps unprompted: the mobile-client JSON contract break and the required Postgres column migration. Reads all three files before editing. |
10 / 10
|
| 23 |
T5 / Performance |
50K×50K correlation, find the real bottleneck |
Caught the dead-code quadratic (a seen_user_ids list scan that’s never read) as well as the main O(N·M) loop. Gave an ns-per-op cost breakdown. Fix uses hash index + per-user timestamp sort + bisect → ~5000× reduction. Profiler-style thinking. |
10 / 10
|
| 24 |
T6 / Precision |
Adversarial review — 2 real bugs, 4 red herrings |
Found exactly the two planted defects (UnixNano ID collision + SettleOverdue that never settles). Zero false positives on red herrings designed to be tempting. The reviewer discipline that’s rare in both humans and models. |
10 / 10
|
| 25 |
T5 / CVE pattern |
Prototype pollution in a deepMerge |
CWE-1321 named, exact payload given, mechanism walked through property-by-property. Correct three-guard fix. Explains why partial fixes (freeze, __proto__-only filter) fail, including the constructor.prototype bypass. Cites CVE-2019-10744 from memory. Domain knowledge, not retrieval. |
10 / 10
|
| 26 |
T5 / TS types |
DeepReadonly, SnakeToCamel, KeysToCamel, TupleToUnion |
All four type-level utilities implemented correctly, including recursive template-literal SnakeToCamel and the as-clause key remap in KeysToCamel. One wrong self-assertion ("a_b_c_d" → "aBCd" where the type correctly produces "aBCD"). (self-check) |
9 / 10
|
| 27 |
T6 / Full-stack |
Archive-tasks feature: migration + Go + hook + component |
Migration and React parts are clean (partial index, TanStack v5 optimistic mutation, accessible button). Go handler has a compile error (ctx, cancel := r.Context() — single-value return) and a wrong claim about database/sql not exposing RowsAffected. (Go stdlib slip) |
7 / 10
|
| 28 |
T6 / Voice |
900-word postmortem in NASA Apollo-era AAR register |
Voice fully inhabited. “An anomaly was observed.” Passive where natural. Numbered sections, sub-second timeline, calm authority, no filler. The line — “That the first indication of this failure was a customer ticket represents a failure of the internal observability system, not of the customers” — could have been written by Kranz. |
10 / 10
|
| 29 |
T6 / Math |
Hexagon 3-colorings + 2n−1 composite proof |
Part A: derives the cycle-graph recurrence An=(k-2)An-1+(k-1)An-2, computes 66, and cross-checks via the closed-form (k-1)n+(-1)n(k-1). Part B: clean factorisation via geometric series, both factors shown > 1. Dual-verification is a taste signal. |
10 / 10
|
| 30 |
T6 / Tool use |
Agentic bug-fix loop — use tools, run, edit, verify |
In pi -p --tools print mode, prints commands as markdown instead of invoking them. In interactive pi the same 3.6 model uses write and bash correctly (see T12 rev below). Harness artefact, not a capability gap. Effective 8/10 interactively. |
8 / 10 rev
|