Methodology: how a thesis becomes a Gap score
A four-step pipeline that turns a natural-language investment thesis into a single comparable probability-gap number. Every step is transparent and independently verifiable.
The one-sentence version
Decompose → retrieve → classify → score. The primary LLM breaks the thesis into verifiable sub-claims, the Polymarket API retrieves matching markets, a dual-model ensemble labels direction, and a weighted formula produces a Gap score from −100 to +100. Typical runtime is 30–90 seconds.
Step 1: Decompose the thesis
Input: any-length natural-language thesis you wrote. Output: a set of atomic sub-claims, each with:
- claim_text: a single verifiable statement (“Fed cuts cumulatively 50bps or less by December 2026”)
- timeframe: an explicit time window (“by 2026-12-31”)
- verifiable_predicate: a machine-checkable condition (“fed_funds_rate decreases by ≤ 0.005”)
- search_keywords: 3–5 terms to drive retrieval in step 2
- importance: critical / high / medium, used as a weight in step 4
Model: the primary LLM (via a large-language-model API). Chosen for its ability to unpack implicit conditions in long theses — older-generation models frequently treat “BTC breaks $120K by year-end” as one atomic claim, but the primary model recognizes that this simultaneously encodes a time condition, a price threshold, and the ambiguity of “breaks” (reach or exceed?) as three independently verifiable dimensions.
Temperature is set to 0.1 so the same input produces stable output. One retry is attempted on failure; on a second failure the thesis is marked decompose_failed and the credit is refunded automatically.
Step 2: Retrieve markets
Each sub-claim's search_keywords are passed to the Polymarket CLOB API's /markets endpoint. We keep only markets that are:
- active (not yet resolved)
- trading at least $1,000 in the last 24 hours (to exclude dead long-tail markets)
- closing within the sub-claim's time window
Surviving candidates are ranked by keyword relevance and trading volume. The top 5 per sub-claim are passed to step 3.
If a sub-claim has no surviving candidates at all (e.g. the thesis touches a niche Polymarket doesn't cover), it's marked polymarket_matchability = "none" and handed off to the AI fallback described below.
Step 3: Classify direction
Given a candidate pool, we classify each market's direction relative to its sub-claim:
- supports (aligns): market YES probability high = sub-claim probability high
- contradicts: market YES probability high = sub-claim probability low
- neutral: not directly related
This step uses a dual-model ensemble: two independent LLMs from different families label the same (sub-claim, market) pair in parallel. When they agree, the label is accepted. When they disagree, the primary model breaks the tie.
Why not trust a single model? Investment-domain semantics are nuanced — “the Fed doesn't cut rates” and “the Fed holds rates flat” are not strictly equivalent in market terms (the second excludes a hike). Using two models from different families independently catches blind spots that one model alone misses. In practice the two agree on ~82% of pairs; of the remaining ~18%, the primary model was correct more often in manual review, so it's the tiebreaker.
Fallback for unmatched claims. When a sub-claim has zero candidate markets in step 2, a separate LLM writes a standalone 100–200-word analysis for that claim, rendered as an amber card labeled “AI-only signal, not market-backed”. The user decides whether to fold it into the Gap calculation.
Step 4: Compute the Gap score
The last step is pure arithmetic, running live in the browser. For each sub-claim i:
ui: the user's subjective probability from the slider (0–1)mi: the direction-weighted market probabilitywi: importance weight (critical=3, high=2, medium=1)
mi direction-weights the candidate pool: supporting markets count as +P, contradicting markets as 1-P, neutral markets are excluded. The result is normalized by relevance.
Final Gap score:
Gap = 100 × ( Σᵢ wᵢ × (uᵢ - mᵢ) ) / ( Σᵢ wᵢ )
Range −100 to +100. Positive = you are more bullish / rate the thesis higher than the market. Negative = you are more bearish. Near zero = your assumptions match the consensus almost exactly.
Color bands. |Gap| < 10 green (you're with the crowd). 10–25 amber (meaningful disagreement). > 25 red (you and the market disagree materially — worth re-examining your assumptions).
What happens when it fails?
Credits are refunded automatically in these cases:
- Step 1 fails (the primary LLM refuses or errors after retry)
- Step 1 produces zero valid sub-claims
- The background task exceeds 90 seconds without responding (typically a Vercel serverless timeout)
If step 2 finds zero markets across all sub-claims — no refund, but the AI fallback in step 3 still produces a per-claim analysis, so you get a full report. The Gap score in that case is purely AI-derived, not market-grounded.
Known limitations
- Polymarket coverage is uneven. US politics, crypto, and specific macro events are densely covered; small-cap equities, China macro, and private markets are nearly empty. Theses touching those areas route heavily to the AI fallback, so the market signal is weak.
- Liquidity noise. Markets with < $10K volume can move 5–10% on a single $1,000 order. We filter out < $1K 24h volume, but the $1K–$10K band still shows up in pools. Treat those probabilities as weak signals.
- Classification errors. The ensemble still mislabels ~5–8% of (claim, market) pairs. The thesis detail page lets you manually override any link's direction.
- Geographic restrictions. Polymarket is CFTC-restricted in the US and inaccessible from mainland China. Legality is undetermined in several other jurisdictions (India, multiple Middle Eastern countries). We don't recommend trading on Polymarket from those regions — but reading market data for research is typically compliant (public information).
Ready to try it?