Reproducible · versioned · public
Methodology
How we measure which businesses AI assistants recommend, how we compute confidence, and where the limits are.
1. Sampling
For each (city, category) we run a fixed prompt set against four LLMs (Claude, ChatGPT, Gemini, Perplexity), N times per (prompt × model) to measure intra-model variance. Total scan slots = prompts × models × samples. The current active prompt sets are versioned below.
2. Intent buckets
Prompts are grouped into intent buckets (broad authority, specific need, process, style/approach, conversational). Aggregating within a bucket and exposing per-bucket counts preserves the signal that "best lawyer for X" and "most aggressive litigator" measure different things — averaging them would flatten useful information.
3. Citation extraction
Each LLM response is parsed by a small extractor (GPT-4o-mini) into a JSON array of business names. Names are normalized (suffixes like "LLP" / "PC" / "& Associates" stripped) and matched against a pre-harvested business database for the (city, category) via Levenshtein-distance fuzzy match. Mentions that don't match any harvested firm are tracked separately and used to drive Google Places reverse-lookup.
4. Ranking + confidence
Share = citation_count / n_slots. 95% CI uses the Wilson score interval (correct for binomial proportions at small N). Consensus score = (n_models_cited / 4) × log(citations + 1) — rewards both breadth across LLMs and depth within them. Rank order uses share, with consensus as the tiebreaker.
Tier classification:
- 🟢 Consensus — CI lower bound > 10%. Signal clearly above noise floor.
- 🟡 Mid-tier — share > 3% AND cited by ≥2 LLMs. Real signal, but CI spans the noise floor.
- ⚪ Below threshold — neither condition met. Published but flagged.
5. Immutable snapshots
Each scan run produces a snapshot. Snapshots are never updated or deleted — the leaderboard you're viewing is always the most recent. Rank trends are reconstructed from the snapshot history; methodology changes increment the version and snapshots from different versions are not directly comparable.
6. AI assistants tested
- Claude —
anthropic/claude-sonnet-4.6 - ChatGPT —
openai/gpt-4o - Gemini —
google/gemini-2.5-pro - Perplexity —
perplexity/sonar-pro
Each model receives each prompt independently — no memory, no system instructions, no conversation history. Each prompt is run N times to surface intra-model variance.
7. Active prompt sets
Each active prompt set is published as a versioned record (slug + version + bucket breakdown). Per-bucket prompt counts disclose the sampling design; verbatim prompt text is not published — it's the proprietary editorial calibration that distinguishes the Atlas from a generic GEO scraper.
Process
5 prompts
Specific Need
5 prompts
Conversational
5 prompts
Style Approach
5 prompts
Broad Authority
5 prompts
8. Archived methodology versions
- divorce-attorneys · v1.0 — 10 prompts · created Jun 25, 2026
9. Corrections + takedowns
If your firm is misrepresented or you want to request removal, contact [email protected]. We do not alter rankings on request, but we will correct factual errors (wrong website, wrong name) within 48 hours.
10. Reproducibility
Every citation traces to a verbatim LLM response. Every snapshot points at a versioned prompt set and the source scan run IDs. Read-only JSON API at /api/leaderboards/{metro}/{industry}/{category}.json.