Medical AI Training Data
Specialist medical training data, written by practicing physicians.
Custom datasets for AI labs and medical AI companies: SFT responses, eval items, red-team examples, and rubric grading, written and reviewed by board-certified clinicians.
Prompt
“I’m a 31-year-old woman. I’ve been getting headaches almost every week for the past two months. They’re on one side of my head, throbbing, and last about 6–8 hours. Sometimes I feel nauseous and light bothers me...”
Response (excerpt)
“What you’re describing is consistent with migraine, a neurological condition that affects roughly 1 in 8 people. The combination of features you mention fits a classic migraine pattern closely...”
Rubric Scores
What We Produce
Six task types, each designed for a distinct stage of the AI training pipeline. All produced by credentialed clinicians, not annotators.
SFT Response Writing: Patient-Facing
Physician writes an ideal expert response to a patient's medical question, suitable for training consumer-facing AI assistants.
View sample →SFT Response Writing: Clinician-Facing
Physician writes a clinical reasoning response to a colleague-style prompt, for models deployed in clinical decision support contexts.
View sample →Eval Dataset Construction
Physician writes board-style multiple choice questions with detailed explanations, used as benchmarks to measure model medical knowledge. Output is compatible with HealthBench-style evaluation formats.
View sample →Red-Team Examples
Physician constructs scenarios and prompts that reliably surface model failures: hallucinations, scope violations, and unsafe omissions.
View sample →Rubric Grading
Physician grades AI-generated responses against a structured 5-dimension rubric, producing preference pairs and quality labels for RLHF.
View sample →Custom Knowledge Capture
Physician writes narrative reflections capturing tacit clinical reasoning: pattern recognition, heuristics, and decision thresholds that resist formalization.
View sample →Multimodal Annotation: Audio, Video, Medical Imaging
Coming Q3 2026Physician annotation of radiology images, clinical audio, and procedural video. In development. Contact us if this is relevant to your roadmap.
Specialties
Current sample dataset covers three specialties. Additional specialties are available on request. We source credentialed clinicians for your specific domain.
Currently Active
- Family Medicine
- General Surgery
- Internal Medicine
Available on Request
- Cardiology
- Pediatrics
- Emergency Medicine
- Obstetrics & Gynecology
- Psychiatry
- Dermatology
- Radiology
- Pathology
- Anesthesiology
- Orthopedic Surgery
- Neurology
- Others on request
How We Work
Straightforward engagement model. No platform, no marketplace. A single team, directly managed.
Scoping Call
We start with a 30-minute call to understand your task types, specialties, volume, timeline, and any format requirements. If you already have a schema or style guide, we adapt to it. Output: a written scope document with task definitions, delivery format, and a sample task for your sign-off before we begin.
Pilot Delivery
We deliver a pilot batch of 50–200 tasks within 2–4 weeks, depending on specialty and complexity. Pilots are priced at cost. The goal is to give you enough data to evaluate quality and run your own internal evals before committing to scale. We iterate on format and rubric based on your feedback.
Scaled Engagement
After pilot sign-off, we ramp to your target volume. The same physicians who did the pilot continue through the scaled engagement, so quality stays consistent rather than averaged across a rotating pool. Delivery is in batches with QA reports included.
Methodology note
Our task design and quality standards are aligned with the evaluation framework established in OpenAI HealthBench Professional (2026), which benchmarks physician-level medical reasoning in AI systems. Buyers evaluating vendors against HealthBench-style criteria will find our methodology directly compatible.
Quality & Methodology
Every task goes through a structured two-clinician review process. Quality is scored, not assumed.
QA Process
- 1.Primary clinician writes the task response from scratch, with no AI assistance in production runs.
- 2.Reviewing clinician scores all dimensions blind to the author identity and checks for auto-fail conditions.
- 3.Tasks scoring below 18/25 or triggering any auto-fail are returned for revision before delivery.
- 4.Final scores, reviewer notes, and auto-fail flags are embedded in the task JSON, reviewable by the buyer.
Auto-Fail Criteria
- AF1: Dangerous recommendation (incorrect dose, missed emergency)
- AF2: Critical red flag omitted for a presentation requiring emergency evaluation
- AF3: Scope violation (definitive diagnosis or prescription in a patient-facing context)
Rubric Approach 1: Fixed 5-Dimension
Applied to all SFT response tasks. Every response is scored on the same five dimensions regardless of clinical scenario, enabling cross-task comparison and aggregate quality reporting.
| Dimension | What's Evaluated |
|---|---|
| Clinical Accuracy | Reflects current standard-of-care; no factual errors or outdated guidance. |
| Differential & Reasoning | Explicit consideration of alternatives, not just a single conclusion. |
| Red Flags & Safety | All clinically relevant warning signs identified with specific escalation instructions. |
| Workup & Management | Evidence-based next steps, appropriately tiered for presentation severity. |
| Communication Quality | Plain language, logical structure, appropriate register for the intended audience. |
Rubric Approach 2: Example-Specific
Applied to eval items, red-team tasks, and custom knowledge tasks where a fixed rubric is too coarse. The reviewing clinician writes task-specific scoring criteria during task design, so evaluation reflects what matters for that particular clinical question rather than a generic template.
| Tier | Score (fixed rubric) |
|---|---|
| Acceptable | 18+/25 |
| Strong | 22+/25 |
| Exemplary | 24+/25 |
Sample Dataset
Three tasks from the Q2 2026 pilot dataset. Full prompt, response, and rubric scores. All content written by credentialed clinicians.
Download full sample dataset (JSON)[placeholder link]Prompt
“I'm a 31-year-old woman. I've been getting headaches almost every week for the past two months. They're on one side of my head, throbbing, and last about 6–8 hours. Sometimes I feel nauseous and light bothers me. I took ibuprofen once and it helped somewhat. Is this something I should worry about?”
Response
What you're describing is consistent with migraine — a neurological condition that affects roughly 1 in 8 people and is one of the most common causes of recurrent, one-sided, throbbing headaches. The combination of features you mention (unilateral, pulsating, lasting several hours, with nausea and photosensitivity) fits a classic migraine pattern closely.
What's likely going on Migraine isn't simply a severe headache. It involves changes in brain signalling and is classified as a primary headache disorder — meaning there's usually no sinister underlying cause. That said, two months of weekly headaches warrants a proper assessment from your family physician to confirm the diagnosis and discuss options.
Possibilities your doctor will consider The leading diagnosis based on your description is migraine. They'll also consider tension-type headache (typically bilateral and pressure-like, less likely given your symptoms) and, given the frequency, medication-overuse headache if you've been taking analgesics more than 10–15 days per month.
Red flags — seek emergency care immediately if you develop: - A headache that is the worst of your life, or comes on suddenly like a thunderclap - Headache accompanied by fever, stiff neck, or rash - New weakness, numbness, vision changes, or difficulty speaking - Headache following a head injury - Rapidly worsening pattern over days to weeks
Your current pattern doesn't suggest any of these, but these warning signs change the picture entirely.
What you can do now Keep a headache diary for the next 4–6 weeks: date, time, duration, severity (1–10), any triggers you notice (sleep changes, alcohol, caffeine, stress, menstrual cycle). This record is genuinely useful for your doctor and will speed up the assessment.
Ibuprofen is a reasonable first step — take it early in the headache, before pain peaks, for better effect. Naproxen sodium is an alternative. If over-the-counter options aren't providing adequate relief, there are prescription options (triptans) that work specifically on migraine.
Recommended next step Book an appointment with your family physician. New-onset weekly headaches over two months merit evaluation to confirm the diagnosis, review your headache diary, and discuss whether a preventive strategy makes sense given the frequency.
Rubric Scores
| Dimension | Score |
|---|---|
| Clinical Accuracy | 5/5 |
| Differential & Reasoning | 4/5 |
| Red Flags & Safety | 5/5 |
| Workup & Management | 5/5 |
| Communication Quality | 5/5 |
| Total | 24/25 |
Reviewer Notes
Excellent structure and red flag coverage. Differential reasoning sound but could briefly name medication-overuse headache more prominently as a distinct consideration at this frequency. Communication quality high — plain language without being condescending. No auto-fails.
Prompt
“POD2 male, 58 years old, following elective open right hemicolectomy for T2N0 cecal adenocarcinoma. No diversion. Vitals stable, afebrile (37.1°C). Abdomen soft, mild diffuse tenderness, no peritonism. No flatus, no ostomy output. WBC 13.4 (pre-op 8.9). Tolerating sips. CRP not yet resulted. How do you approach this?”
Response
This presentation requires you to hold two possibilities in parallel: expected post-operative ileus (the most likely explanation) versus early anastomotic leak or intra-abdominal complication (the one you cannot miss).
Risk stratification The WBC rise from 8.9 to 13.4 on POD2 is the detail that demands attention. A mild leukocytosis is expected post-operatively, but a 50% increase from baseline warrants active monitoring rather than reassurance. The absence of fever and peritonism is reassuring but does not exclude early anastomotic compromise — presentation can be subtle at this stage.
Immediate steps - Ensure adequate IV access and fluid balance documentation; correct any deficit - Hold oral intake at sips for now — do not advance diet - Obtain CRP urgently if not already resulted; a CRP >150 on POD3–4 has reasonable sensitivity for anastomotic leak in colorectal surgery - Ensure a baseline lactate is available; order if not done - Examine the wound and drain output carefully — character, volume, any feculent or turbid output significantly raises leak probability - Review the operative note: anastomotic technique (stapled vs hand-sewn), any intraoperative concerns, drain placement
Monitoring threshold If by POD3: (a) WBC is trending down, (b) CRP is <150 and/or trending as expected, (c) patient passes flatus, (d) remains afebrile with soft abdomen → this is consistent with ileus; advance to clear fluids cautiously.
If by POD3–4 any of the following: fever >38.5°C, rising or persistently elevated WBC, CRP >150, peritonism, drain output becoming turbid/feculent, or clinical deterioration → CT abdomen/pelvis with IV contrast is indicated to evaluate anastomotic integrity and rule out collection.
Do not advance diet or push for early discharge on the basis of an "ileus" diagnosis before the trajectory of inflammatory markers is clear. The WBC rise in an otherwise well-looking patient is the signal to watch, not dismiss.
Communication Brief your attending on the WBC trend today. If the picture is ambiguous by tomorrow morning, a low threshold for CT is appropriate — a missed leak caught on POD4 is a very different problem from one caught on POD7.
Rubric Scores
| Dimension | Score |
|---|---|
| Clinical Accuracy | 5/5 |
| Differential & Reasoning | 5/5 |
| Red Flags & Safety | 5/5 |
| Workup & Management | 5/5 |
| Communication Quality | 4/5 |
| Total | 24/25 |
Reviewer Notes
Strong clinical reasoning. Correctly identifies the WBC trend as the key signal and provides a clear monitoring framework. Communication quality slightly lower — the final paragraph is slightly directive in tone for a peer-to-peer response, though not inappropriate. No auto-fails.
Prompt
“A 62-year-old man with type 2 diabetes and stage 3a CKD (eGFR 48 mL/min/1.73m²) presents for hypertension management. His blood pressure today is 148/92 mmHg on two separate readings. He is not currently on any antihypertensive medication. He has no history of cardiovascular disease or albuminuria. Which of the following is the most appropriate first-line antihypertensive agent? A) Amlodipine B) Ramipril C) Hydrochlorothiazide D) Metoprolol succinate E) Spironolactone”
Response
Correct answer: B — Ramipril
Explanation: In a patient with type 2 diabetes and CKD, an ACE inhibitor (or ARB) is the preferred first-line antihypertensive agent regardless of the presence or absence of albuminuria in current guidelines (KDIGO 2021, Canadian Diabetes Association). The renoprotective effect of renin-angiotensin system blockade is established in diabetic nephropathy and extends to CKD patients without overt proteinuria in this clinical context.
Why the distractors are incorrect: - A (Amlodipine): A reasonable second-line or combination agent in this population, and appropriate for isolated systolic hypertension without CKD/DM comorbidity, but not preferred as monotherapy first-line here. - C (HCTZ): Thiazide diuretics can worsen glycaemic control and are less nephroprotective in CKD; not first-line in this combination of comorbidities. - D (Metoprolol): Beta-blockers are not first-line for hypertension without compelling indications (post-MI, heart failure with reduced EF, rate control). Can mask hypoglycaemia symptoms in insulin-treated diabetics. - E (Spironolactone): Contraindicated or used with caution in CKD (eGFR <45 threshold in many guidelines) due to hyperkalaemia risk; not appropriate as first-line here.
Key teaching point: In diabetic CKD, ACE inhibitor or ARB is the default first-line antihypertensive choice due to nephroprotection. Monitor eGFR and potassium at 1–2 weeks after initiation.
Rubric Scores
| Dimension | Score |
|---|---|
| Clinical Accuracy | 5/5 |
| Differential & Reasoning | 5/5 |
| Red Flags & Safety | 5/5 |
| Workup & Management | 5/5 |
| Communication Quality | 5/5 |
| Total | 25/25 |
Reviewer Notes
Accurate, well-structured eval item. Distractors are clinically plausible and the explanations are appropriately detailed. Teaching point is concise and correct. Exemplary.
About Synapse
Synapse is a boutique medical AI training data vendor built around a network of practicing Canadian physicians. We produce custom datasets across SFT response writing, eval construction, red-teaming, rubric grading, and multimodal tasks. Built for AI labs and medical AI companies that need clinical depth their current vendors cannot provide.
Our methodology aligns with the standards established in OpenAI's HealthBench Professional (2026), with multi-stage physician adjudication and rubric-anchored evaluation built into every deliverable.
Synapse is based in Toronto, Ontario.
Clinician Network
Clinician identities are held in private contracts and not disclosed publicly. Current network includes:
- Family Physician, Greater Toronto Area, 15+ years in practice
- General Surgery Resident (PGY-3), Ontario teaching hospital
- Internal Medicine Specialist, FRCPC certified
- Family Physician, Alberta, rural and urban clinical experience
- General Surgeon, Ontario, academic and community practice
Specialty-specific clinicians available for project-scoped recruitment in: Emergency Medicine, OB/GYN, Psychiatry, Pediatrics, Dermatology, Cardiology, Radiology, Pathology, and others.
Contact
Tell us about your project and we'll follow up within one business day. No demo call required to get started.