validate.science
Claim-Level Epistemic Risk Assessment
Document preview
Epistemic Risk Assessment Report
Novel Treatment Significantly Reduces Symptoms: A Pilot Study
Generated: 2/3/2026, 10:33:32 AM
validate.science

Introduction

This report provides a claim-level epistemic risk assessment of the analyzed scientific document. Each claim extracted from the document has been evaluated against the evidence presented to identify potential instances of overreach—where claims may exceed what the evidence actually supports.

The assessment focuses on three primary failure modes: causal claims from correlational evidence, overgeneralization beyond sample scope, and underpowered claims from small samples.

Executive Summary

3
Total Claims
3
Flagged Claims
2
Evidence Found
0
Other Findings
Risk Distribution
● High: 3 ● Medium: 0 ● Low: 0

All Claims

#ClaimRisk LevelScoreFailure Modes
1Treatment group showed 50% reduction in symptom scores compared to 15% in control group (N=12, p=0.04).high65%Underpowered
2The 50% improvement demonstrates that Treatment X represents a breakthrough in managing this condition.high88%Underpowered, Overgeneralization
3The large effect size confirms that this treatment is superior to existing options.high82%Underpowered

Flagged Claims Details

1. Treatment group showed 50% reduction in symptom scores compared to 15% in control group (N=12, p=0.04).

Risk Score: 65%

Failure Modes: Underpowered

Evidence:

This pilot study with 12 patients demonstrates that Treatment X is highly effective.
N=12

Evidence:

Treatment group showed 50% reduction in symptom scores compared to 15% in control group (N=12, p=0.04). Effect size was very large (Cohen's d=2.8).
N=12
p=0.04

Explanation:

With only 12 total participants (likely 6 per group), this study is severely underpowered. The p-value of 0.04 is just barely significant and highly susceptible to sampling variability. Small samples inflate effect sizes and increase false positive rates.

2. The 50% improvement demonstrates that Treatment X represents a breakthrough in managing this condition.

Risk Score: 88%

Failure Modes: Underpowered, Overgeneralization

Evidence:

This pilot study with 12 patients demonstrates that Treatment X is highly effective.
N=12

Evidence:

Treatment group showed 50% reduction in symptom scores compared to 15% in control group (N=12, p=0.04). Effect size was very large (Cohen's d=2.8).
N=12
p=0.04

Explanation:

Calling a treatment a "breakthrough" based on a 12-person pilot study is premature. The study lacks statistical power to reliably detect true effects, and the large effect size (Cohen's d=2.8) is likely inflated due to small sample size. Pilot studies are meant to inform larger trials, not establish clinical efficacy.

3. The large effect size confirms that this treatment is superior to existing options.

Risk Score: 82%

Failure Modes: Underpowered

Evidence:

Treatment group showed 50% reduction in symptom scores compared to 15% in control group (N=12, p=0.04). Effect size was very large (Cohen's d=2.8).
N=12
p=0.04

Explanation:

Effect sizes from small samples are notoriously unreliable and tend to be inflated. The claimed Cohen's d=2.8 is exceptionally large and should be viewed with skepticism. Without comparison to "existing options" in a properly powered trial, claims of superiority are unsupported.

Evidence Extracted

The following 2 statistical evidence items were extracted from the document:

1
This pilot study with 12 patients demonstrates that Treatment X is highly effective.
N=12
2
Treatment group showed 50% reduction in symptom scores compared to 15% in control group (N=12, p=0.04). Effect size was very large (Cohen's d=2.8).
N=12 p=0.04

Appendix: Methodology

How This Report Was Generated

1
Document Processing
PDF text extracted with section boundaries preserved.
2
Claim Extraction
Atomic, testable claims identified using large language model analysis.
3
Claim Classification
Each claim classified by type, strength language, and population scope.
4
Evidence Extraction
Statistical evidence extracted including sample sizes and p-values.
5
Claim-Evidence Matching
Semantic similarity used to match claims to their supporting evidence.
6
Burden-of-Proof Check
Deterministic rules applied to detect epistemic overreach.
7
Risk Scoring
Epistemic risk score computed based on failure modes.

Failure Mode Definitions

Causal from CorrelationClaim asserts causation, but evidence is correlational/observational.
OvergeneralizationClaim makes broad assertions from a narrow or small sample.
UnderpoweredClaim makes strong assertions with inadequate sample size.
Insufficient EvidenceNo matching evidence found to evaluate this claim.