ProbCOPA Interactive Explorer

Browse individual ProbCOPA items and compare the distribution of human likelihood scores with model responses. Each item asks: given a premise, how likely is the hypothesis as an effect?

Item: Sort by:

Compare the differential entropy (response variability) of human responses vs model responses for each item. Points above the diagonal indicate the model is more variable than humans; below means less variable.

Model:

Wasserstein distance measures how different the model's response distribution is from the human distribution for each item. Higher values mean the model's responses diverge more from humans. Dot color shows human entropy.

Model:

Compare the median human likelihood score with the median model response for each item. Points on the diagonal indicate perfect agreement.

Model:

Explore the relationship between a model's mean reasoning chain length (average number of reasoning tokens across 30 sampled responses per item) and various outcome metrics.

Model: X-axis:

How do temperature and reasoning effort / thinking budget settings affect model behavior? Lines show aggregate metrics across all items.

Ablation: Metric: