Singularity

15 readers

1 users here now

Everything pertaining to the technological singularity and related topics, e.g. AI, human enhancement, etc.

founded 2 years ago

MODERATORS

submitted 1 year ago by bOt@zerobytes.monster to c/singularity@zerobytes.monster

0 comments fedilink hide all child comments

The original was posted on /r/singularity by /u/ConsiderationNo3558 on 2024-01-22 12:34:09+00:00.

The data is from Chatbot Arena based on pairwise chatbot battles
My own notebook with more code and analysis can be referenced at
The original notebook analysis can be found in link below which has more aggregate stats
The focus of this notebook is to analyze model win ratio over a period of time by capturing daily and weekly data.
It does not tell us if model performance degraded or increased over time, as it is relative comparison against other models.
The win ratio is calculated excluding for ties
Note that this is not an evaluation of one model battling against other.

The results show the weekly win rate of each model against other models excluding ties

Some interesting comparisons (weekly)

gpt-3.5-turbo-0613 win rate is going down consistently most likely due to increased competition
gpt-4-turbo is maintaining the lead with win rate between 0.7 to 0.8
gemini pro replaced palm 2 in bard from December 2023.
gemino pro is better than palm-2 but slightly worse than gpt-3.5-turbo-0613
The recent tend shows claude-1 having better win rates than other claude family models
In llama family of models, llama-2-70b-chat has best win ratios, while llama-2-13b-chat is close for some weeks.
The recently launched mistral-medium is doing good compared to other across other models as well as same family

no comments (yet)

sorted by: hot top controversial new old

there doesn't seem to be anything here