this post was submitted on 22 Jan 2024
1 points (100.0% liked)

Singularity

15 readers
1 users here now

Everything pertaining to the technological singularity and related topics, e.g. AI, human enhancement, etc.

founded 2 years ago
MODERATORS
 
This is an automated archive.

The original was posted on /r/singularity by /u/ConsiderationNo3558 on 2024-01-22 12:34:09+00:00.


  • The data is from Chatbot Arena based on pairwise chatbot battles
  • My own notebook with more code and analysis can be referenced at
  • The original notebook analysis can be found in link below which has more aggregate stats
  • The focus of this notebook is to analyze model win ratio over a period of time by capturing daily and weekly data.
  • It does not tell us if model performance degraded or increased over time, as it is relative comparison against other models.
  • The win ratio is calculated excluding for ties
  • Note that this is not an evaluation of one model battling against other.

The results show the weekly win rate of each model against other models excluding ties

Some interesting comparisons (weekly)

  • gpt-3.5-turbo-0613 win rate is going down consistently most likely due to increased competition

  • gpt-4-turbo is maintaining the lead with win rate between 0.7 to 0.8

  • gemini pro replaced palm 2 in bard from December 2023.

  • gemino pro is better than palm-2 but slightly worse than gpt-3.5-turbo-0613

  • The recent tend shows claude-1 having better win rates than other claude family models

  • In llama family of models, llama-2-70b-chat has best win ratios, while llama-2-13b-chat is close for some weeks.

  • The recently launched mistral-medium is doing good compared to other across other models as well as same family

no comments (yet)
sorted by: hot top controversial new old
there doesn't seem to be anything here