🏆 PropertyEval Leaderboard 🏆

Property-based tests for thorough benchmarking of LLM code generation.
GitHub

Pass@1 on HumanEval at Temperature 0

PropertyEval

# Model Size pass@1
1 GPT-4 N/A 76.22
2 ChatGPT N/A 60.37
3 StarCoder N/A 27.44
4 CodeGen 16B 24.39
5 CodeGen 6B 24.39
6 CodeGen 2B 21.34
7 CodeGen2 7B 16.46
8 CodeGen2 16B 15.85
9 Vicuna 13B 14.02
10 CodeGen2 3B 13.41
11 SantaCoder N/A 12.81
12 InCoder 6B 11.59
13 InCoder 1B 10.98
14 GPT-J N/A 10.37
15 Vicuna 7B 9.76
16 CodeGen2 1B 9.15
17 GPT-Neo 2B 6.71
18 PolyCoder N/A 4.88
19 StableLM 7B 1.83

Original

# Model Size pass@1
1 GPT-4 N/A 88.41
2 ChatGPT N/A 73.17
3 StarCoder N/A 34.15
4 CodeGen 16B 32.93
5 CodeGen 6B 29.27
6 CodeGen 2B 24.39
7 CodeGen2 16B 19.51
8 CodeGen2 7B 18.29
9 Vicuna 13B 16.46
10 CodeGen2 3B 15.85
11 InCoder 6B 15.85
12 SantaCoder N/A 14.63
13 GPT-J N/A 12.20
14 InCoder 1B 12.20
15 Vicuna 7B 11.59
16 Codegen2 1B 10.98
17 GPT-Neo 2B 7.93
18 PolyCoder N/A 6.10
19 StableLM 7B 2.44

Webpage designed by Manyata Pawagi