🏆 PropertyEval Leaderboard 🏆
Property-based tests for thorough benchmarking of LLM code generation.
Pass@1 on HumanEval at Temperature 0
PropertyEval
| # |
Model |
Size |
pass@1 |
| 1 |
GPT-4 |
N/A |
76.22 |
| 2 |
ChatGPT |
N/A |
60.37 |
| 3 |
StarCoder |
N/A |
27.44 |
| 4 |
CodeGen |
16B |
24.39 |
| 5 |
CodeGen |
6B |
24.39 |
| 6 |
CodeGen |
2B |
21.34 |
| 7 |
CodeGen2 |
7B |
16.46 |
| 8 |
CodeGen2 |
16B |
15.85 |
| 9 |
Vicuna |
13B |
14.02 |
| 10 |
CodeGen2 |
3B |
13.41 |
| 11 |
SantaCoder |
N/A |
12.81 |
| 12 |
InCoder |
6B |
11.59 |
| 13 |
InCoder |
1B |
10.98 |
| 14 |
GPT-J |
N/A |
10.37 |
| 15 |
Vicuna |
7B |
9.76 |
| 16 |
CodeGen2 |
1B |
9.15 |
| 17 |
GPT-Neo |
2B |
6.71 |
| 18 |
PolyCoder |
N/A |
4.88 |
| 19 |
StableLM |
7B |
1.83 |
Original
| # |
Model |
Size |
pass@1 |
| 1 |
GPT-4 |
N/A |
88.41 |
| 2 |
ChatGPT |
N/A |
73.17 |
| 3 |
StarCoder |
N/A |
34.15 |
| 4 |
CodeGen |
16B |
32.93 |
| 5 |
CodeGen |
6B |
29.27 |
| 6 |
CodeGen |
2B |
24.39 |
| 7 |
CodeGen2 |
16B |
19.51 |
| 8 |
CodeGen2 |
7B |
18.29 |
| 9 |
Vicuna |
13B |
16.46 |
| 10 |
CodeGen2 |
3B |
15.85 |
| 11 |
InCoder |
6B |
15.85 |
| 12 |
SantaCoder |
N/A |
14.63 |
| 13 |
GPT-J |
N/A |
12.20 |
| 14 |
InCoder |
1B |
12.20 |
| 15 |
Vicuna |
7B |
11.59 |
| 16 |
Codegen2 |
1B |
10.98 |
| 17 |
GPT-Neo |
2B |
7.93 |
| 18 |
PolyCoder |
N/A |
6.10 |
| 19 |
StableLM |
7B |
2.44 |