AI Friends

> o3 finds the kerberos authentication vulnerability in the benchmark in 8 of the...

> o3 finds the kerberos authentication vulnerability in the benchmark in 8 of the 100 runs. In another 66 of the runs o3 concludes there is no bug present in the code (false negatives), and the remaining 28 reports are false positives. For comparison, Claude Sonnet 3.7 finds it 3 out of 100 runs and Claude Sonnet 3.5 does not find it in 100 runs. So on this benchmark at least we have a 2x-3x improvement in o3 over Claude Sonnet 3.7. Not bad, so more compute = more finding, just need to have verification as part of the loop
CodeSafetyResearch