Occasionally Secure: A Comparative Analysis of Code Generation Assistants

Paper: Occasionally Secure: A Comparative Analysis of Code Generation Assistants

This repository accompanies the paper and contains model outputs and evaluation scripts for benchmarking code generation assistants on software security and reliability tasks.

Overview

We evaluate a range of web-facing LLMs using a set of security-related programming tasks. Models are tested across two personas: a security-minded engineer and a general-purpose software engineer. Performance is measured based on:

Number of revisions required to fix model-generated code
Cyclomatic complexity
Syntax, functionality, and semantic reliability

Model responses were manually collected via public web interfaces and are organized by model and persona type. This repository provides the data and scripts used for analysis.

Repository Structure

.
├── Complexity/
│   └── Scripts for measuring cyclomatic complexity
├── Consistency/
│   └── Scripts for evaluating reliability (syntax, functionality, semantics)
├── Gemini/
│   ├── Gemini/
│   │   ├── Security Persona/
│   │   ├── Software Engineer Persona/
│   │   ├── Security Reliability/
│   │   └── Software Reliability/
│   └── Gemini_reasoning/
│       └── [Organized similarly to Gemini by persona and reliability type]
├── [Other LLM folders]/
│   └── [Organized similarly by persona and reliability type]
├── README.md
└── [Additional scripts and model outputs]

Data Collection

Model generations were collected manually through browser interactions with LLMs. Evaluation data is stored in tabular form (currently as a Google Sheet) and includes:

Number of revisions required
Reliability ratings (syntax, functionality)

An export of the cleaned dataset will be added to this repository in a future update.

Evaluation

Cyclomatic Complexity

To compute the complexity of model-generated code samples, use the scripts in the Complexity/ directory.

Example usage:

python Complexity/com.py

Reliability Evaluation

The Consistency/ directory contains scripts to assess code reliability across syntax and functionality. These evaluations are currently manual or semi-automated and aligned with the schema described in the paper.

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
Complexity		Complexity
Gemini		Gemini
chat_gpt		chat_gpt
consistency		consistency
deepseek		deepseek
instance		instance
.gitconfig		.gitconfig
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Occasionally Secure: A Comparative Analysis of Code Generation Assistants

Overview

Repository Structure

Data Collection

Evaluation

Cyclomatic Complexity

Future Additions

Prompt templates and persona definitions

CSV export of the evaluation spreadsheet

Additional aggregate analysis scripts

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Occasionally Secure: A Comparative Analysis of Code Generation Assistants

Overview

Repository Structure

Data Collection

Evaluation

Cyclomatic Complexity

Future Additions

Prompt templates and persona definitions

CSV export of the evaluation spreadsheet

Additional aggregate analysis scripts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages