Skip to content

JerichoCruz/google-image-gen-claude-code

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Gemini Image Generation for Claude Code

A CLI skill that pairs Claude's prompt intelligence with Gemini's image generation. Claude handles the conversation; Gemini handles the pixels.

Why this exists

Gemini generates good images. Getting good results means writing good prompts: subject details, lighting, camera settings, composition, and style in structured JSON. Most people don't want to do that. They want to say "headshot" and get a headshot.

Claude understands intent, asks follow-up questions, remembers context, and structures detailed prompts from vague requests. It can't generate images. This skill connects them. You describe what you want in plain language, Claude builds the prompt, Gemini renders it, and you iterate through conversation.

What it does

Three modes, auto-detected by input image count.

Generate (no images):

python3 .claude/skills/gemini/scripts/generate_image.py "a friendly robot in a sunlit workshop"

Edit (one image):

python3 .claude/skills/gemini/scripts/generate_image.py "make the lighting warmer" \
  --image ./generated_images/robot.png

Reference (two or more images) for face swaps, pose transfer, style transfer:

python3 .claude/skills/gemini/scripts/generate_image.py \
  "put the face from image 1 on the body in image 2" \
  --image face.png --image body.png --model gemini-3-pro-image-preview

Widescreen, Pro model with thinking budget:

python3 .claude/skills/gemini/scripts/generate_image.py \
  "medieval marketplace with merchants and goods" \
  --aspect 16:9 --model gemini-3-pro-image-preview --thinking 16384

Or open Claude Code in this repo and ask directly. The skill triggers automatically.

An interactive approval workflow sits on top of all three modes. Claude detects the capability, asks clarifying questions, presents a structured breakdown with the exact command, and waits for approval. Say "just do it" to skip the ceremony.

Why not DALL-E, Midjourney, or Stable Diffusion

Multi-image reference operations work from the CLI. Face swaps, pose transfers, style transfers; pass two images and a prompt, get a result. No web UI, no Discord bot, no inpainting mask editor.

Conversational iteration happens naturally. Generate an image, say "add a coffee cup," and Claude feeds the previous output back to Gemini with the edit instruction. Context carries across the session.

Prompt enhancement fills the gap between what you type and what the model needs. You say "headshot." Claude structures a JSON prompt with subject details, camera specs, lighting setup, and background description.

Setup

Get a free API key at https://aistudio.google.com/apikey

cp .env.sample .env
# Add your key: gemini_api = YOUR_KEY_HERE

Zero dependencies. One Python file, stdlib only. No pip install, no virtual environment.

How it works

The script (generate_image.py, 439 lines) auto-detects mode by image count. It loads the API key from .env, builds the request with base64-encoded images, calls the Gemini API, retries on rate limits with backoff, and saves output to generated_images/. It accepts local paths and URLs.

The skill layer wraps the script in conversation. Four files, each with a distinct job:

.claude/skills/gemini/
├── SKILL.md              # Four-step workflow: detect, clarify, confirm, execute
├── config.md             # Model IDs, rate limits, parameters (single source of truth)
├── capabilities.md       # 10 operations with trigger keywords and prompt templates
├── scripts/
│   └── generate_image.py # CLI script, zero dependencies
└── examples/
    ├── prompts.md        # 50+ templates across 13 categories
    └── advanced-techniques.md

Models

Flash (gemini-2.5-flash-image) generates in 2-5 seconds. Good for drafts and iteration. This is the default.

Pro (gemini-3-pro-image-preview) takes 10-30 seconds, produces higher quality at 2K resolution. Supports a thinking budget (--thinking 4096 through 32768) for complex compositions and text rendering. Use Pro for final output, face swaps, and precision work.

Limitations

The free tier caps Flash at roughly 100 requests per day and Pro at roughly 10. The script retries on rate limits, but sustained heavy use needs a paid tier.

Text rendering works but not every time. Gemini sometimes misspells words or breaks letter spacing. The Pro model with a high thinking budget improves consistency; expect to retry occasionally.

Face swaps produce good results on most inputs. Some source image combinations need a second attempt or a tweaked prompt. Lighting mismatches and unusual angles are the usual culprits.

License

MIT

About

CLI skill that pairs Claude's prompt intelligence with Gemini's image generation.

Topics

Resources

Stars

Watchers

Forks

Languages