A CLI skill that pairs Claude's prompt intelligence with Gemini's image generation. Claude handles the conversation; Gemini handles the pixels.
Gemini generates good images. Getting good results means writing good prompts: subject details, lighting, camera settings, composition, and style in structured JSON. Most people don't want to do that. They want to say "headshot" and get a headshot.
Claude understands intent, asks follow-up questions, remembers context, and structures detailed prompts from vague requests. It can't generate images. This skill connects them. You describe what you want in plain language, Claude builds the prompt, Gemini renders it, and you iterate through conversation.
Three modes, auto-detected by input image count.
Generate (no images):
python3 .claude/skills/gemini/scripts/generate_image.py "a friendly robot in a sunlit workshop"Edit (one image):
python3 .claude/skills/gemini/scripts/generate_image.py "make the lighting warmer" \
--image ./generated_images/robot.pngReference (two or more images) for face swaps, pose transfer, style transfer:
python3 .claude/skills/gemini/scripts/generate_image.py \
"put the face from image 1 on the body in image 2" \
--image face.png --image body.png --model gemini-3-pro-image-previewWidescreen, Pro model with thinking budget:
python3 .claude/skills/gemini/scripts/generate_image.py \
"medieval marketplace with merchants and goods" \
--aspect 16:9 --model gemini-3-pro-image-preview --thinking 16384Or open Claude Code in this repo and ask directly. The skill triggers automatically.
An interactive approval workflow sits on top of all three modes. Claude detects the capability, asks clarifying questions, presents a structured breakdown with the exact command, and waits for approval. Say "just do it" to skip the ceremony.
Multi-image reference operations work from the CLI. Face swaps, pose transfers, style transfers; pass two images and a prompt, get a result. No web UI, no Discord bot, no inpainting mask editor.
Conversational iteration happens naturally. Generate an image, say "add a coffee cup," and Claude feeds the previous output back to Gemini with the edit instruction. Context carries across the session.
Prompt enhancement fills the gap between what you type and what the model needs. You say "headshot." Claude structures a JSON prompt with subject details, camera specs, lighting setup, and background description.
Get a free API key at https://aistudio.google.com/apikey
cp .env.sample .env
# Add your key: gemini_api = YOUR_KEY_HEREZero dependencies. One Python file, stdlib only. No pip install, no virtual environment.
The script (generate_image.py, 439 lines) auto-detects mode by image count. It loads the API key from .env, builds the request with base64-encoded images, calls the Gemini API, retries on rate limits with backoff, and saves output to generated_images/. It accepts local paths and URLs.
The skill layer wraps the script in conversation. Four files, each with a distinct job:
.claude/skills/gemini/
├── SKILL.md # Four-step workflow: detect, clarify, confirm, execute
├── config.md # Model IDs, rate limits, parameters (single source of truth)
├── capabilities.md # 10 operations with trigger keywords and prompt templates
├── scripts/
│ └── generate_image.py # CLI script, zero dependencies
└── examples/
├── prompts.md # 50+ templates across 13 categories
└── advanced-techniques.md
Flash (gemini-2.5-flash-image) generates in 2-5 seconds. Good for drafts and iteration. This is the default.
Pro (gemini-3-pro-image-preview) takes 10-30 seconds, produces higher quality at 2K resolution. Supports a thinking budget (--thinking 4096 through 32768) for complex compositions and text rendering. Use Pro for final output, face swaps, and precision work.
The free tier caps Flash at roughly 100 requests per day and Pro at roughly 10. The script retries on rate limits, but sustained heavy use needs a paid tier.
Text rendering works but not every time. Gemini sometimes misspells words or breaks letter spacing. The Pro model with a high thinking budget improves consistency; expect to retry occasionally.
Face swaps produce good results on most inputs. Some source image combinations need a second attempt or a tweaked prompt. Lighting mismatches and unusual angles are the usual culprits.
MIT