mirror of
https://github.com/zhayujie/chatgpt-on-wechat.git
synced 2026-04-22 08:55:39 +08:00
179 lines
4.6 KiB
Markdown
179 lines
4.6 KiB
Markdown
# OpenAI Image Vision Skill
|
|
|
|
This skill enables image analysis using OpenAI's Vision API (GPT-4 Vision models).
|
|
|
|
## Features
|
|
|
|
- ✅ Analyze images from local files or URLs
|
|
- ✅ Support for multiple image formats (JPEG, PNG, GIF, WebP)
|
|
- ✅ Automatic base64 encoding for local files
|
|
- ✅ Direct URL passing for remote images
|
|
- ✅ Configurable model selection
|
|
- ✅ Custom API base URL support
|
|
- ✅ Pure bash/curl implementation (no Python dependencies)
|
|
|
|
## Quick Start
|
|
|
|
1. **Set up API credentials using env_config:**
|
|
```bash
|
|
env_config(action="set", key="OPENAI_API_KEY", value="sk-your-api-key-here")
|
|
# Optional: custom API base
|
|
env_config(action="set", key="OPENAI_API_BASE", value="https://api.openai.com/v1")
|
|
```
|
|
|
|
2. **Analyze an image:**
|
|
```bash
|
|
bash scripts/vision.sh "/path/to/photo.jpg" "What's in this image?"
|
|
```
|
|
|
|
3. **Analyze from URL:**
|
|
```bash
|
|
bash scripts/vision.sh "https://example.com/image.jpg" "Describe this image"
|
|
```
|
|
```bash
|
|
bash scripts/vision.sh "/path/to/image.jpg" "What's in this image?"
|
|
```
|
|
|
|
3. **Analyze from URL:**
|
|
```bash
|
|
bash scripts/vision.sh "https://example.com/image.jpg" "Describe this image"
|
|
```
|
|
|
|
## Usage Examples
|
|
|
|
### Basic image analysis
|
|
```bash
|
|
bash scripts/vision.sh "photo.jpg" "What objects can you see?"
|
|
```
|
|
|
|
### Text extraction (OCR)
|
|
```bash
|
|
bash scripts/vision.sh "document.png" "Extract all text from this image"
|
|
```
|
|
|
|
### Detailed description
|
|
```bash
|
|
bash scripts/vision.sh "scene.jpg" "Describe this scene in detail, including colors, mood, and composition"
|
|
```
|
|
|
|
### Using different models
|
|
```bash
|
|
# Use gpt-4.1-mini (default, latest mini model)
|
|
bash scripts/vision.sh "image.jpg" "Analyze this" "gpt-4.1-mini"
|
|
|
|
# Use gpt-4.1 (most capable, latest model)
|
|
bash scripts/vision.sh "image.jpg" "Analyze this" "gpt-4.1"
|
|
|
|
# Use gpt-4o-mini (previous mini model)
|
|
bash scripts/vision.sh "image.jpg" "Analyze this" "gpt-4o-mini"
|
|
```
|
|
|
|
## Environment Variables
|
|
|
|
| Variable | Required | Default | Description |
|
|
|----------|----------|---------|-------------|
|
|
| `OPENAI_API_KEY` | Yes | - | Your OpenAI API key |
|
|
| `OPENAI_API_BASE` | No | `https://api.openai.com/v1` | Custom API base URL |
|
|
|
|
## Response Format
|
|
|
|
Success response:
|
|
```json
|
|
{
|
|
"model": "gpt-4.1-mini",
|
|
"content": "The image shows a beautiful sunset over mountains...",
|
|
"usage": {
|
|
"prompt_tokens": 1234,
|
|
"completion_tokens": 567,
|
|
"total_tokens": 1801
|
|
}
|
|
}
|
|
```
|
|
|
|
Error response:
|
|
```json
|
|
{
|
|
"error": "Error description",
|
|
"details": "Additional information"
|
|
}
|
|
```
|
|
|
|
## Supported Models
|
|
|
|
- `gpt-4.1-mini` (default) - Latest mini model, fast and cost-effective
|
|
- `gpt-4.1` - Latest GPT-4 variant, most capable
|
|
- `gpt-4o-mini` - Previous generation mini model
|
|
- `gpt-4-turbo` - Previous generation turbo model
|
|
|
|
## Supported Image Formats
|
|
|
|
- JPEG (`.jpg`, `.jpeg`)
|
|
- PNG (`.png`)
|
|
- GIF (`.gif`)
|
|
- WebP (`.webp`)
|
|
|
|
## Technical Details
|
|
|
|
- **Implementation**: Pure bash script using curl and base64
|
|
- **Timeout**: 60 seconds for API calls
|
|
- **Max tokens**: 1000 tokens for responses
|
|
- **Image handling**:
|
|
- Local files are automatically base64-encoded
|
|
- URLs are passed directly to the API
|
|
- MIME types are auto-detected from file extensions
|
|
|
|
## Error Handling
|
|
|
|
The script handles various error cases:
|
|
- Missing required parameters
|
|
- Missing API key
|
|
- File not found
|
|
- Unsupported image formats
|
|
- API errors
|
|
- Network timeouts
|
|
- Invalid JSON responses
|
|
|
|
## Integration with Agent System
|
|
|
|
When loaded by the agent system, this skill will appear in `<available_skills>` with a `<base_dir>` path. Use it like:
|
|
|
|
```bash
|
|
bash "<base_dir>/scripts/vision.sh" "image.jpg" "What's in this image?"
|
|
```
|
|
|
|
The agent will automatically:
|
|
- Load environment variables from `~/.cow/.env`
|
|
- Provide the correct `<base_dir>` path
|
|
- Handle skill discovery and registration
|
|
|
|
## Notes
|
|
|
|
- Images are sent to OpenAI's servers for processing
|
|
- Large images may be automatically resized by the API
|
|
- Rate limits depend on your OpenAI API plan
|
|
- Token usage includes both the image and text in the prompt
|
|
- Base64 encoding increases the size of local images by ~33%
|
|
|
|
## Troubleshooting
|
|
|
|
**"OPENAI_API_KEY environment variable is not set"**
|
|
- Set the environment variable using env_config tool
|
|
- Or use the agent's env_config tool
|
|
|
|
**"Image file not found"**
|
|
- Check the file path is correct
|
|
- Use absolute paths or paths relative to current directory
|
|
|
|
**"Unsupported image format"**
|
|
- Only JPEG, PNG, GIF, and WebP are supported
|
|
- Check the file extension matches the actual format
|
|
|
|
**"Failed to call OpenAI API"**
|
|
- Check your internet connection
|
|
- Verify the API key is valid
|
|
- Check if custom API base URL is correct
|
|
|
|
## License
|
|
|
|
Part of the chatgpt-on-wechat project.
|