Introduction to GPT-4o and GPT-4o Mini

GPT-4o and its lightweight counterpart, GPT-4o Mini, are advanced multimodal models that process text, audio, and video inputs, generating outputs across multiple formats including text, audio, and images. GPT-4o Mini is optimized for speed and efficiency while retaining high accuracy.

Background on GPT-4o

Prior to GPT-4o, interacting with ChatGPT via Voice Mode required three separate models. GPT-4o consolidates these capabilities into one unified model that handles text, vision, and audio inputs, providing a cohesive experience across different input types.

API Capabilities and Future Updates

Currently, the API for GPT-4o Mini supports text and image inputs with text outputs, similar to GPT-4 Turbo. However, additional modalities like audio will soon be available, further enhancing its multimodal processing capabilities.

This guide will walk you through getting started with GPT-4o Mini for tasks involving text, image, and video understanding.

Getting Started with GPT-4o Mini

To begin, you’ll need to install the OpenAI SDK for Python and configure the OpenAI client with an API key. Once set up, you can start by sending a simple text input to the model and receive responses based on that input.

Below is an example of a basic setup and a sample request to the GPT-4o Mini model:

from openai import OpenAI
import os

MODEL="gpt-4o-mini"
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as an env var>"))

completion = client.chat.completions.create(
  model=MODEL,
  messages=[
    {"role": "system", "content": "You are a helpful assistant. Help me with my math homework!"},
    {"role": "user", "content": "Hello! Could you solve 2+2?"}
  ]
)

print("Assistant: " + completion.choices[0].message.content)

The assistant would respond with something like, "Of course! 2 + 2 = 4."

Image Processing with GPT-4o Mini

GPT-4o Mini can process images directly, allowing it to analyze visual content and provide insights or take actions based on the image provided. Images can be sent to the model either as Base64 encoded strings or as URLs.

Here’s an example of how to process an image using Base64 encoding:

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

base64_image = encode_image(IMAGE_PATH)

response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant that responds in Markdown. Help me with my math homework!"},
        {"role": "user", "content": [
            {"type": "text", "text": "What's the area of the triangle?"},
            {"type": "image_url", "image_url": {
                "url": f"data:image/png;base64,{base64_image}"}
            }
        ]}
    ],
    temperature=0.0,
)

print(response.choices[0].message.content)

The model will respond with the steps and calculation to find the area of the triangle based on the image provided.

Video Processing and Future Modalities

While direct video input isn't yet supported by the API, GPT-4o can understand video content by processing individual frames as images and accompanying audio through models like Whisper.

By integrating these components, you can create applications for video summarization or Q&A, combining visual and audio information for comprehensive analysis.

Conclusion

GPT-4o and GPT-4o Mini represent a significant step forward in multimodal AI, offering robust capabilities across text, image, and eventually, audio processing. This unified model approach simplifies development and enhances interaction, making it a valuable tool for a wide range of applications.