Exploring GPT-4o: Delivering Human-like AI Interaction Across Text, Audio, and Vision
Ryan Daws is a senior editor at TechForge Media, with a background spanning over a decade in tech journalism. He has identified the latest technological trends, dissected complex topics, and written compelling narratives around the most cutting-edge developments. His articles and interviews with leading industry figures have gained him recognition as a key influencer by organisations such as Onalytica. Publications under his direction have been recognized by leading analyst houses like Forrester for their performance. You can find him on X (@gadget_ry) or Mastodon (@gadgetry@techhub.social)
OpenAI has launched its new flagship model, GPT-4o, which integrates text, audio, and visual inputs and outputs, aiming to enhance the naturalness of machine interactions.
GPT-4o, where the “o” stands for “omni,” is designed to cater to a wider range of input and output modalities. “It accepts as input any combination of text, audio, and image and generates any combination of text, audio, and image outputs,” OpenAI announced.
Users can expect a response time as quick as 232 milliseconds, similar to human conversational speed, with an impressive average response time of 320 milliseconds.
GPT-4o’s debut represents a significant advancement from previous models by utilizing a single neural network for all input and output processing. This method allows the model to maintain essential data and context, which was often overlooked in the separate model pipeline of older versions.
Before GPT-4o, ‘Voice Mode’ had audio interaction latencies of 2.8 seconds for GPT-3.5, and 5.4 seconds for GPT-4. The old configuration required three unique models: one for transcribing audio to text, another for producing text responses, and a separate one for reverting text back to audio. This segmentation resulted in the loss of subtle elements such as tone, multi-speaker situations, and ambient noise.
GPT-4o, as an all-in-one solution, showcases significant enhancements in vision and audio comprehension. It can handle more intricate tasks like song harmonization, real-time translation, and even output generation with expressive components like laughter and singing. Its wide-ranging abilities extend to areas such as interview preparation, on-the-spot language translation, and creation of customer service replies.
Nathaniel Whittemore, Founder and CEO of Superintelligent, remarked: “Product announcements inherently have a more divisive nature than technology announcements. It’s difficult to determine if a product will truly be unique until there’s actual interaction. Particularly with a different mode of human-computer interaction, there’s much more space for varying opinions on its usefulness.”
“Despite the absence of an announcement about GPT-4.5 or GPT-5, it should not distract us from the groundbreaking technological advancement that this represents as a natively multimodal model. It seamlessly integrates text, voice, and image features, laying ground for an extensive range of use cases that will gradually gain awareness over time.”
The GPT-4o model equals the performance level of GPT-4 Turbo in English text and coding tasks but shines brighter in non-English languages, thereby elevating its inclusivity and diversity. It marks a new milestone in reasoning capability, scoring 88.7% in 0-shot COT MMLU (general knowledge questions) and 87.2% in 5-shot no-CoT MMLU.
Furthermore, the model outperforms in audio and translation sectors, outdoing former state-of-the-art models such as Whisper-v3. In multi-language and visual evaluations, it displays superior proficiency, advancing OpenAI’s capabilities in multilingual, audio, and visual fields.
GPT-4o, designed by OpenAI, integrates robust safety precautions, implementing methods to filter the training data and refining behaviour with post-training protective measures. The Preparedness Framework assessed the model and confirmed that it aligns with OpenAI’s voluntary promises. According to evaluations in cybersecurity, persuasion, and model autonomy, GPT-4o obtains a ‘Medium’ risk level in all categories.
Further safety tests were conducted with over 70 professionals from various fields like social psychology, bias, fairness, and misinformation, aiming to minimize risks associated with the new functions of GPT-4o.
Starting from today, text and image functions of GPT-4o will be integrated into ChatGPT. This includes both a free tier as well as advanced features for Plus users. In the upcoming weeks, a new Voice Mode powered by GPT-4o will be introduced for alpha testing in ChatGPT Plus.
Developers can use GPT-4o through the
for a variety of text and vision tasks, taking advantage of its doubled speed, reduced cost, and improved rate limits in comparison to GPT-4 Turbo.
In the future, OpenAI aims to extend GPT-4o’s audio and video features to a select few trusted partners via the API, with a plan for a wider release soon. This phased rollout strategy has been established to guarantee comprehensive safety and usability tests before making all capabilities available to the public.
“It’s massively significant that they’ve made this model free for all, alongside making the API 50% more affordable. This provides a tremendous increase in accessibility,” remarked Whittemore.
OpenAI encourages feedback from the community to constantly improve GPT-4o, highlighting the crucial role of user suggestions in recognizing and addressing areas where GPT-4 Turbo could potentially outperform.
(Image Credit: OpenAI)
Read also: OpenAI actions to enhance transparency of AI-generated content
Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.
Explore other upcoming enterprise technology events and webinars powered by TechForge here.
ai, api, artificial intelligence, benchmarks, chatgpt, coding, developers, development, gpt-4o, Model, multimodal, openai, performance, programming
You must be logged in to post a comment.
Discover the pinnacle of WordPress auto blogging technology with AutomationTools.AI. Harnessing the power of cutting-edge AI algorithms, AutomationTools.AI emerges as the foremost solution for effortlessly curating content from RSS feeds directly to your WordPress platform. Say goodbye to manual content curation and hello to seamless automation, as this innovative tool streamlines the process, saving you time and effort. Stay ahead of the curve in content management and elevate your WordPress website with AutomationTools.AI—the ultimate choice for efficient, dynamic, and hassle-free auto blogging. Learn More