ChatGPT is Only the Beginning

With the announcement in September that ChatGPT has become Multimodal (or Multi-Modal) – (ChatGPT Can Now See, Hear and Speak) ChatGPT will now support both voice prompts from users and their image uploads. These new capabilities will offer a new, more intuitive type of interface by allowing you to have a voice conversation, or show ChatGPT what you’re talking about.

The term “Multimodal” means – “having or involving several modes, modalities, or maxima.” For example, a multimodal project might include a combination of text, images, video, files, code, speech or audio.

A variety of modalities are possible and have been explored with increasing frequency, because the same basic concepts that drive ChatGPT can be applied to any type of input or output.

Examples of different modalities:

Text-to-Text (OpenAI ChatGPT)
Text-to-Image (Stable Diffusion)
Image-to-Text (Open AI CLIP)
Image-to-Image (img2img)
Text-to-Audio (Meta MusicGen)
Speech-to-Text (OpenAI Whisper)
Text-to-Speech (Meta’s Massively Multilingual Speech)
Text-to-Video (Synthesia, Picsart, HeyGen )
Video-to-text (HappyScribe)
Text-to-Code (OpenAI Codex / GitHub Copilot)
Code-to-Text (ChatGPT)

The next frontier in AI is combining these modalities in interesting ways, using innovative UI/UX (not just a chat interface!). Explain what is in a photo. Debug a coded program with your voice. Generate music from an image etc.

The present ChatGPT text to text, text to image, image to text and text to video functionalities, will seem like simple demos, as opposed to what is coming up next:

Up Next: Multimodal Neural Networks

Researchers are developing multimodal neural networks that can handle multiple types of data, such as text, PDFs, files, images, videos, speech, audio and more. These neural networks are called “multimodal” because they can process different modes of information.

This “multi-modality,” will soon take center stage, as programs accept input or output – text, images, “point clouds” of physical space, speech audio, video, and entire computer functions as smart applications.

The magic happens when more modalities are combined together.

One example of this is the “High-Modality Multimodal Transformer“, which was proposed by Paul Liang and his team at Carnegie Mellon University in 2023.

This neural network can deal with 10 different modes of data, including database tables and time series. The researchers found that adding more modes improved the performance and transferability of the neural network.

Another example is the “Meta-Transformer“, which was developed by Yiyuan Zhang and his colleagues at the Multimedia Lab of the Chinese University of Hong Kong and the Shanghai AI Laboratory in 2023.

The Meta-Transformer is the future of generative AI, with data of all different kinds being fused to have a richer sense of what is being produced as output. It’s explored in the 2023 paper,

This neural network can handle 12 different modes of data. The Meta-Transformer is a unified framework for multimodal learning that can generate rich and diverse outputs.

Finally, NExT-GPT is an “Any-to-Any Multimodal Large Language Model.”

NExT-GPT is a novel system that can perform any-to-any multimodal large language modeling, meaning it can accept and produce content in any combination of text, image, video, and audio modalities.

The system consists of three stages:

Multimodal encoding
LLM understanding and reasoning
Multimodal generation

The system leverages existing pre-trained models for each modality and connects them with projection layers that are fine-tuned with a small amount of parameters. The system also introduces a modality-switching instruction tuning (MosIT) technique and a curated dataset for it, which enables the system to handle complex cross-modal semantic understanding and generation tasks.

These 3 examples mean future versions of tools like ChatGPT could understand and use a lot more information at once. They will be able to process entire books, movies, and even 3D structures.

The Evolution of AI: Integrating Multimodalities and the UX Challenge

The realm of artificial intelligence is constantly evolving, and one of the most exciting developments is the integration of various modalities. Imagine the possibilities: describing the contents of a photograph, troubleshooting a software issue using voice commands, or even creating music inspired by an image. While the technical aspects of merging these modalities are undeniably intricate, the real challenge lies in crafting the perfect user experience (UX).

Why Traditional Chat Interfaces Fall Short

Chat interfaces have long been the go-to when first introducing users to novel technological concepts. Their intuitive nature makes them an obvious choice, especially when releasing new AI advancements. However, as we move into the next phase of multimodal AI, the limitations of chat interfaces become evident. Embedding images, audio, and other modalities within a chat can quickly lead to a cluttered and overwhelming experience for the user.

Chat interfaces often fall short of being the best tool for any specific task. The challenge, then, is to strike a balance between versatility and specialization.

The Future of UX in Multimodal Genertive AI

The integration of different modalities presents a vast opportunity in the UI/UX domain. The key lies in determining the most effective way to present diverse outputs—be it audio, text, images, or code—to users. Moreover, it’s crucial to develop interfaces that not only display these outputs but also allow users to interact with, modify, and provide feedback on them. For instance, when considering the fine-tuning of a multimodal model, what mechanisms can we introduce to make this process intuitive and effective for the user?

Instead of dealing with a chat interface — you might display more dynamic elements — input boxes, sliders, forms, or other interactive UX elements.

In conclusion, as AI continues to break boundaries by integrating multiple modalities, the onus is on UX designers to create interfaces that not only showcase these advancements but also provide a seamless and intuitive experience for users. The future of AI isn’t just about technological prowess; it’s about crafting experiences that resonate with and empower users.

Full Examples of Multimodal Generative AI inputs and outputs:

Text-to-text: (OpenAI ChatGPT) This type of generative AI takes text as input and produces text as output. For example, a text summarizer can take a long article as input and generate a shorter summary as output. A chatbot can take a natural language query or command as input and generate a natural language response as output. A large language model can take a text prompt as input and generate a coherent text continuation as output 1 2 3
Text-to-image: (Stable Diffusion) This type of generative AI takes text as input and produces an image as output. For example, a text-to-image generator can take a textual description of an object or a scene as input and generate a realistic image that matches the description as output. A text-to-image art system can take a creative or abstract text prompt as input and generate an artistic image that reflects the prompt as output 4 5
Image-to-text: (Open AI CLIP)This type of generative AI takes an image as input and produces text as output. For example, an image captioner can take an image as input and generate a textual description of the image content as output. An image classifier can take an image as input and generate a textual label or category for the image as output. An optical character recognition (OCR) system can take an image of printed or handwritten text as input and generate a textual transcription of the text as output.
Image-to-image: (img2img or pix2pix)This type of generative AI takes an image as input and produces another image as output. For example, an image style transfer system can take an image and a style reference as input and generate a new image that has the same content but different style as output. An image super-resolution system can take a low-resolution image as input and generate a high-resolution image as output. An image inpainting system can take an incomplete or corrupted image as input and generate a complete or restored image as output.
Audio-to-text: This type of generative AI takes audio as input and produces text as the output. For example, a podcast recognition system can take podcast audio as input and generate a textual transcription of the speech content as output. A music transcription system can take music audio as input and generate a textual notation of the music score as output. A sound classifier can take sound audio as input and generate a textual label or category for the sound source or event as output.
Text to audio: (Meta MusicGen)This type of generative AI that takes text as the input and produces audio as the output. For example, a text to audio generator can take a textual script or narration as input and generate a realistic speech or music audio that matches the script or narration as output. A text to audio art system can take a creative or abstract text prompt as input and generate an artistic speech or music audio that reflects the prompt as output.
Audio-to-audio: This type of generative AI takes audio as the input and produces another audio as the output. For example, a speech synthesis system can take text or speech audio as input and generate speech audio with different voice, accent, or emotion as output. A music synthesis system can take music audio or notation as input and generate music audio with different instruments, genres, or styles as output. A sound enhancement system can take noisy or distorted sound audio as input and generate clean or improved sound audio as output.
Speech-to-text: (OpenAI Whisper) This type of generative AI takes spoken language as the input and produces written text as the output. For example, a transcription system can take an audio recording of a lecture and generate a written transcript of the spoken content. A voice command recognition system can take verbal commands as input and produce corresponding textual instructions for a device or software to execute. A podcast summarization system can take long audio episodes and generate concise written summaries of the main topics discussed. This technology is fundamental for applications like voice assistants, real-time captioning, and audio indexing.
Text-to-speech: (Meta’s MMS) This type of generative AI takes written text as the input and produces spoken language as the output. For example, an audiobook generation system can take a written novel and produce an audio version narrated in a human-like voice. A reading assistant tool can take digital text from articles, emails, or documents and vocalize it for users with visual impairments or for those who prefer auditory learning. A voice response system can take textual data or scripted responses and convert them into verbal feedback for user interactions in call centers or virtual assistants. This technology bridges the gap between written content and auditory experiences, enhancing accessibility and user engagement.
Speech-to-speech: This type of generative AI takes spoken language as the input and produces altered or translated spoken language as the output. For example, a real-time translation system can take a sentence spoken in English and produce its equivalent in Spanish audibly. A voice modulation system can take a user’s voice and alter its tone, pitch, or accent to produce a different vocal characteristic or mimic another person’s voice. A speech enhancement system can take unclear or noisy speech as input and generate a clearer, noise-reduced version as output. This technology facilitates cross-lingual communication, voice customization, and improved auditory experiences in challenging environments.
Text to video: (Synthesia, Picsart, HeyGen) This type of generative AI takes text as the input and produces video as the output. For example, a text to video generator can take a textual script or storyboard as input and generate a realistic video that matches the script or storyboard as output. A text to video art system can take a creative or abstract text prompt as input and generate an artistic video that reflects the prompt as output.
Video-to-text: (HappyScribe) This type of generative AI takes video as the input and produces text as the output. For example, a video captioner can take video as input and generate a textual description of the video content as output. A video summarizer can take video as input and generate a shorter summary of the video content as output. A video classifier can take video as input and generate a textual label or category for the video genre, topic, or sentiment as output.
Video-to-video: This type of generative AI takes video as the input and produces another video as the output. For example, a video style transfer system can take video and a style reference as input and generate a new video that has the same content but different style as output. A video super-resolution system can take low-resolution video as input and generate high-resolution video as output. A video painting system can take incomplete or corrupted video as input and generate complete or restored video as output.

Text-to-code: (OpenAI Codex / GitHub Copilot) This type of generative AI takes text as input and produces code as output. For example, a text-to-code generator can take a textual description of a function or a program as input and generate the corresponding code in a specific programming language as output. A text-to-code art system can take a creative or abstract text prompt as input and generate an artistic code that reflects the prompt as output.
Code-to-text: (ChatGPT, etc.) This type of generative AI takes code as the input and produces textual descriptions or explanations as the output. For example, a code documentation system can take a segment of code and generate a detailed description or comment explaining its functionality. A code summarization system can take a lengthy code block as input and produce a concise summary of its main operations. A code-to-comment system can take uncommented or poorly documented code as input and generate relevant and informative comments for each segment, enhancing the readability and understanding of the code.
PDF-to-text: This type of generative AI takes PDF as the input and produces text as the output. For example, a PDF-to-text converter can take a PDF document as input and generate text that extracts the textual content from the document as output. A PDF-to-text summarizer can take a PDF document as input and generate text that summarizes the main points or highlights from the document as output.
Text-to-PDF: This type of generative AI takes text as the input and produces a PDF as the output. For example, a text-to-PDF converter can take text document as input and generate PDF document that preserves the formatting, layout, and style of the text document as output. A text-to-PDF generator can take text data as input and generate PDF document that visualizes the data with charts, graphs, or tables as output.
File-to-text: This type of generative AI takes file as theinput and produces text as the output. For example, a file-to-text converter can take file of any format (such as Word, Excel, PowerPoint, etc.) as input and generate text that extracts the textual content from the file as output. A file-to-text summarizer can take file of any format (such as Word, Excel, PowerPoint, etc.) as input and generate text that summarizes the main points or highlights from the file as output.
Text-to-file: This type of generative AI takes text as input and produces file as output. For example, a text-to-file converter can take text document as input and generate file of any format (such as Word, Excel, PowerPoint, etc.) that preserves the formatting, layout, and style of the text document as output. A text-to-file generator can take text data as input and generate file of any format (such as Word, Excel, PowerPoint, etc.) that visualizes the data with charts, graphs, or tables as output.
Point cloud-to-text: This type of generative AI takes a point cloud as input and produces text as output. A point cloud is a way of representing a physical object or space in digital form. It is made up of many points, each with a location in three dimensions (x, y, and z).For example, a point cloud-to-text converter can take a point cloud representation of a 3D object or scene as input and generate text that describes the shape, size, color, or texture of the object or scene as output. A point cloud-to-text classifier can take a point cloud representation of a 3D object or scene as input and generate text that labels or categorizes the object or scene as output.
Text-to-point cloud: This type of generative AI takes text as the input and produces point cloud as the output. For example, a text-to-point cloud generator can take a textual description of a 3D object or scene as input and generate a point cloud representation of the object or scene that matches the description as output. A text-to-point cloud art system can take a creative or abstract text prompt as input and generate a point cloud representation of an artistic 3D object or scene that reflects the prompt as output.
Data table-to-text: This type of generative AI takes data table as input and produces text as output. For example, a data table-to-text converter can take a data table containing numerical or categorical values as input and generate text that extracts the information from the table as output. A data table-to-text summarizer can take a data table containing numerical or categorical values as input and generate text that summarizes the main trends, patterns, or insights from the table as output.
Text-to-data table: This type of generative AI takes text as input and produces data table as output. For example, a text-to-data table converter can take text containing numerical or categorical information as input and generate a data table that organizes the information into rows and columns as output. A text-to-data table generator can take text containing natural language queries or commands as input and generate a data table that answers the queries or executes the commands using external data sources as output.
Infrared-to-text: This type of generative AI takes infrared as input and produces text as output. For example, an infrared-to-text converter can take an infrared image or video as input and generate text that extracts the thermal information from the image or video as output. An infrared-to-text classifier can take an infrared image or video as input and generate text that labels or categorizes the objects or events based on their thermal signatures as output.
Text-to-infrared: This type of generative AI takes text as input and produces infrared as output. For example, a text-to-infrared generator can take a textual description of an object or event with thermal information as input and generate an infrared image or video that matches the description as output. A text-to-infrared art system can take a creative or abstract text prompt with thermal information as input and generate an infrared image or video that reflects the prompt as output.

October 11, 2023 David Cronshaw

David Cronshaw