In this blog post, I will summarize recent advancements in generative AI for text and images. Is generative AI as significant as the advent of the internet, or just the latest shiny toy? I’ll highlight the implications for business and society, and check in on our progress toward The Singularity.
First, the advancements. In the past few years, we’ve seen several breakthroughs in text generation that allow computers to generate long-form cohesive text that is indistinguishable from human writing. Think of it as the ultimate “auto-complete.” Amazingly, through the process of “prompt engineering,” you can prompt these algorithms to answer questions with well-reasoned responses and even hold intelligent conversations (see character.ai). Because these models are trained on a huge corpus of human writing and can be applied to an equally wide range of scenarios, they are called “foundation models.”
The primary model in this category is GPT-3 (the third version of Generative Pre-trained Transformer, created by the OpenAI research lab primarily funded by Microsoft). GPT-3, and other models like it are trained to predict what comes next in a series of text inputs. GPT-3 uses a deep neural network architecture, some key innovations in attention (see next paragraph), and an unprecedented number of parameters (175 billion connected neuron weights). GPT-3 was trained on hundreds of billions of words from a crawl of the internet, and 10s of billions of words from books.
In a neural network, attention means that certain input elements are given more “weight” or importance than others when the network makes predictions. This allows the network to focus on the most relevant information when making predictions. GPT-3 uses a novel attention mechanism that allows it to better handle long-term dependencies in language. This is important because language often contains complex concepts that can only be understood by considering the context of the entire sentence or paragraph.
OK, so how about generating images? You’ve probably heard of DALL-E. DALL-E (also from OpenAI) uses a transformer-based model similar to GPT-3, but with a few important modifications. It is trained to generate images from textual descriptions, rather than to predict the next word in a sequence (as GPT-3 does). DALL-E 2 uses 3.5 billion (small compared to its predecessors) and was trained on 400 million pairs of images with text captions scraped from the internet.
Here is my cocktail party description of the process: DALL-E 2 encodes (or maps) the textual description of an image using a transformer (the thing that was trained on the labeled images). This text representation is then “decoded” by a decoder that produces an image. The decoding process involves generating an image from noise (this is the diffusion part) and mapping to text encodings from another CLIP model (Contrastive Language-Image Pre-training, it’s not a generative adversarial architecture). Phew, my friends went to get more drinks already.
OK, so what are the ramifications of these amazing research breakthroughs? Obviously, writing becomes easier for humans with an ultimate auto-complete. Generating and editing images becomes much easier for artists with this new toolbox.
I mentioned chatbots and question-answering. I’ve been waiting a lifetime for a chatbot that passes the turning test, and we’re pretty much there. It’s easy to imagine human customer support agents being replaced by carefully trained AI models. Or, perhaps customer-like chatbots could be used to test and train human agents.
Because GPT-3 was trained by “reading” computer code (as well as English text), it’s also an excellent assistant for computer programming. In fact, it may be just a few months away from writing entire programs given basic human instruction. It is currently feasible to write an English description, and then have Copilot autocomplete with several lines of text.
Other unexpected applications of GPT-3 include Auto-completion in spreadsheets, automatically writing database queries, understanding UI layouts, creating and checking legal contracts, writing music, and creating cooking recipes. The list is endless. Today the technology is correct enough to be genuinely surprisingly reasonable and mostly correct, but only capable of basic reasoning and is still prone to occasionally spewing 100% confident non-sensical answers.
Here are two specific B2B ideas.
- Clippy 2.0: AI-enabled Workflow Automation. UIPath is a giant company that helps businesses automate their employee’s repetitive tasks. Tools like OpenAI’s Codex (the backend Github co-pilot) will soon be used to generate scripts and macros that are currently created or recorded by humans today. Because the scripts will be generated and refined using English, they will be more robust and less fragile. “Clippy, delete all the temporary files off my computer.”
- Realms: GPT-based algorithms still need to be fine-tuned for your specific business. Realms will be a public/private space for generating, storing, and forking example sets so that when you use AI text or photo generation in your business, the output comes out with the style of your business. Realms is an online catalog of style guides for your marketing communications, contracts, web design, and stock photography.
The tricky part: All of these ideas are potentially enabled by this new technology, but it’s not quite there yet. As I said above, GPT-3 often writes something that sounds completely plausible but is actually untrue. DALL-E 2 images frequently look great as thumbnails but are nightmarish when you zoom in.
How long do humans have left?
Lastly, I want to check in on where we are relative to the singularity. How do we get from here to sentient computers or artificial general intelligence? How do these new deep-learning models accelerate things?
The singularity is the idea that once artificial intelligence surpasses human intelligence, it will be able to design even better AI, leading to a rapid exponential increase in intelligence. The result is an AI “singularity” where machines become superintelligent and humans are unable to understand or control them.
Kurzweil depicted the y-axis as the “number of neurons” (or the number of parameters in the models above). Over time, I’ve thought of this more as a measure of overall capability (reasoning ability and autonomy) of some imagined AI.
The ultimate auto-complete and art generation algorithms discussed above demonstrate the new ability of computers to represent complex concepts, perform simple logical reasoning, and even encode common sense. Obviously, researchers can employ these techniques like programmers and artists to do more and better experiments. What we need for AGI is to begin empowering the software to actually take actions – do things based on the goals we give it (and it chooses).
You can imagine a process in the cloud, that accepts requests, and plans and performs a set of actions. It could execute them given whatever capabilities we give it. “Hal, watch for new mentions of my name on the Web and email me when you find them.” Internally, Hal auto-completes a prompt like “the specific steps to find new mentions on the web and email someone are…” and then executes those steps. There will be an amazing inflection point when the process is given the ability to modify itself. We’ve learned from reinforcement learning that in a digital environment an algorithm (given a fitness function) can fail quickly and repeat as many times as required to get it right.
Generative technologies are the biggest thing since the iPhone (possibly since the internet). Critically, researchers (even those funded by big tech companies) are sharing their learning, accelerating advancement at a pace we’ve never seen before. We’ve never seen computers converse cogently or imagine images before. Still, I’m cognizant of the hype curve (it was just a year ago we coined Web 3 for apps built on a blockchain!), the complexity of the research, and the current limitations. What a time to be alive, hold on to your papers!