Large language models (LLMs) have taken the tech industry by storm, powering experiences that can only be described as magical—from writing a week’s worth of code in seconds to generating conversations that feel even more empathetic than the ones we have with humans. Trained on trillions of tokens of data with clusters of thousands of GPUs, LLMs demonstrate remarkable natural language understanding and have transformed fields like copy and code, propelling us into the new and exciting generative era of AI. As with any emerging technology, generative AI has been met with some criticism. Though some of this criticism does reflect current limits of LLMs’ current capabilities, we see these roadblocks not as fundamental flaws in the technology, but as opportunities for further innovation.

大型语言模型(LLM)在科技行业掀起了一场风暴,它带来的体验只能用神奇来形容--从在几秒钟内写出一周的代码,到生成比人类对话更让人感同身受的对话。通过成千上万个 GPU 集群对数万亿个词组数据进行训练,LLM 展现出非凡的自然语言理解能力,并改变了复制和代码等领域,推动我们进入全新的、令人兴奋的人工智能生成时代。与任何新兴技术一样,生成式人工智能也受到了一些批评。虽然其中一些批评确实反映了 LLM 目前的局限性,但我们认为这些障碍并不是根本的技术缺陷,而是进一步创新的机遇。

To better understand the near-term technological breakthroughs for LLMs and prepare founders and operators for what’s around the bend, we spoke to some of the leading generative AI researchers who are actively building and training some of the largest and most cutting edge models: Dario Amodei, CEO of Anthropic; Aidan Gomez, CEO of Cohere; Noam Shazeer, CEO of Character.AI; and Yoav Shoham of AI21 Labs. These conversations identified 4 key innovations on the horizon: steering, memory, “arms and legs,” and multimodality. In this piece, we discuss how these key innovations will evolve over the next 6 to 12 months and how founders curious about integrating AI into their own businesses might leverage these new advances.

为了更好地了解 LLM 近期的技术突破,并让创始人和运营者为下一个转折点做好准备,我们采访了一些领先的生成式人工智能研究人员,他们正在积极构建和训练一些最大、最前沿的模型: Anthropic 首席执行官 Dario Amodei、Cohere 首席执行官 Aidan Gomez、Character.AI 首席执行官 Noam Shazeer 和 AI21 实验室的 Yoav Shoham。这些对话确定了未来四项关键创新:引导、记忆、"手脚 "和多模态。在这篇文章中,我们将讨论这些关键创新在未来 6 到 12 个月内将如何发展,以及对于如何将人工智能整合到自己的业务中充满好奇的创始人将如何利用这些新发展。

Steering

引导

Many founders are understandably wary of implementing LLMs in their products and workflows because of these models’ potential to hallucinate and reproduce bias. To address these concerns, several of the leading model companies are working on improved steering—a way to place better controls on LLM outputs—to focus model outputs and help models better understand and execute on complex user demands. Noam Shazeer draws a parallel between LLMs and children in this regard: “it’s a question of how to direct [the model] better… We have this problem with LLMs that we just need the right ways of telling them to do what we want. Small children are like this as well—they make things up sometimes and don’t have a firm grasp of fantasy versus reality.” Though there has been notable progress in steerability among the model providers as well as the emergence of tools like Guardrails and LMQL, researchers are continuing to make advancements, which we believe is key to better productizing LLMs among end users.

可以理解的是,许多创始人对在其产品和工作流程中采用 LLMs 持谨慎态度,因为这些模型有可能产生误解并产出偏差。为了消除这些顾虑,几家领先的模型公司正致力于改进模型引导--一种对 LLM 更好的控制输出的方法--以集中模型输出,帮助模型更好地理解和执行复杂的用户需求。在这方面,Noam Shazeer 将 LLM 比作儿童: "这是一个如何更好地引导 [模型] 的问题......我们在 LLM 上遇到的问题是,我们需要用正确的方法来告诉它们做我们想做的事情。小孩子也是这样--他们有时会胡编乱造,对幻想与现实的关系掌握不牢"。尽管模型提供商在引导性方面已经取得了显著进步,并且出现了 GuardrailsLMQL 等工具,但研究人员仍在不断取得进步,我们相信这是更好地将 LLM 终端产品化的关键所在。

Improved steering becomes especially important in enterprise companies where the consequences of unpredictable behavior can be costly. Amodei notes that the unpredictability of LLMs “freaks people out” and, as an API provider, he wants to be able to “look a customer in the eye and say ‘no, the model will not do this,’ or at least does it rarely.” By refining LLM outputs, founders can have greater confidence that the model’s performance will align with customer demands. Improved steering will also pave the way for broader adoption in other industries with higher accuracy and reliability requirements, like advertising, where the stakes of ad placement are high. Amodei also sees use cases ranging from “legal use cases, medical use cases, storing financial information and managing financial bets, [to] where you need to preserve the company brand. You don’t want the tech you incorporate to be unpredictable or hard to predict or characterize.” With better steering, LLMs will also be able to do more complex tasks with less prompt engineering, as they will be able to better understand overall intent.

在企业中,不可预测的行为,可能会付出高昂的代价,因此,改进模型引导变得尤为重要。Amodei 指出,LLM 的不可预测性 "让人抓狂",作为 API 提供商,他希望能够 "看着客户的眼睛说'不,模型不会这样做',或者至少很少这样做"。通过完善 LLM 输出,企业创始人可以更加确信模型的性能将符合客户的需求。改进引导也将促进其他对准确性和可靠性要求更高的行业(如广告业,广告投放的风险很大)对 LLM 的更广泛采用。Amodei 还认为,应用案例包括 "法律应用案例、医疗应用案例、存储金融信息和管理金融赌注,[以致]维护公司品牌。你不希望你采用的技术是不可掌控的,或者是难以预测或描述的。有了更好的引导技术,LLMs 还能以更少的提示程序完成更复杂的任务,因为它们能更好地理解整体意图。

Advances in LLM steering also have the potential to unlock new possibilities in sensitive consumer applications where users expect tailored and accurate responses. While users might be willing to tolerate less accurate outputs from LLMs when engaging with them for conversational or creative purposes, users want more accurate outputs when using LLMs to assist them in daily tasks, advise them on major decisions, or augment professionals like life coaches, therapists, and doctors. Some have pointed out that LLMs are poised to unseat entrenched consumer applications like search, but we likely need better steering to improve model outputs and build user trust before this becomes a real possibility.

Key unlock: users can better tailor the outputs of LLMs.

在灵活的消费者应用中,用户希望得到量身定制的准确回应,而 LLM 引导技术的进步也可能为这些应用带来新的可能性。如果用户只是用 LLMs 进行对话或创造,他们可能愿意容忍 LLMs 输出的准确度低;但当用户使用 LLMs 协助他们完成日常任务、就重大决策提供建议,或进行职业训练(例如生活教练、治疗师和医生等)时,他们则希望 LLMs 输出的准确度更高。一些人指出,LLMs 有望取代搜索等固有的消费者应用,但要实现这个目标,我们可能需要更好的引导来改进模型输出并建立用户信任。

关键解锁:用户可以更好地定制 LLMs 的输出

Memory

Copywriting and ad-generating apps powered by LLMs have already seen great results, leading to quick uptake among marketers, advertisers, and scrappy entrepreneurs. Currently, however, most LLM outputs are relatively generalized, which makes it difficult to leverage them for use cases requiring personalization and contextual understanding. While prompt engineering and fine-tuning can offer some level of personalization, prompt engineering is less scalable and fine-tuning tends to be expensive, since it requires some degree of re-training and often partnering closely with mostly closed source LLMs. It’s often not feasible or desirable to fine-tune a model for every individual user.

由 LLMs 支持的文案撰写和广告生成应用程序已经取得了巨大的成果,在营销人员广告商野心勃勃的创业者中迅速普及。然而,目前大多数 LLM 的输出结果都比较笼统,因此很难将其用于个性化和需要理解上下文的用例当中。虽然即时工程和微调可以提供一定程度的个性化,但即时工程的可扩展性较差,而微调往往成本高昂,因为它需要一定程度的再培训,而且往往需要与大多数闭源 LLMs 密切合作。通常情况下,为每个用户微调模型既不可行,也不可取。

In-context learning, where the LLM draws from the content your company has produced, your company’s specific jargon, and your specific context, is the holy grail—creating outputs that are more refined and tailored to your particular use case. In order to unlock this, LLMs need enhanced memory capabilities. There are two primary components to LLM memory: context windows and retrieval. Context windows are the text that the model can process and use to inform its outputs in addition to the data corpus it was trained on. Retrieval refers to retrieving and referencing relevant information and documents from a body of data outside the model’s training data corpus (“contextual data”). Currently, most LLMs have limited context windows and aren’t able to natively retrieve additional information, and so generate less personalized outputs. With bigger context windows and improved retrieval, however, LLMs can directly offer much more refined outputs tailored to individual use cases.

情境学习,即 LLM 学习贵公司制作的内容、贵公司的特定术语和贵公司的特定情境,才是必杀技--创造出更精炼、更适用于贵公司特定用例的输出。为了实现这一点,LLM 需要增强记忆能力。LLM 记忆有两个主要组成部分:上下文窗口和检索。上下文窗口是模型可以处理和使用的文本,除了训练它的数据语料库之外,它还可以为其输出提供信息。检索指的是从模型训练数据语料库("上下文数据")之外的数据体中检索和引用相关信息和文档。目前,大多数 LLM 的上下文窗口有限,无法检索到更多信息,因此生成的个性化输出较少。然而,随着上下文窗口的扩大和检索能力的提高,LLMs 可以直接提供更加精细的输出,以满足个性化的使用情况。

With expanded context windows in particular, models will be able to process larger amounts of text and better maintain context, including maintaining continuity through a conversation. This will, in turn, significantly enhance models’ ability to carry out tasks that require a deeper understanding of longer inputs, such as summarizing lengthy articles or generating coherent and contextually accurate responses in extended conversations. We’re already seeing significant improvement with context windows—GPT-4 has both an 8k and 32k token context window, up from 4k and 16k token context windows with GPT-3.5 and ChatGPT, and Claude recently expanded its context window to an astounding 100k tokens.