Language model benchmarks are standardized tests designed to evaluate the performance of language models on various natural language processing tasks....
97 KB (10,592 words) - 05:53, 15 June 2025
large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing...
115 KB (11,926 words) - 02:40, 16 June 2025
A language model is a model of the human brain's ability to produce natural language. Language models are useful for a variety of tasks, including speech...
17 KB (2,413 words) - 04:41, 19 June 2025
Claude is a family of large language models developed by Anthropic. The first model was released in March 2023. The Claude 3 family, released in March...
27 KB (2,313 words) - 01:34, 16 June 2025
Reasoning language models (RLMs) are large language models that have been further trained to solve multi-step reasoning tasks. These models perform better...
24 KB (2,862 words) - 09:59, 13 June 2025
Humanity's Last Exam (category Large language models)
Humanity's Last Exam (HLE) is a language model benchmark consisting of 2,500 questions across a broad range of subjects. It was created jointly by the...
7 KB (478 words) - 23:04, 13 June 2025
Llama (Large Language Model Meta AI, formerly stylized as LLaMA) is a family of large language models (LLMs) released by Meta AI starting in February 2023...
53 KB (4,940 words) - 20:25, 13 June 2025
MMLU (redirect from Measuring Massive Multitask Language Understanding)
Measuring Massive Multitask Language Understanding (MMLU) is a popular benchmark for evaluating the capabilities of large language models. It inspired several...
6 KB (746 words) - 20:00, 11 May 2025
variety of industry benchmarks, while Gemini Pro was said to have outperformed GPT-3.5. Gemini Ultra was also the first language model to outperform human...
54 KB (4,386 words) - 05:10, 18 June 2025
written. List of chatbots List of language model benchmarks This is the date that documentation describing the model's architecture was first released....
64 KB (3,353 words) - 19:38, 17 June 2025
for Transformer-based Masked Language-models, arXiv:2106.10199 "Papers with Code - MMLU Benchmark (Multi-task Language Understanding)". paperswithcode...
52 KB (5,397 words) - 20:13, 15 June 2025
Bidirectional encoder representations from transformers (BERT) is a language model introduced in October 2018 by researchers at Google. It learns to represent...
31 KB (3,568 words) - 19:15, 25 May 2025
average accuracy of 67.5% on the Measuring Massive Multitask Language Understanding (MMLU) benchmark, which is 7% higher than Gopher's performance. Chinchilla...
8 KB (615 words) - 19:51, 6 December 2024
In computing, a benchmark is the act of running a computer program, a set of programs, or other operations, in order to assess the relative performance...
23 KB (2,684 words) - 20:25, 1 June 2025
Qwen (category Large language models)
family of large language models developed by Alibaba Cloud. In July 2024, it was ranked as the top Chinese language model in some benchmarks and third globally...
20 KB (1,429 words) - 06:59, 19 June 2025
in 2016, and of the paper that introduced the language model benchmark MMLU (Massive Multitask Language Understanding) in 2020. In February 2022, Hendrycks...
10 KB (860 words) - 05:42, 11 June 2025
Generative artificial intelligence (category CS1 Japanese-language sources (ja))
language model benchmarks. Yann LeCun has advocated open-source models for their value to vertical applications and for improving AI safety. Language...
174 KB (15,078 words) - 04:09, 19 June 2025
DeepSeek (category Articles containing Chinese-language text)
is a Chinese artificial intelligence company that develops large language models (LLMs). Based in Hangzhou, Zhejiang, Deepseek is owned and funded by...
63 KB (6,074 words) - 09:28, 18 June 2025
evaluating and aligning large language models (LLMs), including through initiatives such as Humanity's Last Exam, a benchmark designed to assess advanced...
20 KB (1,926 words) - 17:17, 17 June 2025
Retrieval-augmented generation (category Large language models)
Retrieval-augmented generation (RAG) is a technique that enables large language models (LLMs) to retrieve and incorporate new information. With RAG, LLMs...
23 KB (2,451 words) - 17:44, 2 June 2025
Stochastic parrot (redirect from On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?)
the claim that large language models, though able to generate plausible language, do not understand the meaning of the language they process. The term...
22 KB (2,364 words) - 00:13, 12 June 2025
Prompt engineering (redirect from In-context learning (natural language processing))
intelligence (AI) model. A prompt is natural language text describing the task that an AI should perform. A prompt for a text-to-text language model can be a query...
40 KB (4,472 words) - 15:50, 19 June 2025
Gemini Ultra, in benchmark tests at the time. Sonnet and Haiku are Anthropic's medium- and small-sized models, respectively. All three models can accept image...
32 KB (2,936 words) - 06:57, 10 June 2025
LangChain (category Large language models)
announcing a $10 million seed investment from Benchmark. In the third quarter of 2023, the LangChain Expression Language (LCEL) was introduced, which provides...
18 KB (748 words) - 12:14, 12 June 2025
The term benchmark, bench mark, or survey benchmark originates from the chiseled horizontal marks that surveyors made in stone structures, into which an...
10 KB (1,053 words) - 15:42, 10 February 2025
OpenAI o3 (category Large language models)
Diamond benchmark, which contains expert-level science questions not publicly available online. On SWE-bench Verified, a software engineering benchmark assessing...
9 KB (846 words) - 12:30, 11 June 2025
PaLM (redirect from Pathways Language Model)
PaLM (Pathways Language Model) is a 540 billion-parameter dense decoder-only transformer-based large language model (LLM) developed by Google AI. Researchers...
13 KB (807 words) - 13:21, 13 April 2025
GPT-4o (category Large language models)
Massive Multitask Language Understanding (MMLU) benchmark compared to 86.5 for GPT-4. Unlike GPT-3.5 and GPT-4, which rely on other models to process sound...
25 KB (2,434 words) - 10:12, 19 June 2025
president Donald Trump. January 23 – Humanity's Last Exam, a benchmark for large language models, is published. The dataset consists of 3,000 challenging...
8 KB (782 words) - 14:11, 25 May 2025
Transformer (deep learning architecture) (redirect from Transformer model)
variations have been widely adopted for training large language models (LLM) on large (language) datasets. The modern version of the transformer was proposed...
106 KB (13,107 words) - 11:55, 19 June 2025