Language model benchmarks are standardized tests designed to evaluate the performance of language models on various natural language processing tasks....
89 KB (9,773 words) - 15:03, 3 May 2025
A language model is a model of natural language. Language models are useful for a variety of tasks, including speech recognition, machine translation,...
16 KB (2,382 words) - 00:06, 17 April 2025
A large language model (LLM) is a type of machine learning model designed for natural language processing tasks such as language generation. LLMs are language...
114 KB (11,942 words) - 05:35, 30 April 2025
Humanity's Last Exam (category Large language models)
Humanity's Last Exam (HLE) is a language model benchmark consisting of 2,500 questions across a broad range of subjects. It was created jointly by the...
7 KB (478 words) - 19:07, 3 May 2025
Llama (Large Language Model Meta AI, formerly stylized as LLaMA) is a family of large language models (LLMs) released by Meta AI starting in February 2023...
53 KB (4,940 words) - 16:55, 22 April 2025
variety of industry benchmarks, while Gemini Pro was said to have outperformed GPT-3.5. Gemini Ultra was also the first language model to outperform human...
52 KB (4,226 words) - 20:15, 19 April 2025
Claude is a family of large language models developed by Anthropic. The first model was released in March 2023. The Claude 3 family, released in March...
21 KB (1,894 words) - 20:08, 19 April 2025
MMLU (redirect from Measuring Massive Multitask Language Understanding)
Measuring Massive Multitask Language Understanding (MMLU) is a popular benchmark for evaluating the capabilities of large language models. It inspired several...
6 KB (746 words) - 20:41, 29 April 2025
written. List of chatbots List of language model benchmarks This is the date that documentation describing the model's architecture was first released....
64 KB (3,361 words) - 09:20, 29 April 2025
for Transformer-based Masked Language-models, arXiv:2106.10199 "Papers with Code - MMLU Benchmark (Multi-task Language Understanding)". paperswithcode...
44 KB (4,714 words) - 05:57, 6 March 2025
Reasoning language models are artificial intelligence systems that combine natural language processing with structured reasoning capabilities. These models are...
24 KB (2,960 words) - 18:31, 16 April 2025
Bidirectional encoder representations from transformers (BERT) is a language model introduced in October 2018 by researchers at Google. It learns to represent...
31 KB (3,528 words) - 01:20, 29 April 2025
average accuracy of 67.5% on the Measuring Massive Multitask Language Understanding (MMLU) benchmark, which is 7% higher than Gopher's performance. Chinchilla...
8 KB (615 words) - 19:51, 6 December 2024
Qwen (category Large language models)
family of large language models developed by Alibaba Cloud. In July 2024, it was ranked as the top Chinese language model in some benchmarks and third globally...
20 KB (1,430 words) - 12:39, 2 May 2025
Mistral AI (section Models)
code-focused model on the HumanEval FIM benchmark. Mathstral 7B achieved a score of 56.6% on the MATH benchmark and 63.47% on the MMLU benchmark. On 17 March...
27 KB (1,716 words) - 03:22, 29 April 2025
In computing, a benchmark is the act of running a computer program, a set of programs, or other operations, in order to assess the relative performance...
22 KB (2,614 words) - 17:29, 2 April 2025
Generative artificial intelligence (category CS1 Japanese-language sources (ja))
language model benchmarks. Yann LeCun has advocated open-source models for their value to vertical applications and for improving AI safety. Language...
163 KB (13,826 words) - 19:09, 30 April 2025
OpenAI o3 (category Large language models)
Diamond benchmark, which contains expert-level science questions not publicly available online. On SWE-bench Verified, a software engineering benchmark assessing...
8 KB (744 words) - 06:49, 29 April 2025
DeepSeek (category Articles containing Chinese-language text)
is a Chinese artificial intelligence company that develops large language models (LLMs). Based in Hangzhou, Zhejiang, it is owned and funded by the...
62 KB (6,059 words) - 16:53, 1 May 2025
in 2016, and of the paper that introduced the language model benchmark MMLU (Massive Multitask Language Understanding) in 2020. In February 2022, Hendrycks...
10 KB (860 words) - 19:15, 22 March 2025
OpenAI o1 (category Large language models)
with rumors suggesting that this experimental model had shown promising results on mathematical benchmarks. In July 2024, Reuters reported that OpenAI was...
13 KB (1,349 words) - 01:41, 28 March 2025
PaLM (redirect from Pathways Language Model)
PaLM (Pathways Language Model) is a 540 billion-parameter dense decoder-only transformer-based large language model (LLM) developed by Google AI. Researchers...
13 KB (807 words) - 13:21, 13 April 2025
Gemini Ultra, in benchmark tests at the time. Sonnet and Haiku are Anthropic's medium- and small-sized models, respectively. All three models can accept image...
31 KB (2,841 words) - 09:41, 26 April 2025
GPT-4o (category Large language models)
Massive Multitask Language Understanding (MMLU) benchmark compared to 86.5 for GPT-4. Unlike GPT-3.5 and GPT-4, which rely on other models to process sound...
23 KB (2,244 words) - 21:49, 3 May 2025
The term benchmark, bench mark, or survey benchmark originates from the chiseled horizontal marks that surveyors made in stone structures, into which an...
10 KB (1,053 words) - 15:42, 10 February 2025
Stochastic parrot (redirect from On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?)
the theory that large language models, though able to generate plausible language, do not understand the meaning of the language they process. The term...
22 KB (2,397 words) - 07:34, 27 March 2025
Perplexity (category Language modeling)
q={\tilde {p}}} . In natural language processing (NLP), a corpus is a structured collection of texts or documents, and a language model is a probability distribution...
12 KB (1,865 words) - 13:50, 11 April 2025
Reflection (artificial intelligence) (redirect from Large reasoning model)
artificial intelligence, notably used in large language models, specifically in Reasoning Language Models (RLMs), is the ability for an artificial neural...
18 KB (1,937 words) - 09:11, 21 April 2025
Prompt engineering (redirect from In-context learning (natural language processing))
intelligence (AI) model. A prompt is natural language text describing the task that an AI should perform. A prompt for a text-to-text language model can be a query...
43 KB (4,790 words) - 18:46, 21 April 2025
Grok (chatbot) (redirect from Aurora (text-to-image model))
artificial intelligence chatbot developed by xAI. Based on the large language model (LLM) of the same name, it was launched in 2023 as an initiative by...
47 KB (4,189 words) - 11:15, 29 April 2025