Chinchilla (language Model)

	Chinchilla (language Model) Chinchilla is a family of large language models (LLMs) developed by the research team at Google DeepMind, presented in March 2022. Models It is named "chinchilla" because it is a further development over a previous model family named Gopher. Both model families were trained in order to investigate the Neural scaling law, scaling laws of large language models. It claimed to outperform GPT-3. It considerably simplifies downstream utilization because it requires much less computer power for inference and fine-tuning. Based on the training of previously employed language models, it has been determined that if one doubles the model size, one must also have twice the number of training tokens. This hypothesis has been used to train Chinchilla by DeepMind. Similar to Gopher in terms of cost, Chinchilla has 70B parameters and four times as much data. Chinchilla has an average accuracy of 67.5% on the Measuring Massive Multitask Language Understanding (MMLU) benchmark, which is 7% higher ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Large Language Models A large language model (LLM) is a language model trained with Self-supervised learning, self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially Natural language generation, language generation. The largest and most capable LLMs are Generative pre-trained transformer, generative pretrained transformers (GPTs), which are largely used in Generative artificial intelligence, generative Chatbot, chatbots such as ChatGPT or Gemini (chatbot), Gemini. LLMs can be Fine-tuning (deep learning), fine-tuned for specific tasks or guided by prompt engineering. These models acquire Predictive learning, predictive power regarding syntax, semantics, and Ontology (information science), ontologies inherent in human Text corpus, language corpora, but they also inherit inaccuracies and Algorithmic bias, biases present in the Training, validation, and test data sets, data they are trained in. History Before the emergence of transformer-bas ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Google DeepMind DeepMind Technologies Limited, trading as Google DeepMind or simply DeepMind, is a British–American artificial intelligence research laboratory which serves as a subsidiary of Alphabet Inc. Founded in the UK in 2010, it was acquired by Google in 2014 and merged with Google AI's Google Brain division to become Google DeepMind in April 2023. The company is headquartered in London, with research centres in the United States, Canada, France, Germany, and Switzerland. DeepMind introduced neural Turing machines (neural networks that can access external memory like a conventional Turing machine), resulting in a computer that loosely resembles short-term memory in the human brain. DeepMind has created neural network models to play video games and board games. It made headlines in 2016 after its AlphaGo program beat a human professional Go player Lee Sedol, a world champion, in a five-game match, which was the subject of a documentary film. A more general program, AlphaZero, ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Chinchilla Chinchilla refers to either of two species ('' Chinchilla chinchilla'' and '' Chinchilla lanigera'') of crepuscular rodents of the parvorder Caviomorpha, and are native to the Andes mountains in South America. They live in colonies called "herds" at high elevations up to . Historically, chinchillas lived in an area that included parts of Bolivia, Peru and Chile, but today, colonies in the wild are known only in Chile. Along with their relatives, viscachas, they make up the family Chinchillidae. They are also related to the chinchilla rat. The chinchilla has the densest fur of all extant terrestrial mammals, with around 20,000 hairs per square centimeter and 50 hairs growing from each follicle. The chinchilla is named after the Chincha people of the Andes, who once wore its dense, velvet-like fur and ate their meat. By the end of the 19th century, chinchillas had become quite rare after being hunted for their notably soft fur. Most chinchillas currently used by the fur ind ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Neural Scaling Law In machine learning, a neural scaling law is an empirical scaling law that describes how Neural network (machine learning), neural network performance changes as key factors are scaled up or down. These factors typically include the number of parameters, Training, validation, and test data sets, training dataset size, and training cost. Introduction In general, a deep learning model can be characterized by four parameters: model size, training dataset size, training cost, and the post-training error rate (e.g., the test set error rate). Each of these variables can be defined as a real number, usually written as N, D, C, L (respectively: parameter count, dataset size, computing cost, and Loss function, loss). A neural scaling law is a theoretical or empirical Empirical statistical laws, statistical law between these parameters. There are also other parameters with other scaling laws. Size of the model In most cases, the model's size is simply the number of parameters. However, ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	GPT-3 Generative Pre-trained Transformer 3 (GPT-3) is a large language model released by OpenAI in 2020. Like its predecessor, GPT-2, it is a decoder-only transformer model of deep neural network, which supersedes recurrence and convolution-based architectures with a technique known as "attention". This attention mechanism allows the model to focus selectively on segments of input text it predicts to be most relevant. GPT-3 has 175 billion parameters, each with 16-bit precision, requiring 350GB of storage since each parameter occupies 2 bytes. It has a context window size of 2048 tokens, and has demonstrated strong " zero-shot" and " few-shot" learning abilities on many tasks. On September 22, 2020, Microsoft announced that it had licensed GPT-3 exclusively. Others can still receive output from its public API, but only Microsoft has access to the underlying model. Background According to ''The Economist'', improved algorithms, more powerful computers, and a recent increase i ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Measuring Massive Multitask Language Understanding Measuring Massive Multitask Language Understanding (MMLU) is a popular benchmark for evaluating the capabilities of large language models. It inspired several other versions and spin-offs, such as MMLU-Pro, MMMLU and MMLU-Redux. Overview MMLU consists of 15,908 multiple-choice questions, with 1,540 of them being used to select and assess optimal settings for models – temperature, batch size and learning rate. The questions span across 57 subjects, from highly complex STEM fields and international law, to nutrition and religion. It was one of the most commonly used benchmarks for comparing the capabilities of large language models, with over 100 million downloads as of July 2024. The benchmark was released by Dan Hendrycks and a team of researchers on 7 September 2020. It was purpose-made to be more challenging than existing benchmarks at the time, such as General Language Understanding Evaluation (GLUE), as models began outperforming humans in easier tests. When MMLU was relea ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Vision-language Model In artificial intelligence (AI), a foundation model (FM), also known as large X model (LxM), is a machine learning or deep learning model trained on vast datasets so that it can be applied across a wide range of use cases.Competition and Markets Authority (2023). ''AI Foundation Models: Initial Report''. Available at: https://assets.publishing.service.gov.uk/media/65081d3aa41cc300145612c0/Full_report_.pdf Generative AI applications like large language models (LLM) are common examples of foundation models. Building foundation models is often highly resource-intensive, with the most advanced models costing hundreds of millions of dollars to cover the expenses of acquiring, curating, and processing massive datasets, as well as the compute power required for training. These costs stem from the need for sophisticated infrastructure, extended training times, and advanced hardware, such as GPUs. In contrast, adapting an existing foundation model for a specific task or using it directly i ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Transformer (machine Learning Model) The transformer is a deep learning architecture based on the multi-head attention mechanism, in which text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplified and less important tokens to be diminished. Transformers have the advantage of having no recurrent units, therefore requiring less training time than earlier recurrent neural architectures (RNNs) such as long short-term memory (LSTM). Later variations have been widely adopted for training large language models (LLM) on large (language) datasets. The modern version of the transformer was proposed in the 2017 paper " Attention Is All You Need" by researchers at Google. Transformers were first developed as an improvement ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	GPT-2 Generative Pre-trained Transformer 2 (GPT-2) is a large language model by OpenAI and the second in their foundational series of Generative pre-trained transformer, GPT models. GPT-2 was pre-trained on a dataset of 8 million web pages. It was partially released in February 2019, followed by full release of the 1.5-billion-parameter model on November 5, 2019. GPT-2 was created as a "direct scale-up" of GPT-1 with a ten-fold increase in both its parameter count and the size of its training dataset. It is a general-purpose learner and its ability to perform the various tasks was a consequence of its general ability to accurately predict the next item in a sequence, which enabled it to machine translation, translate texts, question answering, answer questions about a topic from a text, automatic summarization, summarize passages from a larger text, and natural language generation, generate text output on a level sometimes Turing test, indistinguishable from that of humans; however, it ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	RMSNorm In machine learning, normalization is a statistical technique with various applications. There are two main forms of normalization, namely ''data normalization'' and ''activation normalization''. Data normalization (or feature scaling) includes methods that rescale input data so that the features have the same range, mean, variance, or other statistical properties. For instance, a popular choice of feature scaling method is min-max normalization, where each feature is transformed to have the same range (typically ,1/math> or 1,1/math>). This solves the problem of different features having vastly different scales, for example if one feature is measured in kilometers and another in nanometers. Activation normalization, on the other hand, is specific to deep learning, and includes methods that rescale the activation of hidden neurons inside neural networks. Normalization is often used to: * increase the speed of training convergence, * reduce sensitivity to variations and featur ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Adam Optimizer Stochastic gradient descent (often abbreviated SGD) is an Iterative method, iterative method for optimizing an objective function with suitable smoothness properties (e.g. Differentiable function, differentiable or Subderivative, subdifferentiable). It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient (calculated from the entire data set) by an estimate thereof (calculated from a randomly selected subset of the data). Especially in high-dimensional optimization problems this reduces the very high Computational complexity, computational burden, achieving faster iterations in exchange for a lower Rate of convergence, convergence rate. The basic idea behind stochastic approximation can be traced back to the Robbins–Monro algorithm of the 1950s. Today, stochastic gradient descent has become an important optimization method in machine learning. Background Both statistics, statistical M-estimation, estimation and ma ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]