Landscape of LLMs¶
Author: Hetul Patel | Published on: 6 Dec, 2023
Pretrained LLMs¶
Opensource LLMs¶
References
-
Can we stop relying on proprietary LLMs to evaluate open LLMs?
Evaluation
Open LLM
Proprietary LLM
GPT-4
Feedback Collection dataset
Prometheus model
Using proprietary LLMs like GPT-4 to evaluate open LLMs has limitations. The Feedback Collection dataset and the Prometheus model aim to close the gap between open and closed models by providing a way to evaluate open LLMs without relying on proprietary models.
-
MosaicML releases MPT-30B, a 30 billion parameter LLM that outperforms GPT-3
LLM
Open Source
Machine Learning
Artificial Intelligence
MosaicML has released MPT-30B, a 30 billion parameter LLM that outperforms the original GPT-3 175 billion parameter model. It is fully open source for commercial use and comes with two fine-tuned variants: MPT-30B-Instruct and MPT-30B-Chat. MPT-30B-Chat is available to play with on HuggingFace, powered by MosaicML Inference. If you want to start using MPT-30B in production, you can customize and deploy it using MosaicML Training and MosaicML Inference.
-
OpenChat Surpasses ChatGPT Performance With Open-Source Model
LLM
OpenAI
ChatGPT
NLP
Machine Learning
OpenChat has developed a new language model, Orca, that outperforms ChatGPT on the Vicuna benchmark. Orca was trained on a smaller dataset than ChatGPT, but achieved better performance by using a more efficient training method. OpenChat has made Orca open-source, so that other researchers can build on its success.
-
The Latest Advancements in Large Language Models: Unveiling Llama 2, Code Llama, and More
LLM
Llama 2
Code Llama
GPT-4
OpenAI
Finetuning
Transformer-based LLMs
NeurIPS LLM Efficiency Challenge
The article discusses the latest advancements in large language models (LLMs), including the release of Meta's Llama 2 and Code Llama models, the leaked GPT-4 model details, OpenAI's new finetuning API, and the NeurIPS LLM Efficiency Challenge. It provides a comprehensive overview of the key features, capabilities, and potential applications of these models, while also highlighting ongoing challenges and debates in the field of LLMs.
-
Announcing Mistral 7B: The Most Powerful Language Model For Its Size
language-models
machine-learning
artificial-intelligence
The Mistral AI team has released Mistral 7B, a 7.3B parameter language model that outperforms Llama 2 13B on all metrics. It is easy to fine-tune on any task and is released under the Apache 2.0 license.
-
Hugging Face Unveils Zephyr-7b: A State-of-the-Art 7B Chatbot
LLM
Chatbot
Natural Language Processing
Artificial Intelligence
Hugging Face has released Zephyr-7b, a 7B chatbot that outperforms other models in its class on the MT Bench and Open LLM Leaderboard. The model was trained using a combination of instruction fine-tuning and Direct Preference Optimization on publicly available datasets. It is available to try out on the Hugging Face website.
-
LaMini-LM: Can Small Language Models Compete with Large Ones?
language models
parameter scale
computational requirements
LaMini-LM
distilled instructions
LaMini-LM is a small language model with a huge amount of distilled instructions. It is designed to achieve impressive results with a smaller model locally. In this article, we will delve into the details of LaMini-LM and see how tiny computational requirements the model asks for.
-
Open Source LLaMA 13B Released with Full Commercial Usage Rights
Open Source LLaMA
RedPajama Dataset
SlimPajama Dataset
Code Generation
Commercial Usage
Energy Efficiency
OpenLM research has released a fully open source version of the LLaMA 13B model, trained on the RedPajama dataset. The model weights are available in both Jax and PyTorch. The model is not ideal for code generation due to its treatment of empty spaces, but it remains one of the best open source models for building on top of. The authors are considering training future releases on the SlimPajama dataset, which is a cleaned version of the RedPajama dataset with 49% smaller size.
-
Meet Notus-7B: Data Curation and Open Science go a long way in shaping AI's future
Open Source LLM
RLHF
DPO
LLama 1 & 2 opened the floodgates of open source LLMs. MistralAI released the most powerful 7B base LLM remotely inspired by the success of LLama 2. HuggingFace H4 released Zephyr trained on on a mix of publicly available, synthetic datasets using DPO. TsinghuaNLP released the UltraChat dataset, a large-scale, multi-round dialogue dataset. OpenBMB released the UltraFeedback dataset, a large-scale, fine-grained, diverse preference dataset for RLHF and DPO. Huggingface H4 team fine-tuned Zephyr using UltraChat (supervised fine tuning) and UltraFeedback (DPO for alignment). ArgillaIO fixed some data issues and improved on Zephyr to release Notus-7B.
Prompt Engineering¶
References
-
Prompt Engineering
Prompt engineering is a relatively new discipline for developing and optimizing prompts to efficiently use language models (LMs) for a wide variety of applications and research topics. Prompt engineering skills help to better understand the capabilities and limitations of large language models (LLMs).
Training LLMs¶
References
-
Efficient Deep Learning Optimization Libraries for Large Language Model Training
DeepSpeed
Megatron-DeepSpeed
FairScale
Megatron-LM
Colossal-AI
BMTrain
Mesh TensorFlow
max text
Alpa
GPT-NeoX
This article provides an overview of various deep learning optimization libraries that can simplify and optimize the training process for large language models. These libraries offer features such as distributed training, model parallelism, and efficient training algorithms, enabling researchers and practitioners to achieve better results with less effort.
-
Training vs Prompting Engineering
Task Diversity for OOD Robustness
Self-Instruction for Dataset Generation
Self-Consistency for Higher Performance
Evaluation
This MLOps Community podcast with Mark Huang discusses various LLM training techniques, including training vs prompting engineering, task diversity for OOD robustness, self-instruction for dataset generation, self-consistency for higher performance, and evaluation.
-
Deploying RLHF with 0 Annotations: A Case Study
real-world case-study
reducing manual effort
RLHF
translation quality
reward model
user-designated pair
regression model
Allen AI's library RL4LMs
T5/Flan-T5
HF Trainer
Sentence Transformers Cross-Encoders
This article presents a real-world case study of deploying RLHF with 0 annotations. It describes the challenges faced by a large translation company in SE Asia, and how RLHF was used to reduce manual effort in producing domain-specific vocabulary and robotic translations. The article also discusses the tools and libraries used, and provides a key takeaway for readers.
-
X-LLM: A Framework for Training Multimodal Language Models
Multimodal Language Models
X-LLM
Image Captioning
Text-to-Speech
Multimodal Question Answering
The paper proposes a new framework, X-LLM, for training multimodal language models. X-LLM consists of three main components: single-modal encoders, X2L interfaces, and a large language model (LLM). The authors evaluate X-LLM on a variety of tasks and show that it achieves state-of-the-art results.
-
TRL: A Full-Stack Transformer Language Model with Reinforcement Learning
Reinforcement Learning
Transformer Language Models
Supervised Fine-tuning
Reward Modeling
Proximal Policy Optimization
TRL is a full-stack library that provides tools for training transformer language models and stable diffusion models with Reinforcement Learning. It is built on top of the transformers library by 🤗 Hugging Face and supports most decoder and encoder-decoder architectures.
-
Is Reinforcement Learning Really Necessary for Large Language Models?
language models
reinforcement learning
direct preference optimization
The paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" introduces a novel algorithm that gets rid of the two stages of RL, namely - fitting a reward model, and training a policy to optimize the reward via sampling. This new algorithm, called Direct Preference Optimization (DPO), trains the LLM using a new loss function which encourages it to increase the likelihood of the better completion and decrease the likelihood of the worse completion. DPO has been shown to achieve comparable performance to RL-based methods, but is much simpler to implement and scale.
Supervised Finetuning¶
References
-
Fine-tuning Llama-2 on your own data
LLM
Fine-tuning
Natural Language Processing
The new script allows for fine-tuning Llama-2 on your own data in just a few lines of code. It handles single/multi-gpu and can even be used to train the 70B model on a single A100 GPU by leveraging 4bit.
-
Fine-tuning LLMs for specific tasks
LLM
fine-tuning
performance
The author of the ReAct paper explores the effects of fine-tuning LLMs on specific tasks. They found that fine-tuning significantly improves performance when using the LLM as an agent. The key is to fine-tune each module to tailor it to specific tasks.
-
A discussion on various LLM fine-tuning techniques
lora
adapter
prompt tuning
rl based policy finetuning
The post discusses various LLM fine-tuning techniques. It covers LORA, adapters, prompt tuning and RL based policy finetuning. The discussion revolves around the advantages and disadvantages of each technique and the scenarios where they are most suitable.
-
Fine-tuning Mistral-7b with QLoRA on Google Colab
LLM
Mistral-7b
QLoRA
Hugging Face
TRL
PEFT
The article describes how to fine-tune the Mistral-7b language model using QLoRA on Google Colab. This can be done using the TRL and PEFT tools from the Hugging Face ecosystem. The article also includes links to the Google Colab notebook and a GitHub thread with more information.
-
InstructGPT
T0
The Turking Test
FLAN
Natural Instructions
Instruction-tuning is a method for improving the performance of language models on a given task by providing them with additional instructions. This can be done by either fine-tuning the model on a dataset of instructions or by using a pre-trained model and providing it with instructions at inference time. Instruction-tuning has been shown to be effective for a variety of tasks, including text summarization, question answering, and machine translation.
-
LLM Reasoning Capabilities Improve with Increased Parameters
reasoning
structured outputs
fine-tuning
A survey of llama2-chat models shows that reasoning capabilities improve as the number of parameters increases. However, structured outputs remain a challenge. This suggests that fine-tuning for better structured data extraction could potentially help.
-
Finetuning
In-context learning
Retrieval augmentation
Embedding finetuning
LLM finetuning
LlamaIndex integrations
Finetuning a model involves updating the model itself over a set of data to improve the model in various ways. This can include improving the quality of outputs, reducing hallucinations, memorizing more data holistically, and reducing latency/cost. The core of our toolkit revolves around in-context learning / retrieval augmentation, which involves using the models in inference mode and not training the models themselves. While finetuning can be also used to “augment” a model with external data, finetuning can complement retrieval augmentation in a variety of ways.
-
T-Few Finetuning: Efficient Training and Scalable Serving of Large Language Models
large language models
finetuning
T-Few
training efficiency
serving scalability
T-Few finetuning is a technique that selectively updates only a fraction of the model's weights, thus reducing training time and computational resources. It also enables model stacking, which allows for the concurrent inference of multiple finetunes, maximizing GPU utilization and improving serving scalability.
-
How to Fine-tune Llama 2 Embeddings for Better Retrieval Performance
LLM
RAG
Embedding Finetuning
LlamaIndex
This article provides a step-by-step guide on how to fine-tune Llama 2 embeddings for better retrieval performance in RAG systems. The guide includes instructions on how to generate training data, fine-tune the embedding model, and evaluate the performance of the fine-tuned model.
-
RL4LMs: A Modular RL Library for Fine-Tuning Language Models to Human Preferences
language models
reinforcement learning
natural language processing
RL4LMs is a modular RL library for fine-tuning language models to human preferences. It provides easily customizable building blocks for training language models, including implementations of on-policy algorithms, reward functions, metrics, datasets, and LM-based actor-critic policies.
-
Exploring Alternatives to RLHF for Fine-Tuning Large Language Models
Large Language Models
Supervised Fine-Tuning
Reinforcement Learning from Human Feedback
Direct Preference Optimization
Chain of Hindsight
This blog explores alternatives to Reinforcement Learning from Human Feedback (RLHF) for fine-tuning large language models. The alternatives discussed include supervised fine-tuning and direct preference optimization. The blog also provides a hands-on guide to preparing human preference data and using the Transformers Reinforcement Learning library to fine-tune a large language model using direct preference optimization.
Evaluating LLMs¶
References
-
LMFlow Benchmark: An Automatic Evaluation Framework for Open-Source LLMs
LLM Evaluation
Chatbot Arena
GPT-4
LMFlow Benchmark
The paper introduces LMFlow benchmark, a new benchmark which provides a cheap and easy-to-use evaluation framework that can help reflect different aspects of LLMs.
-
LLM Evaluation
RAG
Hallucinations
Metrics
This article discusses various techniques for evaluating LLM performance, including hallucination detection and metrics-based approaches. It also provides a framework for optimizing LLM performance using RAG and fine-tuning.
-
A Metrics-First Approach to LLM Evaluation
LLM Evaluation
Human Evaluation
Traditional Metrics
Galileo Metrics
The industry has started adopting LLMs for various applications, but evaluating their performance is challenging. Human evaluation is costly and prone to errors, traditional metrics have poor correlations with human judgment, and reliable benchmarks are absent. Galileo has built metrics to help evaluate LLMs in minutes instead of days.
-
Evaluation Driven Development for LLM Apps
Evaluation Driven Development
LLM
EDD
Stochastic nature of LLMs
LlamaIndex
Retrieval methods
Comparing LLMs
The article discusses the importance of Evaluation Driven Development (EDD) for building LLM apps. It provides a step-by-step guide to EDD, including defining evaluation metrics, defining an evaluation dataset, and trying out different approaches. The article also highlights the importance of EDD for mitigating the risks associated with the stochastic nature of LLMs. Finally, the article provides links to additional resources on EDD.
-
How to Evaluate Chatbots with Large Language Models
Chatbots
LLM
RAG
Evaluation
MLflow
This article explores how to evaluate chatbots with large language models (LLMs). It discusses the use of LLMs as judges for automated evaluation, and provides best practices for using LLM judges. The article also discusses the importance of using use-case-specific benchmarks for evaluation.
-
How to Monitor NDCG for Ranking Models in Production
Ranking models
NDCG
ML observability
Model monitoring
Machine learning
This article provides a comprehensive guide to monitoring Normalized Discounted Cumulative Gain (NDCG) for ranking models in production. It covers the intuition behind NDCG, its calculation, and how it can be used to evaluate the performance of ranking models. Additionally, the article discusses the challenges of maintaining ranking models in production and how ML observability can help.
-
Index Metrics for Evaluating Recommender System Performance
Recommender Systems
Evaluation Metrics
Hit Ratio
MRR
Precision
Recall
MAP
NDCG
Recommender systems output a ranking list of items. Hit ratio, MRR, Precision, Recall, MAP, NDCG are commonly used metrics to evaluate the performance of recommender systems.
LLMs Deployment¶
References
-
Model Serving Frameworks for 2023
Model Serving
AI
Machine Learning
MLOps
The article provides a comprehensive list of model serving frameworks for AI applications in 2023. It highlights the benefits and features of each framework, including BentoML, Jina, and Torchserve, and emphasizes their importance in the MLOps process.
-
vLLM: A High-Throughput Library for Large Language Model Serving
LLM
machine learning
artificial intelligence
natural language processing
vLLM is an open-source library for fast LLM inference and serving. It utilizes PagedAttention, a new attention algorithm that effectively manages attention keys and values. vLLM equipped with PagedAttention redefines the new state of the art in LLM serving: it delivers up to 24x higher throughput than HuggingFace Transformers, without requiring any model architecture changes.
-
How to Optimize Latency for Open Source Language Models
Optimization
Latency
LLM
Model Serving
Inference
This study explores various approaches to optimizing latency for open-source LLMs. The author evaluates the effectiveness of different tools and techniques, including CTranslate2, TGI, bitsandbytes, AutoGPTQ, ExLlama, vLLM, and HuggingFace's hosted inference platform. The results show that vLLM is currently the fastest solution for distributed inference, while HuggingFace's hosted inference platform offers the best performance for single-GPU inference.
-
How to Optimize Large Language Model (LLM) Inference
Large Language Model
LLM
Inference
Optimization
Machine Learning
This article provides best practices for optimizing LLM inference, including identifying the optimization target, paying attention to the components of latency, utilizing memory bandwidth, batching, and exploring deeper systems optimizations. It also discusses hardware configurations and the importance of data-driven decisions.
-
HuggingFace
LLM
Rust
Python
gRPC
Docker
CUDA
NCCL
OpenTelemetry
quantization
Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). It implements many features such as optimized models, tensor parallelism, and distributed tracing. TGI can be installed locally or used as a Docker container.
-
Machine Learning
Natural Language Processing
Text Embeddings
Serverless Computing
TEI is a new server for sentence or document embedding that is optimized for speed and efficiency. It is based on the
candle
rust backend and does not require torch, making it very small and lightweight. TEI is a step towards real ML serverless and has the potential to make it easier to use multimodal embeddings in production. -
Text Generation Inference: A Rust, Python, and gRPC toolkit
HuggingFace
Hugging Chat
Inference API
Inference Endpoint
Large Language Models (LLMs)
Llama
Falcon
StarCoder
BLOOM
GPT-NeoX
Open Telemetry
Prometheus
Tensor Parallelism
Server-Sent Events (SSE)
transformers.LogitsProcessor
Custom Prompt Generation
Fine-tuning Support
Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). It supports many features such as simple launcher, production readiness, tensor parallelism, token streaming, continuous batching, optimized transformers code, quantization, watermarking, logits warper, stop sequences, log probabilities, custom prompt generation, and fine-tuning support.
-
LoRAX: The LLM Inference Server that Speaks for the GPUs
LLM
LoRA
GPU
Cloud
Predibase
LoRAX is a new kind of LLM inference solution designed to make it cost effective and scalable to serve many fine-tuned models in production at once, conserving precious GPUs by dynamically exchanging in and out fine-tuned LoRA models within a single LLM deployment.
Running LLMs Locally¶
References
-
Run Large Language Models on Your CPU with Llama.cpp
LLM
Inference
CPU
GPU
ChatGPT
Vicuna
GPT4ALL
Alpaca
ggml
This article explains how to set up llama.cpp on your computer to run large language models on your CPU. It focuses on Vicuna, a chat model behaving like ChatGPT, but also shows how to run llama.cpp for other language models.
-
h2oGPT - 100% Private, 100% Local Chat with a GPT
LLM
h2oGPT
Open Source
Private
Local
This video shows how to install and use h2oGPT, an open-source large language model (LLM), on a local computer for private, local chat with a GPT.
-
Run Large Language Models on Your Own Computer with llama.cpp
Large Language Models
Llama.cpp
NVIDIA CUDA
Ubuntu 22.04
This blog post provides a step-by-step guide for running the Llama-2 7B model using llama.cpp, with NVIDIA CUDA and Ubuntu 22.04.
-
Get up and running with Llama 2 and other large language models locally
LLM
Ollama
Modelfile
Docker
REST API
This article provides instructions on how to get up and running with Llama 2 and other large language models locally. It covers topics such as installing Docker, downloading models, customizing prompts, and using the REST API.
-
GPT4All: A Free, Local, Privacy-Aware Chatbot
privacy
local
chatbot
GPT4All is a free-to-use, locally running chatbot that does not require a GPU or internet connection. It is designed to be privacy-aware and does not collect or store any user data.
-
LocalAI: An Open Source OpenAI Alternative
LLM
OpenAI
gpt-3
localai
LocalAI is a free, open-source alternative to OpenAI that allows you to run LLMs, generate images, audio, and more locally or on-prem with consumer-grade hardware. It does not require a GPU and supports multiple model families that are compatible with the ggml format.
-
LocalGPT: Chat with your documents on your local device using GPT models
localgpt
gpt-3
language-models
privacy
security
LocalGPT is an open-source initiative that allows you to converse with your documents without compromising your privacy. With everything running locally, you can be assured that no data ever leaves your computer.
-
Run any Llama 2 locally with gradio UI on GPU or CPU from anywhere (Linux/Windows/Mac)
GPU
CPU
Linux
Windows
Mac
Llama 2
gradio UI
Generative Agents/Apps
This project enables users to run any Llama 2 model locally with a gradio UI on GPU or CPU from anywhere (Linux/Windows/Mac). It uses
llama2-wrapper
as the local llama2 backend for Generative Agents/Apps.
Semantic Cache for LLMs¶
References
-
GPTCache: Semantic Cache for LLMs
LLM
Semantic Caching
LangChain
Llama Index
GPTCache is a semantic cache for LLMs that helps reduce the cost and latency of LLM API calls. It uses embedding algorithms to convert queries into embeddings and uses a vector store for similarity search on these embeddings. This allows GPTCache to identify and retrieve similar or related queries from the cache storage, thereby increasing cache hit probability and enhancing overall caching efficiency.
LLMs Inference Optimisation¶
LLM Quantization¶
References
-
BitNet: Scaling 1-bit Transformers for Large Language Models
Transformers
Quantization
LLM
BitNet is a scalable and stable 1-bit Transformer architecture designed for large language models. It achieves competitive performance while substantially reducing memory footprint and energy consumption, compared to state-of-the-art 8-bit quantization methods and FP16 Transformer baselines. BitNet exhibits a scaling law akin to full-precision Transformers, suggesting its potential for effective scaling to even larger language models while maintaining efficiency and performance benefits.
-
HuggingFace: An Overview of Natively Supported Quantization Schemes in Transformers
HuggingFace
Transformers
Quantization
The article provides an overview of natively supported quantization schemes in Transformers, including bitsandbytes and GPTQ. It also discusses the relation between bitsandbytes and GPTQ, and compares the performance of GPTQ with bitsandbytes nf4.
-
Hugging Face Optimum GPTQ Quantization
Hugging Face
Optimum
GPTQ
Quantization
LLM
NLP
This blog post introduces GPTQ quantization, a method to compress GPT models by reducing the number of bits needed to store each weight. It also provides a step-by-step tutorial on how to quantize a GPT model using the Hugging Face Optimum library.
-
SqueezeLLM: Efficient LLM Serving with Dense-and-Sparse Quantization
Model Compression
Quantization
Efficient Serving
SqueezeLLM is a post-training quantization framework that incorporates a new method called Dense-and-Sparse Quantization to enable efficient LLM serving. This method splits weight matrices into two components: a dense component that can be heavily quantized without affecting model performance, and a sparse part that preserves sensitive and outlier parts of the weight matrices. With this approach, SqueezeLLM is able to serve larger models with smaller memory footprint, the same latency, and yet higher accuracy and quality.
-
SqueezeLLM: Achieving 3-bit Quantization for LLM Inference Acceleration
Post-Training Quantisation (PQT)
Non-Uniform Quantization
Dense and Sparse Quantization
Memory Bottlenecked Operations
GPU Memory Optimization
Model Compression
LLM Inference Acceleration
The paper proposes SqueezeLLM, a novel Post-Training Quantisation (PQT) technique that achieves 3-bit quantization for LLM inference acceleration. It introduces non-uniform quantization and dense and sparse quantization to address memory bottlenecks and achieve 230% speedup in inference. The paper also compares SqueezeLLM with other quantization techniques and demonstrates its superior performance in terms of compression and accuracy.
-
New Research Paper: Sparse Quantized Representation for Efficient Large Language Model Compression
LLM Compression
SpQR
Quantization
Falcon
LLaMA
A new research paper introduces Sparse Quantized Representation (SpQR), a new compression format and quantization technique that enables near-lossless compression of LLMs down to 3-4 bits per parameter. This technique works by recognizing and isolating outlier weights that cause large quantization errors, and storing them in higher precision, while compressing all other weights to 3-4 bits. The authors claim that SpQR can achieve relative accuracy losses of less than 1% in perplexity for highly accurate LLMs like Falcon and LLaMA.
-
Two Cool Releases from Last Week in the LLM Domain
RedPajama Dataset
LLM Model Family
HELM Benchmark
Cerebras Systems has released a cleaned and de-duplicated version of the RedPajama Dataset, reducing its size by 49%. Additionally, RedPajama has released a model family of 7B size, including chat, instruction fine-tuned, and base models. The instruction fine-tuned model shows promising performance on the HELM benchmark.
LLMs with Large Context Window¶
References
-
How to use 100K context window in LLMs
LLM Training
Model Size
Attention Mechanisms
This article explores techniques to speed up training and inference of LLMs to use large context window up to 100K input tokens. It covers ALiBi positional embedding, Sparse Attention, FlashAttention, Multi-Query attention, Conditional computation, and the use of 80GB A100 GPUs.
-
XGen: A New State-of-the-Art 7B LLM with Standard Dense Attention on Up to 8K Sequence Length
LLM
NLP
Machine Learning
Artificial Intelligence
XGen is a new state-of-the-art 7B LLM with standard dense attention on up to 8K sequence length. It achieves comparable or better results than other open-source LLMs of similar model size on standard NLP benchmarks. XGen also shows benefits on long sequence modeling benchmarks and achieves great results on both text and code tasks.
Challenges with LLMs¶
References
-
Challenges in Building LLM Applications for Production
Consistency
Hallucinations
Privacy
Context Length
Data Drift
Model Updates and Compatibility
LM on the Edge
Model Size
Non-English Languages
Chat vs. Search as an Interface
Data Bottleneck
Hype Cycles and the Importance of Data
This talk discusses the challenges in building LLM applications for production. These challenges include consistency, hallucinations, privacy, context length, data drift, model updates and compatibility, LM on the edge, model size, non-English languages, chat vs. search as an interface, data bottleneck, and hype cycles and the importance of data.
-
Open challenges in LLM research
hallucinations
Âcontext learning
Âmultimodality
Ânew architecture
ÂGPU alternatives
Âagent usability
Âlearning from human preference
Âchat interface efficiency
Ânon-English language support
The article discusses the ten major research directions in the field of LLMs, including reducing and measuring hallucinations, optimizing context length and construction, incorporating other data modalities, making LLMs faster and cheaper, designing new model architectures, developing GPU alternatives, making agents usable, improving learning from human preference, improving the efficiency of the chat interface, and building LLMs for non-English languages.
-
The Perils of Blindly Reusing Pre-trained Language Models
NLP
Transfer Learning
Model Analysis
WeightWatchers
Reusing pre-trained language models without careful consideration can lead to negative impacts on downstream tasks due to issues such as over-training, under-training, or over-parameterization. WeightWatchers is an open-source diagnostic tool that can be used to analyze DNNs without access to training or test data, helping to identify potential issues before deployment.
Large vs Small Langage Models¶
References
-
Small language models can outperform LLMs in specific domains
LLM
NLP
Machine Learning
A new LLM trained by Microsoft Research achieves 51% on HumanEval with only 1.3B parameters and 7B tokens training dataset, outperforming much larger LLMs. This suggests that smaller language models can be more effective in specific domains, such as Python code-generation.
-
Are Large Language Models All We Need?
LLM
Model Size
Data Quality
The author discusses the recent trend of focusing on model sizes in the field of LLMs and argues that data quality is often overlooked. They cite the example of phi-1, a 1.3B parameter Transformer-based model by Microsoft, which achieved surprisingly good results. The author concludes that we should pay more attention to data quality when developing LLMs.
LLM Applications¶
LLMs for Translation¶
References
-
ParroT: Enhancing and Regulating Translation Abilities in Chatbots with Open-Source LLMs
LLM
Translation
Chatbots
ParroT
The ParroT framework enhances and regulates the translation abilities of chatbots by leveraging open-source LLMs and human-written translation and evaluation data.
LLMs For Mobile App Developers¶
References
-
Hugging Face releases tools for Swift developers to incorporate language models in their apps
Hugging Face
Swift
transformers
Core ML
Llama
Falcon
Hugging Face has released a package and tools to help Swift developers incorporate language models in their apps, including swift-transformers, swift-chat, transformers-to-coreml, and ready-to-use LLMs such as Llama 2 7B and Falcon 7B.
LLM Assistants¶
References
-
Rust
coding assistants
best practices
The author asks for advice on how to compare coding assistants. They are concerned about using an assistant for Rust because they are not savvy enough to catch certain bugs. Kalyan KS suggests that the author try out Falcoder, a coding assistant that uses the Falcon-7B model and instruction tuning.
-
GPT-Engineer: An AI Agent That Can Write Entire Codebases
Artificial Intelligence
Machine Learning
Natural Language Processing
Programming
GPT-Engineer is an AI agent that can write entire codebases with a prompt and learn how you want your code to look. It asks clarifying questions, generates technical specifications, writes all necessary code, and lets you easily add your own reasoning steps, modify, and experiment. With GPT-Engineer, you can finish a coding project in minutes.
-
Introducing AssistGPT: A General Multi-modal Assistant
Multimodality
Language and Code
ReAct Agent
Planning and Execution
The paper introduces AssistGPT, a general multi-modal assistant that can plan, execute, inspect, and learn. It combines many of the latest trends in AI, including multimodality, language and code, and ReAct agents. The paper also includes a cool demo and discusses the latency of the system.
-
GPTeam: Building Human-like Social Behavior in Language Models
Multi-agent simulation
Human-like social behavior
Language models
Generative agents
GPTeam is a completely customizable open-source multi-agent simulation, inspired by Stanford’s ground-breaking “Generative Agents” paper. Every agent within a GPTeam simulation has their own unique personality, memories, and directives, leading to interesting emergent behavior as they interact.
-
Large Language Model
Planning
Memory
Tool Use
The article provides a comprehensive overview of building Large Language Model powered agents, including relevant papers, practical applications, and case studies.
-
Best write-up ever on LLM Agents
LLM
NLP
OpenAI
The article provides a comprehensive overview of LLM agents, including their capabilities, limitations, and potential applications. It also discusses the challenges involved in developing and deploying LLM agents, and the ethical considerations that need to be taken into account.
Retrieval Augmented Generation¶
References
-
RAG & Enterprise: A Match Made in Heaven
RAG
LLM
Enterprise Search
Information Retrieval
RAG (Retrieve and Generate) models are a powerful tool for enterprise search, as they offer flexibility, practicality, broader coverage, and interpretability. Additionally, with the help of tools like LangChain and Google Vertex, it is now easier than ever to implement RAG solutions.
-
HNSW-FINGER: Approximate Nearest Neighbor Search with Locality-Sensitive Hashing
locality-sensitive hashing
approximate nearest neighbor search
HNSW
HNSW-FINGER is a new approximate nearest neighbor search algorithm that uses locality-sensitive hashing to project the query and candidate nodes onto a center node. This allows HNSW-FINGER to achieve better accuracy and efficiency than existing approximate nearest neighbor search algorithms.
-
Vector Databases and Hierarchical Navigable Small World
Vector Databases
Machine Learning
Artificial Intelligence
Data Science
Generative AI
The article discusses the rise of vector databases in the era of generative AI and introduces Hierarchical Navigable Small World (HNSW) as an efficient indexing method. HNSW builds multiple graph layers with varying densities to optimize the search process and reduce the number of iterations required to find approximate nearest neighbors.
-
RAG-Fusion: A New Retrieval Technique for LLM
LLM
Retrieval
MultiQueryRetrieval
Reciprocal Rank Fusion
RAG-Fusion is a new retrieval technique that builds upon the idea of MultiQueryRetrieval. It generates multiple sub queries based on a user question, retrieves documents for each sub query, and merges the retrieved documents together using Reciprocal Rank Fusion.
-
Question Answering over Documents with Retrieval-Augmented Generation
rag
question answering
information retrieval
llm
This article describes how to build a question-answering over documents application using LLMs. The article covers the use of retrieval-augmented generation (RAG) for this task, and provides a walkthrough of how to build such an application.
-
Reordering retrieved documents to improve performance
long context
performance degradation
retrieval
When models must access relevant information in the middle of long contexts, they tend to ignore the provided documents. This issue can be avoided by reordering documents after retrieval to avoid performance degradation.
-
How to improve the performance of your LLM search engine with Retrieve & Re-Rank
LLM
Semantic Search
Information Retrieval
Question Answering
This article explains how to improve the performance of your LLM search engine with Retrieve & Re-Rank. It covers lexical search, dense retrieval, semantic search and cross-encoders.
-
passage retrieval
query expansion
query reranking
open-domain question answering
EAR is a query Expansion And Reranking approach for improving passage retrieval, with the application to open-domain question answering. EAR first applies a query expansion model to generate a diverse set of queries, and then uses a query reranker to select the ones that could lead to better retrieval results.
-
MS MARCO: A Large Scale Information Retrieval Corpus
information retrieval
semantic search
TREC-DL 2019
MS Marco Passage Retrieval
BM25
ElasticSearch
electra-base-model
cross-encoder
MS MARCO is a large scale information retrieval corpus that was created based on real user search queries using Bing search engine. It can be used for semantic search, i.e., given keywords / a search phrase / a question, the model will find passages that are relevant for the search query. Performance is evaluated on TREC-DL 2019 and MS Marco Passage Retrieval dataset. As baseline we show the results for lexical search with BM25 using ElasticSearch.
-
Self-querying retriever: A new way to search for information
LLM
VectorStore
information retrieval
A self-querying retriever is a new way to search for information that uses a query-constructing LLM chain to write a structured query and then applies that structured query to its underlying VectorStore. This allows the retriever to not only use the user-input query for semantic similarity comparison with the contents of stored documents but to also extract filters from the user query on the metadata of stored documents and to execute those filters.
-
GenQ: Training Effective Dense Retrieval Models with Synthetic Queries
dense retrieval
bi-encoders
sentence transformers
text generation
synthetic data
asymmetric semantic search
query generation
T5
MNR loss
Pinecone
GenQ is a method for training effective dense retrieval models using synthetic queries. It uses a text generation model to generate queries for unlabeled passages of text, which are then used to fine-tune a bi-encoder model. GenQ can achieve performances approaching models trained with supervised methods, and it is particularly useful when we have limited labeled data.
-
InPars-v2: Efficient Dataset Generation for Information Retrieval with Open-Source Language Models
information retrieval
large language models
dataset generation
open-source
InPars-v2 is a dataset generator that uses open-source LLMs and existing powerful rerankers to select synthetic query-document pairs for training. It achieves new state-of-the-art results on the BEIR benchmark.
-
Qdrant: A Vector Database & Vector Similarity Search Engine
Vector database
Vector similarity search
Approximate nearest neighbor search
Machine learning
Artificial intelligence
Qdrant is a vector database and vector similarity search engine that can be used for building applications such as matching, searching, and recommending. It is easy to use and provides a variety of features such as support for additional payload associated with vectors, payload filtering conditions, and dynamic query planning.
-
AutoMergingRetriever: A New Algorithm for Better Retrieval and RAG
LLM
Retrieval
RAG
ChatGPT
Dynamic Retrieval
Semantic Relatedness
The AutoMergingRetriever algorithm dynamically retrieves less disparate / larger contiguous blobs of context only when you need it. This helps the LLM synthesize better results, but avoids always cramming in as much context as you can.
-
Optimizing RAG With LLMS: Exploring Chunking Techniques and Reranking for Enhanced Results
LLM
Chunking
Ranking
Retrieval Augmented Generation (RAG)
This article explores chunking techniques and reranking for enhanced results in the context of optimizing RAG with LLMs. The key points covered include strategies for optimizing RAG, using chunking techniques to streamline processing, and implementing ranking models to enhance search quality.
-
Dynamic chunk length in AutoMergingRetriever
language-models
retrieval
summarization
The AutoMergingRetriever dynamically chooses the chunk length when retrieving information, resulting in better semantic meaning and context.
-
Multi-Document Agents for Building LLM-Powered QA Systems
RAG
LLM
QA
summarization
multi-document agents
The article introduces a new approach for building LLM-powered QA systems that can scale to large numbers of documents and question types. The approach uses multi-document agents, which are able to answer a broad set of questions, including fact-based QA over single documents, summarization over single documents, fact-based comparisons over multiple documents, and holistic comparisons across multiple documents.
-
How to Improve Your RAG App: Adjusting Chunk Size
RAG
chunk size
retrieval
ranking
evaluation
Adjusting chunk size is an important step in improving the performance of a RAG app. More chunks do not always lead to better results, and reranking retrieved chunks may not necessarily improve results either. To find the optimal chunk size, it is necessary to define an evaluation benchmark and perform a sweep over chunk sizes and top-k values. The Arize AI team has provided a comprehensive Colab notebook and slides that demonstrate how to run chunk size sweeps and perform retrieval and Q&A evaluations with Phoenix and LlamaIndex.
-
How to Choose the Right Chunk Size for Your RAG System
RAG system
chunk size
response time
faithfulness
relevancy
Choosing the right chunk size for a RAG system is critical for efficiency and accuracy. The optimal chunk size strikes a balance between capturing essential information and speed. The article provides a practical evaluation setup to determine the right chunk size for a specific use case and dataset.
-
RAG-Fusion: The Next Frontier of Search Technology
Reciprocal Rank Fusion
Query Generation
Retrieval Augmented Generation
Vector Search
RAG-Fusion is a search methodology that aims to bridge the gap between traditional search paradigms and the multifaceted dimensions of human queries. It employs multiple query generation and Reciprocal Rank Fusion to re-rank search results, with the goal of unearthing transformative knowledge that often remains hidden behind top search results.
-
RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation
language-models
retrieval-augmentation
compression
abstractive-summarization
extractive-summarization
We propose a method to improve the performance of retrieval-augmented language models (LMs) by compressing the retrieved documents into textual summaries. Our method, RECOMP, achieves a compression rate of as low as 6% with minimal loss in performance for both language modeling and open domain question answering tasks. We also show that our compressors trained for one LM can transfer to other LMs on the language modeling task and provide summaries largely faithful to the retrieved documents.
-
Optimizing Retrieval and Generation Performance in Large Language Models
RAG
Machine Learning
Knowledge Retrieval
AI
This article discusses various techniques for optimizing retrieval and generation performance in large language models, including decoupling chunks for retrieval and synthesis, using structured retrieval techniques, dynamically retrieving chunks based on tasks, and optimizing context embeddings.
-
Scaling Retrieval-Augmented LLM to 48B
LLM Scaling
Retrieval-Augmented LLM
Instruction Tuning
NVIDIA introduces Retro 48B, the largest LLM pretrained with retrieval. It shows significant perplexity improvement over GPT 43B and can be instruction-tuned more effectively, achieving +7% improvement on zero-shot question-answering tasks.
-
Parsing complex documents with embedded tables using unstructured.io and LlamaIndex
unstructured.io
LlamaIndex
SEC filings
research papers
invoices
Parsing complex documents with embedded tables can be done using unstructured.io and LlamaIndex. This is especially relevant for SEC filings, research papers, invoices, and more.
-
LLM
RAG
Best Practices
This talk will discuss best practices for creating production ready RAGs in the context of LLMs.
-
Joint Tabular/Semantic QA over Tesla 10K
LLM
NLP
Information Retrieval
Question Answering
This article demonstrates how to ask questions over Tesla's 10K report with understanding of both the unstructured text as well as embedded tables. It utilizes Unstructured to parse out the tables and LlamaIndex recursive retrieval to index and retrieve tables if necessary given the user question.
-
New Fine-Tuning Features in LlamaIndex
fine-tuning
retrieval augmentation
structured outputs
This week, LlamaIndex added a lot of new fine-tuning features, including fine-tuning with retrieval augmentation and fine-tuning for better structured outputs.
-
SuperKnowa: Building Reliable RAG Pipelines for Enterprise LLM Applications
RAG
LLM
NLP
Generative AI
Enterprise AI
This article introduces SuperKnowa, a framework for building reliable and scalable RAG pipelines for enterprise LLM applications. It discusses the challenges of taking a RAG PoC to production and how SuperKnowa addresses these challenges. The article also provides an overview of the SuperKnowa framework and its features, including data indexing, context-aware queries, model evaluation, and debugging.
-
SEC Insights: A real-world full-stack application using LlamaIndex
LLM
RAG
SEC Insights
Tutorial
Open Source
This repository contains the code for SEC Insights, a real-world full-stack application that uses the Retrieval Augmented Generation (RAG) capabilities of LlamaIndex to answer questions about SEC 10-K & 10-Q documents. The application is open source and available on GitHub. A tutorial video is also available on YouTube.
-
Text Ranking with Pretrained Transformers
Text Ranking
Transformers
BERT
Self-supervised Learning
Natural Language Processing
Information Retrieval
This survey provides an overview of text ranking with neural network architectures known as transformers. We cover a wide range of modern techniques, grouped into two high-level categories: transformer models that perform reranking in multi-stage architectures and dense retrieval techniques that perform ranking directly.
-
8 Key Considerations for Building Production-Grade LLM Apps
LLM
RAG
Embeddings
Data Pipelines
Scalability
Retrieval
Entity Lookup
This article discusses 8 key considerations for building production-grade LLM apps over your data. These considerations include using different chunks for retrieval and synthesis, using embeddings that live in a different latent space than the raw text, dynamically loading/updating the data, designing the pipeline for scalability, storing data in a hierarchical fashion, using robust data pipelines, and using hybrid search for entity lookup.
Embeddings for Retrieval¶
References
-
How to use Aleph Alpha's semantic embeddings
embeddings
semantic embeddings
Aleph Alpha
There are two ways to use Aleph Alpha's semantic embeddings: asymmetric embeddings and symmetric embeddings.
-
TaylorAI/gte-tiny: A 45MB Tiny Model That Beats Existing Sentence-Transformer Embeddings
Vector Search
Sentence Transformer
Embedding
VectorDB
MTEB Leaderboard
The paper introduces TaylorAI/gte-tiny, a 45MB tiny model that beats existing sentence-transformer embedders. The model is based on BERT and distilled from thenlper/gte-small. It achieves comparable performance to larger models while being much smaller and faster. The model ranks 28th out of 126 models on the MTEB leaderboard.
-
LLM
Sentence Embeddings
HuggingFaceEmbeddings
SentenceTransformerEmbeddings
Sentence-BERT
This article introduces a new way to generate sentence embeddings using LLM. The method is based on the HuggingFaceEmbeddings integration, which allows users to use SentenceTransformers embeddings directly. The article also provides an example of how to use the new method.
Evaluating RAGs¶
References
-
Ragas: A Framework for Evaluating Retrieval Augmented Generation Pipelines
LLM
RAG
NLP
Machine Learning
Ragas is a framework that helps you evaluate your Retrieval Augmented Generation (RAG) pipelines. It provides you with the tools based on the latest research for evaluating LLM-generated text to give you insights about your RAG pipeline. Ragas can be integrated with your CI/CD to provide continuous checks to ensure performance.
Integrating LLMs with Knowledge Graphs¶
References
-
Knowledge Graphs
LLMs
RAG
Vector Databases
ChromaDB
This article discusses the advantages and disadvantages of using Knowledge Graphs (KGs) with LLMs. It also provides a link to a Colab notebook and a video tutorial on the topic.
LLM Watermarking¶
References
-
AI generated text? New research shows watermark removal is harder than one thinks!
LLM
Watermarking
Text Generation
AI Ethics
Researchers from the University of Maryland have found that it is much harder to remove watermarks from AI-generated text than previously thought. This has implications for the use of watermarks to detect machine-generated content, such as spam and harmful content.
LLM Application Development¶
LLM SaaS Apps¶
References
-
Introducing LLM Studio: A Powerful Platform for Building and Deploying Language Models
LLM
NLP
Machine Learning
LLM Studio is a powerful platform that enables developers to easily build, train, and deploy language models. With its user-friendly interface and comprehensive set of features, LLM Studio makes it easy to create and deploy state-of-the-art language models for a variety of applications.
-
Verba: The Open-Source LLM-Based Search Engine
LLM
Open Source
Search Engine
Verba is an open-source LLM-based search engine that supports a broad spectrum of open-source libraries and custom features. It is easy to install and use, and it does not require users to give away any of their data.
LLM Courses¶
References
-
LLM
NLP
AI
The article recommends some short courses on LLM. The author also recommends some YouTube channels and videos on LLM.