Landscape of LLMs¶
Author: Hetul Patel | Published on: 6 Dec, 2023

Pretrained LLMs¶

Opensource LLMs¶
References
-
Can we stop relying on proprietary LLMs to evaluate open LLMs?
EvaluationOpen LLMProprietary LLMGPT-4Feedback Collection datasetPrometheus modelUsing proprietary LLMs like GPT-4 to evaluate open LLMs has limitations. The Feedback Collection dataset and the Prometheus model aim to close the gap between open and closed models by providing a way to evaluate open LLMs without relying on proprietary models.
-
MosaicML releases MPT-30B, a 30 billion parameter LLM that outperforms GPT-3
LLMOpen SourceMachine LearningArtificial IntelligenceMosaicML has released MPT-30B, a 30 billion parameter LLM that outperforms the original GPT-3 175 billion parameter model. It is fully open source for commercial use and comes with two fine-tuned variants: MPT-30B-Instruct and MPT-30B-Chat. MPT-30B-Chat is available to play with on HuggingFace, powered by MosaicML Inference. If you want to start using MPT-30B in production, you can customize and deploy it using MosaicML Training and MosaicML Inference.
-
OpenChat Surpasses ChatGPT Performance With Open-Source Model
LLMOpenAIChatGPTNLPMachine LearningOpenChat has developed a new language model, Orca, that outperforms ChatGPT on the Vicuna benchmark. Orca was trained on a smaller dataset than ChatGPT, but achieved better performance by using a more efficient training method. OpenChat has made Orca open-source, so that other researchers can build on its success.
-
The Latest Advancements in Large Language Models: Unveiling Llama 2, Code Llama, and More
LLMLlama 2Code LlamaGPT-4OpenAIFinetuningTransformer-based LLMsNeurIPS LLM Efficiency ChallengeThe article discusses the latest advancements in large language models (LLMs), including the release of Meta's Llama 2 and Code Llama models, the leaked GPT-4 model details, OpenAI's new finetuning API, and the NeurIPS LLM Efficiency Challenge. It provides a comprehensive overview of the key features, capabilities, and potential applications of these models, while also highlighting ongoing challenges and debates in the field of LLMs.
-
Announcing Mistral 7B: The Most Powerful Language Model For Its Size
language-modelsmachine-learningartificial-intelligenceThe Mistral AI team has released Mistral 7B, a 7.3B parameter language model that outperforms Llama 2 13B on all metrics. It is easy to fine-tune on any task and is released under the Apache 2.0 license.
-
Hugging Face Unveils Zephyr-7b: A State-of-the-Art 7B Chatbot
LLMChatbotNatural Language ProcessingArtificial IntelligenceHugging Face has released Zephyr-7b, a 7B chatbot that outperforms other models in its class on the MT Bench and Open LLM Leaderboard. The model was trained using a combination of instruction fine-tuning and Direct Preference Optimization on publicly available datasets. It is available to try out on the Hugging Face website.
-
LaMini-LM: Can Small Language Models Compete with Large Ones?
language modelsparameter scalecomputational requirementsLaMini-LMdistilled instructionsLaMini-LM is a small language model with a huge amount of distilled instructions. It is designed to achieve impressive results with a smaller model locally. In this article, we will delve into the details of LaMini-LM and see how tiny computational requirements the model asks for.
-
Open Source LLaMA 13B Released with Full Commercial Usage Rights
Open Source LLaMARedPajama DatasetSlimPajama DatasetCode GenerationCommercial UsageEnergy EfficiencyOpenLM research has released a fully open source version of the LLaMA 13B model, trained on the RedPajama dataset. The model weights are available in both Jax and PyTorch. The model is not ideal for code generation due to its treatment of empty spaces, but it remains one of the best open source models for building on top of. The authors are considering training future releases on the SlimPajama dataset, which is a cleaned version of the RedPajama dataset with 49% smaller size.
-
Meet Notus-7B: Data Curation and Open Science go a long way in shaping AI's future
Open Source LLMRLHFDPOLLama 1 & 2 opened the floodgates of open source LLMs. MistralAI released the most powerful 7B base LLM remotely inspired by the success of LLama 2. HuggingFace H4 released Zephyr trained on on a mix of publicly available, synthetic datasets using DPO. TsinghuaNLP released the UltraChat dataset, a large-scale, multi-round dialogue dataset. OpenBMB released the UltraFeedback dataset, a large-scale, fine-grained, diverse preference dataset for RLHF and DPO. Huggingface H4 team fine-tuned Zephyr using UltraChat (supervised fine tuning) and UltraFeedback (DPO for alignment). ArgillaIO fixed some data issues and improved on Zephyr to release Notus-7B.
Prompt Engineering¶

References
-
Prompt EngineeringPrompt engineering is a relatively new discipline for developing and optimizing prompts to efficiently use language models (LMs) for a wide variety of applications and research topics. Prompt engineering skills help to better understand the capabilities and limitations of large language models (LLMs).
Training LLMs¶

References
-
Efficient Deep Learning Optimization Libraries for Large Language Model Training
DeepSpeedMegatron-DeepSpeedFairScaleMegatron-LMColossal-AIBMTrainMesh TensorFlowmax textAlpaGPT-NeoXThis article provides an overview of various deep learning optimization libraries that can simplify and optimize the training process for large language models. These libraries offer features such as distributed training, model parallelism, and efficient training algorithms, enabling researchers and practitioners to achieve better results with less effort.
-
Training vs Prompting EngineeringTask Diversity for OOD RobustnessSelf-Instruction for Dataset GenerationSelf-Consistency for Higher PerformanceEvaluationThis MLOps Community podcast with Mark Huang discusses various LLM training techniques, including training vs prompting engineering, task diversity for OOD robustness, self-instruction for dataset generation, self-consistency for higher performance, and evaluation.
-
Deploying RLHF with 0 Annotations: A Case Study
real-world case-studyreducing manual effortRLHFtranslation qualityreward modeluser-designated pairregression modelAllen AI's library RL4LMsT5/Flan-T5HF TrainerSentence Transformers Cross-EncodersThis article presents a real-world case study of deploying RLHF with 0 annotations. It describes the challenges faced by a large translation company in SE Asia, and how RLHF was used to reduce manual effort in producing domain-specific vocabulary and robotic translations. The article also discusses the tools and libraries used, and provides a key takeaway for readers.
-
X-LLM: A Framework for Training Multimodal Language Models
Multimodal Language ModelsX-LLMImage CaptioningText-to-SpeechMultimodal Question AnsweringThe paper proposes a new framework, X-LLM, for training multimodal language models. X-LLM consists of three main components: single-modal encoders, X2L interfaces, and a large language model (LLM). The authors evaluate X-LLM on a variety of tasks and show that it achieves state-of-the-art results.
-
TRL: A Full-Stack Transformer Language Model with Reinforcement Learning
Reinforcement LearningTransformer Language ModelsSupervised Fine-tuningReward ModelingProximal Policy OptimizationTRL is a full-stack library that provides tools for training transformer language models and stable diffusion models with Reinforcement Learning. It is built on top of the transformers library by 🤗 Hugging Face and supports most decoder and encoder-decoder architectures.
-
Is Reinforcement Learning Really Necessary for Large Language Models?
language modelsreinforcement learningdirect preference optimizationThe paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" introduces a novel algorithm that gets rid of the two stages of RL, namely - fitting a reward model, and training a policy to optimize the reward via sampling. This new algorithm, called Direct Preference Optimization (DPO), trains the LLM using a new loss function which encourages it to increase the likelihood of the better completion and decrease the likelihood of the worse completion. DPO has been shown to achieve comparable performance to RL-based methods, but is much simpler to implement and scale.
Supervised Finetuning¶
References
-
Fine-tuning Llama-2 on your own data
LLMFine-tuningNatural Language ProcessingThe new script allows for fine-tuning Llama-2 on your own data in just a few lines of code. It handles single/multi-gpu and can even be used to train the 70B model on a single A100 GPU by leveraging 4bit.
-
Fine-tuning LLMs for specific tasks
LLMfine-tuningperformanceThe author of the ReAct paper explores the effects of fine-tuning LLMs on specific tasks. They found that fine-tuning significantly improves performance when using the LLM as an agent. The key is to fine-tune each module to tailor it to specific tasks.
-
A discussion on various LLM fine-tuning techniques
loraadapterprompt tuningrl based policy finetuningThe post discusses various LLM fine-tuning techniques. It covers LORA, adapters, prompt tuning and RL based policy finetuning. The discussion revolves around the advantages and disadvantages of each technique and the scenarios where they are most suitable.
-
Fine-tuning Mistral-7b with QLoRA on Google Colab
LLMMistral-7bQLoRAHugging FaceTRLPEFTThe article describes how to fine-tune the Mistral-7b language model using QLoRA on Google Colab. This can be done using the TRL and PEFT tools from the Hugging Face ecosystem. The article also includes links to the Google Colab notebook and a GitHub thread with more information.
-
InstructGPTT0The Turking TestFLANNatural InstructionsInstruction-tuning is a method for improving the performance of language models on a given task by providing them with additional instructions. This can be done by either fine-tuning the model on a dataset of instructions or by using a pre-trained model and providing it with instructions at inference time. Instruction-tuning has been shown to be effective for a variety of tasks, including text summarization, question answering, and machine translation.
-
LLM Reasoning Capabilities Improve with Increased Parameters
reasoningstructured outputsfine-tuningA survey of llama2-chat models shows that reasoning capabilities improve as the number of parameters increases. However, structured outputs remain a challenge. This suggests that fine-tuning for better structured data extraction could potentially help.
-
FinetuningIn-context learningRetrieval augmentationEmbedding finetuningLLM finetuningLlamaIndex integrationsFinetuning a model involves updating the model itself over a set of data to improve the model in various ways. This can include improving the quality of outputs, reducing hallucinations, memorizing more data holistically, and reducing latency/cost. The core of our toolkit revolves around in-context learning / retrieval augmentation, which involves using the models in inference mode and not training the models themselves. While finetuning can be also used to “augment” a model with external data, finetuning can complement retrieval augmentation in a variety of ways.
-
T-Few Finetuning: Efficient Training and Scalable Serving of Large Language Models
large language modelsfinetuningT-Fewtraining efficiencyserving scalabilityT-Few finetuning is a technique that selectively updates only a fraction of the model's weights, thus reducing training time and computational resources. It also enables model stacking, which allows for the concurrent inference of multiple finetunes, maximizing GPU utilization and improving serving scalability.
-
How to Fine-tune Llama 2 Embeddings for Better Retrieval Performance
LLMRAGEmbedding FinetuningLlamaIndexThis article provides a step-by-step guide on how to fine-tune Llama 2 embeddings for better retrieval performance in RAG systems. The guide includes instructions on how to generate training data, fine-tune the embedding model, and evaluate the performance of the fine-tuned model.
-
RL4LMs: A Modular RL Library for Fine-Tuning Language Models to Human Preferences
language modelsreinforcement learningnatural language processingRL4LMs is a modular RL library for fine-tuning language models to human preferences. It provides easily customizable building blocks for training language models, including implementations of on-policy algorithms, reward functions, metrics, datasets, and LM-based actor-critic policies.
-
Exploring Alternatives to RLHF for Fine-Tuning Large Language Models
Large Language ModelsSupervised Fine-TuningReinforcement Learning from Human FeedbackDirect Preference OptimizationChain of HindsightThis blog explores alternatives to Reinforcement Learning from Human Feedback (RLHF) for fine-tuning large language models. The alternatives discussed include supervised fine-tuning and direct preference optimization. The blog also provides a hands-on guide to preparing human preference data and using the Transformers Reinforcement Learning library to fine-tune a large language model using direct preference optimization.
Evaluating LLMs¶

References
-
LMFlow Benchmark: An Automatic Evaluation Framework for Open-Source LLMs
LLM EvaluationChatbot ArenaGPT-4LMFlow BenchmarkThe paper introduces LMFlow benchmark, a new benchmark which provides a cheap and easy-to-use evaluation framework that can help reflect different aspects of LLMs.
-
LLM EvaluationRAGHallucinationsMetricsThis article discusses various techniques for evaluating LLM performance, including hallucination detection and metrics-based approaches. It also provides a framework for optimizing LLM performance using RAG and fine-tuning.
-
A Metrics-First Approach to LLM Evaluation
LLM EvaluationHuman EvaluationTraditional MetricsGalileo MetricsThe industry has started adopting LLMs for various applications, but evaluating their performance is challenging. Human evaluation is costly and prone to errors, traditional metrics have poor correlations with human judgment, and reliable benchmarks are absent. Galileo has built metrics to help evaluate LLMs in minutes instead of days.
-
Evaluation Driven Development for LLM Apps
Evaluation Driven DevelopmentLLMEDDStochastic nature of LLMsLlamaIndexRetrieval methodsComparing LLMsThe article discusses the importance of Evaluation Driven Development (EDD) for building LLM apps. It provides a step-by-step guide to EDD, including defining evaluation metrics, defining an evaluation dataset, and trying out different approaches. The article also highlights the importance of EDD for mitigating the risks associated with the stochastic nature of LLMs. Finally, the article provides links to additional resources on EDD.
-
How to Evaluate Chatbots with Large Language Models
ChatbotsLLMRAGEvaluationMLflowThis article explores how to evaluate chatbots with large language models (LLMs). It discusses the use of LLMs as judges for automated evaluation, and provides best practices for using LLM judges. The article also discusses the importance of using use-case-specific benchmarks for evaluation.
-
How to Monitor NDCG for Ranking Models in Production
Ranking modelsNDCGML observabilityModel monitoringMachine learningThis article provides a comprehensive guide to monitoring Normalized Discounted Cumulative Gain (NDCG) for ranking models in production. It covers the intuition behind NDCG, its calculation, and how it can be used to evaluate the performance of ranking models. Additionally, the article discusses the challenges of maintaining ranking models in production and how ML observability can help.
-
Index Metrics for Evaluating Recommender System Performance
Recommender SystemsEvaluation MetricsHit RatioMRRPrecisionRecallMAPNDCGRecommender systems output a ranking list of items. Hit ratio, MRR, Precision, Recall, MAP, NDCG are commonly used metrics to evaluate the performance of recommender systems.
LLMs Deployment¶

References
-
Model Serving Frameworks for 2023
Model ServingAIMachine LearningMLOpsThe article provides a comprehensive list of model serving frameworks for AI applications in 2023. It highlights the benefits and features of each framework, including BentoML, Jina, and Torchserve, and emphasizes their importance in the MLOps process.
-
vLLM: A High-Throughput Library for Large Language Model Serving
LLMmachine learningartificial intelligencenatural language processingvLLM is an open-source library for fast LLM inference and serving. It utilizes PagedAttention, a new attention algorithm that effectively manages attention keys and values. vLLM equipped with PagedAttention redefines the new state of the art in LLM serving: it delivers up to 24x higher throughput than HuggingFace Transformers, without requiring any model architecture changes.
-
How to Optimize Latency for Open Source Language Models
OptimizationLatencyLLMModel ServingInferenceThis study explores various approaches to optimizing latency for open-source LLMs. The author evaluates the effectiveness of different tools and techniques, including CTranslate2, TGI, bitsandbytes, AutoGPTQ, ExLlama, vLLM, and HuggingFace's hosted inference platform. The results show that vLLM is currently the fastest solution for distributed inference, while HuggingFace's hosted inference platform offers the best performance for single-GPU inference.
-
How to Optimize Large Language Model (LLM) Inference
Large Language ModelLLMInferenceOptimizationMachine LearningThis article provides best practices for optimizing LLM inference, including identifying the optimization target, paying attention to the components of latency, utilizing memory bandwidth, batching, and exploring deeper systems optimizations. It also discusses hardware configurations and the importance of data-driven decisions.
-
HuggingFaceLLMRustPythongRPCDockerCUDANCCLOpenTelemetryquantizationText Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). It implements many features such as optimized models, tensor parallelism, and distributed tracing. TGI can be installed locally or used as a Docker container.
-
Machine LearningNatural Language ProcessingText EmbeddingsServerless ComputingTEI is a new server for sentence or document embedding that is optimized for speed and efficiency. It is based on the
candlerust backend and does not require torch, making it very small and lightweight. TEI is a step towards real ML serverless and has the potential to make it easier to use multimodal embeddings in production. -
Text Generation Inference: A Rust, Python, and gRPC toolkit
HuggingFaceHugging ChatInference APIInference EndpointLarge Language Models (LLMs)LlamaFalconStarCoderBLOOMGPT-NeoXOpen TelemetryPrometheusTensor ParallelismServer-Sent Events (SSE)transformers.LogitsProcessorCustom Prompt GenerationFine-tuning SupportText Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). It supports many features such as simple launcher, production readiness, tensor parallelism, token streaming, continuous batching, optimized transformers code, quantization, watermarking, logits warper, stop sequences, log probabilities, custom prompt generation, and fine-tuning support.
-
LoRAX: The LLM Inference Server that Speaks for the GPUs
LLMLoRAGPUCloudPredibaseLoRAX is a new kind of LLM inference solution designed to make it cost effective and scalable to serve many fine-tuned models in production at once, conserving precious GPUs by dynamically exchanging in and out fine-tuned LoRA models within a single LLM deployment.
Running LLMs Locally¶
References
-
Run Large Language Models on Your CPU with Llama.cpp
LLMInferenceCPUGPUChatGPTVicunaGPT4ALLAlpacaggmlThis article explains how to set up llama.cpp on your computer to run large language models on your CPU. It focuses on Vicuna, a chat model behaving like ChatGPT, but also shows how to run llama.cpp for other language models.
-
h2oGPT - 100% Private, 100% Local Chat with a GPT
LLMh2oGPTOpen SourcePrivateLocalThis video shows how to install and use h2oGPT, an open-source large language model (LLM), on a local computer for private, local chat with a GPT.
-
Run Large Language Models on Your Own Computer with llama.cpp
Large Language ModelsLlama.cppNVIDIA CUDAUbuntu 22.04This blog post provides a step-by-step guide for running the Llama-2 7B model using llama.cpp, with NVIDIA CUDA and Ubuntu 22.04.
-
Get up and running with Llama 2 and other large language models locally
LLMOllamaModelfileDockerREST APIThis article provides instructions on how to get up and running with Llama 2 and other large language models locally. It covers topics such as installing Docker, downloading models, customizing prompts, and using the REST API.
-
GPT4All: A Free, Local, Privacy-Aware Chatbot
privacylocalchatbotGPT4All is a free-to-use, locally running chatbot that does not require a GPU or internet connection. It is designed to be privacy-aware and does not collect or store any user data.
-
LocalAI: An Open Source OpenAI Alternative
LLMOpenAIgpt-3localaiLocalAI is a free, open-source alternative to OpenAI that allows you to run LLMs, generate images, audio, and more locally or on-prem with consumer-grade hardware. It does not require a GPU and supports multiple model families that are compatible with the ggml format.
-
LocalGPT: Chat with your documents on your local device using GPT models
localgptgpt-3language-modelsprivacysecurityLocalGPT is an open-source initiative that allows you to converse with your documents without compromising your privacy. With everything running locally, you can be assured that no data ever leaves your computer.
-
Run any Llama 2 locally with gradio UI on GPU or CPU from anywhere (Linux/Windows/Mac)
GPUCPULinuxWindowsMacLlama 2gradio UIGenerative Agents/AppsThis project enables users to run any Llama 2 model locally with a gradio UI on GPU or CPU from anywhere (Linux/Windows/Mac). It uses
llama2-wrapperas the local llama2 backend for Generative Agents/Apps.
Semantic Cache for LLMs¶
References
-
GPTCache: Semantic Cache for LLMs
LLMSemantic CachingLangChainLlama IndexGPTCache is a semantic cache for LLMs that helps reduce the cost and latency of LLM API calls. It uses embedding algorithms to convert queries into embeddings and uses a vector store for similarity search on these embeddings. This allows GPTCache to identify and retrieve similar or related queries from the cache storage, thereby increasing cache hit probability and enhancing overall caching efficiency.
LLMs Inference Optimisation¶

LLM Quantization¶
References
-
BitNet: Scaling 1-bit Transformers for Large Language Models
TransformersQuantizationLLMBitNet is a scalable and stable 1-bit Transformer architecture designed for large language models. It achieves competitive performance while substantially reducing memory footprint and energy consumption, compared to state-of-the-art 8-bit quantization methods and FP16 Transformer baselines. BitNet exhibits a scaling law akin to full-precision Transformers, suggesting its potential for effective scaling to even larger language models while maintaining efficiency and performance benefits.
-
HuggingFace: An Overview of Natively Supported Quantization Schemes in Transformers
HuggingFaceTransformersQuantizationThe article provides an overview of natively supported quantization schemes in Transformers, including bitsandbytes and GPTQ. It also discusses the relation between bitsandbytes and GPTQ, and compares the performance of GPTQ with bitsandbytes nf4.
-
Hugging Face Optimum GPTQ Quantization
Hugging FaceOptimumGPTQQuantizationLLMNLPThis blog post introduces GPTQ quantization, a method to compress GPT models by reducing the number of bits needed to store each weight. It also provides a step-by-step tutorial on how to quantize a GPT model using the Hugging Face Optimum library.
-
SqueezeLLM: Efficient LLM Serving with Dense-and-Sparse Quantization
Model CompressionQuantizationEfficient ServingSqueezeLLM is a post-training quantization framework that incorporates a new method called Dense-and-Sparse Quantization to enable efficient LLM serving. This method splits weight matrices into two components: a dense component that can be heavily quantized without affecting model performance, and a sparse part that preserves sensitive and outlier parts of the weight matrices. With this approach, SqueezeLLM is able to serve larger models with smaller memory footprint, the same latency, and yet higher accuracy and quality.
-
SqueezeLLM: Achieving 3-bit Quantization for LLM Inference Acceleration
Post-Training Quantisation (PQT)Non-Uniform QuantizationDense and Sparse QuantizationMemory Bottlenecked OperationsGPU Memory OptimizationModel CompressionLLM Inference AccelerationThe paper proposes SqueezeLLM, a novel Post-Training Quantisation (PQT) technique that achieves 3-bit quantization for LLM inference acceleration. It introduces non-uniform quantization and dense and sparse quantization to address memory bottlenecks and achieve 230% speedup in inference. The paper also compares SqueezeLLM with other quantization techniques and demonstrates its superior performance in terms of compression and accuracy.
-
New Research Paper: Sparse Quantized Representation for Efficient Large Language Model Compression
LLM CompressionSpQRQuantizationFalconLLaMAA new research paper introduces Sparse Quantized Representation (SpQR), a new compression format and quantization technique that enables near-lossless compression of LLMs down to 3-4 bits per parameter. This technique works by recognizing and isolating outlier weights that cause large quantization errors, and storing them in higher precision, while compressing all other weights to 3-4 bits. The authors claim that SpQR can achieve relative accuracy losses of less than 1% in perplexity for highly accurate LLMs like Falcon and LLaMA.
-
Two Cool Releases from Last Week in the LLM Domain
RedPajama DatasetLLM Model FamilyHELM BenchmarkCerebras Systems has released a cleaned and de-duplicated version of the RedPajama Dataset, reducing its size by 49%. Additionally, RedPajama has released a model family of 7B size, including chat, instruction fine-tuned, and base models. The instruction fine-tuned model shows promising performance on the HELM benchmark.
LLMs with Large Context Window¶

References
-
How to use 100K context window in LLMs
LLM TrainingModel SizeAttention MechanismsThis article explores techniques to speed up training and inference of LLMs to use large context window up to 100K input tokens. It covers ALiBi positional embedding, Sparse Attention, FlashAttention, Multi-Query attention, Conditional computation, and the use of 80GB A100 GPUs.
-
XGen: A New State-of-the-Art 7B LLM with Standard Dense Attention on Up to 8K Sequence Length
LLMNLPMachine LearningArtificial IntelligenceXGen is a new state-of-the-art 7B LLM with standard dense attention on up to 8K sequence length. It achieves comparable or better results than other open-source LLMs of similar model size on standard NLP benchmarks. XGen also shows benefits on long sequence modeling benchmarks and achieves great results on both text and code tasks.
Challenges with LLMs¶

References
-
Challenges in Building LLM Applications for Production
ConsistencyHallucinationsPrivacyContext LengthData DriftModel Updates and CompatibilityLM on the EdgeModel SizeNon-English LanguagesChat vs. Search as an InterfaceData BottleneckHype Cycles and the Importance of DataThis talk discusses the challenges in building LLM applications for production. These challenges include consistency, hallucinations, privacy, context length, data drift, model updates and compatibility, LM on the edge, model size, non-English languages, chat vs. search as an interface, data bottleneck, and hype cycles and the importance of data.
-
Open challenges in LLM research
hallucinationsÂcontext learningÂmultimodalityÂnew architectureÂGPU alternativesÂagent usabilityÂlearning from human preferenceÂchat interface efficiencyÂnon-English language supportThe article discusses the ten major research directions in the field of LLMs, including reducing and measuring hallucinations, optimizing context length and construction, incorporating other data modalities, making LLMs faster and cheaper, designing new model architectures, developing GPU alternatives, making agents usable, improving learning from human preference, improving the efficiency of the chat interface, and building LLMs for non-English languages.
-
The Perils of Blindly Reusing Pre-trained Language Models
NLPTransfer LearningModel AnalysisWeightWatchersReusing pre-trained language models without careful consideration can lead to negative impacts on downstream tasks due to issues such as over-training, under-training, or over-parameterization. WeightWatchers is an open-source diagnostic tool that can be used to analyze DNNs without access to training or test data, helping to identify potential issues before deployment.
Large vs Small Langage Models¶
References
-
Small language models can outperform LLMs in specific domains
LLMNLPMachine LearningA new LLM trained by Microsoft Research achieves 51% on HumanEval with only 1.3B parameters and 7B tokens training dataset, outperforming much larger LLMs. This suggests that smaller language models can be more effective in specific domains, such as Python code-generation.
-
Are Large Language Models All We Need?
LLMModel SizeData QualityThe author discusses the recent trend of focusing on model sizes in the field of LLMs and argues that data quality is often overlooked. They cite the example of phi-1, a 1.3B parameter Transformer-based model by Microsoft, which achieved surprisingly good results. The author concludes that we should pay more attention to data quality when developing LLMs.
LLM Applications¶
LLMs for Translation¶
References
-
ParroT: Enhancing and Regulating Translation Abilities in Chatbots with Open-Source LLMs
LLMTranslationChatbotsParroTThe ParroT framework enhances and regulates the translation abilities of chatbots by leveraging open-source LLMs and human-written translation and evaluation data.
LLMs For Mobile App Developers¶
References
-
Hugging Face releases tools for Swift developers to incorporate language models in their apps
Hugging FaceSwifttransformersCore MLLlamaFalconHugging Face has released a package and tools to help Swift developers incorporate language models in their apps, including swift-transformers, swift-chat, transformers-to-coreml, and ready-to-use LLMs such as Llama 2 7B and Falcon 7B.
LLM Assistants¶
References
-
Rustcoding assistantsbest practicesThe author asks for advice on how to compare coding assistants. They are concerned about using an assistant for Rust because they are not savvy enough to catch certain bugs. Kalyan KS suggests that the author try out Falcoder, a coding assistant that uses the Falcon-7B model and instruction tuning.
-
GPT-Engineer: An AI Agent That Can Write Entire Codebases
Artificial IntelligenceMachine LearningNatural Language ProcessingProgrammingGPT-Engineer is an AI agent that can write entire codebases with a prompt and learn how you want your code to look. It asks clarifying questions, generates technical specifications, writes all necessary code, and lets you easily add your own reasoning steps, modify, and experiment. With GPT-Engineer, you can finish a coding project in minutes.
-
Introducing AssistGPT: A General Multi-modal Assistant
MultimodalityLanguage and CodeReAct AgentPlanning and ExecutionThe paper introduces AssistGPT, a general multi-modal assistant that can plan, execute, inspect, and learn. It combines many of the latest trends in AI, including multimodality, language and code, and ReAct agents. The paper also includes a cool demo and discusses the latency of the system.
-
GPTeam: Building Human-like Social Behavior in Language Models
Multi-agent simulationHuman-like social behaviorLanguage modelsGenerative agentsGPTeam is a completely customizable open-source multi-agent simulation, inspired by Stanford’s ground-breaking “Generative Agents” paper. Every agent within a GPTeam simulation has their own unique personality, memories, and directives, leading to interesting emergent behavior as they interact.
-
Large Language ModelPlanningMemoryTool UseThe article provides a comprehensive overview of building Large Language Model powered agents, including relevant papers, practical applications, and case studies.
-
Best write-up ever on LLM Agents
LLMNLPOpenAIThe article provides a comprehensive overview of LLM agents, including their capabilities, limitations, and potential applications. It also discusses the challenges involved in developing and deploying LLM agents, and the ethical considerations that need to be taken into account.
Retrieval Augmented Generation¶
References
-
RAG & Enterprise: A Match Made in Heaven
RAGLLMEnterprise SearchInformation RetrievalRAG (Retrieve and Generate) models are a powerful tool for enterprise search, as they offer flexibility, practicality, broader coverage, and interpretability. Additionally, with the help of tools like LangChain and Google Vertex, it is now easier than ever to implement RAG solutions.
-
HNSW-FINGER: Approximate Nearest Neighbor Search with Locality-Sensitive Hashing
locality-sensitive hashingapproximate nearest neighbor searchHNSWHNSW-FINGER is a new approximate nearest neighbor search algorithm that uses locality-sensitive hashing to project the query and candidate nodes onto a center node. This allows HNSW-FINGER to achieve better accuracy and efficiency than existing approximate nearest neighbor search algorithms.
-
Vector Databases and Hierarchical Navigable Small World
Vector DatabasesMachine LearningArtificial IntelligenceData ScienceGenerative AIThe article discusses the rise of vector databases in the era of generative AI and introduces Hierarchical Navigable Small World (HNSW) as an efficient indexing method. HNSW builds multiple graph layers with varying densities to optimize the search process and reduce the number of iterations required to find approximate nearest neighbors.
-
RAG-Fusion: A New Retrieval Technique for LLM
LLMRetrievalMultiQueryRetrievalReciprocal Rank FusionRAG-Fusion is a new retrieval technique that builds upon the idea of MultiQueryRetrieval. It generates multiple sub queries based on a user question, retrieves documents for each sub query, and merges the retrieved documents together using Reciprocal Rank Fusion.
-
Question Answering over Documents with Retrieval-Augmented Generation
ragquestion answeringinformation retrievalllmThis article describes how to build a question-answering over documents application using LLMs. The article covers the use of retrieval-augmented generation (RAG) for this task, and provides a walkthrough of how to build such an application.
-
Reordering retrieved documents to improve performance
long contextperformance degradationretrievalWhen models must access relevant information in the middle of long contexts, they tend to ignore the provided documents. This issue can be avoided by reordering documents after retrieval to avoid performance degradation.
-
How to improve the performance of your LLM search engine with Retrieve & Re-Rank
LLMSemantic SearchInformation RetrievalQuestion AnsweringThis article explains how to improve the performance of your LLM search engine with Retrieve & Re-Rank. It covers lexical search, dense retrieval, semantic search and cross-encoders.
-
passage retrievalquery expansionquery rerankingopen-domain question answeringEAR is a query Expansion And Reranking approach for improving passage retrieval, with the application to open-domain question answering. EAR first applies a query expansion model to generate a diverse set of queries, and then uses a query reranker to select the ones that could lead to better retrieval results.
-
MS MARCO: A Large Scale Information Retrieval Corpus
information retrievalsemantic searchTREC-DL 2019MS Marco Passage RetrievalBM25ElasticSearchelectra-base-modelcross-encoderMS MARCO is a large scale information retrieval corpus that was created based on real user search queries using Bing search engine. It can be used for semantic search, i.e., given keywords / a search phrase / a question, the model will find passages that are relevant for the search query. Performance is evaluated on TREC-DL 2019 and MS Marco Passage Retrieval dataset. As baseline we show the results for lexical search with BM25 using ElasticSearch.
-
Self-querying retriever: A new way to search for information
LLMVectorStoreinformation retrievalA self-querying retriever is a new way to search for information that uses a query-constructing LLM chain to write a structured query and then applies that structured query to its underlying VectorStore. This allows the retriever to not only use the user-input query for semantic similarity comparison with the contents of stored documents but to also extract filters from the user query on the metadata of stored documents and to execute those filters.
-
GenQ: Training Effective Dense Retrieval Models with Synthetic Queries
dense retrievalbi-encoderssentence transformerstext generationsynthetic dataasymmetric semantic searchquery generationT5MNR lossPineconeGenQ is a method for training effective dense retrieval models using synthetic queries. It uses a text generation model to generate queries for unlabeled passages of text, which are then used to fine-tune a bi-encoder model. GenQ can achieve performances approaching models trained with supervised methods, and it is particularly useful when we have limited labeled data.
-
InPars-v2: Efficient Dataset Generation for Information Retrieval with Open-Source Language Models
information retrievallarge language modelsdataset generationopen-sourceInPars-v2 is a dataset generator that uses open-source LLMs and existing powerful rerankers to select synthetic query-document pairs for training. It achieves new state-of-the-art results on the BEIR benchmark.
-
Qdrant: A Vector Database & Vector Similarity Search Engine
Vector databaseVector similarity searchApproximate nearest neighbor searchMachine learningArtificial intelligenceQdrant is a vector database and vector similarity search engine that can be used for building applications such as matching, searching, and recommending. It is easy to use and provides a variety of features such as support for additional payload associated with vectors, payload filtering conditions, and dynamic query planning.
-
AutoMergingRetriever: A New Algorithm for Better Retrieval and RAG
LLMRetrievalRAGChatGPTDynamic RetrievalSemantic RelatednessThe AutoMergingRetriever algorithm dynamically retrieves less disparate / larger contiguous blobs of context only when you need it. This helps the LLM synthesize better results, but avoids always cramming in as much context as you can.
-
Optimizing RAG With LLMS: Exploring Chunking Techniques and Reranking for Enhanced Results
LLMChunkingRankingRetrieval Augmented Generation (RAG)This article explores chunking techniques and reranking for enhanced results in the context of optimizing RAG with LLMs. The key points covered include strategies for optimizing RAG, using chunking techniques to streamline processing, and implementing ranking models to enhance search quality.
-
Dynamic chunk length in AutoMergingRetriever
language-modelsretrievalsummarizationThe AutoMergingRetriever dynamically chooses the chunk length when retrieving information, resulting in better semantic meaning and context.
-
Multi-Document Agents for Building LLM-Powered QA Systems
RAGLLMQAsummarizationmulti-document agentsThe article introduces a new approach for building LLM-powered QA systems that can scale to large numbers of documents and question types. The approach uses multi-document agents, which are able to answer a broad set of questions, including fact-based QA over single documents, summarization over single documents, fact-based comparisons over multiple documents, and holistic comparisons across multiple documents.
-
How to Improve Your RAG App: Adjusting Chunk Size
RAGchunk sizeretrievalrankingevaluationAdjusting chunk size is an important step in improving the performance of a RAG app. More chunks do not always lead to better results, and reranking retrieved chunks may not necessarily improve results either. To find the optimal chunk size, it is necessary to define an evaluation benchmark and perform a sweep over chunk sizes and top-k values. The Arize AI team has provided a comprehensive Colab notebook and slides that demonstrate how to run chunk size sweeps and perform retrieval and Q&A evaluations with Phoenix and LlamaIndex.
-
How to Choose the Right Chunk Size for Your RAG System
RAG systemchunk sizeresponse timefaithfulnessrelevancyChoosing the right chunk size for a RAG system is critical for efficiency and accuracy. The optimal chunk size strikes a balance between capturing essential information and speed. The article provides a practical evaluation setup to determine the right chunk size for a specific use case and dataset.
-
RAG-Fusion: The Next Frontier of Search Technology
Reciprocal Rank FusionQuery GenerationRetrieval Augmented GenerationVector SearchRAG-Fusion is a search methodology that aims to bridge the gap between traditional search paradigms and the multifaceted dimensions of human queries. It employs multiple query generation and Reciprocal Rank Fusion to re-rank search results, with the goal of unearthing transformative knowledge that often remains hidden behind top search results.
-
RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation
language-modelsretrieval-augmentationcompressionabstractive-summarizationextractive-summarizationWe propose a method to improve the performance of retrieval-augmented language models (LMs) by compressing the retrieved documents into textual summaries. Our method, RECOMP, achieves a compression rate of as low as 6% with minimal loss in performance for both language modeling and open domain question answering tasks. We also show that our compressors trained for one LM can transfer to other LMs on the language modeling task and provide summaries largely faithful to the retrieved documents.
-
Optimizing Retrieval and Generation Performance in Large Language Models
RAGMachine LearningKnowledge RetrievalAIThis article discusses various techniques for optimizing retrieval and generation performance in large language models, including decoupling chunks for retrieval and synthesis, using structured retrieval techniques, dynamically retrieving chunks based on tasks, and optimizing context embeddings.
-
Scaling Retrieval-Augmented LLM to 48B
LLM ScalingRetrieval-Augmented LLMInstruction TuningNVIDIA introduces Retro 48B, the largest LLM pretrained with retrieval. It shows significant perplexity improvement over GPT 43B and can be instruction-tuned more effectively, achieving +7% improvement on zero-shot question-answering tasks.
-
Parsing complex documents with embedded tables using unstructured.io and LlamaIndex
unstructured.ioLlamaIndexSEC filingsresearch papersinvoicesParsing complex documents with embedded tables can be done using unstructured.io and LlamaIndex. This is especially relevant for SEC filings, research papers, invoices, and more.
-
LLMRAGBest PracticesThis talk will discuss best practices for creating production ready RAGs in the context of LLMs.
-
Joint Tabular/Semantic QA over Tesla 10K
LLMNLPInformation RetrievalQuestion AnsweringThis article demonstrates how to ask questions over Tesla's 10K report with understanding of both the unstructured text as well as embedded tables. It utilizes Unstructured to parse out the tables and LlamaIndex recursive retrieval to index and retrieve tables if necessary given the user question.
-
New Fine-Tuning Features in LlamaIndex
fine-tuningretrieval augmentationstructured outputsThis week, LlamaIndex added a lot of new fine-tuning features, including fine-tuning with retrieval augmentation and fine-tuning for better structured outputs.
-
SuperKnowa: Building Reliable RAG Pipelines for Enterprise LLM Applications
RAGLLMNLPGenerative AIEnterprise AIThis article introduces SuperKnowa, a framework for building reliable and scalable RAG pipelines for enterprise LLM applications. It discusses the challenges of taking a RAG PoC to production and how SuperKnowa addresses these challenges. The article also provides an overview of the SuperKnowa framework and its features, including data indexing, context-aware queries, model evaluation, and debugging.
-
SEC Insights: A real-world full-stack application using LlamaIndex
LLMRAGSEC InsightsTutorialOpen SourceThis repository contains the code for SEC Insights, a real-world full-stack application that uses the Retrieval Augmented Generation (RAG) capabilities of LlamaIndex to answer questions about SEC 10-K & 10-Q documents. The application is open source and available on GitHub. A tutorial video is also available on YouTube.
-
Text Ranking with Pretrained Transformers
Text RankingTransformersBERTSelf-supervised LearningNatural Language ProcessingInformation RetrievalThis survey provides an overview of text ranking with neural network architectures known as transformers. We cover a wide range of modern techniques, grouped into two high-level categories: transformer models that perform reranking in multi-stage architectures and dense retrieval techniques that perform ranking directly.
-
8 Key Considerations for Building Production-Grade LLM Apps
LLMRAGEmbeddingsData PipelinesScalabilityRetrievalEntity LookupThis article discusses 8 key considerations for building production-grade LLM apps over your data. These considerations include using different chunks for retrieval and synthesis, using embeddings that live in a different latent space than the raw text, dynamically loading/updating the data, designing the pipeline for scalability, storing data in a hierarchical fashion, using robust data pipelines, and using hybrid search for entity lookup.
Embeddings for Retrieval¶
References
-
How to use Aleph Alpha's semantic embeddings
embeddingssemantic embeddingsAleph AlphaThere are two ways to use Aleph Alpha's semantic embeddings: asymmetric embeddings and symmetric embeddings.
-
TaylorAI/gte-tiny: A 45MB Tiny Model That Beats Existing Sentence-Transformer Embeddings
Vector SearchSentence TransformerEmbeddingVectorDBMTEB LeaderboardThe paper introduces TaylorAI/gte-tiny, a 45MB tiny model that beats existing sentence-transformer embedders. The model is based on BERT and distilled from thenlper/gte-small. It achieves comparable performance to larger models while being much smaller and faster. The model ranks 28th out of 126 models on the MTEB leaderboard.
-
LLMSentence EmbeddingsHuggingFaceEmbeddingsSentenceTransformerEmbeddingsSentence-BERTThis article introduces a new way to generate sentence embeddings using LLM. The method is based on the HuggingFaceEmbeddings integration, which allows users to use SentenceTransformers embeddings directly. The article also provides an example of how to use the new method.
Evaluating RAGs¶
References
-
Ragas: A Framework for Evaluating Retrieval Augmented Generation Pipelines
LLMRAGNLPMachine LearningRagas is a framework that helps you evaluate your Retrieval Augmented Generation (RAG) pipelines. It provides you with the tools based on the latest research for evaluating LLM-generated text to give you insights about your RAG pipeline. Ragas can be integrated with your CI/CD to provide continuous checks to ensure performance.
Integrating LLMs with Knowledge Graphs¶
References
-
Knowledge GraphsLLMsRAGVector DatabasesChromaDBThis article discusses the advantages and disadvantages of using Knowledge Graphs (KGs) with LLMs. It also provides a link to a Colab notebook and a video tutorial on the topic.
LLM Watermarking¶
References
-
AI generated text? New research shows watermark removal is harder than one thinks!
LLMWatermarkingText GenerationAI EthicsResearchers from the University of Maryland have found that it is much harder to remove watermarks from AI-generated text than previously thought. This has implications for the use of watermarks to detect machine-generated content, such as spam and harmful content.
LLM Application Development¶

LLM SaaS Apps¶
References
-
Introducing LLM Studio: A Powerful Platform for Building and Deploying Language Models
LLMNLPMachine LearningLLM Studio is a powerful platform that enables developers to easily build, train, and deploy language models. With its user-friendly interface and comprehensive set of features, LLM Studio makes it easy to create and deploy state-of-the-art language models for a variety of applications.
-
Verba: The Open-Source LLM-Based Search Engine
LLMOpen SourceSearch EngineVerba is an open-source LLM-based search engine that supports a broad spectrum of open-source libraries and custom features. It is easy to install and use, and it does not require users to give away any of their data.
LLM Courses¶

References
-
LLMNLPAIThe article recommends some short courses on LLM. The author also recommends some YouTube channels and videos on LLM.