Understanding the Observability of Large Language Models

ully
9 min readMay 29, 2024

Large model applications are not mysterious; they are still a type of software system. Just like using a library or web service, as well as SaaS and cloud computing services, we need to evaluate, monitor, and track libraries, services, SaaS, and platforms, which can be roughly considered as their observability.

1. Evaluation of Large Model Applications

When evaluating a traditional machine learning model, we generally check the accuracy of the model’s output or predictions. This is usually measured with well-known metrics such as accuracy, precision, RMSE, AUC, recall, etc. When dealing with time series data, we might use domain-specific metrics like MAE or MAPE. For natural language processing tasks, metrics such as BLEU, ROUGE, or Perplexity might be more appropriate.

However, the evaluation of large models is much more complex. Here are some common methods:

1.1 Classification and Regression Metrics

Although large models differ from traditional machine learning models in some aspects, their evaluation methods can still draw on the experience of traditional machine learning models.

Large models can generate numerical predictions or classification labels, making the…

--

--