Enhancing Observability in RAG Solutions with Oracle Cloud.

6 min readAug 26, 2024

Introduction.

Over the past year, interest in solutions based on Generative AI Services has surged dramatically. Many developers have been actively exploring the development of the so-called RAG solutions. The RAG paradigm is increasingly recognized as one of the most effective methods for overcoming the current limitations of Large Language Models (LLM).

RAG stands for Retrieval Augmented Generation. This approach involves searching in a Vector Store using Embeddings technology to find relevant documents to answer a request. The retrieved documents are then provided as input to the LLM, along with the original question and the chat history. By using these retrieved documents as context, well-known issues like hallucinations are mitigated, controlled, or significantly reduced.

The complexity of developing this kind of solution has been greatly simplified through the adoption of Python orchestration libraries like LangChain or LlamaIndex. However, since these solutions involve multiple, often distributed, components, deploying them in real-world production environments remains challenging.

This is why integrating with an Application Performance Monitoring (APM) service is crucial for gaining insights into the primary factors affecting overall response time and providing effective support in resolving production issues.

In this article, we will provide a detailed explanation, with code examples, of how a RAG solution, utilizing Python and Oracle Generative AI Cloud Service, can be integrated with OCI APM. The solution has been developed in collaboration with Erika Sciunzi, from the Emea Observability Specialists team.

The scenario.

The solution we’re going to describe is general and can be used to integrate every Python RAG application, with only these constraints:

The application is deployed in OCI Cloud.
it is built using LangChain.
it leverages OCI Generative AI Service for embeddings and the chat model.
it utilizes Oracle 23 AI Search as the vector store.

To integrate with the APM service we have used the Python library Zipkin.

The solution presented here has been developed for LangChain, but as you can easily see, it can be quickly adapted to LlamaIndex.

The overall architecture and the main concept.

In general, any RAG application developed according to the scenario described above involves the orchestration of three main components:

an Embedding Model
a Vector Store
a Chat Model

Architecture for the integration with APM (image from the author).

LangChain provides abstractions for these three components, with well-defined methods called in the chain. Providers, like Oracle, have developed compatible implementations, available as part of langchain-community library. As part of Oracle's contribution, we have the following Python’s components:

To incorporate the Python code based on Zipkin, needed for sending trace and span information to Oracle APM, we applied solid Object-Oriented principles. We created a subclass for each of the three classes and rewrote the core methods to add the code based on Zipkin annotation.

In summary, we have developed the following three classes:

rewriting the following methods:

embed_documents
similarity_search
invoke and stream

The addition in the code of the subclasses is simple: we add the Zipkin annotation and call the corresponding method of the superclass, like in the following example:

@zipkin_span(service_name=SERVICE_NAME, span_name="similarity_search")
def similarity_search(
    self, query: str, k: int = 4, filter: Dict[str, Any] | None = None, **kwargs
) -> List[Document]:
    """
    Perform a similarity search with APM tracing.

    Args:
    query (str): The query string for the search.
    k (int, optional): The number of top results to return. Defaults to 4.
    filter (Dict[str, Any], optional): A filter to apply to the search. Defaults to None.
    **kwargs: Additional keyword arguments.

    Returns:
       List[Document]: A list of documents that match the search criteria.
    """
    return super().similarity_search(query, k=k, filter=filter, **kwargs)

To integrate your Python RAG application with OCI APM, you need to make the following replacements in your code:

OCIGenAIEmbeddings → OCIGenAIEmbeddingsWithBatch
OracleVS → OracleVS4APM
ChatOCIGenAI → ChatOCIGenAI4APM

and add some configurations (see details at the end).

Transport configuration.

One additional piece of code, needed, is the Trasport Configuration. A working example of code is contained in the transport.py file:

"""
This file contains the pluggable component for the transport
"""

import requests
from utils import load_configuration

from config_private import APM_PUBLIC_KEY


#
# general configs
#
config = load_configuration()

#
# configs for zipkin transport
# see notes here:
# https://docs.oracle.com/en-us/iaas/application-performance-monitoring/doc/configure-open-source-tracing-systems.html
#
BASE_URL = config["apm_tracing"]["base_url"]
APM_CONTENT_TYPE = config["apm_tracing"]["apm_content_type"]

# this is the public endpoint we're calling
APM_UPLOAD_ENDPOINT_URL = f"{BASE_URL}/observations/public-span?dataFormat=zipkin&dataFormatVersion=2&dataKey={APM_PUBLIC_KEY}"

# in config.toml we can enable/disable globally tracing
ENABLE_TRACING = config["apm_tracing"]["enable_tracing"]


def http_transport(encoded_span):
    """
    This function gives the pluggable transport
    to communicate with OCI APM using py-zipkin
    """
    result = None

    if ENABLE_TRACING:
        result = requests.post(
            APM_UPLOAD_ENDPOINT_URL,
            data=encoded_span,
            headers={"Content-Type": APM_CONTENT_TYPE},
            timeout=10,
        )
    return result

The code takes care of sending, through HTTP, the encoded span to the OCI APM endpoint. A flag (ENABLE_TRACING) allows tracing to be globally enabled.

One important detail is the definition of the APM_UPLOAD_ENDPOINT. The needed information can be retrieved in the configuration of the APM domain, in the OCI Console:

BASE_URL
private and public key (here we use the public key)

How to start a trace.

In the following code, you find an example of how to start a trace. In the context (zipkin_span) we define the function wrapping the http_transport (see: transport.py).

with zipkin_span(
        service_name=SERVICE_NAME,
        span_name="rag_invoke",
        transport_handler=http_transport,
        encoding=Encoding.V2_JSON,
        binary_annotations={"conv_id": conv_id},
        sample_rate=100,
    ) as span:
        logger.info("Conversation id: %s", conv_id)

        if detailed_tracing:
            # add input
            span.update_binary_annotations({"genai-chat-input": request.query})

        try:
            response = handle_request(request, conv_id)

            # only the text of the response
            answer = response["answer"]

            if detailed_tracing:
                # add output
                span.update_binary_annotations({"genai-chat-output": answer})

        except Exception as e:
            # to signal error
            span.update_binary_annotations({"error": "true"})
            span.update_binary_annotations({"ErrorMessage": str(e)})
            answer = f"Error: {str(e)}"

    return Response(content=answer, media_type=MEDIA_TYPE_TEXT)

Examples of Trace details.

This picture comes from the OCI APM Trace Explorer, with the list of the traces in the last 8 hours.

The following image shows the detail of the first trace:

Details of a trace (image from the author).

We can see, for example, that the chat model is called twice: first, to condense the chat history and the question into a condensed question, and second, to generate the answer.

The repository with all the code.

The code for a complete example is available in this GitHub repository.

Conclusion.

A RAG application based on cloud services is inherently a distributed system. Observability is crucial for gaining insights into the primary factors affecting overall response time and obtaining information that can help reduce the time needed to identify and resolve issues.

In this article, we have presented a solution enabling the seamless integration of a Python RAG application with OCI APM and provided a complete example of the code.

The recipe for the integration.

Here we provide a list of all the simple steps needed to integrate your LangChain RAG application with OCI APM.

create an APM domain in your tenant; To start testing you can use an always-free domain.
add to your source code the three files: oci_embeddings_4_apm.py, oraclevs_4_apm.py, chatocigenai_4_apm.py that you find in the GitHub repository.
add the file transport.py, define the BASE_URL.
substitute in your code the three classes OCIGenAIEmbedding, OracleVS, ChatOCIGenAI with the versions instrumented with the Zipkin code.
provide the needed configurations; In the example, all the configurations are in the config.toml file, but you can use the format you prefer; Private configurations, like COMPARTMENT_OCID and APM_PUBLIC_KEY are in the config_private.py (not included in the repository).
You can use test_streaming01.py for a simple test (test only the chat model).

Some best practices.

Delivering a robust solution in Python also means providing high-quality Python code. All the code in the repository has been developed following time-tested Object-Oriented principles, formatted using Black, and checked with Pylint.

However, there’s always room for improvement. If you find any mistakes (which are entirely my responsibility), please let us know.

Oh, by the way: we have used Generative AI to check the content of this article. Why not?