Unleash the Power of AI: Develop a Question Answering Service with OpenAI and Ballerina

Published in

Ballerina Swan Lake Tech Blog

12 min readMay 17, 2023

This article was written using Ballerina 2201.4.0 (Swan Lake Update 4)

Introduction

The new wave of Large Language Model (LLMs) has taken the entire world by storm and it has become the talk of the town. Everyone has been trying to build their own use cases around the powerful LLMs, and question-answering based on document data is one such example scenario that most people are interested in.

Typically, when we need an answer to something in a particular area, we would have to search for it and finding exactly what we’re looking for isn’t always easy and straightforward. It can take forever to skim through all the information we find and we might still end up empty-handed. This is where LLMs come in handy! They can answer all kinds of questions in real-time and even provide explanations and sources if we need them. However, there’s a catch: LLMs do not have up-to-date information on recent developments in various domains, and they are known for generating plausible but false responses also known as “hallucinations”. As a result, if the questions asked are highly specific to a relatively new field, the LLM may generate inaccurate responses based on related but outdated or incorrect information, which has been used to train it.

There are ways to make LLMs better and more suited for what we want to achieve. One such approach is fine-tuning, where we use a dataset that exhibits the outcomes we expect to tune the model to behave similarly and give similar outcomes. Unfortunately, this approach is not very useful for our problem with the question-answering scenario because fine-tuning helps the model to remember patterns but not knowledge.

Prompt engineering offers a more promising solution by leveraging LLMs’ ability for in-context learning. In-context learning is the LLM’s ability to learn from the information given in the prompt itself; may it be instructions, examples, or even knowledge. Therefore, we can provide relevant knowledge to the model in the prompt along with the question and it will give us an accurate answer. However, if we have to provide the relevant information, does it mean that we’ve come back to the full circle of searching for relevant content on our own? Quite not. This is where embeddings come in.

Embeddings are numerical representations of text that allow comparing the similarity between two texts. It helps with providing relevant information to the model for question answering by comparing the embedding of a question with the embeddings of documents we have. We can identify the most similar ones and add them to the prompt for the model to use. We can do this comparison programmatically on our own or we can use vector databases such as Pinecone and Weaviate, which do this job for us and return similar content. We simply have to add this content to the prompt and send it to the LLM.

Even though there are many articles and tutorials that discuss LLM-based implementations for question-answering using languages such as Python, in this article, we want to showcase how easy and effortless it is to build AI use cases with Ballerina with its newly introduced support for AI. We will see how Ballerina, being a language specialized for integration, is the ideal candidate for implementing such AI use cases that require communicating with multiple APIs. Ballerina can help in connecting with hundreds of public APIs easily with its in-built connectors, which can greatly benefit when building end-to-end interactions.

For example, let’s think of this use case as a ChatBot service with the following integrations.

Data retrieval — connect to Google Sheets and load data from them using the ballerinax/googleapis.sheets connector
Embedding search — connect to hosted vector DBs such as Pinecone (ballerinax/pinecone.vector) and Weaviate (ballerinax/weaviate)
Answer generation — connect to OpenAI GPT-3 (ballerinax/openai.text) or the ChatGPT (ballerinax/openai.chat) APIs (alternatively, Ballerina supports Azure OpenAI APIs also) to generate answers for questions
Receiving questions and responding to users — this entire process can be implemented easily as a Ballerina service.

A high-level view of our implementation of the question-answering use case is shown below.

High-level component interaction diagram for the question-answering use case

During the initialization, the service connects to a Google Sheet specified by its URL and loads its contents. Then, it uses the OpenAI Embeddings model to obtain the embeddings for each content row, and the content, along with its corresponding embeddings is stored in the Pinecone vector database.

When the service receives a request to answer a question, it will first obtain the embedding of the question. It then passes the question embedding to the Pinecone vector database, which will do the similarity comparison and fetch the most relevant content. The Ballerina service constructs the prompt by combining the retrieved content, question, and instruction, which are sent to the OpenAI GPT-3 model. The model responds with the answer and it is forwarded to the user by the service.

Prerequisites

To get started with the example, we will first need to set up the following prerequisites.

OpenAI API Key
Google Sheets access token (can be obtained using Google API Console)
A Vector DB (we will use Pinecone in this example)
An IDE (VS Code is preferred with the Ballerina extension installed)

For a complete guide on how to fulfill the prerequisites, refer to the sample Question Answering based on Context using OpenAI GPT-3 and Pinecone. We store all the keys and tokens in a Config.toml file in the project folder (which we will create in the next section)

so that we can access them via configurable variables.

Once we have obtained the keys and access tokens, we can create a new Google Sheet and populate it with some data. Make sure that you create the new sheet using the account for which the access token was obtained for. For the purpose of this example, we have some sample data obtained from Choreo documentation, but you can use content from any preferred domain.

Implementation

Now all the prerequisites are set up, we can start building our service that will take a question as the input and provide an accurate answer based on up-to-date information.

Create and initialize the service

As our first step, we will look at how we can initialize the Ballerina service to read the data from a Google sheet and insert them to the Pinecone vector database.

First and foremost, let us create a new Ballerina project to hold our service implementation. We can do this by executing the following command in the desired location. This will generate a folder with all the necessary artifacts to create and run the service.

bal new <service name>

Then, we will create the Ballerina service in the main.bal file, which will answer users’ questions by referring to the document content. We will initialize an HTTP service, which listens on port 8080.

import ballerina/http;

service / on new http:Listener(8080) {
 // implement the service
}

Load the data from Google Sheets

As mentioned earlier, in this example, we will load our document data or content from the Google Sheet that we created previously when setting up the prerequisites populated with sample data. We will load this data from the sheet along with their embeddings into our vector database so that we can easily fetch relevant content for a given question to construct the context.

To ensure that data loading from the Google Sheet to the Pinecone vector database happens only once during the service initialization, we will implement this logic in the init function of the service.

To store our content in the Pinecone vector database, we need to obtain the embeddings for the content. We can compute the embeddings using the OpenAI embeddings model with the help of Ballerina’s openai.embeddings connector. For this, we will create a client object for OpenAI embeddings by providing the API key, which the service will fetch from the Config.toml file to the configurable variable. Then, we send a request to the model via the client to obtain the embedding vector by providing the text and the model name.

import ballerinax/openai.embeddings;

configurable string openAIToken = ?;
final embeddings:Client openaiEmbeddings = check new ({auth: {token: openAIToken}});
function getEmbedding(string text) returns float[]|error {
    embeddings:CreateEmbeddingRequest embeddingRequest = {
        input: text,
        model: "text-embedding-ada-002"
    };
    embeddings:CreateEmbeddingResponse embeddingRes = check openaiEmbeddings->/embeddings.post(embeddingRequest);
    return embeddingRes.data[0].embedding;
}

To read from our Google Sheet that contains the data, we first need to initialize a Google Sheets client object by providing the credentials. Then, we can fetch the content from the Google Sheet by querying the range of columns. In our case, we fetch columns A and B, which contain the titles and content respectively. We can use the googleapis.sheets connector provided by Ballerina to fetch the data from the sheet.

import ballerinax/googleapis.sheets;

configurable string sheetsAccessToken = ?;
configurable string sheetId = ?;
configurable string sheetName = ?;
final sheets:Client gSheets = check new ({auth: {token: sheetsAccessToken}});
sheets:Range range = check gSheets->getRange(sheetId, sheetName, "A2:B");

Notice that we fetch data starting from row 2 (A2:B) assuming that the first row contains the headers “Title” and “Content”.

Upload the data to Pinecone vector database

In order to access our Pinecone database, we will initialize a pinecone client object using the key and URL. Then, we will initialize an empty array of Pinecone vectors to hold our data.

import ballerinax/pinecone.vector as pinecone;

configurable string pineconeKey = ?;
configurable string pineconeServiceUrl = ?; 
final pinecone:Client pineconeClient = check new ({apiKey: pineconeKey}, serviceUrl = pineconeServiceUrl);
pinecone:Vector[] vectors = [];

We will iterate through the rows that we fetched from the Google Sheet to get each title and content and also to obtain the embedding. Then, we will store all this information in the array of Pinecone vectors.

foreach any[] row in range.values {
  string title = <string>row[0];
  string content = <string>row[1];
  float[] vector = check getEmbedding(string `${title} ${"\n"} ${content}`);
  vectors[vectors.length()] = {id: title, values: vector, metadata: {"content": content}};
}

And finally, we insert the data vectors to the Pinecone database. We can do this using the Pinecone connector client by invoking the /vectors/upsert.post method. The namespace indicates the location of the collection of data. We named our namespace “ChoreoDocs” indicating that it contains content from Choreo documentation.

const NAMESPACE = "ChoreoDocs";

pinecone:UpsertResponse response = check pineconeClient->/vectors/upsert.post({vectors, namespace: NAMESPACE});

That completes the initialization of the question-answering service. The complete implementation of the init function is given below.

import ballerina/io;

function init() returns error? {
    sheets:Range range = check gSheets->getRange(sheetId, sheetName, "A2:B");
    pinecone:Vector[] vectors = [];
    foreach any[] row in range.values {
        string title = <string>row[0];
        string content = <string>row[1];
        float[] vector = check getEmbedding(string `${title} ${"\n"} ${content}`);
        vectors[vectors.length()] = {id: title, values: vector, metadata: {"content": content}};
    }
    pinecone:UpsertResponse response = check pineconeClient->/vectors/upsert.post({vectors, namespace: NAMESPACE});
    if response.upsertedCount != range.values.length() {
        return error("Failed to insert embedding vectors to pinecone.");
    }
    io:println("Successfully inserted embedding vectors to pinecone.");
}

Answer questions

Construct the prompt with context

In order to use an OpenAI model to answer questions, we must consider the limitations of these models. As we previously discussed, the accuracy and quality of answers may be diminished for specific domains since the model does not have recent knowledge. To address this limitation, we will provide the model with relevant context extracted from our stored content to help provide accurate answers.

However, OpenAI models also have token size restrictions, limiting the amount of information that can be included in the prompt. To ensure the most relevant information is included, we will utilize the vector database, Pinecone, to fetch a subset of data that is closely related to the question. This is done by comparing the similarity of the content embeddings stored in the database with the embedding of the question obtained through the OpenAI embeddings model. By passing the question embedding to the Pinecone client along with other meta-information, we can fetch the rows similar to the question in the order of similarity. In this example, we will fetch the top 10 most similar rows.

const MAXIMUM_NO_OF_DOCS = 10;

float[] questionEmbedding = check getEmbedding(question);
pinecone:QueryRequest req = {
        namespace: NAMESPACE, 
        topK: MAXIMUM_NO_OF_DOCS, 
        vector: questionEmbedding, 
        includeMetadata: true
};
pinecone:QueryResponse res = check pineconeClient->/query.post(req);
pinecone:QueryMatch[]? rows = res.matches;

Now that we have fetched all the related data, it is time to construct the prompt by providing the data as context. However, although we have fetched only a subset of the content from the vector database, we need to be mindful of the token limit. If we try to include all the retrieved content, the prompt may still exceed the limit. To address this issue, we iteratively add the most relevant content to the prompt until the token limit is reached (also leaving some room for the answer). This ensures that we provide the model with the most pertinent context without exceeding the token limit.

import ballerina/regex;

function countWords(string text) returns int => regex:split(text, " ").length();

string context = "";
int contextLen = 0;
int maxLen = 1125; // approx equivalence between word and token count
foreach pinecone:QueryMatch row in rows {
        pinecone:VectorMetadata? rowMetadata = row.metadata;
        if rowMetadata is () {
            return error("No metadata found for the given document.");
        }
        string content = check rowMetadata["content"].ensureType();
        contextLen += countWords(content);
        if contextLen > maxLen {
            break;
        }
        context += "\n*" + content;
}

Once we have the context ready, we need to combine it with an instruction prompt, which will indicate to the model that it should refer to the context and answer the question.

string instruction = "Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say \"I don't know.\"\n\n";

string prompt = string `${instruction}Context:${"\n"} ${context} ${"\n\n"} Q: ${question} ${"\n"} A:`;

That completes the prompt construction with the relevant context to answer a question. The complete function, which constructs the prompt is given below.

function constructPrompt(string question) returns string|error {
    float[] questionEmbedding = check getEmbedding(question);
    pinecone:QueryRequest req = {
        namespace: NAMESPACE, 
        topK: MAXIMUM_NO_OF_DOCS, 
        vector: questionEmbedding, 
        includeMetadata: true
    };
    pinecone:QueryResponse res = check pineconeClient->/query.post(req);
    pinecone:QueryMatch[]? rows = res.matches;
    string context = "";
    int contextLen = 0;
    int maxLen = 1125; // approx equivalence between word and token count
    if rows is () {
        return error("No documents found for the given query.");
    }
    foreach pinecone:QueryMatch row in rows {
        pinecone:VectorMetadata? rowMetadata = row.metadata;
        if rowMetadata is () {
            return error("No metadata found for the given document.");
        }
        string content = check rowMetadata["content"].ensureType();
        contextLen += countWords(content);
        if contextLen > maxLen {
            break;
        }
        context += "\n*" + content;
    }
    string instruction = "Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say \"I don't know.\"\n\n";
    
    return string `${instruction}Context:${"\n"} ${context} ${"\n\n"} Q: ${question} ${"\n"} A:`;
}

Generate the answer

Now, it is time to put everything together and answer a question that comes as a request to the service. For this, we will create a GET resource function called answer within the service that accepts the user’s question as a parameter. Next, we will construct the prompt by extracting the relevant context from the previously fetched data as discussed earlier. Finally, we will generate the answer using the OpenAI text-davinci-003 model, which we will access through Ballerina’s openai-text connector client.

import ballerinax/openai.text;

final text:Client openAIText = check new ({auth: {token: openAIToken}});

resource function get answer(string question) returns string?|error {
    string prompt = check constructPrompt(question);
    text:CreateCompletionRequest prmt = {
        prompt: prompt,
        model: "text-davinci-003",
        max_tokens: 2000
    };
    text:CreateCompletionResponse completionRes = check openAIText->/completions.post(prmt);
    return completionRes.choices[0].text;
}

By now, we have implemented a Ballerina service, which can answer questions in a specific domain by referring to a set of documents. For the complete implementation, refer to the sample in the Ballerina ai-samples GitHub repository.

Run the Service

Now, we can run the service and send a request to see it in action. To run the service, navigate to the project directory, and execute the following command.

bal run

The command will start the service in the localhost and listen on port 8080 for requests. We can now send a GET request to the service by providing the question as a query parameter. For example, we can execute the following CURL command to ask the question “What is Choreo?”.

curl -G  'http://127.0.0.1:8080/answer' --data-urlencode 'question=what is choreo?'

We can see in the below response how the service would answer a question that we provide via the GET request.

(base) jayanihewavitharana@jayanih ai-samples % curl -G  'http://127.0.0.1:8080/answer' --data-urlencode 'question=what is choreo?'
Choreo is a versatile and comprehensive platform for low-code, cloud-native engineering. It provides an AI-assisted, low-code application development environment, API management capabilities, realistic DevOps with versions, environments, CI/CD pipelines and automation tests, and deep observability to trace executions

Conclusion

In this article, we explored how to develop a question-answering service in Ballerina that leverages the capabilities of OpenAI and Pinecone vector databases using the newly released connectors. The example demonstrates how to load data from a Google Sheet into the Pinecone database and use the data as reference to construct the prompt for the OpenAI model by providing the context obtained from the most similar content for a given question.

As AI integration is becoming more important in modern-day AI scenarios, we can see how Ballerina, a language specialized in integration, makes it easy and simple to implement AI use cases that involve multiple externally hosted models and services. This demonstrates the fact that we can build powerful and intelligent AI applications by combining the strengths of Ballerina, OpenAI, and Pinecone.