In a recent blog post, we outlined six guiding principles for artificial intelligence (AI) at Onna. Our aim was not only to assist others in navigating the evolving AI landscape but also to promote transparency about our AI approach. This approach encompasses our practices, technology, vendors, and data usage, and we extend it to our team, customers, and partners.
Today, I’ll take you through Onna's AI journey — starting from where we began with AI, where we currently stand, where we are headed, and how we apply our principles at every step. Because ultimately, we believe that creating trust in AI requires more than just defining principles; we must actively put those principles into practice.
Before we delve into the evolution of our AI, let me provide a high-level overview of how Onna solves the fragmented data problem. Onna was built to help organizations gain control over their data. Typically, this involves unstructured data that proliferates quickly over time and across multiple digital platforms. While some companies choose to tackle this challenge internally, they soon discover the steep costs and complexities associated with such an endeavor.
We’ve developed a highly automated, scalable, resilient, and extensible system to address this challenge. At its core, our system is based on the following key concepts:
With these capabilities, we initially entered the market with a focus on the eDiscovery use case, empowering legal teams to find the proverbial needles in the haystack. However, we didn't limit ourselves to developing just an eDiscovery tool; instead, we created a powerful data management platform that seamlessly transfers information from point A to B, extracts valuable insights, and enables actionable outcomes.
As mentioned earlier, we have designed our processing engine to be extensible, with each step in the pipeline functioning as a distinct microservice that performs a specific task and appends its output to each piece of content. To accomplish this, we have developed microservices for the following functions:
As you can see, some of these applications serve a broad purpose, while others address very specific problems.
When Generative AI and Large Language Models (LLMs) started making global strides in late 2022, we at Onna quickly recognized the immense potential these technologies held. Through conversations with our customers, we found that they too were interested in leveraging these technologies to elevate the information they could extract from their data.
As chatbots became increasingly prevalent, people started engaging with them and acquiring valuable public knowledge. However, these chatbots were unable to respond to queries based on closed domain-specific enterprise data.
This sparked an idea: What if Onna could empower its customers to develop domain-specific LLMs using their own data?
Fortunately, we had a significant advantage with the content we had already ingested from various data sources on behalf of our customers. We extracted all the text and indexed it for search. This content is continuously refreshed as it updates at the source, and we’ve implemented a robust security system to ensure that interactions are limited to the data each user has access to.
Every year, Onna hosts a Hackathon where all employees are invited to work together and quickly prototype innovative ideas. The participants then present their concepts to the entire company for evaluation. One of the most popular ideas this year, and the eventual winner of the Hackathon, was a domain-specific chatbot trained to answer queries specific to Onna’s platform.
After the Hackathon, we formed a volunteer squad comprising engineers and product managers to carry the project forward. During the planning phase, we identified two primary use cases that we wanted to focus on:
While we’ve discussed additional use cases that may be explored in the future, our team has prioritized these specific areas to maintain a balance with our existing product strategy and fulfill our commitments to our customers.
We designed the chatbot as a document retrieval system using a RAG (Retrieval-Augmented Generation) workflow that capitalizes on Onna's existing internal text data in OpenSearch, as well as new-to-us tools like the Langchain framework and the Chroma embeddings database. The workflow can be summarized as follows:
It’s important to address the elephant in the room — the contribution of the LLM submission component. In our experimentation, we utilized data from Onna’s publicly available Helpdesk articles and the well-known Enron dataset. Recognizing the critical privacy concerns for a production deployment, we prioritize the usage of any model strictly on an opt-in basis. Additionally, we ensure either complete air-gapping of the model or shield it through legal contracts with a trusted partner to protect data privacy.
We evaluated various models, both open-source and closed-source, to assess their responsiveness to our instructions. At the time of writing, these models included Google's Flan-T5, Vicuna, Google's PaLM2, and OpenAI's ChatGPT.
Each model exhibited its own strengths and weaknesses, and we couldn't crown a definitive winner. As is widely known, OpenAI’s models generally provide the most accurate responses and superior performance, but they score low on the privacy scale. We lack assurances about data handling once it is submitted to their APIs.
We’ve also seen impressive results from the Google PaLM2 API through the Vertex AI program. This model is shaping up to be a strong contender, as Google’s terms and conditions provide more favorable stipulations concerning data privacy and usage for further training.
Open-source models also show significant potential. In several instances, our results have rivaled those of commercial counterparts, while in others, they’ve left us scratching our heads. Outputs that are hallucinations or outright gibberish are common, so we're hesitant to put them into production just yet. Nonetheless, the progress observed here is highly encouraging, especially considering the relatively short time elapsed since the release of the first LLaMA-based model.
There is no denying that preparing data for machine learning training can be time-consuming and costly for both engineering teams building data pipelines and data scientists shaping the training format.
We've already covered the data pipeline aspect, so let's look at how we quickly developed a method to export data from Onna into a format compatible with Vertex AI.
Suppose we're handling a classification task where a customer needs to build a training set consisting of numerous label:text pairs. This dataset will be used to train a model that can classify unseen documents.
Using Onna, the entire workflow can be accomplished swiftly as follows:
In this example, Vertex AI expects the dataset format to look like this:
In Onna, the user creates an export with an export_schema setting to map the tag and extracted_text fields from Onna to Vertex AI.
In this simplified example, we create an export, resulting in a JSONL file that is fully compliant for training a Vertex AI model. This file is then ready to be used for any business purpose required by the client.
What does all of this mean for Onna’s future? We’ve learned a lot from our experimentation, experiencing definite successes as well as identifying opportunities for improvement. Along the way, we’ve discovered various use cases that will enhance the experience within the Onna platform. These include semantic search, natural language query processing, smart actions, cross-data source summarization, among others. We will keep our blog updated with our progress as we continue.
We also realized that we can offer these same advantages to our customers as they develop their own AI applications. Our goal is to build tooling that allows our customers to leverage the enterprise data they have in Onna to power their use cases, requiring minimal investment on the data side.
And while the technology in areas such as language models, search, and natural language processing is not entirely new, innovation and advancements are currently happening at a rapid pace. The Onna R&D team will continue to invest in this field, both from a learning and delivery standpoint. We will also partner with customers and vendors who have expressed interest in this field and the capabilities we have already demonstrated.
There’s still quite a journey ahead of us, but personally, I’m excited to explore where it takes us and the new possibilities we unlock.
If your company is interested in building domain-specific LLMs and is considering how to tackle data preparation, please get in touch.