AI Engineer
December 23, 2025

Small Bets, Big Impact Building GenBI at a Fortune 100 – Asaf Bord, Northwestern Mutual

Small Bets, Big Impact: Building GenBI at a Fortune 100

Asaf Bord, Northwestern Mutual

Date: October 2023

This summary explores how a legacy financial giant integrates generative AI into business intelligence without compromising its risk-averse DNA. It provides a blueprint for builders to bridge the gap between experimental demos and hardened production tools.

  • How messy legacy data become a competitive advantage for AI agents?
  • What is the "crawl, walk, run" strategy for building executive trust?
  • Why will the 10x productivity of AI workers kill the per-seat SaaS model?

Top 3 Ideas

SHIP EARLY SHIP OFTEN

  • Real Data First: Using messy production data instead of clean synthetic sets. This ensures that if a tool works in the lab it actually survives the chaos of reality.
  • The Feedback Loop: Involving end users in the research phase rather than just the rollout. This creates immediate buy-in and turns skeptics into internal champions.
  • Risk Mitigation: Breaking projects into six-week sprints with tangible deliverables. This allows leadership to fund innovation without the fear of a million dollar sunk cost.

THE MULTI AGENT ARCHITECTURE

  • Modular Agent Design: Breaking the system into metadata, RAG, and SQL agents. This allows the team to productize individual components like a "data finder" before the full system is ready.
  • Metadata Is King: Proving that LLM performance scales with documentation quality. This turns "boring" data hygiene into a high ROI technical priority.

THE DEATH OF SEATS

  • Value Based Pricing: Moving away from per-user licenses toward usage or outcome metrics. This reflects the reality that AI augmented workers produce vastly more value per head.
  • Efficiency Gains: Automating 80% of routine BI requests. This frees up elite talent to focus on strategic reasoning rather than manual data retrieval.

Actionable Takeaways

  • The Macro Shift: The transition from "Human-in-the-loop" to "Agent-as-the-interface" for enterprise data.
  • The Tactical Edge: Audit your metadata quality now because LLM accuracy is a direct function of your documentation.
  • The Bottom Line: Success in enterprise AI is not about the biggest model but about the smallest, most frequent wins that build institutional trust.

Podcast Link: Click here to listen

Doesn't this look like something's going to drop from the ceiling? Like a ground zero type thing? Be honest. It's like who has the buzzer that if I'm I really suck, they press it and everything falls down through the trap door. No. Be careful. Yeah. Okay. Who was it? Okay. You tell me if I'm doing okay or if I should take a couple of steps back. Right. So, hi everyone. I'm Assaf. And I'm here to talk about Genbi. And kind of first disclaimer, this presentation was not created with Gen AI. To be honest, I actually started doing it with GPT03 back in August. Uh and then I did kind of a first draft and then a couple of weeks back I wanted to come in and refresh it before the conference and then GPT5 took over completely messed up my slides. So I ended up doing it manually kind of old-fashioned. So if I'm missing like an M dash somewhere in the middle, let me know after. Okay.

Assaf: So first of all, a bit of housekeeping. What's GenBBI? So, it's a fusion of Gen AI and BI. It's basically an agent that helps people answer business questions with data like a business intelligence person would do in real life. The reason that we're pursuing GenBI is really because of the data democratization that it can bring, right? So having access to data at your fingertips without having to be reliant on a BI team that helps you find a report, figure out what it means, understand your world before they can even give you any kind of input. So that's GenBBI. A bit about Northwestern Mutual. That's where I work. So we're a financial services, life insurance, and wealth management. Been around for 160 years.

Some very impressive numbers there. But first of all, I want to say why is Northwestern Mutual a great place to do Gen AI. We got a lot of data, we got a lot of money, we got a lot of use cases, and we got access to some of the best talent anyone can dream of. Really truly humbled by the people that I get to work with.

But on the flip side, why is it hard to do Gen AI at Northwestern Mutual? Because it is a very riskaverse company, right? If you think about it, our main motto is generational responsibility. I call it don't f up. Because what we end up selling to people is a decadesl long commitment, right? You buy life insurance now. If you stay with us until it comes to term, so to speak, that can be 20, 40, 80 years down the line, depending on when you buy it and how long you get to live. And so stability is something that's very important for us because it's important for our clients. So how do we balance stability with innovation? That's what I want to talk about today.

And really the four main challenges that we had when we even came up with the idea kind of a pie in the sky Genbi concept. First of all, no one's done it before, right? Truly, no one's done Genbi in this fashion in the past. Secondly, and this was really a preference for us. We wanted to use actual data that's messy because we knew that those were that's where the real challenges are going to be, right? Understanding actual messy data for 160y old company and how can we perform well within that ecosystem.

The third was kind of a blind trust bias. So the bi the trust that we had to build was both with the users but also with the leadership of the company right how can we bring accurate information accurate answers to people when all of these things that we know about and everyone's talked about is is just out there right no one's blind to the trust barriers no one's blind to the accuracy barriers so how do we convince that this is actually something that we can trust in the company and lastly But really firstly when we go to approach this from an enterprise perspective budget the impact right how do we convince someone in a leadership organization where risk aversement is ingrained in the DNA to even invest in something like this that no one's done before we don't really know how we would do it we're not even sure how it would look like when it comes to turn so I'll start kind of one by one and first of all really talk about why we chose to use actual data and not synthesized data or cleanse data.

So really it's about making sure that we understand the actual complexities that we will have to face when we eventually want to go to production right we know that you know building PC's and demos is so easy but the gap from PC to production is so broad especially in this gen AI space especially because we don't know upfront how to design the system what we would expect it to behave like so making sure that we operate with real data just gave us that extra confidence that when something works in the lab it's very likely to also work in reality.

But also and maybe not in the least less important is that we got to work with actual people who work with the data day in and day out and that gave us two things. Okay, first of all subject matter expertise which are super critical for us to be able to validate that the system is actually working gave us a lot of real life examples of what people are actually asking in a corporate and what people have answered to them. So basically the eval right and all the testing and stuff. But at the end of the day it also brought the business to be a part of the research project itself and they became kind of bought into the idea as part of the process. So we didn't just test something in the lab and then had to convince someone to go ahead and use it. The end users were part of the research process itself.

And so when eventually it matured enough so we can take some of that to production, they were already there and they actually were pulling that. They told us we want to take this, how can we wrap it, how can we package it quickly enough so we can put it into practice. And the next part was really about building trust. So this is about building trust first of all with our management team. All right. Now, I don't know about you, but last time that I got a million dollar to do a research project that I wanted in a pie sky idea. I woke up from the dream and I realized that this is not how things work in reality. You don't just get a million dollars and go ahead and try something out. You had to show that you know what you're doing.

And part of what we did, it's kind of listed out here, but obviously, you know, we did all the regular stuff, right? We worked in a sandbox environment. We made sure that we're not using actual client data. We made sure to put in all the security risk aside. But one of the first approaches that we said we're going to take is we're not just going to build a tool that's going to be released to everyone, right? We understood very quickly that how people interact with the tool, their ability to verify that what they're getting is right and also give us feedback changes dramatically depending on their expertise and understanding of the data. So we took that crawl, walk, run approach that basically said we're first going to release it to actual BI experts, right? people that would be able to do it on their own and know what good looks like when they get it. And we're just going to expedite the process for them kind of like a GitHub co-pilot.

The next phase would be to bring it to business managers and again people who are closer to the BI team, but when they see a mistake, they can pretty much figure out that what they're seeing is wrong because they're used to seeing that on day-to-day basis. and they will might be less sensitive to these types of mistakes and be more inclined to give us that feedback instead of just, you know, dumping it aside and never using it again. Giving this type of tool to executives in the company, I don't even know when we're going to get there, right? Like an executive, they want clear, concise answers that they know they can trust. We're definitely not there yet. I think that's the vision at some point in time, but the system is not accurate enough for us to get there. Maybe it never will be.

Another way that we another liver that we kind of used to build inherent transit the system is that we said, well, in the get-go, we're not going to even try to build SQLs, right? This is very complex. This is very hard even for a person. So, we said step number one, let's just bring information that is already in the ecosystem that's already verified, right? We have a lot of certified reports and dashboards. And actually in the conversations we had with some of the BI teams that we worked with, they told us guys like 80% of the work that we do is basically sending people to the right report and helping them figure out how to use it. So the report is already there. And that again built some inherent trust into how we architected the system because we said we're not going to make up information. we're just going to deliver you the same asset that you would have gotten anyway just in a much faster much more interactive way. And that was the alignment of expectations that we did very upfront with the users and also with the management team.

Now the biggest process or kind of the most important approach that we took when approaching our leadership team and convincing them that we want to do this was to create a very gradual incremental process that gave them a lot of visibility and control. And it was very important for us to build incremental deliveries throughout that process so that not only did they have the the visibility into what are we funding now, what do we get out of it, they actually had business deliverables they could realize potential from throughout the process and at any point in time they could pull the plug right and say okay like it's not working well or we got enough out of it or you know the next phase is so you know unknown and long that we don't want further invest in it. And this is how we basically broke it down.

So phase one was just pure research, right? We kind of did the shift from natural language to SQL. We figured out how to write responses. We figure out how to understand questions that coming in. Just kind of setting the stage. Phase two was about really understanding, okay, so what does good metadata and good context look like in the perspective of a BI agent, right? It looks very different if you're just chatting with something or if you're trying to do a rag with you know unstructured data like documents and business knowledge and stuff like that. And this phase on its own already had impact on the business because when we define what good metadata looks like for an an LLM we could immediately apply that also to just the ecosystem of data users across the enterprise.

And by understanding how to extract LM from the information, we could also how to extract metadata. Sorry, here's where the trap door comes into play, right? We could also project that on how or what good metadata looks like for humans interacting with the data. We have another initiative around semantic layer going on which tries to model exactly that and this provided a very valuable input to that initiative as well. But the immediate next step was basically just doing this kind of multicontext semantic search, right? People coming in asking different questions and having the system figure out what's the right context, what's the right information we need in bring them. And this is something that could already be packaged as its own product and delivered and basically just do kind of a data finder and data owner finder which is something that could take anywhere between two to maybe four weeks in an enterprise like Northwestern Mutual just finding what data exists and who owns it so I can start talk the conversation with them.

And the next layer was really about pulling in information and trying to do some light pivoting around the data. Each one of these steps as you can see also created an input to the to the following step so that the research itself was kind of self u self-propelling and there were incremental outcomes coming out of each one of these phases. The next one is more kind of setting it up for enterprise level usage. So understanding roles of in of different users coming in what they may be asking about what type of access we want to give them etc and eventually and this is still some ways to go ahead building kind of a fullyfledged NBI agent which doesn't only quote information from existing reports but I can actually run SQL queries on its own pull in more data do more sophisticated joints between different data so it can answer more complex questions so that's the road map right that's kind the high level plan.

Now, why did that work? Well, kind of quickly summarize them. We talked about so we get value early and we get value often. Each one of this was a six week sprint at the end of which we had had a very tangible deliverable coming back to the business that we could decide to productize. And at any point in time, we could decide how we want to move forward. There was transparent progress. There was incremental business value. Each one of these steps allowed us to learn something that helped feed the next step. And maybe the most important part and that's the bottom line here and that's the part that executives really look at. How do we control the risk in continuing to invest in this type of research project and this is really about eliminating things like sun cost bias, right? We already paid you know you know whatever a million dollar. Let's just get through the project see what we get at the end. This eliminates the fear of of competitors coming in and maybe we don't need to continue investing in this right so everyone in the industry is researching GenBI and there are solutions like data bricks genie that are coming up and they're getting better and better maybe at some point in time it's better for us as an organization to actually adopt data bricks genie but at that point again first it's much easier for us to pull the plug and the funding but we already have a good understanding of what good looks like we have benchmarks that we used for ourselves when testing our own system that we can test a third party solution with. And we know what to expect, right? We know what works, we know what doesn't. We know what a kind of fluffy demo from a vendor would look like. And we know where to drill in to ask the tough questions.

So let's see kind of what it looks like under the hood and how we productize different elements of this architecture. And maybe kind of very quickly, why can't we just do it with chat GPT? So you know just dumping a schema into chachpp doesn't work. Usually schemas are very messy. It's not easy to understand the context and the meaning of things. Uh and eventually governance is super important. So there was a lot of governance built into the architecture that was very hard to apply on chpd from the outside but even solutions like you know data bricks genius third party much harder to govern from the outside than from the inside. But still TBD.

So the stack kind of looks like this. We have a data and metadata layer that we produced. We have four different agents that are running across the pipeline. A metadata agent that understands the context. A rag agent that finds the different reports. An SQL agent that can pull more data if we need that. And then eventually what we call a BI agent that takes all that information and delivers an answer to the question that was asked. On top of that, we slap governance and trust and orchestration and eventually some kind of a contextual UI.

And this is how the flow goes. So when a business question comes in we push it into the orchestrator and basically decides how to facilitate the process. The first thing that we do is understanding the context. So that's where that metadata agent comes in works with the catalog works with all the documentation that we have across the system to understand what we're being asked about and what's the relevant information to share. Then we go to the rag agent which tries to find an existing report again out of a list of certified reports that we know are allowed for people to use and people have spent a lot of time fine-tuning them and making them as accurate as possible. If we can't find the report or if it's not exactly what we need to to use, that's where we go to the SQL agent that basically tries to create a more exact query or a more elaborate query. And even if the report that we have is not usable as is, it gives us that initial seed of a query that we can then expand on rather than having to build one from scratch. So it's kind of like a fewot example, but in this case the example that we give is very very close to the actual result that we're expecting to get. We then execute it against the database pull and push it into the BI agent which gen with which gen translate that to a business answer and not just dumping data back on the user and this is what goes into the final answer. Now there's obviously some kind of a loop that says if I'm in the same conversation I'm probably talking about the same data so we don't have to talk about this or do this again and again. Now each one of these three components, each one of these three agents can be packaged as its own product and delivered to production with a very tangible and actual impact on business metrics. Okay.

And that's the kind of beauty of this approach that after we productize each one of these, we could have basically said stop or let's move forward. and just some giving bottom line numbers around some of these. So just the rag agent that pulls the right report allowed us to take about 20% of the overall capacity of the BI team that basically said all we do is just share the right report with the right person. So we were able to automate around 80% out of those 20% and we're talking about a team of 10 people. So roughly two people full-time job all they do is find the right report and send it to the right person. the metadata understandings that we got from learning how to interact with the data through an LLM allowed us to run AB test in a in the semantic layer project that we did and that allowed us to prove back again to the senior leadership in the company that there is value and tangible value measurable value in enriching metadata. And we did that basically by running a a battery of questions against a database that had good metadata and one that didn't have good metadata. And we show how much better an LLM performs when having the right metadata in place. So basically proving the value of something that can be very fluffy like hey let's bring in more documentation into the code. Right now we're experimenting with the data pivoting bot. So once you have a dashboard or a report be able to change the time horizon some of the views some of the segmentations and the groupings of the data again kind of real time without having a person do that for a business stakeholder and some of the next steps is really evaluating the tools that are out there for Genbi like data bricks genie for example and we're going to go into a much more rigorous process of enriching our catalog with metadata and documentation and that's also going to come out of a lot of the learnings that we got from the research that we've done. So even if we don't end up writing a GenBI agent full-fledged end to end, we already got a lot of value back from this and this is really what allowed our senior leadership team to continuously invest in this project quarter over quarter.

One thing that I want to wrap up with is just a couple of thoughts I had about the future. So I think we talk a lot about how to prepare data. I think that's going to be a huge area in the market and they're going to be probably a lot of companies and tools that are going to help us with that. Building very specific task specific models and applications. I think a lot of startups and companies are going to come up from that area. Co-pilot is really making sure that we meet the users where they are. And securing of models obviously a very big thing. The last thing is the one I the the one I want to focus on the most because that's kind of a recent thought that came to me a couple of weeks ago. How we do pricing of SAS in the Gen AI era. This is really about the fact that one individual person today can be 10x more effective than they used to be in the past. And then do we price software based on seats or do we price software based on how much they used it or do we price software based on the value that they got out of it? Salesforce is already experimenting with that. So that the data cloud product at Salesforce is starting to be usage priced and not seats priced. And I think this is going to have a big impact on just the kind of SAS economics worldwide. and it it doesn't even matter if the product itself is genai. It's really about what does the person using the product can do and what can they do in their other time and whether it still makes sense to price it by how many employees you have or how much work you get done with the employees that you have.

That is me and thank you very much for listening and thanks for not opening the door on me.

Others You May Like