Skip to content

Testing page

How we used AI to lift the voices of California state employees (this will be an H1 in the real version)

By Summer Mothwood and Barb Cosio Moreno
[Publish date]

Engaged California surrounded by bubbles representing areas of feedback

Engaged California is a new state program managed by the Office of Data and Innovation (ODI). It gives Californians a way to share their thoughts on topics that are important to them. They can connect with other people through respectful discussions and tell their government how they feel.

We launched our second engagement last Fall. This time, we asked state employees to tell us their ideas on how to make government more efficient. We expected a lot of unstructured, qualitative input.

What we didn’t expect was how much the analysis process itself would teach us about working with AI. We learned 2 key points: how AI could help analyze large text data, and where it can steer you wrong if you’re not careful.

Over 10 weeks, 1,469 employees participated, leaving 2,477 comments. They shared what they felt was necessary to make the work of government more efficient. 

This post will explain:

  • How we worked through the analysis
  • Where AI helped 
  • Where humans had to stay in the loop to catch errors and garner trust in the results.

We used AI for 2 different jobs

AI played 2 distinct roles for us: one during the engagement, and another after it closed.

During the engagement, we built a large language model (LLM) we could prompt to explore the comments. Our team could ask plain-language questions about the live stream of comments, such as:

  • What themes are popping up this week?
  • How often are people bringing up training?

That gave us an early signal of what was in the data. It also helped us build some grounded hypotheses.

After the engagement closed, the real work started. We synthesized the comments into a final report designed for readers to navigate.

The pivot: from “top solutions” to “themes”

Our initial goal sounded straightforward: identify the top solutions employees proposed.

But employees wrote long, layered responses. They included lots of:

  • Context
  • Feelings
  • Examples
  • Acronyms and internal names for tools and systems

Many people had more than just one idea.

To find the solutions people agreed on, we decided to write a custom AI prompt. It extracted problem and solution statements from each comment.The goal would then be to count which solutions occurred the most, and find our winners.

But the data didn’t cooperate. We couldn’t get to clean, matchable items that were easy to count. There was just too much context and nuance within each comment.

So we changed the goal. Instead of ranking specific ideas, we focused on organizing what they said. We identified the 10 themes and 65 subthemes that were present in the responses. Then we labelled each comment and presented that entire, organized dataset for readers. They could browse, explore, and take action on them.

Where humans stayed in the loop

AI did the heavy lifting at scale, but it didn’t run unattended. We built human checkpoints in on purpose.

First, our research team hand-categorized a random sample of comments. From there we built the initial list of categories.

Then we used AI to apply the categories to the rest of the comments. Data engineers ran quality assurance tests that:

  • Compared AI’s labels to the researchers’
  • Did random sampling checks
  • Analyzed the AI-labelled dataset

Their tests looked for outliers. These were things like if any themes got over-applied and which themes had a lot of overlap.

We noted when AI labelled data incorrectly and looked for patterns to figure out why. We used this feedback to improve the prompt.

One way was making the theme names and definitions more descriptive. Most comments received 4 or fewer subthemes. But sometimes when the AI couldn’t figure out what to do with a comment, its solution was to apply all 65 subthemes to it. While we can’t know exactly why that happened, we noted what those comments had in common with each other. So we added more details and context to the LLM prompt.

At the very end, we found there were still some edge cases. We decided to hand-label them.

Again, every stage of this work went through a round of human-led QA. This process helped us understand how to use new AI tools better. It also helped garner trust in the results. And informed how we build out a framework for future analyses like this.

Learnings

Using AI on open-ended qualitative data at this scale was new territory for us. A few lessons we’re carrying forward:

  • Prompt scope matters. The messier the input data, the more specific you have to be with the LLM prompts and settings to get what you want back.
  • When labelling text data, start with humans. Having researchers build the taxonomy first enabled the AI to gain context from us.
  • Be honest about what the data can’t support. The pivot away from rankings was the right call for the data we had at our disposal. The result is a rich dataset that highlights prominent themes while maintaining details.
  • Put AI analysis in version control. LLM prompts we used are in our version-controlled codebase. There are many benefits to this. One is that we know what prompts (and what versions of the models) we used. This makes them reusable and auditable for the future.

The ideas employees shared are already informing real programs. Examples include the Governor’s Innovation Fellows, results.ca.gov, and case studies happening across the state.

The next conversation

On March 30, Governor Newsom announced the next Engaged California conversation is coming soon. It will be the first statewide effort with all Californians. The topic: AI and its impact on the workforce. Engaged California is the first-in-the-nation digital democracy effort.