In Environmental Due Diligence, Trust in AI Has to Be Earned Insights| LightBox

Note to Readers: This article is part of an ongoing series by Eric Bollens, Chief Technology Officer at LightBox. It draws from his presentation at the Environmental Bankers Association conference in February 2026, where he shared his perspective on integrating artificial intelligence into environmental due diligence workflows.

At the Environmental Bankers Association conference last week, I had the opportunity to speak on a panel about the growing role of artificial intelligence in environmental due diligence. AI has stopped being something we encounter only in product demos and conference keynotes. It has become ordinary. People use it at home, they encounter it in the products they buy, and they increasingly bring it into the workplace, sometimes long before their organizations have made deliberate decisions about governance or standards. In a comparison of public and AI expert views, Pew Research Center found both that AI has become a mainstream topic, and that questions of trust, responsible development, and oversight now shape the conversation.

Environmental due diligence is not insulated from this. In late 2025, LightBox surveyed environmental professionals and environmental risk managers to understand how AI is being used, what outcomes are being seen, and what risks are top-of-mind. The picture that emerged is what many leaders are sensing operationally: adoption is real but uneven; personal use often leads professional use; and governance, training, and shared norms are lagging behind use.

In a field dependent on defensible conclusions, this matters. Environmental due diligence is an accuracy-driven business. When our work informs transactions, financing, risk transfer, and compliance strategy, speed is only valuable when paired with trust.

The Risk Is Not the Model. It Is the Operator.

When teams talk about AI risk, it is easy to focus on large language models (LLMs) as if they were themselves the hazard. In practice though, most failures are a matter of uninformed or irresponsible use. Used well, AI is an accelerator and an aide, but used carelessly, it produces junk and undermines trust. That is exactly what the LightBox benchmark respondents worried about: inaccuracy was the leading concern.

One reason these concerns resonate is that hallucination is not a rare edge case. It is a predictable behavior in systems trained to produce answers. OpenAI’s research on why language models hallucinate describes how training and evaluation incentives can reward guessing, which in turn produces plausible statements that are not grounded in truth. This does not mean AI is unusable, but it does mean that professional workflows must be designed to manage this reality.

When I talk about the responsible use of AI, I frequently use an aviation analogy. Modern aircraft rely heavily on automation, but no one boards a plane without a pilot. In air travel, the autopilot assists, but accountability remains with the humans in the cockpit. In our industry, we should pursue much the same. AI can help with execution, but responsibility for the deliverable must remain with the professionals whose names sit behind it.

A Junior Teammate, Not an Expert Witness

A useful mental model is to treat AI as a junior teammate, not an expert witness. Junior teammates can do meaningful work. They grind through boilerplate, create first drafts, extract structured facts, and compare sources. But they can be wrong, and they do not always signal uncertainty in a way that matches the confidence of the prose they produce. If you treat AI output as a draft that must be reviewed, validated, and refined, you get leverage from its speed without surrendering the standards that make your work defensible.

Five Tested Practices for Better AI Outcomes

The practical question is how to operationalize the posture. At LightBox, we have identified five practices that consistently produce better outcomes, and we have incorporated them into how we leverage AI in both our workforce and our products. They are not exotic, but they are disciplined. They also map to what the industry is asking for implicitly when it names inaccuracy and a lack of transparency as core risks.

1. Task-by-task workflows, because models perform best in short iterative cycles.

It is not an accident that model providers introduced LLMs to the world through interactive chat. ChatGPT was launched as a conversational interface designed to take feedback, reveal strengths and weaknesses, and improve through iterative interaction; Anthropic introduced Claude as an AI assistant accessible through a chat interface as well as an API, reflecting the same idea that the model’s value emerges through dialogue and iteration rather than one-shot generation; and Google framed Bard as an early experiment that lets users collaborate with generative AI in a conversational format.

The common point is not branding. It is workflow. LLMs tend to be most reliable when the work is broken into small tasks, each output reviewed and corrected before moving on. This matters because these systems are optimized to produce answers, and the incentives in training and evaluation can lead models to guess rather than admit uncertainty.

In practice, task-by-task means resisting the urge to ask the model for a finished narrative when the underlying facts are still unsettled. In environmental due diligence, that might mean using AI first for bounded, checkable work: extracting entities and dates, listing contradictions between sources, drafting clarifying questions, or normalizing regulatory results into a consistent schema. It might also mean extracting clauses, reconciling tables, identifying missing fields, or drafting the write-up for one discrete concern. Across all these items of work, each individual request is small enough that a reviewer can quickly verify it, and correction is an explicit step rather than an implicit hope.

2. Section-by-section drafting, because zero-shot reports fail in predictable ways.

If you have spent any time around due diligence teams experimenting with AI, you have seen a familiar temptation: upload documents full of records and ask an AI system to draft the report. It is the natural extension of what people do manually, and it becomes especially tempting when deadlines compress. This is the aspiration of the zero-shot report: generate a large, polished deliverable in a single pass, assuming that review will catch whatever the model gets wrong. But it won’t, at least not consistently.

The problem is that this workflow fails in ways that should feel familiar. Even when the first pages are accurate, errors tend to appear as scope expands, just as a reviewer’s ability to reliably detect them declines with increasing cognitive load and fatigue. The longer the output, the more likely it becomes that a subtle but material mistake survives because the reviewer has been forced into endurance reading.

There is also a model-side constraint that makes the zero-shot report risky. Long-context comprehension within LLMs is still uneven. Research on long-context behavior has shown that models can perform worse when relevant information is positioned in the middle of large contexts, a phenomenon described as “lost in the middle.”

Put together, this means risk increases in two directions at once: the model may blur or misweight details as context grows, while the reviewer may miss the resulting errors because the review task has become too onerous to perform reliably.

The practical alternative is section-by-section drafting, and in accuracy-driven work, the “section” should be interpreted as the smallest defensible unit of truth in your workflow, not necessarily the traditional headings of a final report. The most reliable unit might be a single concern, a discrete risk hypothesis, a table, a dataset slice, an evidentiary thread, or a decision point. For example, in an ESA, it could be working through a specific potential condition or a specific thread of historical use evidence.

Only after you have worked through all the discrete elements of your workflow do you want to then engage the model to produce a higher-level synthesis because LLMs are most likely to get it right when summarizing validated truth rather than imposing coherence over uncertainty.

3. Structured and tagged data, because compiled documents invite fuzzing of detail.

Another common failure mode is treating an AI like a magical reader of large, compiled documents such as a stitched PDF or folders of records. In such cases, the model typically produces something coherent, but coherence is not the same thing as fidelity. When the model is forced to infer structure from compilation, details blur, and those are exactly the kinds of errors that are hardest to catch because they so often look reasonable at a glance.

Structured and tagged data is a means to resolve this issue. Instead of asking the model to discover the structure of your evidence, you provide structure up front so inference stays tight. This is the intuition that makes approaches like retrieval-augmented generation (RAG) successful, where a model is paired with an explicit retrieval mechanism so its output can be grounded in specific sources rather than in generalized pattern completion.

In environmental due diligence, structure might mean transforming evidence into discrete labeled records: a timeline with source pointers, a normalized set of regulatory hits, atomic claims tagged with source and date, or a table of entities linked to geospatial and relational contexts. The specific schema changes in other domains, but the effect is consistent. The model does not have to guess what is important or what is related. Instead, it can reason over a set of tagged elements, and reviewers can verify outputs quickly because each claim has a short path back to an explicit input record.

Structuring and tagging can feel like overhead at first, but it almost always pays back by reducing rework, tightening review, and preventing the late-stage surprises that turn a fast first draft into a slow, expensive cleanup.

4. Traceability and lineage, because defensibility requires the ability to backtrack.

In environmental due diligence, the ability to explain how you reached a conclusion is not a luxury. It is the entire point. And if AI is contributing to drafting or analysis, traceability becomes even more important because fluent prose can otherwise become an authority detached from the underlying evidence.

One of the core motivations in the literature is provenance: the ability to connect generated outputs back to the explicit sources rather than relying on the model’s internal parameters as an opaque store of knowledge. The broader AI governance ecosystem is moving in this direction too. NIST’s AI Risk Management Framework considers traceability and lineage as an important practice to make AI systems more trustworthy and governable in real deployments.

Operationally, traceability and lineage means AI-assisted workflows where citations or source pointers are the default review approach, not an optional add-on. If the model drafts a sentence about historical use, it should carry a reference to the underlying record; if it summarizes a regulatory hit, it should include the record identifier; and if it proposes a conclusion, it should enumerate the supporting facts and point to each fact’s origin.

This changes review in a meaningful way. It cannot be effective if it is limited to whether something sounds right. It must allow the reviewer to verify that the underlying sources support the claim, and it needs to be tightly integrated into the workflow to avoid context switching.

5. Deterministic validation, because probabilistic systems should be checked outside the model.

Even when you use iterative tasks, structured inputs, and traceable outputs, it is still important to recognize what an LLM actually is. At its core, it is a probabilistic generator. That is why, wherever possible, the final validation step should be deterministic. You do not want your verification process to rely on another round of probabilistic output.

In practice, deterministic validation means tie-outs, rules, reconciliations, or external checks. If an AI system classifies records into categories, validate the mapping rules against a controlled taxonomy; if it extracts fields, run completeness and consistency checks; if it aggregates numbers, tie totals; and if it proposes logical implications, validate premises against the cited records. This hybrid instinct is also reflected in research directions that combine learning systems with symbolic elements to support verification and validation, rather than trusting generation alone.

The practical takeaway is straightforward: let probabilistic generation accelerate work but insist on deterministic checks for anything that would be costly to get wrong.

Governance and Responsible Use Are What Makes Adoption Durable

One of the most striking themes in the benchmark survey is the gap between usage and oversight. Many respondents reported adoption, while many also reported limited training and limited policy guidance.

That is not a stable equilibrium. Without governance and responsible practices, the industry will eventually experience visible failures, and the response will be restriction through client prohibitions, internal bans, or external standards tightening.

NIST’s AI RMF ecosystem is useful here not as a prescriptive checklist for environmental work, but as a reminder of what mature adoption tends to require: structures, transparency, evaluation, and validation practices that make AI usage reviewable and accountable. In accuracy-driven industries, these elements are not bureaucracy. They are how you preserve trust while you scale new tools.

AI can make environmental due diligence faster, and in some cases, it may even make it more effective, but the mistake would be to treat acceleration as free. In an accuracy-driven business, acceleration has to be earned through process. Task-by-task workflows, small-scope drafting, structured inputs, traceable outputs, and deterministic tie-outs are not academic preferences. They are the practical conditions under which AI becomes a durable advantage rather than a recurring risk event.

The tools are powerful, but the responsibility remains ours.

Due-Diligence Commercial Real Estate Innovation & Technology