When we built the data ingestion pipeline for Tenantvein, we ran a test that shaped how we think about AI underwriting: we fed the same property's rent roll to our extraction engine in three different formats — a structured Excel export from Yardi, a scanned PDF of a printed report, and a manually assembled spreadsheet. The output should have been identical. It wasn't. The scanned PDF produced tenant name variants, missing CAM fields, and two misread suite numbers. The manual spreadsheet had three tenants with inconsistent date formats and one lease end date that was clearly a data entry error — 2039 instead of 2029. The Yardi export was clean.
That test taught us the first rule of CRE data quality: the source system determines the data quality floor. Everything downstream — normalization, AI extraction, underwriting model — inherits whatever errors exist at the input stage. Garbage in, garbage out isn't a slogan. It's a pipeline reality.
Where CRE Data Quality Problems Actually Come From
The CRE industry runs on data that was never designed to be machine-readable. Rent rolls are produced for human review — property managers format them to be printable and scannable by eye, not parseable by software. Lease abstracts are written in legal language optimized for enforceability, not structured data extraction. Operating statements are formatted for accountant review, not for NOI normalization algorithms.
The result is a sector where the fundamental unit of underwriting data — the rent roll — arrives in hundreds of different formats, with different column structures, different naming conventions for the same fields, and different treatments for standard lease elements. In our data processing work, we've seen rent rolls with the lease commencement date labeled as "Start," "Commencement," "Lease Start," "Begin Date," and "From." All mean the same thing. All require different parsing logic if you're trying to normalize them automatically.
This isn't a technology failure — it's a legacy of how CRE transactions were conducted before data interoperability became an expectation. The problem is that AI underwriting tools are now expected to operate on that legacy data without a cleanup step, and most of them struggle with it silently — producing outputs that look correct but are built on misread source data.
The Three Categories of CRE Data Quality Failure
In building our pipeline, we categorized data quality failures into three types with different detection and remediation profiles:
Structural errors. These are formatting inconsistencies: date formats that don't parse, numeric fields containing text values ("NNN" in a rent field), merged cells in Excel that confuse column alignment. Structural errors are detectable with validation rules before any AI analysis runs. They should never reach the underwriting model. On average, we see structural errors in approximately 30% of rent rolls we ingest — not necessarily in every row, but enough to require systematic scanning rather than assumption of clean input.
Semantic errors. These are values that are technically valid but contextually wrong: a lease end date in the past (expired lease still shown as active), a rent per SF that is clearly a typo (0.45 instead of 45.00), a suite number that doesn't match any other record in the building's floor plan. Semantic errors are harder to detect because they pass format validation but fail business logic checks. Catching them requires domain knowledge — knowing that a $0.45/SF rent on a 10,000 SF office suite is implausible, even though $0.45 is a valid decimal value.
Completeness errors. Missing data is in some ways the hardest category to handle because absence is ambiguous. A blank CAM field might mean the tenant has no CAM obligations (NNN structure), or it might mean the field wasn't filled in during data entry, or it might mean the lease is still being negotiated. An AI system that treats all three cases the same will produce different errors depending on which interpretation it silently applies.
Why Garbage In Produces Wrong Outputs Silently
The specific problem with AI underwriting tools isn't that data quality errors cause system crashes — they don't. The problem is that they produce plausible-looking wrong answers. An AI system that misreads a lease commencement date will calculate the wrong annualized income for a new tenant and produce a higher NOI figure than the asset will actually deliver. That figure will pass a sanity check because it's in the right range. It will populate the DCF model and produce an IRR. Nothing in the model will flag the error.
The most dangerous data quality errors are the ones that produce results within normal range. Out-of-range errors get caught. In-range errors get underwritten.
This is why data quality validation needs to happen before the AI analysis layer, not after. Validating the output of a model built on flawed inputs doesn't help you. By that point, the error has propagated through every downstream calculation. You need to catch it at the source.
Building a Data Quality Gate for CRE Underwriting
A practical data quality gate for rent roll ingestion covers these checks in sequence:
- Field completeness scan. Every required field (tenant name, suite, lease start, lease end, base rent, rent escalation schedule) is present and non-null. Missing fields are flagged before processing continues.
- Format validation. Date fields parse as valid dates. Numeric fields contain numeric values. Fields with controlled vocabularies (lease type: NNN / MG / Gross) match the expected value set.
- Business logic checks. Lease end date is after lease start date. Rent per SF is within a plausible range for the asset class and market (e.g., $5-$200/SF for office depending on submarket). Total rent is consistent with unit count and average rent per unit on multifamily.
- Cross-field consistency. Tenant names are consistent with the master tenant list if one exists. Suite numbers don't duplicate unless intentional. Lease term implied by start/end dates is consistent with stated term months field if present.
None of these checks are individually sophisticated. Together, they catch the majority of structural and semantic errors before they reach the AI analysis layer. The investment is building the checks once and applying them consistently to every ingest — which is exactly the type of mechanical rigor that's hard to maintain in a manual process but straightforward to enforce systematically.
What AI Can and Can't Fix
AI extraction can recover legible data from unstructured formats — scanned PDFs, inconsistent column headers, variant date formats. That's genuine value. What AI extraction cannot do is invent data that isn't there, verify that a value is correct rather than merely present, or apply contextual business judgment to ambiguous fields.
A well-designed AI underwriting system is honest about these limits. It surfaces confidence scores on extracted fields, flags fields where extraction confidence is below threshold, and presents ambiguous cases for human review rather than silently applying a default value. The combination of automated extraction with transparent uncertainty quantification is more useful than either confident-but-wrong automation or pure manual review.
The CRE data quality problem is real, it's persistent, and it won't be solved by any single technology. What it can be managed by is a data pipeline that treats input validation as a first-class concern rather than an afterthought. The underwriting model is only as reliable as the data it runs on. That's been true in spreadsheets. It remains true in AI.