Your Customer Just Asked About Your Training Data. Here's Exactly How to Answer the Data Governance Section.
Your Customer Just Asked About Your Training Data. Here's Exactly How to Answer the Data Governance Section.
The questionnaire arrived at 9 AM. You've answered the easy parts — company name, product description, which regulation applies to you.
Then you hit the data governance section.
"Describe the data used to train your AI system, including data sources, data quality measures applied, and steps taken to identify and address bias in the training dataset."
Your AI model was trained on a dataset your ML team assembled two years ago. You're not entirely sure what's in it. You don't know how to phrase "data quality measures" for a legal audience. And you've never written this down for a customer before.
Here's how to answer this question — and the five others like it that almost always follow.
Why This Section Appears in Every Questionnaire
For high-risk AI systems under the EU AI Act, Article 10 imposes specific requirements on training, validation, and test data. It requires that:
- Training data is subject to data governance practices
- Data is relevant, representative, and free from errors "to the best extent possible"
- Data is examined for potential biases that could lead to discriminatory outcomes
- Data covers the specific geographic, contextual, or behavioral settings where the system operates
HR tech products — resume screeners, candidate ranking tools, interview scoring systems — almost universally fall under Annex III, point 4(a) as high-risk systems. Article 10 applies in full.
When your enterprise buyer asks about training data, they are not conducting academic research. They need answers they can include in their own compliance documentation, which they will show to their DPO, their legal team, and potentially their regulator. Your answer has to be usable, not vague.
The Six Questions You'll Typically Get
Most data governance sections follow a similar structure. Here are the six most common questions and how to answer each.
Question 1: What data did you use to train your AI?
Describe the source type, not the raw dataset. Enterprise buyers understand that training data is often proprietary. They want to know:
- What category of data (anonymized application records, public job board data, recruiter-labeled outcomes, synthetic data)
- Approximate scale ("over 500,000 anonymized hiring event records")
- Whether data was licensed, collected under consent, or drawn from public repositories
Example answer:
"[Product] was trained on a proprietary dataset of [X]+ anonymized hiring-event records, including application data, recruiter decisions, and candidate outcome labels. No personally identifiable information was retained in the training dataset. Data was sourced from customers who participated in our model development program under applicable data processing agreements."
Question 2: What steps did you take to ensure data quality?
Article 10(3) requires training data to be "relevant, sufficiently representative, and free from errors to the best extent possible." Map your answer directly to this language.
Example answer:
"Prior to training, the dataset was reviewed for completeness, duplicate records, and systematic anomalies. Records originating from markets or roles with fewer than [X] observations were excluded to prevent low-sample overfitting. A held-out test set representing [Y]% of the full dataset was reserved to evaluate model performance before deployment. Quality review is repeated on each model update."
Question 3: How did you test for bias?
This is the question that causes the most hesitation. Founders often worry that acknowledging bias in their training data will sink the deal. It won't. The absence of a testing process is what sinks deals.
If you ran bias testing, describe it specifically. If you didn't run formal testing, describe what you did do — and what your roadmap looks like.
Example answer:
"The training dataset was evaluated for statistical representation across protected characteristics including age, gender, and national origin [specify what applies]. We applied [re-sampling / re-weighting / fairness-aware training — describe what you did] to reduce differential performance across demographic groups. Bias evaluation is conducted on a [quarterly / per-release] basis using [equalized odds / disparate impact ratio / specify metric]. Results are documented in our internal model card."
Question 4: Was the training data representative of our use case?
This question asks whether your model will generalize to their specific context — their industry, their geography, their role types.
Answer it specifically. Vague answers here generate follow-up questions that slow down the deal.
Example answer:
"[Product] was trained on data drawn primarily from [specify: industry vertical, company size range, geographies represented]. Customers deploying in significantly different contexts — for example, roles requiring highly domain-specific credentials not well-represented in the training data — are advised to evaluate model performance in our configuration interface during onboarding. Our implementation team can support this assessment."
Question 5: How long do you retain training data?
Example answer:
"Training data used in model development is stored in a secured, access-controlled environment and retained for [X years] in accordance with our data retention policy. Model artifacts do not contain retrievable training records. Customer data processed through [product] is not used to retrain the shared model without explicit opt-in from the customer."
Question 6: Can you provide documentation of your data governance practices?
This is often the final question in the data section, and the one founders dread most — because it assumes formal documentation exists.
If you have a model card, data card, or internal data governance policy, reference it and offer to share a summary. If you don't have formal documentation yet, describe the practices you follow and note that formal documentation is in progress.
Example answer:
"[Product] maintains internal data governance documentation covering dataset composition, quality assurance processes, and bias evaluation methodology. A summary data card is available to enterprise customers upon request under NDA. We are in the process of formalizing this into a full Annex IV technical documentation package, targeted for [date]."
The Part Founders Skip — And Shouldn't
Buyers don't just read your answers. They compare them.
If your answer to "describe your training data" in one questionnaire is three paragraphs long, and in the next questionnaire it is two sentences that mention different features, the discrepancy is noticed. Legal teams flag it. It becomes a negotiation issue.
Your Article 10 answers should be identical across every questionnaire you submit. Not similar — identical. That requires writing down your canonical answers once, documenting them somewhere consistent, and copying from them every time.
Most HR tech founders are re-answering the same training data questions from scratch, on deadline, in slightly different ways every time. The inconsistencies accumulate. Trust erodes slowly, then all at once.
What If You Have Gaps?
Article 10 compliance is not binary. The regulation uses language like "to the best extent possible" deliberately. Buyers understand that no model is perfectly bias-free and no dataset is perfectly representative.
What they do not accept is silence. If your bias testing was informal, describe it. If your training data was from a narrow geographic region, acknowledge it and explain what it means for deployment. If your documentation is incomplete, say so and give a timeline.
The buyers asking these questions are building compliance files they will stand behind when their regulator reviews their AI vendor list. They need enough detail to make a defensible record. Give them that detail.
August 2, 2026 is the deadline for high-risk AI obligations. It is less than four months away. The questionnaires are arriving now.
Try Complizo free — paste your first questionnaire and get your Article 10 answers drafted in minutes.