"include_confidence_score": true to your extract structured request.
confidence_score object contains the confidence level for each field.
It includes the level and score values.
The score is a value between 0 and 1, and the level is one of the following:
LOWMEDIUMHIGH
Confidence level evaluation
Box provides the following suggested thresholds:| Score range | Confidence level | Recommended action |
|---|---|---|
| >= 0.90 | High | Accept the response |
| 0.70 - 0.89 | Medium | Verify the response |
| < 0.70 | Low | Review the response |
The suggested thresholds are not strict and they can evolve over time as more testing data
becomes available.
Limitations
The confidence level is not a guarantee of the accuracy of the extracted field value and has some limitations.Confidence score is not a guarantee
A high score indicates that the extracted field value is likely to be correct, but errors can still happen. Even a very high score doesn’t guarantee that the extraction is correct. Make sure to validate the critical data regardless of the confidence score.Context is important
Confidence score is based on the model’s understanding of the data, but it doesn’t consider business-specific details that a human reviewer can identify. For example: You create a field calledcompany_name for an invoice extraction. The model can struggle to
recognize if you mean the vendor or the customer, and it can lower the confidence score accordingly.
It’s crucial to provide clear, specifc field descriptions. The more context you provide,
the better it can assess the extracted data.
Implementing human review workflows
Human-in-the-loop workflows are necessary to ensure the accuracy of the extracted data. Instead of trusting all extractions or reviewing all scores manually, you can programmatically route low-confidence fields for human verification. Such workflows require custom implementation. You need to:- Parse the
confidence_scoreobject from the response. - Compare each field’s score against the thresholds.
- Route low-confidence extractions to a review queue.
- Implement a mechanism for humans to correct and confirm values.
- Using confidence scores to prioritize review queues (handle low-confidence extractions first).
- Filter data sets (exclude extractions below a certain threshold from automated processing).
- Create conditional workflows (automatically approve high-confidence extractions while flagging low-confidence ones for manual review).
Best practices
- Provide clear field descriptions.
- Be specific about what you’re asking for and include context about where the data typically appears.
- Test and iterate. Monitor confidence patterns across your specific document types and use cases.
- Track how often high-confidence extractions are actually correct, and adjust your thresholds based on the accuracy of the data from your workflows.
- Use scores to prioritize, not replace, human judgment.
Model support
Confidence estimation currently works with Google Gemini models:gemini-2.5-flashgemini-2.5-pro
ai_agent_info.models in the response.
