DocExtract

FAQ

Quick decisions and concrete examples for choosing the right workflow, setting up the first run, and verifying results.

Browse by topic

Every FAQ section, grouped by topic. Click any item to jump to it.

Which workflow should I use?

Use Extraction when you want named fields back in a table from one or many PDFs, such as findings, dates, metrics, or conclusions.

Use Gap analysis when you want differences between two whole documents, such as future-vs-current requirements or version-to-version changes.

How should I set up my first extraction?

Keep the first run small. Start with 2-3 fields you can verify quickly, then expand once the output looks right.

Use representative PDFs for the first run rather than the biggest batch. The goal is to validate the setup, not maximize volume on the first attempt.

How do I describe my document set?

Data type tells the extractor what kind of records the PDFs represent, such as safety reports, endpoint tables, study summaries, or requirement documents.

Write it the way you would describe the set to another person. This helps instruction generation and relational template inference stay on the right track.

Good Data type examples
post-market surveillance findings
clinical study summaries
contract payment terms
How should I use Document profile?

Leave Document profile on Auto for the first run. It is the safest default when you are still learning how the job behaves.

Use Structured Numeric for row-heavy or amount-heavy PDFs like invoices, contracts, or estimate tables. Use Research for journals, studies, and narrative evidence extraction.

How should I choose columns for the first run?

Columns define the exact fields you want returned in the output table.

For the first run, choose fields that are easy to spot and easy to verify. Add interpretation hints only when a field name could be read more than one way.

When should I use Relational mode?

Use Relational mode when one PDF contains repeated linked entities, such as multiple devices, endpoints, cohorts, or adverse events.

Relational output can contain multiple records per PDF, and each record carries the same required field keys.

Relational mode is currently in Beta — the record-template inference is still being refined; flag any unexpected groupings so we can tune it.

Example output: one trial PDF, two device rows
PDFRecordDevice nameManufacturerRisk class
trial.pdfdevice-1CardioSense MonitorACME MedClass IIb
trial.pdfdevice-2NeuroTrack SensorACME MedClass III
Example template for repeated devices
record=device
fields=device_name,udi,manufacturer,risk_class
cardinality=multi
Example output (multiple records from one PDF)
{
  "records": [
    {
      "record_type": "device",
      "cells": {
        "device_name": { "value": "CardioSense Monitor" },
        "udi": { "value": "00812345678901" },
        "manufacturer": { "value": "ACME Med" },
        "risk_class": { "value": "Class IIb" }
      }
    },
    {
      "record_type": "device",
      "cells": {
        "device_name": { "value": "NeuroTrack Sensor" },
        "udi": { "value": "00812345678944" },
        "manufacturer": { "value": "ACME Med" },
        "risk_class": { "value": "Class III" }
      }
    }
  ]
}
When should I use Flat mode?

Use Flat mode when you need one consolidated row per PDF.

This is best for document-level summaries where repeated entities do not need separate rows.

Example output: 3 invoices in, 3 rows out
PDFInvoice #TotalDue date
invoice_apr.pdfINV-1042$4,820.002026-06-01
invoice_may.pdfINV-1078$1,200.002026-07-01
invoice_jun.pdfINV-1095$9,150.502026-07-15
Example flat output (one row per file)
PDF file      | study_id | primary_endpoint         | adverse_events
trial_a.pdf   | ST-101   | 12-month MACE rate 4.3%  | Not found
trial_b.pdf   | ST-102   | 6-month restenosis 2.1%  | Mild bleeding (3 events)
When should I care about the Record template (DSL)?

Most users can rely on the auto-generated "What we’ll extract" summary shown in the new-job form — it describes the same template in plain English.

The DSL only matters when you are doing relational extraction and need precise control over how repeated entities are grouped. Open the advanced editor in the form to access it. `record` is the entity anchor, `fields` lists the required field keys, and `cardinality` sets whether you expect one record (`single`) or many (`multi`).

Example A: device-level extraction (multi)
record=device
fields=device_name,udi,manufacturer,risk_class
cardinality=multi
How that appears in review table
IFU_2025.pdf | device-1 | device | CardioSense Monitor | 00812345678901 | ACME Med | Class IIb
IFU_2025.pdf | device-2 | device | NeuroTrack Sensor   | 00812345678944 | ACME Med | Class III
Example B: trial summary (single)
record=trial
fields=trial_id,primary_endpoint,safety_summary
cardinality=single
When should I add Rules?

Rules are optional setup instructions that influence how values are interpreted and normalized.

Keep them concrete. For example: use endpoint table first, normalize percentages, return Not found when unsure.

What am I checking during instruction approval?

Check three things: the objective matches the document set, the field plans match the outputs you expect, and the missing-value behavior is strict enough.

You do not need to rewrite everything. If something is off, revise with one narrow sentence such as "Prefer endpoint table wording" or "Treat score as numeric plus unit when present."

How do I verify a result?

Click a value in the results table to open the evidence view. Review the quote, the page number, and the highlight before trusting the value.

If the document does not support the result, use the feedback actions to flag it or suggest a correction.

What files are allowed?

Upload PDF files only. Extraction jobs accept up to 10 PDFs per run, and per-file size limits are enforced at upload.

Do not upload confidential or sensitive documents.