提取 API¶

Odoo 提供了一项服务，用于自动处理类型为 invoices 、 expenses 或 resumes 的文档。

The service scans documents using an OCR engine and then uses AI-based algorithms to extract fields of interest such as the total, due date, or invoice lines for invoices, the total, date for expenses, or the name, email, phone number for resumes.

这项服务是付费服务。每个文档处理将消耗一个信用点。信用点可以在 iap.odoo.com 上购买。

您可以直接在会计、费用或招聘应用程序中使用此服务，也可以通过API使用。下一节详细介绍的提取API允许您将我们的服务直接集成到您自己的项目中。

概览¶

The extract API uses the JSON-RPC2 protocol; its endpoint routes are located at https://extract.api.odoo.com.

版本¶

提取 API 的版本在路由中指定。

最新版本为：

发票: 122
费用：132
申请人：102

流程¶

The flow is the same for each document type.

Call /parse to submit your documents (one call for each document). On success, you receive a document_token in the response.
然后，您需要定期轮询 /get_result 以获取文档的解析状态。

或者，您可以在调用时提供一个 webhook_url 到 /parse ，当结果准备好时，您将会收到通知（通过 POST 请求）。

所有这些都应该使用HTTP POST方法。完整的发票流程的Python实现可以在这里找到： 这里 ，集成测试的令牌在集成测试部分中提供。

解析¶

Request the processing of a document from the OCR. The route will return a document_token, you can use it to obtain the result of your request.

路线¶

/api/extract/invoice/2/parse

/api/extract/expense/2/parse

/api/extract/applicant/2/parse

请求¶

jsonrpc (required)

参见 JSON-RPC2

method (required)

参见 JSON-RPC2

id (required)

参见 JSON-RPC2

params

account_token (required)

从中扣除信用额度的帐户令牌。每个成功的调用都需要一个令牌。

version (required)

The version will determine the format of your requests and the format of the server response. You should use the latest version available.

documents (required)

文档必须以ASCII编码的字符串形式提供。列表应该只包含一个字符串。如果提供了多个字符串，只会处理与pdf对应的第一个字符串。如果没有找到pdf，则会处理第一个字符串。出于遗留原因，此字段仅为列表。支持的扩展名有 pdf 、 png 、 jpg 和 bmp 。

``dbuuid``（可选）

Odoo数据库的唯一标识符。

webhook_url (可选)

A webhook URL can be provided. An empty POST request will be sent to webhook_url/document_token when the result is ready.

user_infos (可选)

关于将文档发送到提取服务的人的信息。它可以是客户或供应商（取决于“perspective”）。这些信息对于服务的工作并不是必需的，但它极大地提高了结果的质量。

user_company_vat (optional): 用户的增值税号码。
user_company_name (optional): 用户公司的名称。
user_company_country_code (optional): 用户的国家代码。格式： ISO3166 alpha-2 <https://www.iban.com/country-codes> _。
user_lang (optional): 用户语言。格式： language_code + _ + locale （例如：fr_FR, en_US）。
user_email (optional): 用户电子邮件。
purchase_order_regex (可选): 采购订单识别的正则表达式。如果未提供，默认为Odoo采购订单格式。
perspective (optional): 可以是 client 或 supplier 。此字段仅适用于发票。 client 表示提供的用户信息与发票的客户相关。 supplier 表示与供应商相关。如果未提供，则使用客户。

{
    "jsonrpc": "2.0",
    "method": "call",
    "params": {
        "account_token": string,
        "version": int,
        "documents": [string],
        "dbuuid": string,
        "webhook_url": string,
        "user_infos": {
            "user_company_vat": string,
            "user_company_name": string,
            "user_company_country_code": string,
            "user_lang": string,
            "user_email": string,
            "purchase_order_regex": string,
            "perspective": string,
        },
    },
    "id": string,
}

注解

The user_infos parameter is optional but it greatly improves the quality of the result, especially for invoices. The more information you can provide, the better.

回应¶

jsonrpc

参见 JSON-RPC2

id

参见 JSON-RPC2

result

状态: The code indicating the status of the request. See the table below.
status_msg: 一个字符串，提供有关请求状态的详细信息。
document_token: 仅在请求成功时出现。

状态	状态消息
`成功`	成功
`error_unsupported_version`	Unsupported version
`error_internal`	发生了一个错误
`error_no_credit`	您的信用额度不足
`error_unsupported_format`	不支持的文件格式
`error_maintenance`	Server is currently under maintenance, please try again later

{
    "jsonrpc": "2.0",
    "id": string,
    "result": {
        "status": string,
        "status_msg": string,
        "document_token": string,
    }
}

注解

API实际上并不使用JSON-RPC错误方案。相反，API在成功的JSON-RPC结果中捆绑了自己的错误方案。

获取结果¶

路线¶

/api/extract/invoice/2/get_result

/api/extract/expense/2/get_result

/api/extract/applicant/2/get_result

请求¶

jsonrpc (required)

参见 JSON-RPC2

method (required)

参见 JSON-RPC2

id (required)

参见 JSON-RPC2

params

version (required): The version should match the version passed to the /parse request.
document_token (required): The document_token for which you want to get the current parsing status.
account_token (required): The token of the account that was used to submit the document.

{
    "jsonrpc": "2.0",
    "method": "call",
    "params": {
        "version": int,
        "document_token": int,
        "account_token": string,
    },
    "id": string,
}

回应¶

当从解析中获取结果时，检测到的字段根据文档类型而变化很大。每个响应都是一个字典列表，每个文档对应一个字典。字典的键是字段的名称，值是字段的值。

jsonrpc

参见 JSON-RPC2

id

参见 JSON-RPC2

result

状态

The code indicating the status of the request. See the table below.

status_msg

一个字符串，提供有关请求状态的详细信息。

results

仅在请求成功时出现。

full_text_annotation: 包含了文档的未经处理的OCR完整结果

状态	状态消息
`成功`	成功
`error_unsupported_version`	Unsupported version
`error_internal`	发生了一个错误
`error_maintenance`	Server is currently under maintenance, please try again later
`error_document_not_found`	The document could not be found
`error_unsupported_size`	文件因太小而被拒绝
`error_no_page_count`	Unable to get page count of the PDF file
`error_pdf_conversion_to_images`	Couldn’t convert the PDF to images
`error_password_protected`	The PDF file is protected by a password
`error_too_many_pages`	The document contains too many pages

{
    "jsonrpc": "2.0",
    "id": string,
    "result": {
        "status": string,
        "status_msg": string,
        "results": [
            {
                "full_text_annotation": string,
                "feature_1_name": feature_1_result,
                "feature_2_name": feature_2_result,
                ...
            },
            ...
        ]
    }
}

常见字段¶

`feature_result`¶

我们想要从文档中提取的每个感兴趣的字段，例如总数或到期日期，也被称为特征。与文档类型相关联的所有提取特征的详尽列表可以在下面的章节中找到。

对于每个特征，我们返回一个候选列表，并突出显示我们的模型预测最适合该特征的候选项。

``selected_value``（可选）: 该功能的最佳候选人。
selected_values (可选): 这个功能的最佳候选人。
candidates (optional): 按照置信度从高到低排序的此功能的所有候选人列表。

"feature_name": {
    "selected_value": candidate_12,
    "candidates": [candidate_12, candidate_3, candidate_4, ...]
}

候选人¶

对于每个候选项，我们都会给出其在文档中的表示和位置。候选项按照适用性递减的顺序排序。

content: 候选人的表示。
coords: [center_x, center_y, width, height, rotation_angle]. The position and dimensions are relative to the size of the page and are therefore between 0 and 1. The angle is a clockwise rotation measured in degrees.
page: 原始文档中包含该候选项的页面编号（从0开始）。

"candidate": [
    {
        "content": string|float,
        "coords": [float, float, float, float, float],
        "page": int
    },
    ...
]

发票¶

发票是复杂的，可能有很多不同的字段。下表列出了我们可以从发票中提取的所有字段的详尽列表。

特性名称	特殊性
`SWIFT_code`	`content` 是一个以字符串编码的字典。它包含有关检测到的 SWIFT 代码（或 BIC）的信息。键： `bic` 检测到 BIC（字符串）。 `name` (optional) 银行名称（字符串）。 `country_code` 银行的ISO3166 alpha-2国家代码（字符串）。 `city` (optional) 银行所在城市（字符串）。 `verified_bic` 如果在我们的数据库中找到了BIC，则为True（布尔值）。如果 verified_bic 为真，则名称和城市才存在。
`iban`	`content` 是一个字符串
`aba`	`content` 是一个字符串
`VAT_Number`	`content` 是一个字符串根据user_infos中perspective的值，这将是供应商或客户的增值税号码。如果perspective是客户，它将是供应商的增值税号码。如果是供应商，它将是客户的增值税号码。
`qr-bill`	`content` 是一个字符串
`payment_ref`	`content` 是一个字符串
`purchase_order`	`content` 是一个字符串使用 `selected_values` 而不是 `selected_value`
`country`	`content` 是一个字符串
`currency`	`content` 是一个字符串
`date`	`content` 是一个字符串格式： YYYY-MM-DD
`due_date`	与 `date` 相同
`total_tax_amount`	`content` 是一个浮点数
`invoice_id`	`content` 是一个字符串
`subtotal`	`content` 是一个浮点数
`total`	`content` 是一个浮点数
`supplier`	`content` 是一个字符串
`客户端`	`content` 是一个字符串
`电子邮件`	`content` 是一个字符串
`website`	`content` 是一个字符串

`feature_result` for the `invoice_lines` feature¶

它遵循更具体的结构。基本上是一个字典列表，其中每个字典表示一个发票行。每个值都遵循 feature_result 结构。

"invoice_lines": [
    {
        "description": feature_result,
        "discount": feature_result,
        "product": feature_result,
        "quantity": feature_result,
        "subtotal": feature_result,
        "total": feature_result,
        "taxes": feature_result,
        "total": feature_result,
        "unit": feature_result,
        "unit_price": feature_result
    },
    ...
]

费用¶

费用比发票更简单。下表列出了我们可以从费用报告中提取的所有字段的详尽列表。

特性名称	特殊性
`description`	`content` 是一个字符串
`country`	`content` 是一个字符串
`date`	`content` 是一个字符串
`total`	`content` 是一个浮点数
`currency`	`content` 是一个字符串

申请人¶

这第三种类型的文档是用于处理简历的。下表列出了我们可以从简历中提取的所有字段的详尽列表。

特性名称	特殊性
`name`	`content` 是一个字符串
`电子邮件`	`content` 是一个字符串
`电话`	`content` 是一个字符串
`手机`	`content` 是一个字符串

集成测试¶

您可以通过在 /parse 请求中使用 integration_token 作为 account_token 来测试您的集成。

使用此令牌将您置于测试模式，并允许您模拟整个流程，而无需实际解析文档，并且每次成功的文档解析不会计费。

测试模式下的唯一技术差异是您发送的文档不会被系统解析，并且您从 /get_result 得到的响应是硬编码的。

完整的发票流程的Python实现可以在这里找到： 这里 。

Edit on GitHub

提取 API¶

概览¶

版本¶

流程¶

解析¶

路线¶

请求¶

回应¶

获取结果¶

路线¶

请求¶

回应¶

常见字段¶

feature_result¶

候选人¶

发票¶

feature_result for the invoice_lines feature¶

费用¶

申请人¶

集成测试¶

`feature_result`¶

`feature_result` for the `invoice_lines` feature¶