About data extraction

Hubdoc Support

June 05, 2024 23:45

Overview

The data extraction process that new documents in Hubdoc go through.

The extraction process

How it works

Every document imported into Hubdoc goes through a data extraction process, unless the organisation is in a non-paying state or data extraction is disabled. The extraction process usually happens within seconds, but can sometimes take up to 24 hours, depending on the document.

You can import different file types into Hubdoc, but data only gets extracted from DOC, PDF, GIF, JPG, PNG, HTML, TXT, HEIC and HEIF files.

New documents waiting to have data extracted show on the Processing tab. Once extraction is complete, documents move to the Review tab. If Hubdoc can’t extract any data, documents move to the Failed tab.

To successfully complete data extraction, a document must include a date, supplier name and total amount. These details are extracted and stored in the organisation, along with the document image. Hubdoc also extracts any invoice number or due date showing on the document.

Hubdoc doesn't automatically extract line item data from a document. You need to enter this information manually or save the line items in any configured supplier rules you have set up.

The extracted data shows in the data toolbar, to the right of the document. Hubdoc creates a folder based on the supplier's name to store the document, or adds it to an existing folder for that supplier.

You can turn data extraction off and on to suit your needs.

Potential duplicate documents

When you email or manually upload a document, Hubdoc checks to see if it's a duplicate with an existing document.

Hubdoc identifies a new document as a potential duplicate if it has the same date, supplier name, and total amount as one or more existing documents.

Any invoice numbers on documents are also checked. Two documents must have the same invoice numbers to show as potential duplicates.

Hubdoc identifies a potential duplicate by an icon Duplicate document icon in the data toolbar. Click Show Duplicates to open a panel showing the document alongside all other potential duplicate documents. Choose to mark a document as Not a duplicate, or click Move to Trash to delete it.

You can turn off duplicate detection for a particular supplier if you regularly get duplicate documents from the same supplier and want to keep them.

If you've set up automatic publishing for a particular supplier, any potential duplicate documents from that supplier aren't automatically published.

Dates

Hubdoc identifies an organisation’s region based on the currency selected when the organisation is set up. You can enter a date in any format, but the region determines how Hubdoc extracts and displays the date.

For organisations in the US and Canada, the date format is assumed to be MM/DD/YYYY.
For organisations in the UK, AU, NZ and the rest of the world, the date format is assumed to be DD/MM/YYYY.

On some documents, the date can be ambiguous if the correct date format isn’t identified. If the date format based on the organisation’s currency results in a future date, we’ll automatically use the other format. If you need to change the currency selected for your organisation, you can do this in the organisation settings.

Currencies

Hubdoc automatically recognises a wide range of currencies in your documents. If we can't determine the currency of the original document, we'll use the base currency of the organisation. You can change the currency extracted from a document in the edit data toolbar.

Sometimes Hubdoc misreads amounts where figures are separated by commas. To fix this, change the amount in the edit data toolbar.

If you’re publishing multicurrency documents to your cloud accounting platform, make sure the currency is set up in your cloud accounting platform first, then select the currency on the document.

Tax extraction

For organisations in the UK, CA, AU and NZ, if your organisation is connected to your cloud accounting platform and you’ve enabled tax data to be published, you can select the tax rate that applies to the document, or select Extracted Amount to manually adjust the tax amount.

For organisations in the UK, AU and NZ, you can also turn on auto-tax extraction. When this is enabled, Hubdoc extracts the sales tax amount in addition to the supplier name, date, and total amount. If the sales tax amount can’t be found on the document, we use the tax rate selected for the supplier or the default tax rate for the organisation.

However, selecting a single tax rate from the tax rate field doesn’t always result in the correct tax amount for the document.

Sometimes, the tax on a document doesn’t reflect a flat application of a single tax rate on the total. For example, restaurant meals where the subtotal is taxed but the tip isn’t, or grocery bills where some food items are taxed and others aren’t.
The amount that Hubdoc extracts is rounded to two decimal places, so depending on how the numbers were rounded on the document, the tax calculation might be off by a cent or two.

To manually change this, you can use an automatic calculator tool to change the tax amount calculated. You can’t use the automatic calculator if you’re only publishing a single line item. To publish a single line item, you need to change the tax rate to a specific rate.

What's next?

Identify the best way to get documents into your organisation or resolve issues with data extraction.