Import

This describes the state of the import code as of 5.51. This is for a common understanding of the code in core.

The Import framework consists of:

The form layer
The datasource object
The parser object
The user job
The import table
The queue

The form layer

There are 6 imports in core (contact, participant, contribution, membership, activity and multiple-field custom data) and each has the following forms:

DataSource - responsible for initiating the datasource. During postProcess
- the userJob is created
- the datasource is initialised with the data in it being moved to the import table
MapField - responsible for storing the mapping to be used to map the datasource to the import fields. During postProcess the validateValues function on the Parser is called and the import table is updated with the results.
Preview - responsible for presenting the results of validating the data, permitting download of rows that failed validation. On postProcess this sets off the import
Summary - this used to be the last step in the flow. However, the summary is now a standalone form that the user lands on, loading the data from the UserJob and import table. (only the contact summary form is now used and as of writing removing the others is a todo).

DataSource

The Datasource is responsible for getting the managing the source data and the import table. Currently there are two classes CRM_Import_DataSource_CSV and CRM_Import_DataSource_SQL and only the Contact import offers a choice between them. Each datasource needs to offer the following functions:

getInfo() - declare the source title and permissions
buildQuickForm build the ajax snippet that adds the fields specific to this dataSource
getSubmittableFieldsget a list of the fields added via the quickForm. (in practice the submittableFields property is set and the parent getter is used)
initialize() - this function takes the submitted datasource information and transfers it to the import table.

The parent class offers additional functions setUserJobID getUserJobID getColumnHeaders getColumnCount getRow - retrieve a row from the import table - can be adjusted with setLimit, setOffset and setStatuses updateStatus upddates the status for the given row in the import table. purge

For developers whom whish to support a custom source e.g. a JSON feed or similar the main you can do so in an extension by creating a class that extends CRM_Import_DataSource and importantly implements the Civi\Import\DataSource\DataSourceInterface. Your extension will need to ensure that the scan-classes mixin is enabled as well.

As part of the initialize function you want to call updateUserJobDataSource to set the import table name, the column headers and the number of columns that were created.

The import table The import framework expects the data to be import to be transferred to a table. Some tracking fields are added to this table. The tracking fields all have leading underscore and are not returned using getColumnHeaders:

_id - row number
_status - see further down the page
_status_message
_entity_id - eg. the imported contact ID - not yet exposed in the UI.
potentially other import-specific fields. This is used in contact import to store the number of related contacts created - but not displayed in the UI.

The import table persists beyond the import which potentially can allow troubleshooting or auditing. The intention is to add a cleanup process for the import table and user jobs but for now they are dropped during cache clear if more than two days old, or during upgrade.

The following shows interrogation of the datasource during a test to check on the outcome

    $dataSource = new CRM_Import_DataSource_CSV($this->userJobID);
    $dataSource->setOffset(4);
    $dataSource->setLimit(1);
    $row = $dataSource->getRow();
    $this->assertEquals(1, $row['_entity_id']);
    $this->assertEquals(1,  $dataSource->getRowCount([CRM_Import_Parser::VALID]));

Statuses The statuses stored to the table are not quite the same as the constant. This is part of the refactoring transition and could reasonably be further cleaned up. However, for now the statuses are per the function on the CRM_Import_DataSource class. The keys are the constants and the values are the values stored in the _status field. Note that a status might mean 'valid' and also mean 'soft_credit_imported'.

  protected function getStatusMapping(): array {
    return [
      CRM_Import_Parser::VALID => ['imported', 'new', 'soft_credit_imported', 'pledge_payment_imported'],
      CRM_Import_Parser::ERROR => ['error', 'invalid', 'soft_credit_error', 'pledge_payment_error'],
      CRM_Import_Parser::DUPLICATE => ['duplicate'],
      CRM_Import_Parser::NO_MATCH => ['invalid_no_match'],
      CRM_Import_Parser::UNPARSED_ADDRESS_WARNING => ['warning_unparsed_address'],
      CRM_Contribute_Import_Parser_Contribution::SOFT_CREDIT_ERROR => ['soft_credit_error'],
      CRM_Contribute_Import_Parser_Contribution::SOFT_CREDIT => ['soft_credit_imported'],
      CRM_Contribute_Import_Parser_Contribution::PLEDGE_PAYMENT => ['pledge_payment_imported'],
      CRM_Contribute_Import_Parser_Contribution::PLEDGE_PAYMENT_ERROR => ['pledge_payment_error'],
      'new' => ['new'],
    ];
  }

User Job

The user job corresponds to a row in civicrm_user_job and is used to track the import job. It holds metadata about the job including

submitted_values - what was submitted on the forms
dataSource - dataSource metadata
type_id - used to determine the Parser class.

Parser

This class does the importing. A lot of the work is done by the parent class, based on the metadata for the class. There is a bit of ambiguity currently around when it is set but at this stage both calling setMetaData and init should ensure it is set. The key functions called from outside the class are

Field presentation functions:

getAvailableFields() - get a [name => label] array of fields that can be imported to in this import
getDataPatterns() - gets an array like [name => dataPattern] (a re-keying of the metadata) for attempting to guess the field if column headers are not used.
getHeaderPatterns()- gets an array like [name => dataPattern] (a re-keying of the metadata) for attempting to guess the field if column headers are used.
getMappedFieldLabel($mappedField) - gets the label to present on the preview screen from the mapped field. This may include location, phone type.
validate - this is called during PostProcess from MapField. It runs through all the fields in the dataSource and runs validation on them. The validation is limited to what can be checked without row-specific database queries. ie. it will validate the option value or date is valid but it will not do a lookup on contact_id. Validate sets the status
runImport - imports all rows with a status of 'new', updating the import table with the outcome.

Internally the validate and import functions rely a lot of these functions getMappedRow(array $values) - this receives a row from the DataSource and maps it to and api-ish $params array. It does this by retrieving the saved mapper submitted on the MapField form from the UserJob and passing it through the parser-specific getFieldMappings (for Contact the values are parsed into locations and relationship, the others are mostly generic).

getTransformedValue Before returning getMappedRow Each field is parsed through getTransformedValue which validates the value against field metadata. This function explodes serialized values to arrays and converts any option values to their id. Invalid values are set to invalid_import_value. Note that option values are matched in a case-insensitive way with labels, names or ids.
getFieldMetadata - gets the metadata for the specific field. Does some extra loading in some cases.

Obsolete methods

run
summary
preview
encloseScrub
errorFileName

And see https://github.com/civicrm/civicrm-core/pull/23721#issuecomment-1149616759

Thoughts/ todos:

Make skipColumnHeaders the default
Move fieldSeparator from the DataSource form to the form declared by CRM_Import_Datasource_CSV
Why not allow creation of contacts from (e.g) contribution import
Add batch size
Would be really good to be able to be able to specify defaults (e.g default source ie 'my spreadsheet'). It might be in the csv or not (e.g not have to add a currency column to say 'all of these are AUD')
The UserJob type id is currently hard coded as an option list in the UserJob BAO. It includes the ParserClass so the runImport function cannot currently be used with a non-core parser but that feels like it will be needed for the two existing import extensions.