Import

This describes the state of the import code as of 5.51. This is for a common understanding of the code in core.

The Import framework consists of:

  • The form layer
  • The datasource object
  • The parser object
  • The user job
  • The import table
  • The queue

The form layer

There are 6 imports in core (contact, participant, contribution, membership, activity and multiple-field custom data) and each has the following forms:

  • DataSource - responsible for initiating the datasource. During postProcess
    • the userJob is created
    • the datasource is initialised with the data in it being moved to the import table
  • MapField - responsible for storing the mapping to be used to map the datasource to the import fields. During postProcess the validateValues function on the Parser is called and the import table is updated with the results.
  • Preview - responsible for presenting the results of validating the data, permitting download of rows that failed validation. On postProcess this sets off the import
  • Summary - this used to be the last step in the flow. However, the summary is now a standalone form that the user lands on, loading the data from the UserJob and import table. (only the contact summary form is now used and as of writing removing the others is a todo).

DataSource

The Datasource is responsible for getting the managing the source data and the import table. Currently there are two classes CRM_Import_DataSource_CSV and CRM_Import_DataSource_SQL and only the Contact import offers a choice between them. Each datasource needs to offer the following functions:

  • getInfo() - declare the source title and permissions
  • buildQuickForm build the ajax snippet that adds the fields specific to this dataSource
  • getSubmittableFieldsget a list of the fields added via the quickForm. (in practice the submittableFields property is set and the parent getter is used)
  • initialize() - this function takes the submitted datasource information and transfers it to the import table.

The parent class offers additional functions setUserJobID getUserJobID getColumnHeaders getColumnCount getRow - retrieve a row from the import table - can be adjusted with setLimit, setOffset and setStatuses updateStatus upddates the status for the given row in the import table. purge

The import table The import framework expects the data to be import to be transferred to a table. Some tracking fields are added to this table. The tracking fields all have leading underscore and are not returned using getColumnHeaders:

  • _id - row number
  • _status - see further down the page
  • _status_message
  • _entity_id - eg. the imported contact ID - not yet exposed in the UI.
  • potentially other import-specific fields. This is used in contact import to store the number of related contacts created - but not displayed in the UI.

The import table persists beyond the import which potentially can allow troubleshooting or auditing. The intention is to add a cleanup process for the import table and user jobs but for now they are dropped during cache clear if more than two days old, or during upgrade.

The following shows interrogation of the datasource during a test to check on the outcome

    $dataSource = new CRM_Import_DataSource_CSV($this->userJobID);
    $dataSource->setOffset(4);
    $dataSource->setLimit(1);
    $row = $dataSource->getRow();
    $this->assertEquals(1, $row['_entity_id']);
    $this->assertEquals(1,  $dataSource->getRowCount([CRM_Import_Parser::VALID]));
Statuses The statuses stored to the table are not quite the same as the constant. This is part of the refactoring transition and could reasonably be further cleaned up. However, for now the statuses are per the function on the CRM_Import_DataSource class. The keys are the constants and the values are the values stored in the _status field. Note that a status might mean 'valid' and also mean 'soft_credit_imported'.

  protected function getStatusMapping(): array {
    return [
      CRM_Import_Parser::VALID => ['imported', 'new', 'soft_credit_imported', 'pledge_payment_imported'],
      CRM_Import_Parser::ERROR => ['error', 'invalid', 'soft_credit_error', 'pledge_payment_error'],
      CRM_Import_Parser::DUPLICATE => ['duplicate'],
      CRM_Import_Parser::NO_MATCH => ['invalid_no_match'],
      CRM_Import_Parser::UNPARSED_ADDRESS_WARNING => ['warning_unparsed_address'],
      CRM_Contribute_Import_Parser_Contribution::SOFT_CREDIT_ERROR => ['soft_credit_error'],
      CRM_Contribute_Import_Parser_Contribution::SOFT_CREDIT => ['soft_credit_imported'],
      CRM_Contribute_Import_Parser_Contribution::PLEDGE_PAYMENT => ['pledge_payment_imported'],
      CRM_Contribute_Import_Parser_Contribution::PLEDGE_PAYMENT_ERROR => ['pledge_payment_error'],
      'new' => ['new'],
    ];
  }

User Job

The user job corresponds to a row in civicrm_user_job and is used to track the import job. It holds metadata about the job including

  • submitted_values - what was submitted on the forms
  • dataSource - dataSource metadata
  • type_id - used to determine the Parser class.

Parser

This class does the importing. A lot of the work is done by the parent class, based on the metadata for the class. There is a bit of ambiguity currently around when it is set but at this stage both calling setMetaData and init should ensure it is set. The key functions called from outside the class are

Field presentation functions:

  • getAvailableFields() - get a [name => label] array of fields that can be imported to in this import
  • getDataPatterns() - gets an array like [name => dataPattern] (a re-keying of the metadata) for attempting to guess the field if column headers are not used.
  • getHeaderPatterns()- gets an array like [name => dataPattern] (a re-keying of the metadata) for attempting to guess the field if column headers are used.
  • getMappedFieldLabel($mappedField) - gets the label to present on the preview screen from the mapped field. This may include location, phone type.

  • validate - this is called during PostProcess from MapField. It runs through all the fields in the dataSource and runs validation on them. The validation is limited to what can be checked without row-specific database queries. ie. it will validate the option value or date is valid but it will not do a lookup on contact_id. Validate sets the status

  • runImport - imports all rows with a status of 'new', updating the import table with the outcome.

Internally the validate and import functions rely a lot of these functions getMappedRow(array $values) - this receives a row from the DataSource and maps it to and api-ish $params array. It does this by retrieving the saved mapper submitted on the MapField form from the UserJob and passing it through the parser-specific getFieldMappings (for Contact the values are parsed into locations and relationship, the others are mostly generic).

  • getTransformedValue Before returning getMappedRow Each field is parsed through getTransformedValue which validates the value against field metadata. This function explodes serialized values to arrays and converts any option values to their id. Invalid values are set to invalid_import_value. Note that option values are matched in a case-insensitive way with labels, names or ids.
  • getFieldMetadata - gets the metadata for the specific field. Does some extra loading in some cases.

Obsolete methods

  • run
  • summary
  • preview
  • encloseScrub
  • errorFileName

And see https://github.com/civicrm/civicrm-core/pull/23721#issuecomment-1149616759

Thoughts/ todos:

  • Make skipColumnHeaders the default
  • Move fieldSeparator from the DataSource form to the form declared by CRM_Import_Datasource_CSV
  • Why not allow creation of contacts from (e.g) contribution import
  • Add batch size
  • Would be really good to be able to be able to specify defaults (e.g default source ie 'my spreadsheet'). It might be in the csv or not (e.g not have to add a currency column to say 'all of these are AUD')
  • The UserJob type id is currently hard coded as an option list in the UserJob BAO. It includes the ParserClass so the runImport function cannot currently be used with a non-core parser but that feels like it will be needed for the two existing import extensions.