Data quality criteria and indicators

Quality criteria can be used to describe and assess the quality of data sets. The criteria are part of the Data Quality Framework. They help users evaluate whether a data set is of sufficient quality for its intended purpose. In the long term, quality criteria support the improvement of the quality of data sets and data repositories.

Quality criteria are intended to be a flexible tool; not all criteria or especially indicators may be relevant in all situations or for all data sets. Additionally, it is important to note that the intended use determines the level of quality described by the quality criteria. Quality criteria, especially their indicators, are focused on structured data.

Quality criteria with their indicators form a hierarchical structure, but quality criteria and indicators are interrelated and interconnected. Improving the quality with respect to one quality criterion can even weaken the quality of data described by another quality criterion. For example, if the goal is to achieve comprehensive coverage or particularly high accuracy of attribute data in a data set, data timeliness generally declines.

The data quality criteria and indicators (proposal for recommendation) are described in the attached file. Below is also a summary of the content.

A proposal summary for the data quality criteria

The data quality criteria are organized for three categories from the perspective of the information user:

  1. How well does information describe reality?
  2. How has the information been described?
  3. How can I use the information?

How well does information describe reality?

Correctness

Synonym: accuracy

Description: Correctness describes how the data in the dataset correspond to reality. It also helps to identify systematic distortions in the dataset.

Example: The data leading to an operational decision represent the best understanding of the accurate data. For example, the data are considered accurate when the salary declared for tax purposes corresponds to the salary paid.

Indicators: methodically produced values, incorrect values, misclassification

Accuracy

Synonym: unbiasedness

Description: Accuracy describes how well the data in the dataset correspond to what is being sought. It describes how well the data hit the mark.

Examples: Accuracy describes the dispersion of indicator values, the proportion of outliers in the dataset, the accuracy of the classification and the scale of measurement (e.g. decimals, time, coordinates).

Indicators: standard deviation, outliers

Consistency

Synonyms: regularity, logical integrity of data

Description: Consistency indicates that the data are consistent and non-contradictory. The indicator can also be used to describe the consistency between different datasets.

Examples: For example, there is an inconsistency when there are no dwellings in a residential building, or a person’s date of marriage is earlier than their date of birth. Data consistency can be checked by means of validation/qualification rules.

Indicators: logic of data reviewed

Currentness

Description: Currentness describes the timeframe of the data in the dataset. The closer the data baseline period is to the present, the more current the data are. The baseline period is the point in time to which the data apply.

Examples: The baseline period associated with the dataset is provided with the data. It can be used to determine the freshness of the data. The baseline period can be the period between the beginning and the end of the year or a particular day, for example. In data production, it is also important to check the data review and change periods.

Indicators: baseline period, creation period, review period, change period,

Completeness

Synonym: coverage

Description: Completeness describes the temporal and regional target coverage of the data, as well as the target units and characteristics data. It also indicates the degree to which the dataset contains the desired data.

Examples: The dataset covers all units in a defined area, e.g. all enterprises in Finland. Regional coverage indicates whether all the target regions are included in the dataset (e.g. all Finnish municipalities), and if the dataset also covers Åland. Over-coverage indicates that the dataset includes units that do not belong to the dataset. Under-coverage indicates that units belonging to the dataset are missing. Non-response is also included in under-coverage. On the other hand, completeness also indicates whether the dataset contains all the characteristics specified for the target units in the dataset, for example, the details of the population and area of the Finnish municipalities in the dataset, or whether address or turnover data have been provided for all enterprises in the dataset.

Indicators: temporal target coverage, regional target coverage, target units, shortcomings in characteristics, missing units, additional units, incomplete units, incomplete characteristics

How has the information been described?

Traceability

Synonym: non-repudiation

Description: Traceability indicates that changes made to the dataset and its data can be traced. The origin of the data is known.

Examples: The origin of the data and the change history are described, and time stamps of the changes are available. The data can be shown to be indisputable, and the data in the dataset can be verified.

Indicators: data source, data lifecycle, change management

Understandability

Synonyms: interpretability, comprehensibility

Description: Understandability describes the degree to which a dataset contains metadata that help users understand the data being used.

Examples: The dataset and data characteristics are described in the metadata descriptions at a sufficient level to facilitate understanding of the data content and its significance. The code lists used for the data characteristics have been recorded and are consistent with the data. The descriptions of the code lists are available e.g. via links. Essential concepts are described, and links to the necessary glossaries are included in the metadata descriptions.

Indicators: dataset descriptions, definitions of concepts, data descriptions of characteristics, customer feedback on comprehensibility

Compliance

Synonyms: compatibility, semantic conformity, conformity

Description: Compliance indicates that the dataset and its characteristics comply with known standards, practices and regulations, and that they are specified in the dataset description.

Examples: For example, national conformity can be supported by using uniform national terminology and code lists when planning datasets. International conformity can be supported by using standard classifications adopted by the EU, as well as ISO language codes, for example.

Indicators: regulations and standards to be complied with

How can I use the information?

Portability

Description: Portability describes whether the dataset is structured so that the data can be processed in an automated manner and in different information systems.

Examples: The dataset is in a structured format (e.g. .csv, .json or .xml). The structure of the dataset is described by using a schema, for example.

Indicators: the dataset data model, permanent identifier of the target unit, customer feedback on portability

User rights

Description: User rights describes the user rights to the data, and how the data can be used (i.e. for what purposes).

Examples: For example, a dataset is available for scientific research, subject to certain restrictions. Open data are licensed.

Indicators: access rights, restrictions on use

Punctuality

Synonym: timeliness

Description: Punctuality means that the dataset is released at the indicated time and updated with sufficient frequency to reflect changes in the dataset.

Examples: The time and frequency of publication

Indicators: compliance with due dates, frequency of updates, values changed in the update