Skip to main content

Table 1 Data quality indicators

From: Review of data quality indicators and metrics, and suggestions for indicators and metrics for structural health monitoring

Indicator

Definition

Reference

Intrinsic

 Correctness

free-of-error dimension represents data correctness

(Pipino et al. 2002)

 Accuracy

conformity with real word

(Parssian et al. 2004)

Indicates the percentage cells in a dataset that has correct values according to the domain and the type of information of the dataset.

(Vetrò et al. 2016)

Indicates the ratio between the error in aggregation and the scale of data representation.

(Vetrò et al. 2016)

-data are correct, reliable, and certified free of error.

(Wand and Wang 1996)

(Färber et al. 2017)

The degree of closeness of a datum value v to some value v’ in the attribute domain is considered correct for the entity e and the attribute a.

(Fox et al. 1994)

How accurate do our data need to be?

(Herzog et al. 2007)

The recorded value is in conformity with the actual value

(Ballou and Pazer 1985)

The value of the data is close to some value considered correct

(Fox et al. 1994)

Is correct according to a reference value

(Rodríguez and Servigne 2013)

A value v is close to a value v’ considered as the correct representation of the real-phenomenon v aims to represent

(Batini and Scannapieco 2006)

 Precision

It can represent a small quantity (resolution)

(Fox et al. 1994)

 Veracity

Refers to the accuracy of the data, and relates to the vernacular garbage-in, garbage-out description

(NIST, 2019)

 Trustworthiness

Collective term for believability, reputation, objectivity, and verifiability

(Färber et al. 2017)

(Wang and Strong 1996)

(Naumann 2002)

_the information is accepted to be correct, true, real, and credible

Zaveri et al. (2016)

 Consistency

Is used when two or more values in a database are required to agree in some way.

(Date 1983)

if it satisfies all the constraints in the set.

(Elmasri and Navathe 1989)

_data has attributes that are free from contradiction and are coherent with other data in a specific context of use.

(ISO/IEC 25,012, 2008)

Two or more values do not conflict with each other, i.e., free of conflicting information.

(Bizer 2007)

Two or more values [in a dataset] do not conflict with each other

(Färber et al. 2017)

The representation of the data value is the same in all cases

(Ballou and Pazer 1985)

Data are always presented in the same format and are compatible with previous data

(Wang and Strong 1996)

Data satisfy specified constraints

(Fox et al. 1994)

Data is free from internal contradiction with regard to a rule

(Heinrich et al. 2018)

Consistency at the schema level means that the schema of a dataset should be free of contradictions. Consistency at the data level focuses on the degree to which the format and the value of the data conform to the predefined schema of a given dataset

(Behkamal et al. 2014)

 Compliance

Indicates the percentage of standardized columns in a dataset

(Vetrò et al. 2016)

Indicates the degree to which a dataset follows the e-GMS standard

(Vetrò et al. 2016)

Indicates the level of the 5-star Open Data model in which the dataset is and the advantage offered by this reason

(Vetrò et al. 2016)

 Uniqueness

_data is free of redundancies, in breadth, depth, and scope

(Behkamal et al. 2014)

(Fürber and Hepp 2010)

 Uniqueness in breadth

_ontology is free of redundancies regarding its represented classes and properties.

(Fürber and Hepp 2010)

 Uniqueness in scope

_a knowledge base has multiple different instances to represent the same object.

(Fürber and Hepp 2010)

 Uniqueness in depth

_values of a property are unique.

(Fürber and Hepp 2010)

 Duplication

The dataset contains distinct values for the same attribute of the same entity

(Fox et al. 1994)

 Free of error

Data is correct and reliable synonym of accuracy

(Färber et al. 2017)

(Gitzel 2016)

_data is correct and reliable

(Pipino et al. 2002)

 Reliability

The data is accurate

(Heinrich et al. 2018)

The data is correct and integer.

(Brodie 1980)

The data is accurate

(Heinrich et al. 2018)

 Integrity

Synonym of accuracy, correctness, security, and concurrency control

(Brodie 1980)

 Variety of data and data sources

-data are available from several different data sources.

(Pipino et al. 2002)

(Wang and Strong 1996)

 Redundancy

Multiple sets of the same data are available

(Jianwen and Feng 2015)

Representational

 Representational consistency

-data are always presented in the same format and are compatible with previous data

(Färber et al. 2017)

(Pipino et al. 2002)

(Wang and Strong 1996)

 Understandability

Indicates the percentage of columns in a dataset that is represented in a format that can be easily understood by the users and is also machine-readable.

(Vetrò et al. 2016)

Indicates the percentage of columns in a dataset that has associated descriptive metadata.

(Vetrò et al. 2016)

-data is easily comprehended

(Pipino et al. 2002)

 Ease of understanding

-data are clear without ambiguity and easily comprehended

(Färber et al. 2017)

(Wang and Strong 1996)

 Interoperability

A dimension that includes the aspects of interpretability, representational consistency, and concise representation.

(Färber et al. 2017)

 Concise

Data are compactly represented without being overwhelming (i.e., brief in presentation, yet complete and to the point).

(Wang and Strong 1996)

 Concise representation

-data are compactly represented without being overwhelming

(Färber et al. 2017)

(Pipino et al. 2002)

 Representational consistency

-data are always presented in the same format and are compatible with previous data

(Wang and Strong 1996)

 Consistent representation

-data are always presented in the same format

(Pipino et al. 2002)

 Interpretability

-data are in an appropriate language and units and the data definitions are clear

(Färber et al. 2017)

(Pipino et al. 2002)

(Wang and Strong 1996)

 Ease of operation

-data are easily managed and manipulated (i.e., updated, moved, aggregated, reproduced, customized).

s(Wang and Strong 1996)

 Ease of manipulation

-data is easy to manipulate and apply to different tasks

(Pipino et al. 2002)

Accessibility

 Accessibility

-data are available or easily and quickly retrievable

(Pipino et al. 2002)

(Färber et al. 2017)

(Wang and Strong 1996)

 Access security

-access to data can be restricted and hence kept secure.

(Wang and Strong 1996)

 Security

-access to data is restricted appropriately to maintain its security.

(Pipino et al. 2002)

 Believability

-data are accepted or regarded as true, real, and credible

(Färber et al. 2017)

-data is regarded as true, real, and credible

(Pipino et al. 2002)

 Reputation

-data are trusted or highly regarded in terms of their source or content.

(Wang and Strong 1996)

(Pipino et al. 2002)

 Objectivity

-data is unbiased, unprejudiced, and impartial

(Pipino et al. 2002)

(Wang and Strong 1996)

 Traceability

Indicates the presence or absence of metadata associated with the process of creation of a dataset.

(Vetrò et al. 2016)

-data are well documented, verifiable, and easily attributed to a source.

(Pipino et al. 2002)

(Wang and Strong 1996)

 Availability

Data are accessible for an intended use

(Rodríguez and Servigne 2013)

 License

Is the granting of permission for a consumer to re-use a dataset under defined conditions

(Färber et al. 2017)

 Interlinking

-entities that represent the same concept are

linked to each other, be it within or between two or more data sources

(Färber et al. 2017)

 Comparability

-data fields present within these databases allow to easily link individuals across the databases.

(Herzog et al. 2007)

Contextual

 Timeliness

The recorded value is not out of date

Heinrich and Klier (2009)

Ballou et al. (1985)

Ballou et al. (1998)

-the age of the data is appropriate for the task at hand

Heinrich and Klier (2009)

Wang et al. (1996)

(Färber et al. 2017)

Property that the attributes or tuples respectively of a data product correspond to the current state of the discourse world, i.e. they are not out-dated

Heinrich and Klier (2009)

Hinrichs (2002)

Expresses how current data are for the task at hand

Heinrich and Klier (2009)

Batini et al. (2006)

interpreted as the probability that an attribute value is still up-to-date

Heinrich and Klier (2009)

(Heinrich et al. 2007b)

(Heinrich et al. 2007a)

Is the availability of information for decision making

(Fox et al. 1994)

Kleijnen (1980)

Expresses how “current” the information needs to be to predict which subsets of customers are more likely to purchase certain products.

(Herzog et al. 2007)

-the age of the data is sufficiently up-to-date for the task at hand

(Pipino et al. 2002)

Data is sufficiently up-to-date for the task at hand.

(Pipino et al. 2002)

The quantization of data transmission delay

(Jianwen Guo and Feng Liu, 2015)

Data is available for decision making

(Fox et al. 1994)

 Age

Is defined as a function of the processing delay necessary to generate and deliver information, and the reporting interval used in the system

(Fox et al. 1994)

Davis and Olson

 Currentness

Indicates the percentage of rows of a dataset that have current values, and they don’t have any value that refers to a previous or a following period of time.

(Vetrò et al. 2016)

Indicates the ratio between the delay in the publication (number of days passed between the moment in which the information is available and the publication of the dataset) and the period of time referred by the dataset (week, month, year)

(Vetrò et al. 2016)

Data is correct at the time of evaluation

(Fox et al. 1994)

The data value is up-to-date

(Fox et al. 1994)

Data is current or updated

(Rodríguez and Servigne 2013)

The value corresponds to its real-world counterpart

(Hinrichs, 2002)

 Expiration

Indicates the ratio between the delay in the publication of a dataset after the expiration of its previous version and the period of time referred by the dataset (week, month, year)

(Vetrò et al. 2016)

 Currency

is when the age of data is appropriate to their use

Heinrich and Klier (2009)

Price et al. (2005)

 Completeness

Is the availability of all relevant data to satisfy the user requirement

Gardyn (1997)

Parssian et al. (2004)

At schema level, completeness means that all of the required classes and properties should be represented.

At data level, refers to the missing values of properties with respect to the schema.

Behkamal et al. (2013)

_data collection has values for all attributes of all entities that are supposed to have values

(Fox et al. 1994)

All values for a certain variable are recorded

(Ballou and Pazer 1985)

-data are of sufficient breadth, depth, and scope for the task at hand

(Wang and Strong 1996)

(Färber et al. 2017)

(Pipino et al. 2002)

-data is not missing and is of sufficient breadth and depth for the task at hand

(Pipino et al. 2002)

Indicates the percentage of complete cells in a dataset.

(Vetrò et al. 2016)

Indicates the percentage of complete rows in a dataset.

(Vetrò et al. 2016)

means that no records are missing and that no records have missing data elements.

(Herzog et al. 2007)

 Appropriate amount of data

-the quantity or volume of available data is appropriate for the task at hand

(Pipino et al. 2002)

-the quantity or

the volume of available data is appropriate.

(Wang and Strong 1996)

 Relevancy

-data are applicable and helpful for the task at hand.

(Pipino et al. 2002) (Färber et al. 2017)

(Wang and Strong 1996)

-the data meet the basic needs for which they were collected, placed in a database, and used.

-data can be used for additional purposes and several different purposes

(Herzog et al. 2007)

 Value-added

-data are beneficial and provide advantages from their use.

(Wang and Strong 1996)

(Pipino et al. 2002)

 Cost-effectiveness

-the cost of collecting appropriate data is reasonable.

(Pipino et al. 2002)

(Wang and Strong 1996)

 Flexibility

-data are expandable, adaptable, and easily applied to other needs.

(Pipino et al. 2002)

(Wang and Strong 1996)