Data Quality

This section describes data quality rules & parameters. They are tightly linked to the schema described in the previous section.

Data quality rules support different levels/stages of data quality attributes:

Text: A human-readable text that describes the quality of the data.
Library : A maintained library of commonly used quality metrics such as rowCount, nullValues, invalidValues, and more.
SQL: An individual SQL query that returns a value that can be compared.
Custom: Quality attributes that are vendor-specific, such as Soda, Great Expectations, dbt tests, dbx, or Montecarlo monitors.

Back to TOC

Text

A human-readable text that describes the quality of the data. Later in the development process, these might be translated into an executable check (such as sql), a library metric, or checked through an AI engine.

quality:
  - id: email_verified_text
    type: text
    description: The email address was verified by the system.

Library

ODCS provides a set of predefined metrics commonly used in data quality checks, designed to be compatible with all major data quality engines. This simplifies the work for data engineers by eliminating the need to manually write SQL queries. This section has been improved in ODCS v3.1.0.

The type for library metrics is library, which can be omitted, if a metric property is defined.

These metrics return a numeric value come with an operator to compare if the metric is valid and in the expected range.

Some metrics require additional parameters, which can be defined in the arguments property.

Example:

properties:
  - name: order_id
    quality:
      - id: order_id_no_nulls
        type: library
        metric: nullValues
        mustBe: 0
        unit: rows
        description: "There must be no null values in the column."

is equal to:

properties:
  - name: order_id
    quality:
      - id: order_id_no_nulls_simplified
        metric: nullValues
        mustBe: 0
        description: "There must be no null values in the column."

Metrics

Metric	Level	Description	Arguments	Arguments Example
`nullValues`	Property	Counts null values in a column/field	None
`missingValues`	Property	Counts values considered as missing (empty strings, N/A, etc.)	`missingValues`: Array of values considered missing	`missingValues: [null, '', 'N/A']`
`invalidValues`	Property	Counts values that don't match valid criteria	`validValues`: Array of valid values `pattern`: Regex pattern	`validValues: ['pounds', 'kg']` `pattern: '^[A-Z]{2}[0-9]{2}...'`
`duplicateValues`	Property	Counts duplicate values in a column	None
`duplicateValues`	Schema	Counts duplicate values across multiple columns	`properties`: Array of property names	`properties: ['tenant_id', 'order_id']`
`rowCount`	Schema	Counts total number of rows in a table/object store	None

Null Values

Check that the count of null values is within range.

properties:
  - name: customer_id
    quality:
    - id: customer_id_no_nulls
      metric: nullValues
      mustBe: 0
      description: "There must be no null values in the column."

Example with percent:

properties:
  - name: order_status
    quality:
    - id: order_status_null_percent
      metric: nullValues
      mustBeLessThan: 1
      unit: percent
      description: "There must be less than 1% null values in the column."

Missing Values

Check that the missing values are within range.

In the argument missingValues, a list of values that are considered to be missing.

properties:
  - name: email_address
    quality:
    - id: email_missing_values
      metric: missingValues
      arguments:
        missingValues: [null, '', 'N/A', 'n/a']
      mustBeLessThan: 100
      unit: rows # rows (default) or percent

Invalid Values

Check that the value is within a defined set or matching a pattern.

properties:
  - name: line_item_unit
    quality:
      - id: line_item_unit_valid_values
        metric: invalidValues
        arguments:
          validValues: ['pounds', 'kg']
        mustBeLessThan: 5
        unit: rows

Using a pattern:

properties:
  - name: iban
    quality:
    - id: iban_pattern_check
      metric: invalidValues
      mustBe: 0
      description: "The value must be an IBAN."
      arguments:
      pattern: '^[A-Z]{2}[0-9]{2}[A-Z0-9]{4}[0-9]{7}([A-Z0-9]?){0,16}$'

Duplicate Values

No more than 10 duplicate names.

properties:
  - name: email_address
    quality:
    - id: email_duplicate_values
      metric: duplicateValues
      mustBeLessThan: 10
      unit: rows
      description: "There must be less than 10 duplicate values in the column."

Duplicates should be less than 1%.

properties:
  - name: phone_number
    quality:
    - id: phone_duplicate_percent
      metric: duplicateValues
      mustBeLessThan: 1
      unit: percent

Row count (Schema-Level)

Calculates the number of rows (usually in a table) and compares it to an absolute operator.

schema:
  - name: orders
    quality:
      - id: orders_row_count
        metric: rowCount
        mustBeBetween: [100, 120]

Duplicates (Schema-Level)

Checks for duplicate rows based on a combination of properties. This is useful for validating compound keys where uniqueness is defined not by a single column but by multiple columns together.

schema:
  - name: orders
    quality:
      - id: orders_unique_tenant_order
        description: The combination of tenant_id and order_id must be unique
        metric: duplicateValues
        mustBe: 0
        arguments:
          properties: # Properties refer to the property in the schema.
            - tenant_id
            - order_id

SQL

A single SQL query that returns either a numeric or boolean value for comparison. The query must be written in the SQL dialect specific to the provided server. {object} and {property} are automatically replaced by the current object (in the case of SQL on a relational database, the table or view name) and the current property name (in the case of SQL on a relational database, the column).

quality:
  - id: sql_count_not_null
    type: sql
    query: |
      SELECT COUNT(*) FROM {object} WHERE {property} IS NOT NULL
    mustBeLessThan: 3600

Custom

Custom rules allow for vendor-specific checks, including tools like Soda, Great Expectations, dbt-tests, Montecarlo, and others. Any format for properties is acceptable, whether it's written in YAML, JSON, XML, or even uuencoded binary. They are an intermediate step before the vendor accepts ODCS natively.

Soda Example

quality:
- id: soda_duplicate_percent
  type: custom
  engine: soda
  implementation: |
        type: duplicate_percent  # Block
        columns:                 # passed as-is
          - carrier              # to the tool
          - shipment_numer       # (Soda in this situation)
        must_be_less_than: 1.0   #

Great Expectation Example

quality:
- id: row_count_btwn_10_50
  type: custom
  engine: greatExpectations
  implementation: |
    type: expect_table_row_count_to_be_between # Block
    kwargs:                                    # passed as-is
      minValue: 10000                          # to the tool
      maxValue: 50000                          # (Great Expectations in this situation)

Scheduling

The data contract can contain scheduling information for executing the rules. You can use schedule and scheduler for those operation. In previous versions of ODCS, the only allowed scheduler was cron and its syntax was scheduleCronExpression.

quality:
  - id: count_less_than_3600
    type: sql
    query: |
      SELECT COUNT(*) FROM {object} WHERE {property} IS NOT NULL
    mustBeLessThan: 3600
    scheduler: cron
    schedule: 0 20 * * *

Definitions

Acronyms:

DQ: data quality.

Key	UX label	Required	Description
quality	Quality	No	Quality tag with all the relevant information for rule setup and execution.
quality.id	ID	No	A unique identifier for the element used to create stable, refactor-safe references. Recommended for elements that will be referenced. See References for more details.
quality.name	Name	No	A short name for the rule.
quality.description	Description	No	Describe the quality check to be completed.
quality.type	Type	No	Type of DQ rule. Valid values are `library` (default), `text`, `sql`, and `custom`.
quality.metric	Metric name	No	Required for `library`: the name of the metric to be calculated and compared.
quality.rule	Rule name	No	Deprecated, use `metric` instead.
quality.arguments	Arguments	No	Additional arguments for the metric, if needed.
quality.\	See below	No	Multiple values are allowed for the property, the value is the one to compare to.
quality.unit	Unit	No	Unit the rule is using, popular values are `rows` or `percent`.
quality.query	SQL Query	No	Required for `sql` DQ rules: the SQL query to be executed. Note that it should match the target SQL engine/database, no transalation service are provided here.
quality.engine	Third-party DQ Engine	No	Required for `custom` DQ rule: name of the third-party engine being used. Any value is authorized here but common values are `soda`, `greatExpectations`, `montecarlo`, etc.
quality.implementation	Third-party Implementation	No	A text (non-parsed) block of code required for the third-party DQ engine to run.
quality.dimension	Dimension	No	The key performance indicator (KPI) or dimension for data quality. Valid values are listed after the table.
quality.method	Method	No	Values are open and include `reconciliation`.
quality.severity	Severity	No	The severity of the DQ rule.
quality.businessImpact	Business Impact	No	Consequences of the rule failure.
quality.customProperties	Custom Properties	No	Additional properties required for rule execution. Follows the same structure as any custom properties block.
quality.tags	Tags	No	Tags. Follows the same structure as any tags property.
quality.authoritativeDefinitions	Authoritative Definitions	No	Authoritative definitions indicate the link to external definition. Follows the same structure as any authoritative definitions block.
quality.scheduler	Scheduler	No	Name of the scheduler, can be `cron` or any tool your organization support.
quality.schedule	Scheduler Configuration	No	Configuration information for the scheduling tool, for `cron` a possible value is `0 20 * * *`.

Valid Values for Dimension

Those data quality dimensions are used for classification and reporting in data quality. Valid values are:

accuracy (synonym ac),
completeness (synonym cp),
conformity (synonym cf),
consistency (synonym cs),
coverage (synonym cv),
timeliness (synonym tm),
uniqueness (synonym uq).

Valid Properties for Operator

The operator specifies the condition to validate a metric or result of a SQL query.

Operator	Expected Value	Math Symbol	Example
`mustBe`	number	`=`	`mustBe: 5`
`mustNotBe`	number	`<>`, `≠`	`mustNotBe: 3.14`
`mustBeGreaterThan`	number	`>`	`mustBeGreaterThan: 59`
`mustBeGreaterOrEqualTo`	number	`>=`, `≥`	`mustBeGreaterOrEqualTo: 60`
`mustBeLessThan`	number	`<`	`mustBeLessThan: 1000`
`mustBeLessOrEqualTo`	number	`<=`, `≤`	`mustBeLessOrEqualTo: 999`
`mustBeBetween`	list of two numbers	`∈`	`mustBeBetween: [0, 100]`
`mustNotBeBetween`	list of two numbers	`∉`	`mustNotBeBetween: [0, 100]`

mustBeBetween is the equivalent to mustBeGreaterThan and mustBeLessThan.

quality:
  - type: sql
    query: |
      SELECT COUNT(*) FROM {table} WHERE {column} IS NOT NULL
    mustBeBetween: [0, 100]

is equivalent to:

quality:
  - type: sql
    query: |
      SELECT COUNT(*) FROM {table} WHERE {column} IS NOT NULL
    mustBeGreaterThan: 0
    mustBeLessThan: 100

Back to TOC