Skip to main content

Incremental models

Incremental models are built as tables in your data warehouse. The first time a model is run, the table is built by transforming all rows of source data. On subsequent runs, dbt transforms only the rows in your source data that you tell dbt to filter for, inserting them into the target table which is the table that has already been built.

Often, the rows you filter for on an incremental run will be the rows in your source data that have been created or updated since the last time dbt ran. As such, on each dbt run, your model gets built incrementally.

Using an incremental model limits the amount of data that needs to be transformed, vastly reducing the runtime of your transformations. This improves warehouse performance and reduces compute costs.

Using incremental materializations

Like the other materializations built into dbt, incremental models are defined with select statements, with the materialization defined in a config block.

{{
config(
materialized='incremental'
)
}}

select ...

To use incremental models, you also need to tell dbt:

  • How to filter the rows on an incremental run
  • The unique key of the model (if any)

Filtering rows on an incremental run

To tell dbt which rows it should transform on an incremental run, wrap valid SQL that filters for these rows in the is_incremental() macro.

Often, you'll want to filter for "new" rows, as in, rows that have been created since the last time dbt ran this model. The best way to find the timestamp of the most recent run of this model is by checking the most recent timestamp in your target table. dbt makes it easy to query your target table by using the "{{ this }}" variable.

Also common is wanting to capture both new and updated records. For updated records, you'll need to define a unique key to ensure you don't bring in modified records as duplicates. Your is_incremental() code will check for rows created or modified since the last time dbt ran this model.

For example, a model that includes a computationally slow transformation on a column can be built incrementally, as follows:

models/stg_events.sql
{{
config(
materialized='incremental'
)
}}

select
*,
my_slow_function(my_column)

from raw_app_data.events

{% if is_incremental() %}

-- this filter will only be applied on an incremental run
-- (uses > to include records whose timestamp occurred since the last run of this model)
where event_time > (select max(event_time) from {{ this }})

{% endif %}
Optimizing your incremental model

For more complex incremental models that make use of Common Table Expressions (CTEs), you should consider the impact of the position of the is_incremental() macro on query performance. In some warehouses, filtering your records early can vastly improve the run time of your query!

Defining a unique key (optional)

A unique_key enables updating existing rows instead of just appending new rows. If new information arrives for an existing unique_key, that new information can replace the current information instead of being appended to the table. If a duplicate row arrives, it can be ignored. Refer to strategy specific configs for more options on managing this update behavior, like choosing only specific columns to update.

Not specifying a unique_key will result in append-only behavior, which means dbt inserts all rows returned by the model's SQL into the preexisting target table without regard for whether the rows represent duplicates.

The optional unique_key parameter specifies a field (or combination of fields) that define the grain of your model. That is, the field(s) identify a single unique row. You can define unique_key in a configuration block at the top of your model, and it can be a single column name or a list of column names.

The unique_key should be supplied in your model definition as a string representing a single column or a list of single-quoted column names that can be used together, for example, ['col1', 'col2', …]). Columns used in this way should not contain any nulls, or the incremental model run may fail. Either ensure that each column has no nulls (for example with coalesce(COLUMN_NAME, 'VALUE_IF_NULL')), or define a single-column surrogate key (for example with dbt_utils.generate_surrogate_key).

tip

In cases where you need multiple columns in combination to uniquely identify each row, we recommend you pass these columns as a list (unique_key = ['user_id', 'session_number']), rather than a string expression (unique_key = 'concat(user_id, session_number)').

By using the first syntax, which is more universal, dbt can ensure that the columns will be templated into your incremental model materialization in a way that's appropriate to your database.

When you pass a list in this way, please ensure that each column does not contain any nulls, or the incremental model run may fail.

Alternatively, you can define a single-column surrogate key, for example with dbt_utils.generate_surrogate_key.

When you define a unique_key, you'll see this behavior for each row of "new" data returned by your dbt model:

  • If the same unique_key is present in the "new" and "old" model data, dbt will update/replace the old row with the new row of data. The exact mechanics of how that update/replace takes place will vary depending on your database, incremental strategy, and strategy specific configs.
  • If the unique_key is not present in the "old" data, dbt will insert the entire row into the table.

Please note that if there's a unique_key with more than one row in either the existing target table or the new incremental rows, the incremental model may fail depending on your database and incremental strategy. If you're having issues running an incremental model, it's a good idea to double check that the unique key is truly unique in both your existing database table and your new incremental rows. You can learn more about surrogate keys here.

info

While common incremental strategies, such asdelete+insert + merge, might use unique_key, others don't. For example, the insert_overwrite strategy does not use unique_key, because it operates on partitions of data rather than individual rows. For more information, see About incremental_strategy.

unique_key example

Consider a model that calculates the number of daily active users (DAUs), based on an event stream. As source data arrives, you will want to recalculate the number of DAUs for both the day that dbt last ran, and any days since then. The model would look as follows:

models/staging/fct_daily_active_users.sql
{{
config(
materialized='incremental',
unique_key='date_day'
)
}}

select
date_trunc('day', event_at) as date_day,
count(distinct user_id) as daily_active_users

from raw_app_data.events


{% if is_incremental() %}

-- this filter will only be applied on an incremental run
-- (uses >= to include records arriving later on the same day as the last run of this model)
where date_day >= (select max(date_day) from {{ this }})

{% endif %}

group by 1

Building this model incrementally without the unique_key parameter would result in multiple rows in the target table for a single day – one row for each time dbt runs on that day. Instead, the inclusion of the unique_key parameter ensures the existing row is updated instead.

How do I rebuild an incremental model?

If your incremental model logic has changed, the transformations on your new rows of data may diverge from the historical transformations, which are stored in your target table. In this case, you should rebuild your incremental model.

To force dbt to rebuild the entire incremental model from scratch, use the --full-refresh flag on the command line. This flag will cause dbt to drop the existing target table in the database before rebuilding it for all-time.

$ dbt run --full-refresh --select my_incremental_model+

It's also advisable to rebuild any downstream models, as indicated by the trailing +.

For detailed usage instructions, check out the dbt run documentation.

Understanding incremental models

When should I use an incremental model?

Building models as tables in your data warehouse is often preferred for better query performance. However, using table materialization can be computationally intensive, especially when:

  • Source data has millions or billions of rows.
  • Data transformations on the source data are computationally expensive (take a long time to execute) and complex, like using Regex or UDFs.

Incremental models offer a balance between complexity and improved performance compared to view and table materializations and offer better performance of your dbt runs.

In addition to these considerations for incremental models, it's important to understand their limitations and challenges, particularly with large datasets. For more insights into efficient strategies, performance considerations, and the handling of late-arriving data in incremental models, refer to the On the Limits of Incrementality discourse discussion.

Understanding the is_incremental() macro

The is_incremental() macro will return True if all of the following conditions are met:

  • the destination table already exists in the database
  • dbt is not running in full-refresh mode
  • The running model is configured with materialized='incremental'

Note that the SQL in your model needs to be valid whether is_incremental() evaluates to True or False.

How do incremental models work behind the scenes?

dbt's incremental materialization works differently on different databases. Where supported, a merge statement is used to insert new records and update existing records.

On warehouses that do not support merge statements, a merge is implemented by first using a delete statement to delete records in the target table that are to be updated, and then an insert statement.

Transaction management, a process used in certain data platforms, ensures that a set of actions is treated as a single unit of work (or task). If any part of the unit of work fails, dbt will roll back open transactions and restore the database to a good state.

What if the columns of my incremental model change?

Incremental models can be configured to include an optional on_schema_change parameter to enable additional control when incremental model columns change. These options enable dbt to continue running incremental models in the presence of schema changes, resulting in fewer --full-refresh scenarios and saving query costs.

You can configure the on_schema_change setting as follows.

dbt_project.yml
models:
+on_schema_change: "sync_all_columns"
models/staging/fct_daily_active_users.sql
{{
config(
materialized='incremental',
unique_key='date_day',
on_schema_change='fail'
)
}}

The possible values for on_schema_change are:

  • ignore: Default behavior (see below).
  • fail: Triggers an error message when the source and target schemas diverge
  • append_new_columns: Append new columns to the existing table. Note that this setting does not remove columns from the existing table that are not present in the new data.
  • sync_all_columns: Adds any new columns to the existing table, and removes any columns that are now missing. Note that this is inclusive of data type changes. On BigQuery, changing column types requires a full table scan; be mindful of the trade-offs when implementing.

Note: None of the on_schema_change behaviors backfill values in old records for newly added columns. If you need to populate those values, we recommend running manual updates, or triggering a --full-refresh.

on_schema_change tracks top-level changes

Currently, on_schema_change only tracks top-level column changes. It does not track nested column changes. For example, on BigQuery, adding, removing, or modifying a nested column will not trigger a schema change, even if on_schema_change is set appropriately.

Default behavior

This is the behavior if on_schema_change: ignore, which is set by default, and on older versions of dbt.

If you add a column to your incremental model, and execute a dbt run, this column will not appear in your target table.

Similarly, if you remove a column from your incremental model, and execute a dbt run, this column will not be removed from your target table.

Instead, whenever the logic of your incremental changes, execute a full-refresh run of both your incremental model and any downstream models.

About incremental_strategy

There are various ways (strategies) to implement the concept of incremental materializations. The value of each strategy depends on:

  • the volume of data,
  • the reliability of your unique_key, and
  • the support of certain features in your data platform

An optional incremental_strategy config is provided in some adapters that controls the code that dbt uses to build incremental models.

Supported incremental strategies by adapter

Click the name of the adapter in the below table for more information about supported incremental strategies.

The merge strategy is available in dbt-postgres and dbt-redshift beginning in dbt v1.6.

Configuring incremental strategy

The incremental_strategy config can either be specified in specific models, or for all models in your dbt_project.yml file:

dbt_project.yml
models:
+incremental_strategy: "insert_overwrite"

or:

models/my_model.sql
{{
config(
materialized='incremental',
unique_key='date_day',
incremental_strategy='delete+insert',
...
)
}}

select ...

Built-in strategies

Before diving into custom strategies, it's important to understand the built-in incremental strategies in dbt and their corresponding macros:

incremental_strategyCorresponding macro
appendget_incremental_append_sql
delete+insertget_incremental_delete_insert_sql
mergeget_incremental_merge_sql
insert_overwriteget_incremental_insert_overwrite_sql

For example, a built-in strategy for the append can be defined and used with the following files:

macros/append.sql
{% macro get_incremental_append_sql(arg_dict) %}

{% do return(some_custom_macro_with_sql(arg_dict["target_relation"], arg_dict["temp_relation"], arg_dict["unique_key"], arg_dict["dest_columns"], arg_dict["incremental_predicates"])) %}

{% endmacro %}


{% macro some_custom_macro_with_sql(target_relation, temp_relation, unique_key, dest_columns, incremental_predicates) %}

{%- set dest_cols_csv = get_quoted_csv(dest_columns | map(attribute="name")) -%}

insert into {{ target_relation }} ({{ dest_cols_csv }})
(
select {{ dest_cols_csv }}
from {{ temp_relation }}
)

{% endmacro %}

Define a model models/my_model.sql:

{{ config(
materialized="incremental",
incremental_strategy="append",
) }}

select * from {{ ref("some_model") }}

Custom strategies

Questions from the Community

0