dbt-project-template

Starter project template for all new projects

Installation prerequisites

Within a conda env, use the package manager pip to install the dev requirements outlined in requirements.txt
To create a new conda env conda create -n ENV_NAME python=3.9 then run conda activate ENV_NAME and finally pip install -r requirements.txt
Run pre-commit install to set up your git hook scripts. This will set your hooks so that the next time you commit, pre-commit will be invoked (note: on its first invocation, pre-commit will need to install its own dependencies which may take a minute; these dependencies will be installed outside of your project and will be available from that moment onwards).

pip install -r requirements.txt
pre-commit install

dbt configuration

Profile

In order to use the template, add the following to your ~/.dbt/profiles.yml file, taking care to replace the value of the project and dataset and location parameter wtih appropriate values. Note the profile provided is called sandbox - rename this and update dbt_project.yml ln 8 to match. If you don't already have a profiles file, create a new one.

sandbox:
  target: dev
  outputs:
    dev:
      type: bigquery
      method: oauth
      project: data-sandbox-266217
      dataset: dbt_tl
      threads: 8
      location: EU

Using the project:

Once everything has been setup, try running the following commands:

dbt debug (if you're having issues)
dbt deps
dbt seed
dbt run
dbt test

Working with Pre-Commit

Pre-commit is configured to run various checks automatically when you attempt to commit your code.
Pre-commit will only run against changed files to keep its execution as quick as possible.
On its first execution, pre-commit will install any dependencies it needs into a virtual environment (located outside of this repo); this may take a few minutes on its first run, but every following run will reuse that env and as a result will be much quicker.

Notes around versioning

Take note to appropriately update versions in .pre-commit-config.yaml and requirements.txt

Working with SQLFluff

SQLFluff is used to enforce formatting guidelines.
SQLFluff lint is configured as a pre-commit hook that runs on commit, so in most cases no explicit commands are needed. This will only list errors and will not fix any errors if found.
If you would like to run SQLFluff lint manually, you can do so with the following commands which will run them through pre-commit.

pre-commit run --hook-stage commit sqlfluff-lint --all-files

SQLFluff active rules

SQLFluff Rule	Description	Configuration
L002	Mixed Tabs and Spaces in single whitespace. This rule will fail if a single section of whitespace contains both tabs and spaces.	None
L005	Commas should not have whitespace directly before them. Unless it’s an indent. Trailing/leading commas are dealt with in a different rule.	None
L006	Operators should be surrounded by a single whitespace.	None
L010	Inconsistent capitalisation of keywords.	capitalisation_policy = lower
L011	Implicit/explicit aliasing of table. Aliasing of table to follow preference, e.g explict AS.	aliasing = explicit
L012	Implicit/explicit aliasing of columns. Aliasing of columns to follow preference, e.g explict AS.	aliasing = explicit
L013	Column expression without alias. Use explicit AS clause.	None
L014	Inconsistent capitalisation of unquoted identifiers.	extended_capitalisation_policy = lower
L015	DISTINCT used with parentheses.	None
L017	Function name not immediately followed by parenthesis.	None
L018	WITH clause closing bracket should be on a new line.	None
L019	Leading/Trailing comma enforcement.	trailing
L020	Table aliases should be unique within each clause.	None
L021	Ambiguous use of DISTINCT in a SELECT statement with GROUP BY. When using GROUP BY a DISTINCT` clause should not be necessary as every non-distinct SELECT clause must be included in the GROUP BY clause.	None
L022	Blank line expected but not found after CTE closing bracket.	None
L023	Single whitespace expected after AS in WITH clause.	None
L024	Single whitespace expected after USING in JOIN clause.	None
L025	Tables should not be aliased if that alias is not used.	None
L027	References should be qualified if select has more than one referenced table/view.	None
L030	Inconsistent capitalisation of function names.	extended_capitalisation_policy = lower
L035	Do not specify else null in a case when statement (redundant).	None
L036	Select targets should be on a new line unless there is only one select target.	None
L037	Ambiguous ordering directions for columns in order by clause.	None
L038	Trailing commas within select clause.	select_clause_trailing_comma = forbid
L040	Inconsistent capitalisation of boolean/null literal.	capitalisation_policy = lower
L041	SELECT modifiers (e.g. DISTINCT) must be on the same line as SELECT.	None
L042	Join/From clauses should not contain subqueries. Use CTEs instead.	None
L044	Query produces an unknown number of result columns.	None
L045	Query defines a CTE (common-table expression) but does not use it.	None
L046	Jinja tags should have a single whitespace on either side.	None
L047	Use consistent syntax to express “count number of rows”.	None
L049	Comparisons with NULL should use “IS” or “IS NOT”.	None
L050	Files must not begin with newlines or whitespace.	None
L051	Join clauses should be fully qualified.	fully_qualify_join_types = both
L054	Inconsistent column references in GROUP BY/ORDER BY clauses.	group_by_and_order_by_style = consistent
L055	Use LEFT JOIN instead of RIGHT JOIN.	None
L058	Nested CASE statement in ELSE clause could be flattened.	None
L060	Use COALESCE instead of IFNULL or NVL.	None
L065	Set operators should be surrounded by newlines.	None
L066	Enforce table alias lengths in from clauses and join conditions.	min_alias_length = 3, max_alias_length = 25
L071	Parenthesis blocks should be surrounded by whitespaces.	None

Working with dbt-gloss

dbt-gloss is used to enforce dbt project structure.
dbt-gloss is configured as a pre-commit hook that runs on commit, so in most cases no explicit commands are needed. This will only list errors and will not fix any errors if found.
If you would like to run dbt-gloss manually, you can do so with the following commands which will run them through pre-commit.

pre-commit run {{dbt-gloss hook_id}} --hook-stage commit --all-files

dbt-gloss active hooks

dbt-gloss hook	Description	Configuration
dbt-run	Ensures that changed files pass dbt run.	None
check-source-has-loader	Ensures that the source has a loader option in the properties file (usually sources.yml).	None
check-source-has-all-columns	Ensures that all columns in the database are also specified in the properties file. (usually sources.yml).	None
check-source-table-has-description	Ensures that the source table has a description in the properties file (usually sources.yml).	None
check-model-has-properties-file	Ensures that the model has a properties file (models.yml).	None
check-model-has-all-columns	Ensures that all columns in the database are also specified in the properties file. (usually models.yml).	None
check-model-has-description	Ensures that the model has a description in the properties file (usually models.yml).	None

SQL style guide

General

All sql reserved keywords should be in lowercase. e.g select and not SELECT

Do not use select * from, an exception to this is if you are selecting all columns from a CTE that is already selecting specific columns. e.g. the CTE is acting as a pass through.

-- bad
select * from product;

-- OK
with product_dependency as (
	select
		product_id,
		product_name
	from product
	where 1 = 1
	and product_id is not null
)

select
	*
from product_dependency;

-- preferred
with product_dependency as (
	select
		product_id,
		product_name
	from product
	where 1 = 1
	and product_id is not null
)

select
	product_id,
	product_name
from product_dependency

Within a select statement place commas at the end of the line.
```
select
	1 as col1,
	2 as col2,
from test
```
- Each column in a select statement should be on a newline. Moreover do not include any columns on the same line as the select and from.
When aliasing columns be consistent at which column val
The select, from, where, qualify, group by, having and limit clauses should be at the same indentation.
- where: the first clause should be on the same line as the where statement itself. If you using multiple clauses the first clause should be where 1 = 1.
```
select
	product
from test
where 1 = 1
and product not like '%xbox%'
```
- group by: if grouping by positions keep it all on the same line as the group by clause otherwise do not and also indent by 1.
```
group by 1,2,3

vs

group by
	product_type,
	sales_id,
	product_id
```
When using case statements, each component of the case statement should be on a new line and when/then should be a single indentation from the outer components.
```
select
	case
		when 1=1
		then true
		else false
	end as col1
from x
	
```
Common table expressions (CTE)
Try to name CTEs with their purpose in mind, try not to use names like ‘cte1’, ‘cte2’. A good example would be product_dependency.
Use CTEs to breakup complex code.

Use CTEs to capture model dependencies and apply necessary filtering within the CTE (select needed columns and filter early - this is the way!)

with orders_dependency as (
	select
		order_id,
		order_name,
		...
	from orders
	where 1 = 1
	and meta_is_valid = 1
	and order_id is not null
)

products_dependency as (
	select
		product_id,
		product_name,
		...
	from product
	where 1 = 1
	and meta_is_valid = 1
	and product_id is not null
)

When formatting CTEs
- Do not indent CTE name.
- Keep the open bracket on the same line as the CTE name. Put the closing bracket on its own line at the end with the indentation the same as the CTE name.
- The select statement should be indented by 1 tab from the CTE name, all other formatting/indentation rules then apply.
- Keep a 1 line gap between end of CTE and start of next CTE/final select.

SQL tips

New lines are cheap, brain power is expensive.
Use exists and not exists rather than in and not in

dbt standards

We have these standards in place to ensure we maintain consistency across our projects, so if for example, you have to switch to help on a different project you will be familiar with the overall structure, naming etc to quickly become productive!

Language

English should be used for all table/view names, dbt schema column descriptions and comments in dbt models. This also applies to variable names in Jinja where applicable.

dbt project naming standards

We use the convention _data for naming dbt projects e.g. s24_data. Project names are lowercase and contain underscores.

GitHub repo naming standards

Every dbt project should be linked to a designated Github repo. These should be named <dbt_project_name>-dbt → e.g. s24-data-dbt

Note “-” are used instead of underscores.

Logical dwh layers in BigQuery

In BigQuery, we separate the dwh into 5 logical layers. Each layer is optimised for data ingestion, storage, transformation or consumption. The 5 layers are:

raw (ingestion)
archive (storage)
clean (storage)
staging (transformation)
presentation (consumption)

Our approach closely follows the factory model, where raw materials (source data) is delivered, checked (cleaned), stored (archived), and finally manufactured (transformed) into high-value products (data assets).

BigQuery dataset naming

Layer	Template	Example
Raw	dw_raw_	dw_raw_fs24_creditscout
Archive	dw_archive_	N/A
Clean	dw_clean_	dw_clean_fs24_creditscout
Stage	dw_pl__stage	dw_pl_financescout24_stage
Presentation	dw_pl_	dw_pl_financescout24

Notes

pl = Presentation Layer
dw = Data Warehouse

Presentation layer “zones”

The presentation layer is split into logical zones. Each zone represents a business process or product. In addition, we have a designated zone called dw_pl_reference that contains assets that are shared across zones (for example, dim_date is in dw_pl_reference).

Model folder structure in dbt

To ensure the horizontal scalability, we pay careful attention to how we organise the model folders in the dbt project. We have standards for both the structure and folder names. The key folders are:

Folder	Overview
analysis	Supporting ad-hoc files, relating to Apex24, for example useful dbt commands and shell scripts.
analysis/cloud functions	Sometimes we need Cloud Functions to support more advanced dbt use cases..
analysis/sql queries	Any helpful ad-hoc SQL.
analysis/utilities	Mostly Python utilities we have written to support various use cases. Note the magnificent_model_maker and the magnificent_schema_maker are in this folder.
data	This folder contains dbt seeds. These are csvs that dbt will convert to a physical table. Useful for static data that changes very little.
dbt_packages	Installed packages reside here.
macros	Standard dbt macros together with custom macros we have developed e.g. clean_numeric, scd2_history.
models/00_dw_utils	Any functionality specific to running the dwh itself resides here. For example, we have an example table called cloud_function_error (note this is disabled) that we use on projects to trigger a cloud function when a row is inserted.
models/01_dw_raw_backfill	Raw layer containing tables used to backfill legacy data. e.g. backfill_autoscout24_vehicleidperday - BQ datasets: dw_raw_backfill & dw_clean_backfill
models/02_dw_raw/	Raw and clean layers. Note we have a sub folder for each source. Raw contains tables landed into BQ. e.g. raw_mssql_crm, raw_reference. Clean contains tables that have the following transformations applied from raw: 1) Data type conversion/cleaning. 2) Column renaming to follow standards. 3)We include a meta_is_valid and meta_is_valid_reason so we can mark records as invalid, for example if they are missing a key. e.g. table dw_clean_crm.contactbase_clean - BQ datasets: dw_raw_ & dw_clean_
models/03_dw_raw_stage	Staging layer to help complex transformations from raw to clean. e.g. crm_account_clean - BQ dataset: dw_raw__stage
models/04_pl__stage	Staging layer to help complex transformations from clean to the pl. For pl objects where the transformation from clean is simple, a staging object is not needed. e.g. 04_pl_crm_stage.salesorders_detail_stage -BQ dataset: dw_pl__stage
models/05_pl_	Presentation Layer, serving high-value facts and dimensions for consumption. Look consumes data from the PL. e.g. 05_pl_fs24
scratch	Any ad-hoc files relating to the project go here. For example, we sometimes put legacy code if migrating a data warehouse.

Model naming and field standards

Tables

Schema, table and column names should be in snake_case
Dimensions should be prefixed dim_ e.g. dim_customer
Facts (transactional) should be prefixed fact_ e.g. fact_order
Facts (snapshots) should be suffixed _ e.g. factstock_leveldaily
Facts (accumulating) should be suffixed summary e.g. factordersummary
Table names should be singular e.g. dim_account, fact_order

Columns

Date columns should be suffixed date e.g. birthdate, month_date
Time columns should be suffixed time e.g. ordertime
Timestamp data type should be used for UTC only. Therefore, it is assumed a column of timestamp type in clean and beyond is displaying UTC. To encode a local time use a datetime type with the naming order_time_
Booleans should be prefixed with is_ e.g. is_cancelled
Monetary fields should be suffixed amount e.g. orderamount
Where a non-local currency is needed, these should be provided as a column and suffixed e.g. order_amount_<currency_code>.
Consistency is key! Use the same field names across models where possible.

Meta columns

Meta columns describe the data warehouse itself and aid in lineage, auditing and troubleshooting.

All tables should have the following meta columns:

Meta field	Description
meta_process_time	The start time the etl process that loaded this record. This can be in the past where we are replaying a previous run. In dbt this is the {{dbt_macros.meta_process_time()}} - this is UTC.
meta_delivery_time	When data arrived in the raw layer.
meta_source	Which source system the record was sourced from e.g. salesforce. This helps with data lineage and aids troubleshooting.

Tables in the clean layer should include:

Meta field	Description
meta_is_valid	Denotes if the record is usable. For example, set to 0 if the primary key is missing.
meta_is_valid_reason	This is used to describe why a record is not valid e.g. missing id

SCD2 events tables in the PL should include:

Meta field	Description
meta_scd_action	Denotes what prompted this record to be triggered. Options are: (insert, update, delete, reactivate)
meta_start_time	Represents when this change is effective from.

SCD2 dimension tables in the PL should include:

Meta field	Description
meta_scd_action	Denotes what prompted this record to be triggered. Options are: (insert, update, delete, reactivate)
meta_start_time	Represents when this version of the dimension is effective from.
meta_end_time	Represents when this version of the dimension is effective to. Note if the record is active, this is set to a large timestamp value.
meta_is_latest	Represents if this is the most recent version of the record. Note a record can be deleted (meta_end_time in the past) and still have meta_is_latest = 1.

Name		Name	Last commit message	Last commit date
Latest commit History 130 Commits
.docker		.docker
.github		.github
analysis		analysis
data		data
macros		macros
models		models
python		python
scratch		scratch
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.sqlfluff		.sqlfluff
.sqlfluffignore		.sqlfluffignore
.user.yml		.user.yml
README.md		README.md
dbt_project.yml		dbt_project.yml
guideREADME.md		guideREADME.md
packages.yml		packages.yml
profiles.yml		profiles.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

dbt-project-template

Installation prerequisites

dbt configuration

Profile

Using the project:

Working with Pre-Commit

Notes around versioning

Working with SQLFluff

SQLFluff active rules

Working with dbt-gloss

dbt-gloss active hooks

SQL style guide

General

Common table expressions (CTE)

SQL tips

dbt standards

Language

dbt project naming standards

GitHub repo naming standards

Logical dwh layers in BigQuery

BigQuery dataset naming

Presentation layer “zones”

Model folder structure in dbt

Model naming and field standards

Tables

Columns

Meta columns

About

Uh oh!

Releases

Packages

Languages

ancoris/dbt-cloud-run

Folders and files

Latest commit

History

Repository files navigation

dbt-project-template

Installation prerequisites

dbt configuration

Profile

Using the project:

Working with Pre-Commit

Notes around versioning

Working with SQLFluff

SQLFluff active rules

Working with dbt-gloss

dbt-gloss active hooks

SQL style guide

General

Common table expressions (CTE)

SQL tips

dbt standards

Language

dbt project naming standards

GitHub repo naming standards

Logical dwh layers in BigQuery

BigQuery dataset naming

Presentation layer “zones”

Model folder structure in dbt

Model naming and field standards

Tables

Columns

Meta columns

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages