Skip to content

Building a dataset

We can now move on to implementing some of the QOF business rules.

In this tutorial, we'll look at rule DM006, which is about finding the percentage of patients on the register, with a diagnosis of clinical proteinuria or micro-albuminuria who are currently treated with angiotensin-converting enzymes (ACEs) or angiotensin II receptor blockers (ARBs).

The QOF rules define "currently treated" to mean having a medication within 180 days of the index date.

To do this, we'll needs some more codelists. The diagnosis codelists (proteinuria and micro-albuminuria) are published by QOF; the medication codeslists aren't, so we'll use ones developed for OpenSAFELY studies (ACEs and ARBs).

Medications🔗

We have already seen how to determine whether a patient has a clinical event matching a code in a codelist. Medications are recorded in a similar way to clinical events, and can be queried using the same kinds of ehrQL.

from ehrql import codelist_from_csv, days, show
from ehrql.tables.core import clinical_events, medications


index_date = "2024-03-31"

proteinuria_codes = codelist_from_csv("codelists/nhsd-primary-care-domain-refsets-prt_cod.csv", column="code")
microalbuminuria_codes = codelist_from_csv("codelists/nhsd-primary-care-domain-refsets-mal_cod.csv", column="code")
ace_codes = codelist_from_csv("codelists/opensafely-ace-inhibitor-medications.csv", column="code")
arb_codes = codelist_from_csv("codelists/opensafely-angiotensin-ii-receptor-blockers-arbs.csv", column="code")

previous_events = clinical_events.where(clinical_events.date.is_on_or_before(index_date))
recent_meds = medications.where(medications.date.is_on_or_between(index_date - days(180), index_date))

has_proteinuria_diagnosis = (
    previous_events
    .where(clinical_events.snomedct_code.is_in(proteinuria_codes))
    .exists_for_patient()
)

has_microalbuminuria_diagnosis = (
    previous_events
    .where(clinical_events.snomedct_code.is_in(microalbuminuria_codes))
    .exists_for_patient()
)

has_arb_treatment = (
    recent_meds
    .where(medications.dmd_code.is_in(arb_codes))
    .exists_for_patient()
)

has_ace_treatment = (
    recent_meds
    .where(medications.dmd_code.is_in(ace_codes))
    .exists_for_patient()
)

show(
    has_proteinuria_diagnosis,
    has_microalbuminuria_diagnosis,
    has_arb_treatment,
    has_ace_treatment,
)

Notice that we've added a couple of intermediate variables (previous_events and recent_meds) to reduce duplication in our code.

Datasets🔗

So far, we've been using show() to show us what our queries return when running against dummy data.

But we can't use show() to extract data for a real study!

Instead, we need to define a dataset. As a reminder: a dataset is a new table containing one row per patient, and a dataset definition consists of a population definition and a set of column definitions.

We then use dataset = create_dataset() to create a new dataset object, dataset.define_population(...) to define the population, and dataset.column_name = ... to define the columns:

from ehrql import create_dataset, codelist_from_csv, days, show
from ehrql.tables.core import patients, practice_registrations, clinical_events, medications

index_date = "2024-03-31"

diabetes_codes = codelist_from_csv("codelists/nhsd-primary-care-domain-refsets-dm_cod.csv", column="code")
resolved_codes = codelist_from_csv("codelists/nhsd-primary-care-domain-refsets-dmres_cod.csv", column="code")
proteinuria_codes = codelist_from_csv("codelists/nhsd-primary-care-domain-refsets-prt_cod.csv", column="code")
microalbuminuria_codes = codelist_from_csv("codelists/nhsd-primary-care-domain-refsets-mal_cod.csv", column="code")
ace_codes = codelist_from_csv("codelists/opensafely-ace-inhibitor-medications.csv", column="code")
arb_codes = codelist_from_csv("codelists/opensafely-angiotensin-ii-receptor-blockers-arbs.csv", column="code")

previous_events = clinical_events.where(clinical_events.date.is_on_or_before(index_date))
recent_meds = medications.where(medications.date.is_on_or_between(index_date - days(180), index_date))

aged_17_or_older = (index_date - patients.date_of_birth).years >= 17
was_alive = patients.date_of_death.is_null() | (patients.date_of_death < index_date)
was_registered = (
    practice_registrations.where(practice_registrations.start_date <= index_date)
    .except_where(practice_registrations.end_date < index_date)
    .exists_for_patient()
)

last_diagnosis_date = (
    previous_events
    .where(clinical_events.snomedct_code.is_in(diabetes_codes))
    .sort_by("date")
    .last_for_patient()
    .date
)
last_resolved_date = (
    previous_events
    .where(clinical_events.snomedct_code.is_in(resolved_codes))
    .sort_by("date")
    .last_for_patient()
    .date
)

has_unresolved_diabetes = last_diagnosis_date.is_not_null() & (
    last_resolved_date.is_null() | (last_resolved_date < last_diagnosis_date)
)

on_register = aged_17_or_older & was_alive & was_registered & has_unresolved_diabetes

has_proteinuria_diagnosis = (
    previous_events
    .where(clinical_events.snomedct_code.is_in(proteinuria_codes))
    .exists_for_patient()
)

has_microalbuminuria_diagnosis = (
    previous_events
    .where(clinical_events.snomedct_code.is_in(microalbuminuria_codes))
    .exists_for_patient()
)

has_arb_treatment = (
    recent_meds
    .where(medications.dmd_code.is_in(arb_codes))
    .exists_for_patient()
)

has_ace_treatment = (
    recent_meds
    .where(medications.dmd_code.is_in(ace_codes))
    .exists_for_patient()
)

dataset = create_dataset()
dataset.define_population(on_register)

dataset.prt_or_mal = has_proteinuria_diagnosis | has_microalbuminuria_diagnosis
dataset.ace_or_arb = has_arb_treatment | has_ace_treatment

show(dataset)

Note that it is essential that the dataset you create is given the name dataset.

Question: what's the relationship between the number of rows in the dataset and the on_register series?

Question: what happens if you try to set the dataset's population to something that is not a boolean patient series?

Question: what happens if you try to set a dataset column to something that is not a patient series?

Question: what happens if you try to reuse a name for a dataset's column?

Next: Using ehrQL as part of a study