Scripted actions
The most common kind of action is a scripted action.
Generally-speaking, you can write whatever code you like as long as it will run successfully on server, and it is possible to test this locally.
However, note the following restrictions and guidance:
-
Write analyses in Python, R, or Stata. You can can use more than one language in a single project if necessary. You can find more information about the available libraries below.
-
Do not write code that requires an internet connection to run. Any research objects (datasets, libraries, etc.) that are retrieved via the internet should be imported to the repo locally first. If this is not possible (for instance if the object size is too large to be transferred via GitHub) then get in touch.
-
Avoid code that consumes a lot of time or memory. The server is not an infinite resource. We can advise on code optimisation if run-times become problematic. A good strategy is to split your processing into separate project pipeline actions; the job runner can then choose to run them in parallel if sufficient resources are available.
-
Write code that runs in the OpenSAFELY platform. Code will be run within a Linux-based Docker environment. In practice, this just means ensuring you use forward-slashes
/
for directories. -
Structure your code into discrete chunks, both within scripts, and by splitting into different pipeline actions. This helps with:
- readability
- bug-finding
- parallelisation via the project pipeline
Reading and Writing Outputsπ
Scripted actions can read and write output files that are saved in the workspace. These generally fall into two categories:
* large pseudonymised patient-level files of highly_sensitive
data for use by other actions
* smaller moderately_sensitive
aggregated patient-data (this should never be patient-level data) files for review and release
Large highly_sensitive
output filesπ
Outputs labelled highly_sensitive
will not be visible to researchers. This is a deliberate design feature of OpenSAFELY, intended to reduce the risk of disclosure of sensitive information. Outputs should always be classed as highly_sensitive
if they are:
- Pseudonymised patient-level outputs derived from queries run against Level 1 and 2 data, i.e., a specific study dataset generated by a dataset definition.
- Pseudonymised patient-level intermediate outputs for a study derived from queries run against Level 3 data which output pseudonymised patient-level data i.e., a processed study dataset with certain filters/formatting applied.
These types of outputs are considered potentially highly-disclosive, should not be pushed to Level 4, and are never intended for publishing outside the secure environment.
Pseudonymised patient-level outputs tend to be large in size and therefore it is important that the right files formats are used for these large data files. The wrong formats can waste disk space, execution time, and server memory. The specific formats used vary with language ecosystem, but they should always be compressed.
Note
The template sets up the ehrql
command to produce csv.gz
outputs.
This is the current recommended output format, as CSV files compress well,
and this reduces both storage requirements and improves job execution times
on the backend.
If you need to view the raw CSV data locally, you can unzip with opensafely unzip dataset.csv.gz
.
# read compressed CSV output from ehrql
pd.read_csv("output/dataset.csv.gz")
# write compressed feather file
df.to_feather("output/model.feather", compression="zstd")
# read feather file, decompressed automatically
pd.read_feather("output/dataset.feather")
# read compressed CSV output from ehrql
df <- readr::read_csv("output/dataset.csv.gz")
# write a compressed feather file
arrow::write_feather(df, "output/model.feather", compression = "zstd")
# read a feather file, decompressed automatically
df <- arrow::read_feather("output/dataset.feather")
// stata cannot handle compressed CSV files directly, so unzip first to a plain CSV file
// the unzipped file will be discarded when the action finishes.
!gunzip output/dataset.csv.gz
// now import the uncompressed CSV using delimited
import delimited using output/dataset.csv
// save in compressed dta.gz format
gzsave output/model.dta.gz
// load a compressed .dta.gz file
gzload output/dataset.dta.gz
Smaller moderately_sensitive
output filesπ
Files that are labelled moderately_sensitive
should only ever be aggregated data such as summary tables, images, and the outputs from statistical models. These files and will be available to view with Level 4 access. These (and the corresponding automatically created log files of each action/script) will be the only output files that users will have access to; users do not have unfettered access to any patient-level data and only see aggregated outputs derived from their analysis code, which satisfies the GDPR principle of confidentiality.
File type restrictions for moderately_sensitive
outputsπ
There are restrictions on the type of file that are transferred to Level 4. This is to reduce the risk of making pseudonymised patient-level data available for researchers to view.
If a file labelled as moderately_sensitive
does not meet the below allowed file types, it will be replaced on Level 4 with a .txt
file with the same filename, which explains why the file was not allowed on Level 4.
File format
These are restricted to types of file that are likely to contain summary data, rather than patient-level data, and so reviewers can properly examine the outputs on the secure server.
Type | Formats |
---|---|
Text | .txt , .log , .md |
Data | .csv , .json |
Images | .png , .jpeg , .svgz |
Reports | .html , .pdf |
File size
There is a maximum file size of 16 MB to:
- prevent large patient-level data files being accessed via Level 4
- allow a thorough review of the outputs in a reasonable time
Files with patient_id
in the header
Any CSV file with a patient_id
header will not be made available in level 4.
Execution environmentsπ
OpenSAFELY currently supports Stata, Python, and R for statistical analysis.
For security reasons, available libraries are restricted to those provided by the framework, though you can request additions.
The framework executes your scripts using Docker images which have been preloaded with a fixed set of libraries. These Docker images have yet to be optimised; if you have skills in creating Dockerfiles and would like to help, get in touch!
Stataπ
We currently package version 16.1, with datacheck
, safetab
, and safecount
libraries installed; when installed, new libraries will appear in the stata-docker GitHub repository.
Note
Stata can only produce very limited image formats on Linux, none of which are in the approved list above. To output an image from Stata, you can output as eps and use the convert tool:
graph export "output/graph.eps", replace
! convert output/graph.eps output/graph.png
As Stata is a commercial product, a license key is needed to use it.
If you are a member of the opensafely
GitHub organisationπ
- If you are using Windows, you must have the Github Desktop
app installed and be logged into it. Then the
opensafely
command line software will use that app to obtain the OpenSAFELY Stata license automatically. - If you are using macOS:
- Download and install GitHub's command-line tool (
gh
) - Run
gh auth login --web
. Select the "HTTPS" option, and follow the instructions - The
opensafely
command line software will now automatically use the OpenSAFELY Stata license
All other external usersπ
If you are not a member of the opensafely
GitHub organisation, you must provide your own Stata/MP license. Unfortunately other Stata flavours are not yet supported; let us know if this is a problem.
-
Locate your Stata license string as follows: Locate a text file, called
STATA.LIC
(on Windows) orstata.lic
(macOS and Linux) which is usually at the top level of the folder of your Stata installation:- On Windows machine it's usually somewhere like
C:\Program Files\Stata17
- On Linux, somewhere like
/usr/local/stata17/
- On macOS it's usually in
/Applications/Stata/
- Within that file, locate a license string of the format
SerialNumber!Code!Authorization!User!Organisation!VersionCode
. - Set it as an environment variable using a method appropriate to your operating system. The name of the environment variable should be
STATA_LICENSE
, and its contents should be the entire license string. Theopensafely
command line software should now automatically use this Stata license after opening a new terminal.
- On Windows machine it's usually somewhere like
Pythonπ
There are two versions of the Docker image.
python:v1
, which for historical reasons is the same aspython:latest
, contains Python 3.8. It has this list of packages installed.python:v2
contains Python 3.10. It has this list of packages installed.
Rπ
The R image provided is R 4.0, with this list of libraries installed.