Research Data Management: RDM Ethics

Ethical Considerations

Your research data may require special handling if it contains personal or sensitive information. Use your data management plan to address the following questions:

Is the data anonymous? Remove or redact sensitive information
Who will have access to the data? Ensure appropriate access restrictions are in place (dark archives, embargo periods, etc.)
What are the intellectual property rights? Ensure that copyright, ownership, and licenses for the use of the data are clear
Have participants given informed consent? Consent forms can include language that allows data sharing. ICPSR has a great page with suggested language that allows data sharing

Be sure to check Memorial University's Ethics of Research Involving Human Participants policy.

Ethical Management of Sensitive Data

Consider the ethical implications of your data when formulating your data management plan. The contents of your data may obligate you to store and share your data with special care. Consult the table below to determine the risk-level of your data and best practices for protection, storage, access, and transfer. As a best practice, if in doubt, assume that your data is sensitive and poses a high risk. The table below is adapted with permission from Chandra Kavanagh, Ethics Officer for the NL Health Research Ethics Authority (HREA).

	Low Risk	Medium Risk	High Risk
Types of Data	- Research data that does not contain any sensitive or identifiable information (e.g. data which has been de-identified). NB If in doubt, assume that data is sensitive. - Non-sensitive research documentation - Publicly available information	- Research data that may or does contain sensitive or identifiable information - Some sensitive research-related documentation - Personally identifiable information - De-identified records of compensation - Data and research protocols related to private or sensitive intellectual property	- Research data that contains confidential, restricted or highly sensitive information - Personal health information - Personal financial information - Data and research protocols related to highly sensitive intellectual property
Examples	- Completely de-identified or anonymous data - Blank consent forms and information sheets - Information gathered from a public-facing website	- Video or audio recorded interviews depending on the content - Identification keys and signed consent forms - De-identified financial information associated with research payments	- Information with regard to racial or ethnic origin; political opinions; religious beliefs or other beliefs of a similar nature; trade union membership; physical or mental health or condition; sexual life; the commission or alleged commission by the data subject of any offence
Data Protection	- Research data must always be stored according to protocols approved by the appropriate Research Ethics Board	- Collect and store data on password-protected devices. Preferably static devices in a secure location such as on a desktop computer in a locked office or an appropriately protected server. - All research data is subject to the TCPS2 which states “identifiable data obtained through research that is kept on a computer and connected to the Internet should be encrypted.” - See below for more information about secure data storage, access and transfer.	- Collect and store data on password-protected and encrypted devices. Preferably static devices in a secure location such as on a desktop computer in a locked office or an appropriately protected server. - All Research data is subject to the TCPS2 which states “identifiable data obtained through research that is kept on a computer and connected to the Internet should be encrypted.” - All public cloud services (Google Drive, DropBox, iCloud, Onedrive, etc.) for data storage or transfer are strictly prohibited. Use of private cloud services for data storage and transfer are subject to the restrictions detailed below.
Data Storage	- All storage devices, file shares and cloud services allowed including public cloud services ( Google Drive, DropBox, iCloud, Onedrive etc.) as well as institutional cloud services.	- A computer or external electronic storage device that meets the data protection requirements. - Public cloud services (Google Drive, DropBox, iCloud, Onedrive etc.) for data storage or transfer are strictly prohibited. Institutional cloud services might be suitable if specified in the REB protocol. Privacy and security risk are the reasons for preferring internal services over external, particularly those for which there is not an enterprise agreement. - Central, departmental and lab file shares that meet data protection requirements and have been identified in the REB protocol.	- A computer or external electronic storage device that meets the data protection requirements. - Central, departmental and lab file shares that meet data protection requirements and have been identified in the REB protocol.
Data Access	- No special handling required.	- Access to confidential information must be restricted to authorized individuals who have been identified in the REB protocol only.	- Access to confidential information must be restricted to authorized individuals only who have been identified in the REB protocol. Note for reviewers: Access should be restricted to the fewest number of individuals possible.
Data Transfer	- Can be shared via email and all cloud services including public cloud services (Google Drive, DropBox, iCloud, Onedrive etc.)	- Encrypted and password-protected files can be shared via approved cloud services.	- Restricted data should be shared hand to hand on a password-protected and encrypted data storage device. - Files may be shared using properly encrypted, password-protected, expiring links.

Alliance Sensitive Data Tools

Glossary of terms for Sensitive Data used for Research Purposes
Human Participant Research Data Risk Matrix
Research Data Management Language for Informed Consent
This resource provides examples of language that can be used for informed consent

Anonymization of Research Data

If your research data contains personal or otherwise sensitive information, it will have to be anonymized before you upload it to an open repository and share it.

The following guidelines, from the Inter-university Consortium for Political and Social Research at the University of Michigan, provide an overview of a number of considerations to take when anonymizing your data. These guidelines provide a general overview only; it is the responsibility of the researcher to ensure that their anonymization procedures are sufficient for their particular datasets.

Why anonymize research data?

“Once data are released to the public, it is impossible to monitor use to ensure that other researchers respect respondent confidentiality. Thus, it is common practice in preparing public-use datasets to alter the files so that information that could imperil the confidentiality of research subjects is removed or masked before the dataset is made public. At the same time, care must be used to make certain that the alterations do not unnecessarily reduce the researcher’s ability to reproduce or extend the original study findings.”

What kinds of data can identify individuals?

“Direct identifiers: These are variables that point explicitly to particular individuals or units. They may have been collected in the process of survey administration and are usually easily recognized. For instance, in the United States, Social Security numbers uniquely identify individuals who are registered with the Social Security Administration. Any variable that functions as an explicit name can be a direct identifier -- for example, a license number, phone number, or mailing address. Data depositors should carefully consider the analytic role that such variables fulfill and should remove any identifiers not necessary for analysis.

“Indirect identifiers: Data depositors should also carefully consider a second class of problematic variables -- indirect identifiers. Such variables make unique cases visible. For instance, a United States ZIP code field may not be troublesome on its own, but when combined with other attributes like race and annual income, a ZIP code may identify unique individuals (e.g., extremely wealthy or poor) within that ZIP code, which means that answers the respondent thought would be private are no longer private. Some examples of possible indirect identifiers are detailed geography (e.g., state, county, or census tract of residence), organizations to which the respondent belongs, educational institutions from which the respondent graduated (and year of graduation), exact occupations held, places where the respondent grew up, exact dates of events, detailed income, and offices or posts held by the respondent. Indirect identifiers often are items that are useful for statistical analysis. The data depositor must carefully assess their analytic importance. Do analysts need the ZIP code, for example, or will data aggregated to the county or state levels suffice?

“Geographic identifiers: Some projects collect data containing direct geographic identifiers such as coordinates that can be used with a mapping application. These data can be classified and displayed with geographic information system (GIS) software. Direct geographic identifiers are actual addresses (e.g., of an incident, a business, a public agency, etc.). As described above, the role of these variables should be considered and only included if necessary for analysis. Indirect geographic identifiers include location information such as state, county, census tract, census block, telephone area codes, and place where the respondent grew up.”

What can I do to anonymize my data?

“If, in the judgment of the principal investigator, a variable might act as an indirect identifier (and thus could be used to compromise the confidentiality of a research subject), the investigator should treat that variable in a special manner when preparing a public-use dataset. Commonly used types of treatment are as follows:

Removal -- eliminating the variable from the dataset entirely.
Top-coding -- restricting the upper range of a variable.
Collapsing and/or combining variables -- combining values of a single variable or merging data recorded in two or more variables into a new summary variable.
Sampling -- rather than providing all of the original data, releasing a random sample of sufficient size to yield reasonable inferences.
Swapping -- matching unique cases on the indirect identifier, then exchanging the values of key variables between the cases. This retains the analytic utility and covariate structure of the dataset while protecting subject confidentiality. Swapping is a service that archives may offer to limit disclosure risk. (For more in-depth discussion of this technique, see O’Rourke, 2003 and 2006.)
Disturbing -- adding random variation or stochastic error to the variable. This retains the statistical properties between the variable and its covariates, while preventing someone from using the variable as a means for linking records.”

Example

“An example from a national survey of physicians (containing many details of each doctor’s practice patterns, background, and personal characteristics) illustrates some of these categories of treatment of variables to protect confidentiality. Variables identifying the school from which the physician’s medical degree was obtained and the year graduated should probably be removed entirely, due to the ubiquity of publicly available rosters of college and university graduates. The state of residence of the physician could be bracketed into a new 'Region' variable (substituting more general geographic categories such as 'East,' 'South,' 'Midwest,' and 'West'). The upper end of the range of the 'Physician’s Income' variable could be top-coded (e.g., '$150,000 or More') to avoid identifying the most highly paid individuals. Finally, a series of variables documenting the responding physician’s certification in several medical specialties could be collapsed into a summary indicator (with new categories such as 'Surgery,' 'Pediatrics,' 'Internal Medicine,' and 'Two or More Specialties').”

Reproduced from:

Inter-university Consortium for Political and Social Research (ICPSR). (2012). Guide to Social Science Data Preparation and Archiving: Best Practice Throughout the Data Life Cycle (5th ed.). Ann Arbor, MI. ISBN 978-0-89138-800-5. https://doi.org/10.3886/GuideToSocialScienceDataPreparationAndArchiving