Skip to main content

Copyright Guide: Licensing research data at Aalto University

Licensing research data at Aalto University

Why license research data?

While practice varies from discipline to discipline, there is an increasing trend towards the planned release of research data. The need for data licensing arises directly from such releases, so the first question to ask is why research data should be released at all.

A significant number of research funders now require that data produced in the course of the research they fund should be made available for other researchers to discover, examine and build upon. The rationale given is that by opening up the data allows for new knowledge to be discovered through comparative studies, data mining and so on; it also allows greater scrutiny of how research conclusions have been reached, potentially driving up research quality. Many require that authors deposit their supporting data either with the journal itself or with a recognised data repository.

There are many additional reasons why releasing data can be in a researcher’s interests. The discipline of working up data for eventual release helps in ensuring that a full and clear record is preserved of how the conclusions were reached from the data, protecting the researcher from potential challenges. A culture of openness deters fraud, encourages learning from mistakes as well as from successes, and breaks down barriers to interdisciplinary and ‘citizen science’ research. The availability of the data, alongside associated tools and protocols, increases the efficiency of research by reducing both data collection costs and the possibility of duplication. It also has the potential to increase the impact of the research, not only academically, but also economically and socially.

Merely releasing data without making clear their terms of use can be somewhat counter-productive, though. The default legal position on how data may be used in any given context is hard to untangle, not least because different jurisdictions apply different standards of creativity, skill, labour and expense when judging whether copyright or similar rights pertain. The situation is complicated by the fact that different aspects of a database – field values (i.e. the data themselves), field names, the structure and data model for the database, data entry interfaces, visualisations and reports derived from the data – may be treated quite differently.

Within the EU, the act of compiling a database attracts copyright insofar as the compiler has exercised intellectual judgement in selecting or arranging the data. In Finland, this copyright to database is created to the author/authors, that is the individual researchers, if they are doing independent research. If they are not doing independent research the copyright to databases is created by law to university. Aalto University has an appendix to work agreement. With this appendix the copyright to database and the copyright and other intellectual property rights that are result of research project receiving outside funding is transferred to the university.

EU also has a separate sui generis database right that applies to the contents of a database where a substantial investment was made to obtain, verify or present them. This sui generis database is always created by law the ownership of the database to the employer. Databases can also be protected as catalogues in the Nordic countries, this catalogue right is also owned by employer.

A source of confusion are the variations between jurisdictions in what can be done with copyright material. While the Berne Convention provides a level of consistency among its signatories there are still variations in the exemptions that each jurisdiction provides, and subtle differences concerning, for example, which acts count as copying, and what constitutes an insubstantial use or extract of a work. The latter is an important point because the exemptions to copyright and database rights permit a dataset to be compiled from insubstantial extracts from a number of other datasets, but the fact of whether the extracts are indeed insubstantial might be contested.

With all these complexities and ambiguities surrounding the rights of database compilers in the national laws, re-users need a license and Rules of the Road in able to achieve clear guidance from compilers on what they are allowed to do with the research data.

Licensing concepts

The ways of communicating permissions to potential re-users of data are licenses and waiversA license is a legal instrument for a rights holder to permit a second party to do things that would otherwise infringe on the rights held. Only the rights holder can grant a license; it is therefore imperative that the ownership of intellectual property rights (IPR)  to the data are established before any licensing takes place. A waiver is a legal instrument for giving up one’s rights to a resource. 

Licenses grant permissions on condition that certain terms are met. Three conditions commonly found in licenses are attribution, copyleft, and non-commerciality.

  • An attribution requirement means that the licensor must be given due credit for the work when it is distributed, displayed, performed, or used to derive a new work.
  • copyleft  or share alike requirement means that any new works derived from the licensed one must be released under the same license, and only that license.
  • The intent of a non-commercial license is to prevent the licensee from exploiting the work commercially. Such licenses are often used as part of a dual-licensing, where the alternative license allows commercial uses but requires payment to the licensor.

While these all have their uses, they can cause problems in the context of datasets:

Datasets are particularly prone to attribution stacking, where a derivative work must acknowledge all contributors to each work from which it is derived, no matter how distantly. If a dataset is at the end of a long chain of derivations, or if large teams of contributors were involved, the list of credits might well be considered too unwieldy. The problem is magnified if different sets of contributors have to be credited in a different way, especially if automated methods are used to assemble the dataset – some of the benefits of automation are lost if attribution conditions have to be inspected manually. Licensors can tackle this problem by using the waiver CC0 explained below that does not require attribution, and give Rules of the Road recommendations on how to state the source of the research data.

The problem with copyleft or share alike -licenses is, that they prevent the licensed data being combined with data released under a different license: the derived dataset would not be able to satisfy both sets of license terms simultaneously. Some copyleft licenses, however, demonstrate a small amount of flexibility in allowing derivative works to be released under a compatible license, that is, one that applies approximately the same conditions.

Non-commercial licenses reduce the ways datasets can be used because of ambiguity of what constitutes a commercial use. The EU, the G8 countries and the Finnish government wish that opening up data creates new businesses, growth and employment, and restricting the use of dataset by demanding that they not be used commercially does not allow for these goals to be met. However, if dual licensing is the goal, then  allowing use with a non-commercial license and separately licensing commercial uses is the way to achieve this goal.

Creative Commons standard licenses

Below is a selection of standard licenses available, along with reasons for and against using each one. Please note that these licenses can be terminated only by expiry of the licensor’s IPR or, for a particular licensee, through breach of terms.

Creative Commons is a non-profit corporation set up in 2001 for the purpose of producing simple licenses for creative works. These licenses give the creators of such works finer-grained control over how they may be used than simply declaring them public domain or reserving all rights. As well as the legal text, the licenses all have quick clear summaries and a canonical URL for use in HTML, RDF and other code. A rights expression language is also provided for use with RDF. While originally aimed at works such as music, images and video, Creative Commons licenses are widely for most forms of original content, including research data. Over one billion documents have been licensed using Creative Commons licenses

There are six main Creative Commons licenses. Each license includes the Attribution condition. There are three other conditions that licensors can add, and the various possible combinations produce the six licenses. Using just the Attribution condition is known as the CC BY license.

There is a Non-Commercial condition, where commercial is defined as ‘primarily intended for or directed toward commercial advantage or monetary compensation’.

The Share Alike condition inserts a strong copyleft clause into the license. Finally, including the No Derivatives condition (version 4.0) allows these things for private use, but prevents the licensee from sharing the derivations. The six permutations are therefore

Versions of the licenses prior to version 4.0 present problems. The most significant is that the older versions do not cover sui generis database rights in force in the European Union. The version 4.0 licenses, however, do explicitly include sui generis database rights. The 4.0 versions were translated into Finnish by Aalto University project Services  (with  allocated Ministry of Education funding)  as Aalto University is the affiliate organization of Creative Commons in Finland. The older Creative Commons licenses were originally created in Stanford and Harvard, and universities are usually the affiliate organizations in each jurisdiction.

The 4.0 versions were created in global co-operation to ensure that the licenses are well adapted to different jurisdictions. Co-creation legal teams included Aalto University legal team.

CC BY 4.0 license was adopted by Ministry of Finance as the legal tool for opening publicly funded data (JHS 189  and and  by the Ministry of Education as the license recommended for open access publishing of research data .

The licenses do not distinguish using data as part of a new collection/database from using them to generate content (graphs, models, maps, etc.). This means the Share Alike and No Derivatives conditions greatly reduce possibilities to use data.  No Derivatives condition disallows most substantive types of reuse. It should therefore be avoided.

In addition to the licenses, Creative Commons provides the waiver CC0, a tool for waiving all rights without any terms for example attribution is not required.should be used with Rules of the Road, where authors of dataset clarify how they want to be cited, but citing authors is not a condition for reuse.  

Creative Commons licensing for FAIR data

  • wide societal impact is achieved by using  CC BY 4.0 , the license that enables legal interoperability for research data. The license is  recommended for FAIR data.
  • the NC (non-commercial) condition: can be used use if  goal is  dual licensing, meaning keeping a possibility for commercial licensing later. License text must include a source  URL of website, where further licenses can be obtained.  For Aalto University owned databases and software this URL where further licenses can be obtained is  and commercial licenses are negotiated by Aalto University Innovations Services. 

The following  terms for research data can be  problematic with FAIR data goals of interoperability and reusability

  • the SA (share alike) condition  reduces interoperability and forces users to license always with the same license. Can be useful in creating a community effort, such copyleft licenses are used to create software  developer communities.  For example, the GNU General Public License (GPL), is a copyleft license and it is used for the Linux kernel.
  • the ND (no derivatives)  condition severely restricts reuse, but can in some cases be  used with restricted access of data, when use of data is already restricted because it contains personal data

This text by senior legal counsel Maria Rehbinder is a derivative work based on: Ball, A. (2014). ‘How to License Research Data’. DCC How-to Guides. Edinburgh: Digital Curation Centre, made available under the Creative Commons Attribution 4.0 International license: . Changes were made reflecting Aalto University, Finland and commercial licensing.