I Have A Cool Dataset, Can I Publish It?

Added by Maarja Toots 2019/04/30

You have scraped some interesting data, downloaded it, perhaps modified a bit or combined with other data and what you now have is a really cool dataset that you would love to share with others. So here’s the question: can an organization or an individual share a dataset on opendata.riik.ee that is based on processing or combining someone else’s data?

The issue is there – not all interesting datasets are (yet) available on the portal and some not even on the data holder’s website. Can an eager user help a data holder publish their data? On what conditions? As an example, the need for more clarity recently came up in relation to election data, which the data holder has not released under an open license but in which there seems to be considerable public interest (see the Github conversation).

Since no rules and requirements exist to regulate the publication of secondary data, the participants in the legal issues workshop at the Open Data Forum of 18 April took the first steps in formulating some as a basis of further discussion. As a result of a heated debate, the following proposals were made:

Question: Can secondary data be published on the portal?

The short answer: Yes.

The correct answer: Yes, but on certain conditions.

What conditions should be met?

  1. The most important: the content and origin of the data should be transparent for users.
  2. Transparency can be created by providing proper metadata. If the metadata is detailed and correct, users can assess the credibility of the data. As a rule, the more detailed the metadata, the more trustworthy the data (unless the data holder has issued an explicit warning about possible errors in data, which, of course, would be nice of them).
  3. Requirements on data publishers should not be as strict as to make them lose interest in publishing the data – publishers should be required to do as little as possible and as much as necessary.
  4. In order to find a good balance, we should agree on standard compulsory metadata that any dataset should always have. In addition to that, it is advisable to agree on a set of recommended but not mandatory metadata which could help users assess the quality of the dataset. Any metadata should always follow an agreed structure.
  5. The owner of the original data should only be responsible for the accuracy of the metadata provided with the original dataset. The owner of the original dataset is not liable for any indirect damages caused by users of the data or its derivatives.

What metadata should be mandatory?

  1. Data source and publisher’s name. This includes the source of the secondary dataset and the original data that the dataset is derived from.
  2. Data collection method.
  3. Data processing method.
  4. Time of collecting the data.
  5. Time period covered in the data.

However, this may not be enough. In the discussion, participants raised a familiar problem: you have built a brilliant service on someone else’s data, you wake up one morning, open your computer and… it’s all broken! The cause may be a change in the data collection method, update frequency or the composition of the data, or perhaps change in a process, regulation or law due to which the data is no longer available in the same format. Such situations may be more common if the data is collected and published not because of a long-term legal obligation but at the data provider’s own initiative. In other words, it is crucial for the provider of an open data driven service to know whether and for how long the data that they use will continue to be published. It would therefore be extremely helpful if the data holder would give an advance notice of any changes in the availability of the data. To this end, the metadata could also include information on:

  1. The frequency of release or the policy for updating the data (as a small remark, a lot of the datasets currently in the portal provide no information on their update frequency).
  2. The storage period of the data, if set by the law (e.g. after which time period the data would be archived).

All of that may be thinkable if the data is published by the data owner. However, where should this information come for data that is scraped and uploaded by another party? What if the holder of the original data does not wish the data to be published on the portal? Should it only be allowed to publish data which the original data holder has provided with a clear license? What if the license is not specified?

These and many other questions still remain open, so this, dear readers, is where we’ll invite you to join the discussion on GitHub!! Based on your input, guidelines will be formulated to outline the data publisher's obligations and set a clearer responsibility for data holders, users and any intermediaries.

The Open Data Portal's content is created as part of the EU structural funds' programme 'Raising Public Awareness about the Information Society' financed through the EU Regional Development Fund. The project is implemented by Open Knowledge Estonia.

European Union Regional Developmen Fund

The Open Data Portal's content is created as part of the EU structural funds' programme "Raising Public Awareness about the Information Society" financed through the EU Regional Development Fund.