Data organisation and documentation

Choosing the right format

As technology changes, researchers must plan for both hardware and software obsolescence and consider the longevity of their file format choices to ensure long-term readability and access.

The file formats most likely to be accessible in the future have the following characteristics:

  • Non-proprietary
  • Open and documented
  • Widely used by the research community
  • Standard representation (ASCII, Unicode)
  • Unencrypted
  • Uncompressed

Examples of preferred FAIR archive formats for preservation are listed below:

  • Containers: TAR, GZIP, ZIP
  • Databases: XML, CSV, JSON
  • Geospatial: SHP, DBF, GeoTIFF, NetCDF
  • Video: MPEG, AVI, MXF, MKV
  • Sounds: WAVE, AIFF, MP3, MXF, FLAC
  • Statistics: DTA, POR, SAS, SAV
  • Images: TIFF, JPEG 2000, PDF, PNG, GIF, BMP, SVG
  • Tabular data: CSV, TXT
  • Text: XML, PDF/A, HTML, JSON, TXT, RTF
  • Web archive: WARC

Consider migrating your data to a format with the above characteristics, in addition to keeping a copy in the original software format. Be aware that in some cases the migration of data to an open format may result in the loss of data/metadata.

If you deposit your data in a repository, your files may be migrated to newer formats so that they can be used by future researchers.

Find out more:

Some format conversion tools:

 

File structure

The name, folder structure and version control of the files should make it easy to find, locate and understand the data. It is therefore very important to plan this well.

Recommendations for naming files:

  • Give files short and relevant names.
  • Do not use special characters: ! @ # # $ % ^ & * ( ) ` ; < > ? , [ ] { } ' " |.
  • Use underscores instead of spaces.
  • Be consistent with the nomenclature you choose: uppercase, lowercase, date format, YYYY-MM-DD or YYYY-MM.

More recommendations, in this document prepared by the Library Service of the University of A Coruña.

Tools for renaming folders:

Recommendations for organising folders:

  • Consider the best hierarchy for files: deep or shallow;
  • Organise folders and files systematically;
  • Limit the number of folders to three or four;
  • Separate completed work from work in progress.
 
Find out more: UK Data Service
 

Version control recommendations:

  • If there are several versions, name them by number (e.g. v01, v02, etc.);
  • The final version can be called FINAL;
  • Determine how many and which versions of a file are kept, and for how long;
  • Keep a record of the changes made to a file when a new version is created;
  • Keep track of where files are located if they're stored in multiple locations;
  • Choose a single location for important or final versions.
 
Find out more: UK Data Service
 
Version management tools:
 
 
Documenting the data
To ensure easy and efficient replication of the data, it is essential to document it by adding a readme.txt file. The file should contain necessary information such as description, methodology, coverage, rights of use, and privacy. Several guides and templates are available to help create the file, including the readme.txt template (created by the Economics and Business Library, UDC) and the Guide to Creating a Readme File from Cornell University.
 

In addition to the explanatory documentation, it is necessary to describe the data in order to be able to identify, organise and collect information about each piece of data. This facilitates reuse and access to the data in other systems and ensures its long-term preservation. This description of the data is done through metadata.

There are currently several metadata standards available for describing data. Each knowledge area typically has its metadata standard and tools for each type of metadata. Therefore, it is essential to select the most appropriate ones. Metadata Standard Catalog and DCC offer a good selection, as well as Fairsharing.org.