Code and Data Review at JCLS


General remarks

In JCLS’s review process, there is a basic and an extended code and data review.

  • The basic review is mandatory and is part of the normal reviewing process. Basic code quality is a requirement for acceptance. The goal of the basic code review is to make sure that code and data are readily available and can be inspected, beyond what is possible in the text or appendices of a paper.
  • The extended review is an optional step that authors can request for their article once it has been accepted in principle (e.g. for the annual conference). The goal of the extended code review is to help authors bring the code and data repositories accompanying their paper to a level where their work becomes truly replicable. This means making it as easy as possible for others to repeat the entire data analysis process using the data and code provided by the author(s) of a paper. If the author(s) of a paper under review decide to take advantage of the extended code and data review, a reviewer dedicated specifically to code reviewing will be assigned to the paper and will make proposals for how to revise the code and data repository.
Upon successful completion of the review, the corresponding code and data repository (CDR in short) will receive a certification notice and a badge showing the successful (basic or extended) code and data review.

Review Criteria

The following criteria are marked as pertinent for the basic code review, only the extended code review, or as being optional. Regardless of the reviewing option they choose, all authors are invited to consider the criteria when designing their code and data.

Findability and long-term availability

  • URL (basic). The URL of the CDR has been correctly included in the paper. (Before acceptance, this repository is anonymized. After acceptance, the regular repository is used.)
  • Archive (extended). There is an archive copy of the CDR, e.g. on Zenodo or Figshare, and its DOI is also mentioned in the paper. (This is checked after acceptance.)

Data availability

  • Data (basic). The key datasets, scripts and outputs are available in the CDR, or are freely available from places linked in the CDR.
  • Data (extended). All data required for running the scripts is provided in the CDR itself and can be found in dedicated folders, or is freely available from places linked in the CDR. Similarly, all data produced by the scripts is available.
  • Additional resources (extended). If any additional datasets, tools or other resources are required for running the code in the CDR, their URLs and (if available) DOIs are mentioned in the top-level README. (This is checked after acceptance.)
  • Large files (extended). If data cannot be shared within the CDR itself because the files are too large, solutions like Large File Storage on Github or an external deposit location, like Zenodo or Figshare, should be used. Links to the external files need to be provided in the top-level README.
  • Copyright issues (extended). If data cannot be shared because of copyright issues, two solutions are feasible: Either the materials are transformed to a derived format that is sufficiently information-rich for the analysis to still work (e.g. document-term-matrix for stylometric experiments). Alternatively, the full copyrighted materials are deposited on Zenodo under the “restricted access” modalities.

Documentation

  • Top-level README (basic). There is a README at the top level of the repo that clearly states the purpose, structure and contents of the repository, including (after acceptance) a reference to the article the repository pertains to.
  • Top-level README (extended). The README explains how to proceed in order to reproduce the key results of the paper and includes, if applicable, a statement of any limitations to the replicability of the code.
  • Explanatory text in scripts (extended). Each script in the repository has a short description of its purpose, strategy and requirements. Each function in each script has a short description of its purpose. Use docstrings and comments as appropriate.

Interoperability

  • Requirements (extended). The requirements for the code to run are clearly stated in the README and documented in a machine-readable manner. For example, a requirements.txt file is used.
  • Paths (extended). The code is written in such a way as to allow executing it on computers running different operation systems with minimal adjustments. For example, relative paths and ‘os.path.join’ should be used for file paths in Python.
  • Encoding (extended). The code is written in such a way as to manage file encoding issues as best as possible. For instance, the file encoding are Unicode / UTF-8 whenever possible and the code makes it explicit that files are to be opened with this encoding.
  • Packaging (optional). The code and data are made available either using a live environment (such as Binder) or by packaging them up entirely (e.g. as a Docker image including all dependencies and requirements).

Replication support

  • Paper and repository (basic). A mapping from the tables and/or figures in the paper to the paths of the corresponding matrices, files and/or figures in the CDR is provided in the README. (This mapping will need to be checked again at the copyediting stage.)
  • Completeness (extended). Based solely on the input data provided in the repository (or in other places linked to in the README) and on the code, all results available in the repository can be generated again.
  • Parameters (extended). Any settings that users performing a replication of the research will need to adapt in order to run the code (e.g. base paths), or may want to adapt in order to test the influence of certain parameters (e.g. clustering or classification settings) are separated out from the body of the code as variables and clearly marked as such (e.g. by comments).

General code quality

  • Repository structure (basic). The repository is usefully structured into folders, in particular separating input data, data or results generated by the code, the code itself, and any visualizations.
  • Code structure (extended). The code is usefully structured into scripts and functions, creating an appropriate level of modularity.
  • Folders (extended). The code is written in such a way as to automatically create all folders and subfolders required to save results to file. No manual creation of folders should be necessary.
  • Naming conventions (extended). The scripts, functions and variables have names that are concise, meaningful and readily understandable. For instance, they avoid excessive use of abbreviations as well as purely functional names (such as ‘dataframe’).
  • Coding style (optional). The code follows the relevant coding style conventions, such as PEP8 for Python or tidyverse for R