GitHub for research documentation

GitHub for research documentation

July 7, 2021

GitHub for research documentation #

See the slides here

hackmd-github-sync-badge

A tutorial written by Emma Hudgins

1. Getting set up #

2. Using GitHub desktop #

2.1 creating a repo #

  • In the left pane, select Add to create a new repository
  • Choose Create new repository and give the path to a new project folder (or fork this one via Clone, or add an existing folder).
  • I keep all my repos in my OneDrive.1 in a folder called Github

1. Some people say not to do this because it causes OneDrive to constantly sync small Github files, but it doesn’t seem to cause me any trouble, and it helps me avoid accidentally deleting things when I make mistakes with GitHub Desktop. see here for at least one other person who does this

  • Choose a License that reflects the reuse conditions you’d like for your project (see here for a description of licenses available)
  • Initialize the repo with a README that you will fill following the Metadata tips in section 4. to ensure reproducibility
  • If necessary, change the privacy settings of your repo on the Github website (storage limits are lower for private repositories).
  • If the repo does not yet exist online, make sure you Publish it.

2.2 Your first commit! #

  • You can manually open your files as usual, but you can also use the Repository tab to open your files all at once in a text editor (I set my default to Sublime)
  • Try make a small change to one of your files, or create a new file.
  • If you navigate back to Github desktop, there should be evidence of a changed/new file
  • If you’re satisfied with your changes, type a commit message (or use the default for a single file change) and press Commit to main
  • Send your changes to Github’s server by pressing Push to origin
  • Your changes should now be reflected on the online version of your repo!
  • NOTE: There should be a Git pane in the top right of newer versions of RStudio when you’re working in an R Project contained within a Github repo, and using it to Push and Commit works similarly to Github Desktop.

2.3 Pulling online changes to your machine #

  • Now, try changing a file on the online repo (e.g. using the pencil icon to update the README)

  • Here’s an example of the Commit menu after I edited the README -

  • You may need to click ‘Fetch origin’ to see your changes

  • Now Pull

  • Now your local computer should be up to date with the online repo!

  • NOTE: There should be a Git pane in the top right of newer versions of RStudio when you’re working in an R Project contained within a Github repo, and using it to pull works similarly to Github Desktop.

3. Linking GitHub with OSF #

  • Create a new OSF project
  • Or use a template (My lab has started following this basic template)
  • Use a program like DMP Assistant to add a Research Data Management Plan to the RDMP section of the OSF project
  • Add GitHub as an Add-on in your OSF profile in Settings» Select add-ons
  • Link GitHub with your OSF account in the Add-ons section of the relevant project components (e.g. Analysis)
    • Select Import Account from Profile
    • Select the corresponding repo for the project

4. Some reproducibility tips (largely taken from the Bennett Lab manual compiled by Jaimie Vincent) #

  • Data management and storage

    • Starting any research project with an RDMP provides direction for conducting research in line with Open Science/FAIR practices.
    • Data and code should be backed up regularly, using the “3-2-1” rule as a rule of thumb - this means having three copies of your data (your working copy and two backups) on two different formats (e.g., cloud storage and disk storage) with at least 1 off-site copy for disaster recovery.
    • Update your RDMP as necessary to include information about where these files will be permanently stored in addition to your storage, backup, security and archiving protocols.
    • Your code and/or analyses, interpreted data, and other outputs (e.g., figures, tables) should be continuously backed up and securely stored. Be sure to consider data privacy when making backups.
    • Store data and code in an organized file system (for instance, using a breakdown of scripts, raw data, derived data , outputs within a large project directory)
    • Do not alter the raw data (consider making it read-only) to have a stable separate copy
  • Intellectual property

    • This is particularly important to discuss when the work belongs to students whose association with a particular research group may be temporary. The RDMP can help transparency around whose intellectual property this work represents. It can be useful to name a single data steward who is responsible for the maintenance of the code and data throughout its lifecycle
  • Metadata

    • There should be adequate metadata documentation. Metadata provides information about code and data function and usage, and often takes the form of ‘README’ files in research projects. GitHub Desktop asks if you want to initialize a README file whenever you create a new repo
    • Consider having a single readme at the minimum for the project that outlines all scripts and input data files and how they interact such that the analyses can be reproduced.
    • At the minimum, this file should contain the project Title, Authors, Description - including of all folder subdirectories and how they relate to each other, Date, and License. GitHub allows easy association of a variety of license types with repositories - consider a license like GPL or CC-BY to ensure allowability of the reuse of your code
    • Cryptic naming conventions in data files should be described, as well as any units and geographic transformations.
    • If the project includes external data sources, download dates should be provided as well as any relevant filters selected.
    • Update the metadata following any changes to the workflow
  • Code

    • Provide software and package version information in either the metadata or in a commented header section of any script
    • Provide annotated code with comments describing all steps taken in the analysis
    • All figures and tables should be entirely reproducible with the code and data provided (data privacy restrictions permitting). For sensitive data, plan for appropriate anonymization and secure storage
    • Consider using packages that guess working directories (e.g. here package for R), or using project files like .Rproj to facilitate data and code integration when the data and code are shared
  • Hosting

    • Link the project with a platform that can provide a persistent link to the published version of the data (e.g. Zenodo link with GitHub, Dryad) in order to ensure the published results can be reproduced even as the workflow evolves. See here for how to create a Release and link with Zenodo
  • Naming (OSF naming guidelines)

    • Consider adopting a standard file naming convention, i.e. using dashes or underscores to separate name components (avoiding special characters and spaces, especially)
    • Use the most informative naming as possible within all project components (including variable names in code).
    • Number or date scripts so that they order themselves meaningfully (i.e. by order of use or version number)

Appendix 1. Connecting to a remote server via SSH (in case you want to sync your GitHub repo with a server for backup or computing power) #

  • Open terminal (Ctrl+Alt+T on Ubuntu, Command+Space on Mac to bring up Spotlight and search for Terminal, or use the Windows Start menu and look for Windows Powershell or Command Prompt)
  • Type ssh user@hostname for the remote machine you want to connect to
EmmaH:~/$ ssh ehudgins@nature-vm04.carleton.ca
  • Enter your password when prompted, (be prepared to say Yes to yet another message about a fingerprint)
  • Navigate to the directory where your script is stored (maybe you transfered it there with FileZilla?) using cd

Appendix 2. Configuring git on a remote machine (this section also shows you how to embed code chunks in Markdown) #

  • If you want to sync an existing GitHub repository, use git in the terminal (see here for more info on working with git in various ways)

  • To clone a GitHub repo to a remote machine, install git

    • in the terminal (for Ubuntu), type
    ehudgins@nature-vm04:~/$ sudo apt install git
    
    • see here for other OS instructions
    • Configure git for your GitHub account
    ehudgins@nature-vm04:~/$ git config --global user.email "you@example.com"
    ehudgins@nature-vm04:~/$ git config --global user.name "Your Name"
    
  • Clone using

ehudgins@nature-vm04:~/$ git clone https://github.com/emmajhudgins/example_github_osf
  • If you update the repo on your local computer, pull from the folder
ehudgins@nature-vm04:~/example_github_osf$ git pull 
  • If you update the repo on this remote machine, commit changes and push to origin
ehudgins@nature-vm04:~/example_github_osf$ git add <changed file>
ehudgins@nature-vm04:~/example_github_osf$ git commit -m "<message>"
ehudgins@nature-vm04:~/example_github_osf$ git push