Getting Started with Jetstream2 for Capstone Projects

INFO 698 Capstone - Spring 2025

Preparation: Sign up for ACCESS

๐Ÿ”— ACCESS ID

  1. Register without an existing identity.
  2. Send the instructor your ACCESS ID via this Google Form.
  3. Once you have been added to the course allocation, proceed with the next steps.

๐Ÿš€ Option 1: Use Linux VPNs on Jetstream2

This guide walks you through launching your first virtual machine (VM), connecting to it, transferring files, installing software, and using storageโ€”all in Jetstream2.


โœ… Step 1: Log in to Jetstream2

๐Ÿ”— Login Guide

  1. Go to the Jetstream2 Exosphere portal.
  2. Log in using your ACCESS ID (provided when added to the capstone project).
  3. You should see your allocated project under โ€œProjectsโ€ in the dashboard.

๐Ÿ” Step 2: SSH Key Setup

You need an SSH key to connect:

๐Ÿ”น Option A: Create your own

ssh-keygen -t rsa -b 4096 -f ~/.ssh/jetstream-key

Upload jetstream-key.pub to Exosphere under SSH Keys.

๐Ÿ”น Option B: Use a Jetstream2-generated key

Save the .pem file securely. Youโ€™ll use it to connect like this:

ssh -i ~/Downloads/your-key.pem ubuntu@<your-instance-ip>

๐Ÿ’ป Step 3: Launch Your First Instance (Virtual Machine)

๐Ÿ”— First Instance Guide

  1. Click โ€œLaunch Instanceโ€.
  2. Choose a base image:
    • For Python + Jupyter: Try Ubuntu 22.04 or a prebuilt AI/ML image if available.
  3. Choose instance type:
    • Use CPU instances for general tasks.
    • Use GPU instances for training deep learning models.
  4. Assign a volume if needed (for storage persistence).
  5. Name your instance and click Launch.

Hereโ€™s a summary of commonly used instance types on Jetstream2, depending on your project needs:

Use Case Instance Type vCPUs RAM GPU? Notes
Lightweight dev/testing m3.small 2 6 GB โŒ For quick prototyping, notebooks
Medium data processing m3.medium 8 30 GB โŒ ETL, pandas, scikit-learn
Heavy feature engineering m3.large 16 60 GB โŒ PostgreSQL joins, Dask
Multi-threaded CPU training m3.xl / m3.2xl 32โ€“64 125โ€“250 GB โŒ XGBoost, random forest, NLP CPU
Large-memory workloads r3.large / r3.xl 64โ€“128 500โ€“977 GB โŒ In-memory joins, large DataFrames
Lightweight GPU g3.medium 8 30 GB A100 (partial) Torch, TensorFlow experiments
Full GPU training g3.xl 32 117 GB A100 (full) Deep learning, transformers
Visualization, Jupyter Any + web desktop - - - Enable Web Desktop option

๐Ÿ’ก You can resize or relaunch as your project scales.


๐Ÿ”Œ Step 4: Access Your Instance

๐Ÿ”— Access Guide

  • Once your VM is running:
    • Click โ€œAccessโ€ โ†’ โ€œOpen Shellโ€ to use the built-in terminal (no extra tools needed).
    • Alternatively, connect using SSH (for advanced users).
    ssh -i ~/.ssh/your-key.pem ubuntu@<your-instance-ip>

๐Ÿ“ Step 5: Set Up Storage (Volumes)

๐Ÿ”— Volume Guide

  • Volumes are persistent storage devices that can be attached to your instances.
  • Steps:
    1. Create a new volume.

    2. Attach it to your instance.

    3. Mount the volume in your Linux filesystem:

      sudo mkdir /mnt/data
      sudo mount /dev/sdb /mnt/data
    4. Now you can store data here, and it will persist even if the instance is deleted.


๐Ÿ”„ Step 6: Transfer Files (Data, Code, Models)

๐Ÿ”— File Transfer Guide

  • Use Exosphereโ€™s web UI to upload:

    • Python notebooks, datasets, or pre-trained models.
  • Or use scp from the command line:

    scp myfile.csv your-access-id@your-instance-address:/mnt/data/

In-depth example

Transfer files to Jetstream2 using rsync:

rsync -avz -e "ssh -i ~/.ssh/your-key.pem" \
  ~/your-local-folder/ \
  ubuntu@<your-instance-ip>:/home/ubuntu/project-data/

๐Ÿ’ก For large files (e.g., >10GB), compress before transferring:

gzip largefile.csv
# transfer, then decompress:
gunzip largefile.csv.gz

๐Ÿ”ง Step 7: Install Software on Your VM

๐Ÿ”— Software Install Guide

  • Use apt to install packages:

    sudo apt update
    sudo apt install python3-pip git htop
  • Install Python libraries:

    pip install numpy pandas scikit-learn torch transformers

๐Ÿง  Use Case Examples for Capstone Projects

  • Data Processing & EDA: Use Jetstream2 CPU instances + mounted volume.
  • Deep Learning: Use Jetstream2 GPU instance with PyTorch or TensorFlow.
  • Collaboration: Team members can share volumes or copy project files via the UI.

๐Ÿ” Step 8: Manage Your Instance

๐Ÿ”— Instance Management Guide

  • Stop instance when not in use to conserve SUs.
  • Terminate only when youโ€™re done permanently.
  • Monitor usage and billing under your project dashboard.

๐Ÿ›  Pro Tips

  • Keep notebooks and code in /mnt/data to persist across instance shutdowns.
  • Use screen or tmux to run long processes that wonโ€™t stop if you disconnect.
  • Back up large models or results periodically.

๐Ÿ“ฌ Need Help?

Use: - Jetstream2 Documentation - ACCESS Support

Or ask your instructor/project lead for help accessing shared resources.


๐ŸŒ Option 2: Use CACAO + Google Colab to Run on Jetstream2

Using CACAO (Cyberinfrastructure and ACCESS Connectivity for Academia and Outreach), you can launch a Jetstream2 VM and connect it directly to Google Colab. This allows you to:

โœ… Work in the familiar Google Colab environment
โœ… Use Jetstream2 GPU/CPU power for compute
โœ… Avoid needing a personal SSH setup


๐Ÿงฐ What You Need

  • Your ACCESS ID and project membership (already done by your instructor)
  • CACAO installed in your browser
  • A Google account (for Colab)
  • A running Jetstream2 VM (weโ€™ll walk through this below)

โœ… Step-by-Step: Set Up CACAO + Colab

See the webinar version of this tutorial via this YouTube Video on CACAO + Google Colab integration by Cyverse.

1. Launch a Jetstream2 VM

Just like in the regular Jetstream2 setup:

  • Go to https://jetstream2.exosphere.app
  • Launch an instance (Ubuntu 22.04 recommended)
  • Choose the project youโ€™re part of
  • Use GPU or CPU depending on your project
  • Optionally attach a volume for storage
  • Launch!

Let the instance boot up fully before proceeding.


2. Install CACAO in Google Chrome

  • Visit the CACAO Chrome Extension
  • Install the extension
  • It allows Colab to โ€œtunnelโ€ into your ACCESS resources

3. Set Up CACAO Configuration

In the extension:

  • Sign in with your ACCESS ID
  • Select your Jetstream2 instance
  • Enable port forwarding (default is port 22)

Make sure the instance is Running and accessible.


4. Connect from Google Colab

In your Colab notebook, run:

!pip install colab_ssh --upgrade
from colab_ssh import launch_ssh

# Replace with your instance's public IP and username (typically 'ubuntu')
launch_ssh("your-jetstream2-public-ip", "ubuntu", password="your-password")

If you prefer to use key-based login, you can also:

from colab_ssh import launch_ssh
launch_ssh("your-jetstream2-public-ip", "ubuntu", key="your-private-ssh-key")

๐Ÿ’ป Youโ€™re Now Running Code on Jetstream2 via Colab!

You can:

  • Install PyTorch, Transformers, and other libraries
  • Run GPU training loops
  • Access data stored on Jetstream2 (e.g., in mounted volumes)

๐Ÿงผ Tear Down

When done:

  • Save any model files to /mnt/data or your mounted volume
  • Shut down your Jetstream2 instance to avoid burning SUs
  • Disconnect from Colab or revoke CACAO permissions as needed

โš ๏ธ Notes

  • CACAO setup avoids direct SSH, which is great for beginners
  • CACAO and Colab may time out; use screen or tmux for long processes
  • Colab can still save to Google Drive for convenience

Absolutely โ€” hereโ€™s a generalized, student-friendly guide for running a data science or AI project on Jetstream2, with an emphasis on reproducibility and scalability. The steps are flexible for various datasets (not just MIMIC-III), and the instance table now covers a range of typical project needs.


Absolutely โ€” hereโ€™s a generalized, student-friendly guide for running a data science or AI project on Jetstream2, with an emphasis on reproducibility and scalability. The steps are flexible for various datasets (not just MIMIC-III), and the instance table now covers a range of typical project needs.