Getting Started with Jetstream2 for Capstone Projects
INFO 698 Capstone - Spring 2025
Preparation: Sign up for ACCESS
๐ ACCESS ID
- Register without an existing identity.
- Send the instructor your ACCESS ID via this Google Form.
- Once you have been added to the course allocation, proceed with the next steps.
๐ Option 1: Use Linux VPNs on Jetstream2
This guide walks you through launching your first virtual machine (VM), connecting to it, transferring files, installing software, and using storageโall in Jetstream2.
โ Step 1: Log in to Jetstream2
๐ Login Guide
- Go to the Jetstream2 Exosphere portal.
- Log in using your ACCESS ID (provided when added to the capstone project).
- You should see your allocated project under โProjectsโ in the dashboard.
๐ Step 2: SSH Key Setup
You need an SSH key to connect:
๐น Option A: Create your own
ssh-keygen -t rsa -b 4096 -f ~/.ssh/jetstream-keyUpload jetstream-key.pub to Exosphere under SSH Keys.
๐น Option B: Use a Jetstream2-generated key
Save the .pem file securely. Youโll use it to connect like this:
ssh -i ~/Downloads/your-key.pem ubuntu@<your-instance-ip>๐ป Step 3: Launch Your First Instance (Virtual Machine)
๐ First Instance Guide
- Click โLaunch Instanceโ.
- Choose a base image:
- For Python + Jupyter: Try
Ubuntu 22.04or a prebuilt AI/ML image if available.
- For Python + Jupyter: Try
- Choose instance type:
- Use CPU instances for general tasks.
- Use GPU instances for training deep learning models.
- Assign a volume if needed (for storage persistence).
- Name your instance and click Launch.
Hereโs a summary of commonly used instance types on Jetstream2, depending on your project needs:
| Use Case | Instance Type | vCPUs | RAM | GPU? | Notes |
|---|---|---|---|---|---|
| Lightweight dev/testing | m3.small |
2 | 6 GB | โ | For quick prototyping, notebooks |
| Medium data processing | m3.medium |
8 | 30 GB | โ | ETL, pandas, scikit-learn |
| Heavy feature engineering | m3.large |
16 | 60 GB | โ | PostgreSQL joins, Dask |
| Multi-threaded CPU training | m3.xl / m3.2xl |
32โ64 | 125โ250 GB | โ | XGBoost, random forest, NLP CPU |
| Large-memory workloads | r3.large / r3.xl |
64โ128 | 500โ977 GB | โ | In-memory joins, large DataFrames |
| Lightweight GPU | g3.medium |
8 | 30 GB | A100 (partial) | Torch, TensorFlow experiments |
| Full GPU training | g3.xl |
32 | 117 GB | A100 (full) | Deep learning, transformers |
| Visualization, Jupyter | Any + web desktop | - | - | - | Enable Web Desktop option |
๐ก You can resize or relaunch as your project scales.
๐ Step 4: Access Your Instance
๐ Access Guide
- Once your VM is running:
- Click โAccessโ โ โOpen Shellโ to use the built-in terminal (no extra tools needed).
- Alternatively, connect using SSH (for advanced users).
ssh -i ~/.ssh/your-key.pem ubuntu@<your-instance-ip>
๐ Step 5: Set Up Storage (Volumes)
๐ Volume Guide
- Volumes are persistent storage devices that can be attached to your instances.
- Steps:
Create a new volume.
Attach it to your instance.
Mount the volume in your Linux filesystem:
sudo mkdir /mnt/data sudo mount /dev/sdb /mnt/dataNow you can store data here, and it will persist even if the instance is deleted.
๐ Step 6: Transfer Files (Data, Code, Models)
๐ File Transfer Guide
Use Exosphereโs web UI to upload:
- Python notebooks, datasets, or pre-trained models.
Or use
scpfrom the command line:scp myfile.csv your-access-id@your-instance-address:/mnt/data/
In-depth example
Transfer files to Jetstream2 using rsync:
rsync -avz -e "ssh -i ~/.ssh/your-key.pem" \
~/your-local-folder/ \
ubuntu@<your-instance-ip>:/home/ubuntu/project-data/๐ก For large files (e.g., >10GB), compress before transferring:
gzip largefile.csv
# transfer, then decompress:
gunzip largefile.csv.gz๐ง Step 7: Install Software on Your VM
Use
aptto install packages:sudo apt update sudo apt install python3-pip git htopInstall Python libraries:
pip install numpy pandas scikit-learn torch transformers
๐ง Use Case Examples for Capstone Projects
- Data Processing & EDA: Use Jetstream2 CPU instances + mounted volume.
- Deep Learning: Use Jetstream2 GPU instance with PyTorch or TensorFlow.
- Collaboration: Team members can share volumes or copy project files via the UI.
๐ Step 8: Manage Your Instance
๐ Instance Management Guide
- Stop instance when not in use to conserve SUs.
- Terminate only when youโre done permanently.
- Monitor usage and billing under your project dashboard.
๐ Pro Tips
- Keep notebooks and code in
/mnt/datato persist across instance shutdowns. - Use
screenortmuxto run long processes that wonโt stop if you disconnect. - Back up large models or results periodically.
๐ฌ Need Help?
Use: - Jetstream2 Documentation - ACCESS Support
Or ask your instructor/project lead for help accessing shared resources.
๐ Option 2: Use CACAO + Google Colab to Run on Jetstream2
Using CACAO (Cyberinfrastructure and ACCESS Connectivity for Academia and Outreach), you can launch a Jetstream2 VM and connect it directly to Google Colab. This allows you to:
โ
Work in the familiar Google Colab environment
โ
Use Jetstream2 GPU/CPU power for compute
โ
Avoid needing a personal SSH setup
๐งฐ What You Need
- Your ACCESS ID and project membership (already done by your instructor)
- CACAO installed in your browser
- A Google account (for Colab)
- A running Jetstream2 VM (weโll walk through this below)
โ Step-by-Step: Set Up CACAO + Colab
See the webinar version of this tutorial via this YouTube Video on CACAO + Google Colab integration by Cyverse.
1. Launch a Jetstream2 VM
Just like in the regular Jetstream2 setup:
- Go to https://jetstream2.exosphere.app
- Launch an instance (Ubuntu 22.04 recommended)
- Choose the project youโre part of
- Use GPU or CPU depending on your project
- Optionally attach a volume for storage
- Launch!
Let the instance boot up fully before proceeding.
2. Install CACAO in Google Chrome
- Visit the CACAO Chrome Extension
- Install the extension
- It allows Colab to โtunnelโ into your ACCESS resources
3. Set Up CACAO Configuration
In the extension:
- Sign in with your ACCESS ID
- Select your Jetstream2 instance
- Enable port forwarding (default is port 22)
Make sure the instance is Running and accessible.
4. Connect from Google Colab
In your Colab notebook, run:
!pip install colab_ssh --upgrade
from colab_ssh import launch_ssh
# Replace with your instance's public IP and username (typically 'ubuntu')
launch_ssh("your-jetstream2-public-ip", "ubuntu", password="your-password")If you prefer to use key-based login, you can also:
from colab_ssh import launch_ssh
launch_ssh("your-jetstream2-public-ip", "ubuntu", key="your-private-ssh-key")๐ป Youโre Now Running Code on Jetstream2 via Colab!
You can:
- Install PyTorch, Transformers, and other libraries
- Run GPU training loops
- Access data stored on Jetstream2 (e.g., in mounted volumes)
๐งผ Tear Down
When done:
- Save any model files to
/mnt/dataor your mounted volume - Shut down your Jetstream2 instance to avoid burning SUs
- Disconnect from Colab or revoke CACAO permissions as needed
โ ๏ธ Notes
- CACAO setup avoids direct SSH, which is great for beginners
- CACAO and Colab may time out; use
screenortmuxfor long processes - Colab can still save to Google Drive for convenience
Absolutely โ hereโs a generalized, student-friendly guide for running a data science or AI project on Jetstream2, with an emphasis on reproducibility and scalability. The steps are flexible for various datasets (not just MIMIC-III), and the instance table now covers a range of typical project needs.
Absolutely โ hereโs a generalized, student-friendly guide for running a data science or AI project on Jetstream2, with an emphasis on reproducibility and scalability. The steps are flexible for various datasets (not just MIMIC-III), and the instance table now covers a range of typical project needs.