Skip to content

System Status

Real-Time Monitoring

For real-time system metrics, visit our Grafana Dashboard

Current Status

Last Updated: October 2025

Server Status

ServerStatusNotes
roselab1🟢 OnlineAll systems operational
roselab2🟢 OnlineAll systems operational
roselab3🟢 OnlineAll systems operational
roselab4🟢 OnlineAll systems operational
roselab5🟢 OnlineAll systems operational
rosedata🟢 OnlineAll systems operational

Legend: 🟢 Online, 🟡 Degraded Performance, 🔴 Offline, 🔵 Maintenance

Storage Status

Storage Update

The lab's shared storage had previously been at high utilization. We've freed up approximately 20TB of space and new drives are on the way to expand capacity. Please continue to be mindful of storage usage.

Storage PoolTotalUsedAvailableUsage
Shared HDD Cluster~140TBVaries~20TB+Manageable
Per-user /data5TBVariesVariesCheck with df -h /data
Per-user /public5TB (shared)VariesVariesCheck with df -h /public

Best practices for storage management:

  1. Store large datasets on /data (synchronized across servers)
  2. Keep environments and code on local SSD
  3. Archive or compress old datasets
  4. Clean up intermediate experiment results regularly
  5. Use /utilities/common-utilities.py to clean pip cache if needed

Current NVIDIA Driver Version

Version: 580.95.05

Important

Do NOT install nvidia-driver through your package manager (apt, yum, etc.). This will break GPU passthrough in containers.

If you accidentally corrupt your NVIDIA driver, use /utilities/nvidia-upgrade.sh and sudo reboot to fix it.

Recent System Updates

October 2025

  • Server Migration Complete: All five roselab servers migrated to new network architecture
    • Server room move completed (wave 4, Oct 2)
    • Improved data loading speed between servers
    • Network upgraded to 100Gbps
    • roselab5 now has all 8x H200 GPUs online
  • roselab2 & roselab3 Back Online: All NVIDIA drivers upgraded to version 580.95.05
  • rosedata Recovered: Storage server back online after data recovery
  • Storage Update: Freed up ~20TB of space, new drives on the way

Scheduled Maintenance

No Scheduled Maintenance

There is currently no scheduled maintenance. This page will be updated if maintenance is planned.

Known Issues

No Active Issues

There are currently no major known issues. All servers and services are operational.

Service Status

ServiceStatusURLNotes
Grafana🟢 Onlineroselab1.ucsd.edu/grafanaReal-time metrics
Seafile🟢 Onlineroselab1.ucsd.edu/seafileFile sharing
MinIO🟢 Onlinerosedata.ucsd.eduS3 object storage
HedgeDoc🟢 Onlineroselab1.ucsd.edu/hedgedocMarkdown collaboration
WandB🟢 Onlinerosewandb.ucsd.eduExperiment tracking
RoseLibreChat🟢 Onlineroselab1.ucsd.edu:3407AI chat interface (API-based)

Monitoring Your Resources

Check Your Container Status

bash
# Check CPU usage
htop

# Check GPU usage
nvidia-smi

# Check storage usage
df -h /
df -h /data

# Check network storage speed
dd if=/dev/zero of=/data/testfile bs=1M count=1024
# Should see ~300+ MB/s write speed
rm /data/testfile

Grafana Dashboards

Visit Grafana to monitor:

  • GPU utilization per container
  • CPU and memory usage
  • Network throughput
  • Storage I/O

Reporting Issues

If you encounter any issues:

  1. Check this status page first for known issues
  2. Check Grafana for resource utilization
  3. Contact the admin: Zihao Zhou (ziz244@ucsd.edu)

When reporting issues, please include:

  • Your container name
  • The server(s) affected (roselab1-5)
  • Error messages (if any)
  • Steps to reproduce the issue
  • Output of relevant commands (nvidia-smi, df -h, etc.)

Subscribe to Updates

Stay Informed

Important system updates are posted in the lab Slack channel #server. Make sure you're subscribed to receive notifications about:

  • Scheduled maintenance
  • System outages
  • Driver updates
  • Storage alerts

Historical Status

2025

October 2025:

  • Oct 2-4: Server room migration (wave 4)
  • Oct 4: roselab2, roselab3 back online; All systems operational
  • Oct 11: Power outage for room finalization (mid-afternoon recovery)
  • Driver upgrade to 580.95.05
  • Network upgraded to 100Gbps
  • roselab5 H200 GPUs online
  • Storage freed up ~20TB

Page Maintenance

This page is manually updated by the RoseLab admin. For the most current real-time metrics, always check Grafana.

If you notice this page is out of date, please contact the admin.

Released under the MIT License.