Getting Started
Overview
The RoseLab servers are the primary machine learning servers owned and managed by the UCSD CSE Rose Lab. These servers offer a versatile platform for machine learning researchers to develop and run their models within Linux Containers. In addition, RoseLab servers provide access to Grafana for real-time machine metrics tracking, Seafile for convenient data sharing and backup, MinIO for hosting S3 dataset, and Hedgedoc for online markdown collaboration. Further web applications are planned to be added in the future to support the needs of researchers.
Hardware
The RoseLab servers are located in Rack C05 of the CSE server room 1215 and consist of two primary components:
- Gigabyte G292 4x A100 GPU server
- Supermicro 12-bay Storage server, equipped with 6x 20TB hard drives.
Note
Please note that the RoseLab servers are still in the early stages of development and any feedback regarding the user experience is highly appreciated. More hardwares are planned for the future. For more information about the rationale behind the servers, please refer to the Why RoseLab section.
Quick Start
You shall apply for a Roselab server container using the Account Request Form.
- Don't worry too much about your selections, as port mappings and resource quota can be easily adjusted
- You will have root permission to install or remove any software other than the nvidia driver
- If you need to switch to a different software image, such as for a new project on reinforcement learning, you can request migration. However, this process requires you to back up all personal files to
/data
.
WARNING
Please be aware that the host and container nvidia driver version must match, because your GPU tasks within the container will communicate with the host driver kernel module. If you see the message Failed to initialize NVML: Driver/library version mismatch
, please contact the admin.
Once your request is approved, you will receive an email containing two tables:
<id> is a 3-digit number.
Container Port → Host Port 22 (SSH) <id>00 3389 (RDP) <id>01 5900 (VNC) <id>02 80 (HTTP) <id>03 443 (HTTPS) <id>04 8080 (Web server, e.g. Tomcat) <id>05 8888 (Jupyter) <id>06 8889 (Backup) <id>07 8890 (Backup) <id>08
description account password System, Remote Desktop ubuntu <token1> S3 Object Storage <name> <token2> Seafile <email> <token3> Hedgedoc <email> <token4> Jupyter <token5> SSH ubuntu <keyfile>
If Jupyter password is not provided, it is at line 991 of ~/.jupyter/jupyter_lab_config.py
. If S3 credential is not provided, the name is the same as your container name, and the password is the same as Seafile. Keep the tables in a secure place and do not share with others.
Note
OS-level virtualization makes each isolated container look like a dedicated machine from inside, so everyone's username is ubuntu
. Different containers differ by its hostname. It is not recommended to change the username, as you would have a lot of troubles with static configurations.
SSH Login
Move the downloaded private key file to your ~/.ssh
folder. Then, change the file permission such that it is not readable by others.
chmod 600 ~/.ssh/keyfile
Run the SSH command with your designated port (-p
).
ssh ubuntu@roselab1.ucsd.edu -p [id]00 -i ~/.ssh/keyfile
(base) ubuntu@account:~$
There are instances where ssh request is blocked when using UCSD-GUEST
. Switch to another wifi network if this issue occurs.
Know Your Container
Now let's check the resources assigned to you. First, use lscpu
to check the CPU cores. Although the CPU indices may differ, you should see 12 online CPU cores. Here's an example output:
$ lscpu
...
CPU(s): 56
On-line CPU(s) list: 4,6,11,13,18,19,23,29,34,38,41,42
Off-line CPU(s) list: 0-3,5,7-10,12,14-17,20-22,24-28,30-33,35-37,39,40,43-55
...
Next, you can inspect the memory assigned to you using the /proc/meminfo
file. You should see around 128 GB of total RAM.
$ cat /proc/meminfo
MemTotal: 125000000 kB
MemFree: 96093828 kB
MemAvailable: 96883860 kB
To see the file system, run df -H
. You would see
- the system SSD with around 256 GB of available space,
- a 5 TB private data HDD mounted under
/data
that is only accessible to you, and - a 5 TB public data HDD mounted under
/public
that is accessible to everyone.
$ df -H
Filesystem Size Used Avail Use% Mounted on
zfs-pool/containers/account 290G 43G 248G 15% /
...
data/account-vol 5.0T 263k 5.0T 1% /data
data/public 5.0T 263k 5.0T 1% /public
It is recommended to use soft links to access your data files on the /data
HDD. For example, instead of downloading your data files to /data/project1/sample...pt
and hard-coding their absolute paths, you can create a soft link under the code folder using the ln -s /data/project1/ /home/ubuntu/project1/data/
command. Then, you can refer to the data files as if they and the code are in the same project structure.
TIP
If your dataset is smaller than 200 GB, it is recommended to directly load the dataset from the system disk, as SSD is faster than HDD. You can use the HDD disk to store your completed projects' data.
Check Credentials
To check your webapp credentials and use the webapps, refer to the Seafile, HedgeDoc and MinIO documentation. If you requested Jupyter Lab or Remote Desktop, refer to the corresponding pages to check if you can log in successfully.
What's next?
Congratulations! You are ready to use your resource now. You may have noticed that using a Roselab container is like using a dedicated server. However, there is still some difference which you may want to take a look. You may want to use your own passwords and private key, following the guide in the Security section. If you notice that your model is running slowly on the server, you can refer to the Troubleshooting section for possible solutions.
Support
If you have questions or need help, reach out to the admin Zihao Zhou (ziz244@ucsd.edu).