General Usage · deep-spin/wiki Wiki
Excerpt
Contribute to deep-spin/wiki development by creating an account on GitHub.
This document outlines the best practices for using SARDINE servers.
🚨🚨🚨 Please, make sure to read this entering document before running any experiment. 🚨🚨🚨 You can also find a short version of our guidelines here in slides format, but keep in mind that it might be incomplete or outdated.
Table of Contents
Servers
Server | IP/Hostname | Location |
---|---|---|
🪽 Hermes | 193.136.223.55 | IT building, -1 floor |
🦉 Athena | 193.136.223.41 | IT building, -1 floor |
⚡ Zeus | 193.136.223.42 | IT building, -1 floor |
🏠 Hera | 193.136.223.39 | IT building, -1 floor |
🔱 Poseidon | 193.136.166.82 | Taguspark |
🏹 Artemis | 193.136.166.83 | Taguspark |
🔮 Maia | maia.hlt.inesc-id.pt | INESC facility |
To access the servers located in the IT building, exit the elevator and take the corridor to the left until the end. The door is on the right wall (room 01.20). The key to open the door is usually in the lab, on the “mini-table” between Tsvety’s and Ben’s desk. Physical locations:
- ⚪️ Hermes is in a white computer case
- ⚫️ Athena is in a black case
- 🧱 Zeus is next to the left wall
- 🪑 Hera is under the desk close to Zeus
- Poseidon and Artemis can be accessed remotely through a nohup connection (we can turn it on and off as we wish).
Maia is located at the INESC facility and shared with other groups. Contact the maia
server admin at hlt-admin@inesc-id.pt to get access. Add André as cc in your email and make sure to say that you are his student.
Specifications
Server | GPUs | CPU | RAM |
---|---|---|---|
Hermes | 4 × Titan Xp - 12GB | 16 × AMD Ryzen 1950X @ 3.40GHz | 128GB |
Athena | 4 × GTX 1080 Ti - 12GB | 8 × Intel i7-9800X @ 3.80GHz | 128GB |
Zeus | 3 × RTX 2080 Ti - 12GB | 12 × AMD Ryzen 2920X @ 3.50GHz | 128GB |
Hera | 3 × RTX 2080 Ti - 12GB | 12 × AMD Ryzen 2920X @ 3.50GHz | 128GB |
Poseidon | 8 × NVIDIA RTX A6000 - 46GB | 12 × Intel Xeon Gold 6348 @ 2.60GHz | 1TB |
Artemis | 8 × NVIDIA RTX A6000 - 46GB | 12 × Intel Xeon Gold 6348 @ 2.60GHz | 1TB |
Maia | 4 × Quadro RTX 6000 - 24GB | 12 × Intel Xeon Silver 4214 @ 2.20GHz | 256GB |
4 × RTX 2080 Ti - 12GB |
Login
If you don’t have a user account, please send a message to us in slack. Someone with sudo powers will create one for you. The instruction to create a consistent user across all servers is here: https://github.com/deep-spin/wiki/wiki/Creating-users-in-the-servers. To create a user account on the Maia server, send an email to the Maia server admin.
Note that password-based authentication is disabled, so all users need to authenticate via SSH. You can find a helpful guide to how to generate a SSH keys here. After that, check also how to create a SSH config file.
Disks
All servers have one disk dedicated to the operational system (home disks
), while some servers are equipped with extra disks to store data, like saved models, datasets, etc. (data disks
). The guidelines for disk usage is here: https://github.com/deep-spin/wiki/wiki/Disk-Usage-Guidelines-for-SARDINE-Servers. Make sure to read it as well.
CPU/GPU/RAM
We use slurm
to manage the usage of our resources on Artemis and Poseidon. Please read this guide for more information: https://github.com/deep-spin/wiki/wiki/Slurm-Usage-Guidelines-for-SARDINE-Servers.
🚨🚨🚨 IMPORTANT: if you want to use GPUs on Artemis or Poseidon, please make sure to use slurm. Otherwise your process will be killed. 🚨🚨🚨
In addition, we strongly recommend you to follow these guidelines:
- Monitor RAM usage using the command
htop
. - Monitor GPU usage using the command
nvidia-smi
ornvitop
. - Monitor disk usage of your home folder using the command
ncdu
. - Monitor disk usage of all disks using the command
duf
or the usualdf -h
. - Store large data, such as datasets, models, and checkpoints on the shared data disks located at
/mnt/data*
. - Use a tool such as
tmux
to manage processes remotely via ssh. - Always use virtual environments for your projects.
- Avoid performing installations with
sudo
unless necessary.- We have global CUDA installations on the servers, so try to use those instead. If a specific CUDA version is required, install it locally (see how to do it here.
- If you still believe you need a
sudo
installation, talk to us on Slack.
Basic tmux usage:
tmux
create a tmux sessionCTRL+B
→D
: exit from the session without killing itCTRL+B
→[
: enable scrolling (ESC to return)CTRL+D
: exit and kill the current sessiontmux list-sessions
: show a list of your sessionstmux attach -t ID
: recover a specific session
Tips
SSH:
- Create a SSH config file on your computer to label IP addresses for easy access. For example, instead of typing the artemis’ full IP address, you can do this:
Virtual envs:
- A handy way to create a virtualenv is via
python3 -m venv env
. This will create a virtualenv namedenv
in your current folder. - If you want a smart solution to manage multiple virtual envs along with python versions, check pyenv-virtualenv.
- Primarily create virtual environments on your home disk (e.g.,
/home/your_username
). - Read this guide for more information on how to manage virtualenv.
GPUs in PyTorch
- In PyTorch, specify the device to use by calling using
.to()
, e.g.,tensor.to(gpu_id)
. - Alternatively, you can use the flag
CUDA_VISIBLE_DEVICES
to set a specific gpu id for the entire program. e.g.CUDA_VISIBLE_DEVICES=0 python3 ...
- Note that with
slurm
this may be obsolete.
Jupyter Notebooks
- If you want to use Jupyter Notebooks online:
- start a ssh connection with the
-L
flag:ssh -L 8888:127.0.0.1:8888 username@ip_address
- run jupyter on an specific port:
jupyter-lab --port 8888
- launch 127.0.0.1:8888 on your browser
- start a ssh connection with the
- In Jupyter, you can add the following magic commands to the first cell of your notebook to inform CUDA devices:
%env CUDA_VISIBLE_DEVICES=0
File upload/download
- If you use PyCharm, I highly recommend the development tool to quick upload/download files.
Q&A
Q: I got a message saying the disk is out of space. What should I do?
A: Notify all members in the sardine-servers
slack channel for assistance.
Q: The server was rebooted and I can no access
/mnt/data
disks. What should I do?
A: Check if the IPs are correct and if the disks are mounted properly: https://github.com/deep-spin/wiki/wiki/Sharing-filesystem-across-servers
Q: I can see a data disk but I am unable to see the contents in it. What happened?
A: If you can still access the server’s home disk, the most likely explanation is that the data disk is full. In this case, someone will need to manually clean up some space by going to the server room, connecting a keyboard, and freeing up some space. Let’s try to avoid this situation by monitoring disk usage regularly 🙃.
Q: We just received a new disk! How can I “install” it?
A: Follow this guide: https://github.com/deep-spin/wiki/wiki/Partitioning-and-mounting-new-disks