General Usage · deep-spin/wiki Wiki

Excerpt

Contribute to deep-spin/wiki development by creating an account on GitHub.


This document outlines the best practices for using SARDINE servers.

🚨🚨🚨 Please, make sure to read this entering document before running any experiment. 🚨🚨🚨 You can also find a short version of our guidelines here in slides format, but keep in mind that it might be incomplete or outdated.

Table of Contents

Servers

ServerIP/HostnameLocation
🪽 Hermes193.136.223.55IT building, -1 floor
🦉 Athena193.136.223.41IT building, -1 floor
⚡ Zeus193.136.223.42IT building, -1 floor
🏠 Hera193.136.223.39IT building, -1 floor
🔱 Poseidon193.136.166.82Taguspark
🏹 Artemis193.136.166.83Taguspark
🔮 Maiamaia.hlt.inesc-id.ptINESC facility

To access the servers located in the IT building, exit the elevator and take the corridor to the left until the end. The door is on the right wall (room 01.20). The key to open the door is usually in the lab, on the “mini-table” between Tsvety’s and Ben’s desk. Physical locations:

  • ⚪️ Hermes is in a white computer case
  • ⚫️ Athena is in a black case
  • 🧱 Zeus is next to the left wall
  • 🪑 Hera is under the desk close to Zeus
  • Poseidon and Artemis can be accessed remotely through a nohup connection (we can turn it on and off as we wish).

Maia is located at the INESC facility and shared with other groups. Contact the maia server admin at hlt-admin@inesc-id.pt to get access. Add André as cc in your email and make sure to say that you are his student.

Specifications

ServerGPUsCPURAM
Hermes4 × Titan Xp - 12GB16 × AMD Ryzen 1950X @ 3.40GHz128GB
Athena4 × GTX 1080 Ti - 12GB8 × Intel i7-9800X @ 3.80GHz128GB
Zeus3 × RTX 2080 Ti - 12GB12 × AMD Ryzen 2920X @ 3.50GHz128GB
Hera3 × RTX 2080 Ti - 12GB12 × AMD Ryzen 2920X @ 3.50GHz128GB
Poseidon8 × NVIDIA RTX A6000 - 46GB12 × Intel Xeon Gold 6348 @ 2.60GHz1TB
Artemis8 × NVIDIA RTX A6000 - 46GB12 × Intel Xeon Gold 6348 @ 2.60GHz1TB
Maia4 × Quadro RTX 6000 - 24GB12 × Intel Xeon Silver 4214 @ 2.20GHz256GB
4 × RTX 2080 Ti - 12GB

Login

If you don’t have a user account, please send a message to us in slack. Someone with sudo powers will create one for you. The instruction to create a consistent user across all servers is here: https://github.com/deep-spin/wiki/wiki/Creating-users-in-the-servers. To create a user account on the Maia server, send an email to the Maia server admin.

Note that password-based authentication is disabled, so all users need to authenticate via SSH. You can find a helpful guide to how to generate a SSH keys here. After that, check also how to create a SSH config file.

Disks

All servers have one disk dedicated to the operational system (home disks), while some servers are equipped with extra disks to store data, like saved models, datasets, etc. (data disks). The guidelines for disk usage is here: https://github.com/deep-spin/wiki/wiki/Disk-Usage-Guidelines-for-SARDINE-Servers. Make sure to read it as well.

CPU/GPU/RAM

We use slurm to manage the usage of our resources on Artemis and Poseidon. Please read this guide for more information: https://github.com/deep-spin/wiki/wiki/Slurm-Usage-Guidelines-for-SARDINE-Servers.

🚨🚨🚨 IMPORTANT: if you want to use GPUs on Artemis or Poseidon, please make sure to use slurm. Otherwise your process will be killed. 🚨🚨🚨

In addition, we strongly recommend you to follow these guidelines:

  1. Monitor RAM usage using the command htop.
  2. Monitor GPU usage using the command nvidia-smi or nvitop.
  3. Monitor disk usage of your home folder using the command ncdu.
  4. Monitor disk usage of all disks using the command duf or the usual df -h.
  5. Store large data, such as datasets, models, and checkpoints on the shared data disks located at /mnt/data*.
  6. Use a tool such as tmux to manage processes remotely via ssh.
  7. Always use virtual environments for your projects.
  8. Avoid performing installations with sudo unless necessary.
    • We have global CUDA installations on the servers, so try to use those instead. If a specific CUDA version is required, install it locally (see how to do it here.
    • If you still believe you need a sudo installation, talk to us on Slack.

Basic tmux usage:

  • tmux create a tmux session
  • CTRL+BD: exit from the session without killing it
  • CTRL+B[: enable scrolling (ESC to return)
  • CTRL+D: exit and kill the current session
  • tmux list-sessions: show a list of your sessions
  • tmux attach -t ID: recover a specific session

Tips

SSH:

  • Create a SSH config file on your computer to label IP addresses for easy access. For example, instead of typing the artemis’ full IP address, you can do this:

Virtual envs:

  • A handy way to create a virtualenv is via python3 -m venv env. This will create a virtualenv named env in your current folder.
  • If you want a smart solution to manage multiple virtual envs along with python versions, check pyenv-virtualenv.
  • Primarily create virtual environments on your home disk (e.g., /home/your_username).
  • Read this guide for more information on how to manage virtualenv.

GPUs in PyTorch

  • In PyTorch, specify the device to use by calling using .to(), e.g., tensor.to(gpu_id).
  • Alternatively, you can use the flag CUDA_VISIBLE_DEVICES to set a specific gpu id for the entire program. e.g. CUDA_VISIBLE_DEVICES=0 python3 ...
  • Note that with slurm this may be obsolete.

Jupyter Notebooks

  • If you want to use Jupyter Notebooks online:
    1. start a ssh connection with the -L flag: ssh -L 8888:127.0.0.1:8888 username@ip_address
    2. run jupyter on an specific port: jupyter-lab --port 8888
    3. launch 127.0.0.1:8888 on your browser
  • In Jupyter, you can add the following magic commands to the first cell of your notebook to inform CUDA devices: %env CUDA_VISIBLE_DEVICES=0

File upload/download

  • If you use PyCharm, I highly recommend the development tool to quick upload/download files.

Q&A

Q: I got a message saying the disk is out of space. What should I do?

A: Notify all members in the sardine-servers slack channel for assistance.

Q: The server was rebooted and I can no access /mnt/data disks. What should I do?

A: Check if the IPs are correct and if the disks are mounted properly: https://github.com/deep-spin/wiki/wiki/Sharing-filesystem-across-servers

Q: I can see a data disk but I am unable to see the contents in it. What happened?

A: If you can still access the server’s home disk, the most likely explanation is that the data disk is full. In this case, someone will need to manually clean up some space by going to the server room, connecting a keyboard, and freeing up some space. Let’s try to avoid this situation by monitoring disk usage regularly 🙃.

Q: We just received a new disk! How can I “install” it?

A: Follow this guide: https://github.com/deep-spin/wiki/wiki/Partitioning-and-mounting-new-disks