Disk Usage Guidelines for SARDINE Servers · deep-spin/wiki Wiki

Excerpt

Contribute to deep-spin/wiki development by creating an account on GitHub.


This document outlines the best practices for managing disk space across the SARDINE servers. It aims to ensure efficient and responsible use of the available disk resources, providing a smooth experience for all users.

Table of Contents

Servers

ServerTypeFilesystemLocal Mount pathGlobal Mount PathSize
Hermeshome/dev/sda1/-916G
Hermesdata/dev/sdb1/mnt/data/mnt/data2.7T
Athenahome/dev/sda1/-916G
Athenahome/dev/sdb1/home-1.8T
Zeushome/dev/nvme0n1p2/-228G
Zeusdata/dev/sda1/media/hdd1/mnt/data-zeus15.5T
Zeusdata/dev/sdb1/media/hdd2/mnt/data-zeus25.5T
Herahome/dev/nvme0n1p2/-228G
Heradata/dev/sda1/media/hdd1/mnt/data-hera15.5T
Heradata/dev/sdb1/media/hdd2/mnt/data-hera25.5T
Maiahome/dev/sda3/-2,8T
Poseidonhome/dev/sdb2/-877G
Poseidondata/dev/sda1/media/hdd1/mnt/data-poseidon3.4T
Artemishome/dev/sdb2/-870G
Artemisdata/dev/sda/media/hdd1/mnt/data-artemis4.4T
Artemisdata/dev/nvme*n1/media/scratch/mnt/scratch46T

Disk Types

Home Disks

The home disk (/home) is intended for personal use, including user configuration files, scripts, and source code. Keep in mind the storage quota and use this space for items that require regular, direct access.

Data Disks

Data disks (/mnt/data, /mnt/data-zeus1, /mnt/data-hera1, etc.) are designed for storing larger, more critical datasets, models, and checkpoints. These disks have a larger capacity and are shared across servers, making them ideal for collaboration and large-scale projects.

Scratch Disks

The scratch disk (/mnt/scratch-artemis) is a high-speed, temporary storage solution that uses RAID 0 for increased performance. There is no redundancy, meaning data is not automatically backed up. Store large, temporary files here, especially those needed for high-performance computations. However, be prepared to lose this data in the event of disk failure.


Usage Quota per User

To maintain optimal server performance and to ensure fair resource allocation among all users, we have implemented a usage quota system. Each user is allocated a specific amount of disk space on both the home and data disks:

  • Home Disk Quota: Max of 50GB per user. This space should be used primarily for user-specific files, such as scripts, code, and small datasets.
  • Data Disk Quota: Max of 250GB per user. This space is intended for larger files, such as datasets, models, and checkpoints.
  • Scratch Disk Quota: Max of 4TB per user, with the understanding that data stored here is temporary and can be deleted without notice.

Please be mindful of your disk usage and regularly clean up unnecessary files. Users exceeding their quotas may experience restrictions on their ability to save new data.


Maintenance

Caches

Many applications, such as Hugging Face models, store large amounts of data in the ~/.cache directory, which can quickly consume disk space. To manage this:

  1. Create a dedicated folder for such applications on a data disk, e.g., /mnt/data/huggingface.
  2. Replace the original cache directory with a symbolic link pointing to the new location, using the following commands:
mv ~/.cache/huggingface /mnt/data/huggingface
ln -s /mnt/data/huggingface ~/.cache/huggingface

This approach conserves space on the home disk and leverages the larger capacity of the data disks.

Virtual Environments

Effective management of virtual environments is crucial for maintaining project-specific dependencies and ensuring the smooth operation of our server infrastructure. Here’s how to best manage your virtual environments on the SARDINE servers:

  • Creation: Primarily create virtual environments on your home disk (/home/your_username) to ensure compatibility with server-specific libraries and CUDA versions. This practice helps avoid issues that can arise from server-to-server variations in the environment. For projects requiring extensive libraries or dependencies that significantly exceed home disk space limitations, consider using data disks with caution (more below).

  • Maintenance: Regardless of where your virtual environment is located, it’s important to regularly clean up unused environments and remove unnecessary packages. Tools like venv for Python virtual environments or conda for managing environments that may include Python and non-Python packages can help manage these environments efficiently. We also suggest the use of ncdu to investigate the size of each folder in your home.

Data Disk Caution: If you opt to use data disks (/mnt/data/your_username/envs), be mindful of potential path dependencies. Virtual environments on data disks may introduce complexities when accessed from different servers. Always ensure that paths in scripts and executables are relative or correctly mapped to the environment’s current location to mitigate issues related to library dependencies, Python versions, etc.


Management and Use of Shared Resources

To optimize our server storage and facilitate collaborative work, we have established shared directories in /mnt/data-shared (physically in /mnt/scratch-artemis/shared). This section outlines the organization, access, and usage guidelines for these shared resources.

Shared Directories Structure

The /mnt/data-shared directory is organized into three main subdirectories:

  • Datasets: /mnt/data-shared/datasets/ - A central repository for shared datasets
  • Models: /mnt/data-shared/models/ - Contains machine learning models

Access Permissions

  • Read Access: All users have read and access to the shared directories, allowing them to utilize the datasets and models in their projects without duplication.
  • Write Access: Write access is restricted to admins. This control ensures the integrity and organization of the shared resources.

Using Shared Resources

For more information, check the Shared Space page


Deleted Users Data Management

When a user leaves the lab, it is crucial to manage their data on the servers efficiently to ensure that valuable disk space is not wasted. Here’s our strategy for managing the data of deleted users:

Data Retention and Deletion Policy

  • Initial Step: Upon a user’s departure, the user or their direct supervisor should inform the admins about it via the sardine_servers Slack channel.

  • Data Review: The user should identify any data that should be saved for ongoing or future projects, and move it to /mnt/scratch/retained/username

  • Retention Period: Data transferred to the “retained” directory will be stored for a period of 3 months.

  • Final Deletion: After the 3 months period, the data will be permanently deleted.

Guidelines for Review and Transfer

  1. Collaborative Projects: For data associated with collaborative projects, consult with all project members before making decisions on data retention or deletion.

  2. Archiving: If certain datasets or project outputs are deemed valuable for long-term preservation, consider archiving them in the shared space /mnt/data-shared.


Troubleshooting

Inaccessible /mnt/data Disks

If you experience difficulties accessing /mnt/data disks, ensure proper mounting and network settings. Verify IP addresses and mount statuses. Notify all members in the sardine-servers Slack channel for assistance.

Full Disks

To prevent and address full disks:

  • Regularly monitor disk usage and clean up large, unnecessary files.
  • If a disk reaches capacity, causing access issues, manual cleanup may be required at the server location. Alert the sardine-servers Slack channel to coordinate a response.

Maintaining awareness of disk usage and adhering to quotas will help prevent these issues.


Tips for Efficient Disk Use

Monitoring with ncdu

The ncdu (just type ncdu in your terminal) tool offers an interactive way to view and manage disk space usage, helping users identify and delete unnecessary files efficiently. Here’s an example of how ncdu looks:

ncdu Screenshot

This command provides a detailed, navigable interface to review file sizes and directories, making it easier to manage disk space usage.

Alternative: You can use the old good du -hs | sort -h in each folder to get a similar result.

Monitoring with duf

duf is a user-friendly command-line tool for disk usage monitoring, offering a visually appealing and intuitive interface to view information about your hard drive, mounted filesystems, and available disk space. Unlike traditional commands, duf provides a more readable and organized output, including graphs and color-coded displays.

duf command

Alternative: You can use the old good df -h to get a similar result.