Disk Usage Guidelines for SARDINE Servers · deep-spin/wiki Wiki
Excerpt
Contribute to deep-spin/wiki development by creating an account on GitHub.
This document outlines the best practices for managing disk space across the SARDINE servers. It aims to ensure efficient and responsible use of the available disk resources, providing a smooth experience for all users.
Table of Contents
- Servers and Disk Types
- Usage Quota per User
- Disk Types
- Maintenance
- Management and Use of Shared Resources
- Deleted Users Data Management
- Troubleshooting
- Tips for Efficient Disk Use
Servers
Server | Type | Filesystem | Local Mount path | Global Mount Path | Size |
---|---|---|---|---|---|
Hermes | home | /dev/sda1 | / | - | 916G |
Hermes | data | /dev/sdb1 | /mnt/data | /mnt/data | 2.7T |
Athena | home | /dev/sda1 | / | - | 916G |
Athena | home | /dev/sdb1 | /home | - | 1.8T |
Zeus | home | /dev/nvme0n1p2 | / | - | 228G |
Zeus | data | /dev/sda1 | /media/hdd1 | /mnt/data-zeus1 | 5.5T |
Zeus | data | /dev/sdb1 | /media/hdd2 | /mnt/data-zeus2 | 5.5T |
Hera | home | /dev/nvme0n1p2 | / | - | 228G |
Hera | data | /dev/sda1 | /media/hdd1 | /mnt/data-hera1 | 5.5T |
Hera | data | /dev/sdb1 | /media/hdd2 | /mnt/data-hera2 | 5.5T |
Maia | home | /dev/sda3 | / | - | 2,8T |
Poseidon | home | /dev/sdb2 | / | - | 877G |
Poseidon | data | /dev/sda1 | /media/hdd1 | /mnt/data-poseidon | 3.4T |
Artemis | home | /dev/sdb2 | / | - | 870G |
Artemis | data | /dev/sda | /media/hdd1 | /mnt/data-artemis | 4.4T |
Artemis | data | /dev/nvme*n1 | /media/scratch | /mnt/scratch | 46T |
Disk Types
Home Disks
The home disk (/home
) is intended for personal use, including user configuration files, scripts, and source code. Keep in mind the storage quota and use this space for items that require regular, direct access.
Data Disks
Data disks (/mnt/data
, /mnt/data-zeus1
, /mnt/data-hera1
, etc.) are designed for storing larger, more critical datasets, models, and checkpoints. These disks have a larger capacity and are shared across servers, making them ideal for collaboration and large-scale projects.
Scratch Disks
The scratch disk (/mnt/scratch-artemis
) is a high-speed, temporary storage solution that uses RAID 0 for increased performance. There is no redundancy, meaning data is not automatically backed up. Store large, temporary files here, especially those needed for high-performance computations. However, be prepared to lose this data in the event of disk failure.
Usage Quota per User
To maintain optimal server performance and to ensure fair resource allocation among all users, we have implemented a usage quota system. Each user is allocated a specific amount of disk space on both the home and data disks:
- Home Disk Quota: Max of 50GB per user. This space should be used primarily for user-specific files, such as scripts, code, and small datasets.
- Data Disk Quota: Max of 250GB per user. This space is intended for larger files, such as datasets, models, and checkpoints.
- Scratch Disk Quota: Max of 4TB per user, with the understanding that data stored here is temporary and can be deleted without notice.
Please be mindful of your disk usage and regularly clean up unnecessary files. Users exceeding their quotas may experience restrictions on their ability to save new data.
Maintenance
Caches
Many applications, such as Hugging Face models, store large amounts of data in the ~/.cache
directory, which can quickly consume disk space. To manage this:
- Create a dedicated folder for such applications on a data disk, e.g.,
/mnt/data/huggingface
. - Replace the original cache directory with a symbolic link pointing to the new location, using the following commands:
mv ~/.cache/huggingface /mnt/data/huggingface
ln -s /mnt/data/huggingface ~/.cache/huggingface
This approach conserves space on the home disk and leverages the larger capacity of the data disks.
Virtual Environments
Effective management of virtual environments is crucial for maintaining project-specific dependencies and ensuring the smooth operation of our server infrastructure. Here’s how to best manage your virtual environments on the SARDINE servers:
-
Creation: Primarily create virtual environments on your home disk (
/home/your_username
) to ensure compatibility with server-specific libraries and CUDA versions. This practice helps avoid issues that can arise from server-to-server variations in the environment. For projects requiring extensive libraries or dependencies that significantly exceed home disk space limitations, consider using data disks with caution (more below). -
Maintenance: Regardless of where your virtual environment is located, it’s important to regularly clean up unused environments and remove unnecessary packages. Tools like
venv
for Python virtual environments orconda
for managing environments that may include Python and non-Python packages can help manage these environments efficiently. We also suggest the use ofncdu
to investigate the size of each folder in your home.
Data Disk Caution: If you opt to use data disks (/mnt/data/your_username/envs
), be mindful of potential path dependencies. Virtual environments on data disks may introduce complexities when accessed from different servers. Always ensure that paths in scripts and executables are relative or correctly mapped to the environment’s current location to mitigate issues related to library dependencies, Python versions, etc.
Management and Use of Shared Resources
To optimize our server storage and facilitate collaborative work, we have established shared directories in /mnt/data-shared
(physically in /mnt/scratch-artemis/shared
). This section outlines the organization, access, and usage guidelines for these shared resources.
Shared Directories Structure
The /mnt/data-shared
directory is organized into three main subdirectories:
- Datasets:
/mnt/data-shared/datasets/
- A central repository for shared datasets - Models:
/mnt/data-shared/models/
- Contains machine learning models
Access Permissions
- Read Access: All users have read and access to the shared directories, allowing them to utilize the datasets and models in their projects without duplication.
- Write Access: Write access is restricted to admins. This control ensures the integrity and organization of the shared resources.
Using Shared Resources
For more information, check the Shared Space page
Deleted Users Data Management
When a user leaves the lab, it is crucial to manage their data on the servers efficiently to ensure that valuable disk space is not wasted. Here’s our strategy for managing the data of deleted users:
Data Retention and Deletion Policy
-
Initial Step: Upon a user’s departure, the user or their direct supervisor should inform the admins about it via the
sardine_servers
Slack channel. -
Data Review: The user should identify any data that should be saved for ongoing or future projects, and move it to
/mnt/scratch/retained/username
-
Retention Period: Data transferred to the “retained” directory will be stored for a period of 3 months.
-
Final Deletion: After the 3 months period, the data will be permanently deleted.
Guidelines for Review and Transfer
-
Collaborative Projects: For data associated with collaborative projects, consult with all project members before making decisions on data retention or deletion.
-
Archiving: If certain datasets or project outputs are deemed valuable for long-term preservation, consider archiving them in the shared space
/mnt/data-shared
.
Troubleshooting
Inaccessible /mnt/data
Disks
If you experience difficulties accessing /mnt/data
disks, ensure proper mounting and network settings. Verify IP addresses and mount statuses. Notify all members in the sardine-servers
Slack channel for assistance.
Full Disks
To prevent and address full disks:
- Regularly monitor disk usage and clean up large, unnecessary files.
- If a disk reaches capacity, causing access issues, manual cleanup may be required at the server location. Alert the
sardine-servers
Slack channel to coordinate a response.
Maintaining awareness of disk usage and adhering to quotas will help prevent these issues.
Tips for Efficient Disk Use
Monitoring with ncdu
The ncdu
(just type ncdu
in your terminal) tool offers an interactive way to view and manage disk space usage, helping users identify and delete unnecessary files efficiently. Here’s an example of how ncdu looks:
This command provides a detailed, navigable interface to review file sizes and directories, making it easier to manage disk space usage.
Alternative: You can use the old good du -hs | sort -h
in each folder to get a similar result.
Monitoring with duf
duf
is a user-friendly command-line tool for disk usage monitoring, offering a visually appealing and intuitive interface to view information about your hard drive, mounted filesystems, and available disk space. Unlike traditional commands, duf
provides a more readable and organized output, including graphs and color-coded displays.
Alternative: You can use the old good df -h
to get a similar result.