Shared space · deep-spin/wiki Wiki

Excerpt

Contribute to deep-spin/wiki development by creating an account on GitHub.


Table of Contents

  1. Introduction
  2. Adding a Shared Model (for admins)
  3. Using a Shared Model
  4. Adding and Using a Shared Dataset
  5. Contributing to Shared Directories
  6. Maintenance and Updates
  7. Best Practices

Introduction

The shared space is located in /mnt/data-shared (physically in /mnt/scratch-artemis/shared), containing:

  • Datasets: /mnt/data-shared/datasets/ - A central repository for shared datasets
  • Models: /mnt/data-shared/models/ - Contains machine learning models (e.g., from huggingface)

Note that all users have read access to the shared directories, but write access is restricted to admins.

Adding a shared model (for admins)

  1. Download the model into /mnt/data-shared/models/. For instance, for a model available in the huggingface hub:
cd /mnt/data-shared/models/
sudo git clone https://huggingface.co/Unbabel/TowerInstruct-7B-v0.2
  1. Remove the leftovers from git information:
sudo rm -rf TowerInstruct-7B-v0.2/.git
  1. Fix permissions so everyone can read and execute:
sudo chmod -R 775 TowerInstruct-7B-v0.2

Using a shared model

If you use HuggingFace transformers or VLLM, you can simply pass the path to this directory instead of the usual MODEL ID. For example,

  • Before, with MODEL ID, you would do this:
from transformers import AutoTokenizer, AutoModelForCausalLM
 
tokenizer = AutoTokenizer.from_pretrained("Unbabel/TowerInstruct-7B-v0.2")
model = AutoModelForCausalLM.from_pretrained("Unbabel/TowerInstruct-7B-v0.2")
  • Now, with the shared space, you should do this:
from transformers import AutoTokenizer, AutoModelForCausalLM
 
tokenizer = AutoTokenizer.from_pretrained("/mnt/data-shared/models/TowerInstruct-7B-v0.2")
model = AutoModelForCausalLM.from_pretrained("/mnt/data-shared/models/TowerInstruct-7B-v0.2")

Huggingface cache

We advise you to not mess with your huggingface cache. Instead, when using a shared resource from huggingface, use its respective absolute path.

Adding and using a shared dataset

Follow the same steps as before, but place them inside /mnt/data-shared/datasets/.

Contributing to Shared Directories

Users are encouraged to contribute datasets and models that may benefit others. To contribute:

  1. Review Existing Resources: Check the shared directories to ensure the resource is not already available.
  2. Contribution Request: Submit a detailed request to the an admin, including the resource’s description, size, and potential impact on projects.
  3. Approval and Addition: The admin will review the request and, upon approval, add the resource to the shared directory or provide instructions for doing so.

Maintenance and Updates

The shared directories are regularly reviewed and updated to ensure they remain organized, relevant, and useful to all users. Users are encouraged to report any issues, outdated resources, or suggestions for new resources to the admins.

Best Practices

  • Regularly check the shared directories for new or updated resources.
  • Use absolute paths to the shared space instead of MODEL IDs if you use huggingface.
  • Communicate with the admins if you encounter issues or have suggestions for improvement.