Shared space · deep-spin/wiki Wiki
Excerpt
Contribute to deep-spin/wiki development by creating an account on GitHub.
Table of Contents
- Introduction
- Adding a Shared Model (for admins)
- Using a Shared Model
- Adding and Using a Shared Dataset
- Contributing to Shared Directories
- Maintenance and Updates
- Best Practices
Introduction
The shared space is located in /mnt/data-shared
(physically in /mnt/scratch-artemis/shared
), containing:
- Datasets:
/mnt/data-shared/datasets/
- A central repository for shared datasets - Models:
/mnt/data-shared/models/
- Contains machine learning models (e.g., from huggingface)
Note that all users have read access to the shared directories, but write access is restricted to admins.
Adding a shared model (for admins)
- Download the model into
/mnt/data-shared/models/
. For instance, for a model available in the huggingface hub:
cd /mnt/data-shared/models/
sudo git clone https://huggingface.co/Unbabel/TowerInstruct-7B-v0.2
- Remove the leftovers from git information:
sudo rm -rf TowerInstruct-7B-v0.2/.git
- Fix permissions so everyone can
read
andexecute
:
sudo chmod -R 775 TowerInstruct-7B-v0.2
Using a shared model
If you use HuggingFace transformers or VLLM, you can simply pass the path to this directory instead of the usual MODEL ID. For example,
- Before, with MODEL ID, you would do this:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Unbabel/TowerInstruct-7B-v0.2")
model = AutoModelForCausalLM.from_pretrained("Unbabel/TowerInstruct-7B-v0.2")
- Now, with the shared space, you should do this:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("/mnt/data-shared/models/TowerInstruct-7B-v0.2")
model = AutoModelForCausalLM.from_pretrained("/mnt/data-shared/models/TowerInstruct-7B-v0.2")
Huggingface cache
We advise you to not mess with your huggingface cache. Instead, when using a shared resource from huggingface, use its respective absolute path.
Adding and using a shared dataset
Follow the same steps as before, but place them inside /mnt/data-shared/datasets/
.
Contributing to Shared Directories
Users are encouraged to contribute datasets and models that may benefit others. To contribute:
- Review Existing Resources: Check the shared directories to ensure the resource is not already available.
- Contribution Request: Submit a detailed request to the an admin, including the resource’s description, size, and potential impact on projects.
- Approval and Addition: The admin will review the request and, upon approval, add the resource to the shared directory or provide instructions for doing so.
Maintenance and Updates
The shared directories are regularly reviewed and updated to ensure they remain organized, relevant, and useful to all users. Users are encouraged to report any issues, outdated resources, or suggestions for new resources to the admins.
Best Practices
- Regularly check the shared directories for new or updated resources.
- Use absolute paths to the shared space instead of MODEL IDs if you use huggingface.
- Communicate with the admins if you encounter issues or have suggestions for improvement.