Hugging Face for Audio: Resources ✨
Audio transformers course: https://huggingface.co/learn/audio-course/chapter0/introduction#course-structure. This covers the standard tasks (ASR, TTS, audio classification) with notes on using pre-trained models and fine-tuning. See also Unit 7 for a speaker diarization application.
Using pre-trained models
- With pipelines: https://www.reddit.com/r/MachineLearning/comments/16xshji/d_the_most_complete_audio_ml_toolkit/
- Transformers docs: https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer
Training
- Datasets https://huggingface.co/blog/audio-datasets
- Fine-tune Whisper for ASR https://huggingface.co/blog/fine-tune-whisper
- Distil Whisper for ASR https://github.com/huggingface/distil-whisper/tree/main/training
- Fine-tune VITS for TTS https://twitter.com/yoachlacombe/status/1735348885369889264
- Fine-tune Wav2Vec2 for audio class https://github.com/huggingface/transformers/tree/main/examples/pytorch/audio-classification
Optimisation
- Whisper JAX for ASR https://github.com/sanchit-gandhi/whisper-jax
- Distil Whisper for ASR https://github.com/huggingface/distil-whisper/tree/main
- Insanely Fast Whisper for ASR https://github.com/Vaibhavs10/insanely-fast-whisper
- Speculative decoding with Whisper for ASR https://huggingface.co/blog/whisper-speculative-decoding
- Bark for TTS https://huggingface.co/blog/optimizing-bark
Deployment
- Endpoint https://huggingface.co/blog/run-musicgen-as-an-api
- Gradio client https://www.gradio.app/docs/client (e.g. for Whisper https://huggingface.co/spaces/hf-audio/whisper-large-v3)