A Python application that transcribes speech to text using OpenAI's Whisper API or Google's Gemini API, activated by a custom keypress. This macOS-focused tool streamlines the transcription workflow by automatically copying the result to your clipboard.
- 🎹 Activate recording with a customizable key combination
- 🎤 Record audio directly from your microphone
- 🔄 Transcribe speech using OpenAI's Whisper API or Google's Gemini API
- 📋 Automatically paste transcribed text into the active text field
- 🔔 macOS native notifications for operation status
- 🧪 Comprehensive test suite
- macOS (currently not supported on other platforms)
- Python 3.8+
- OpenAI API key
- Google Gemini API key
- Microphone
- PortAudio library (required for PyAudio)
# Install from source
git clone https://github.com/shaabhishek/whisper-transcribe.git
cd whisper-transcribe
pip install -e .
# Clone the repository
git clone https://github.com/shaabhishek/whisper-transcribe.git
cd whisper-transcribe
# Install PortAudio (required for PyAudio)
brew install portaudio
# Install dependencies using uv
uv sync
source .venv/bin/activate
You can set up API keys in several ways. The application supports both OpenAI and Google Gemini APIs for transcription.
-
Using a .env file (recommended):
Copy the example environment file and add your API key:
cp .env.example .env
Then edit the
.env
file and add your preferred API key:# Choose either OpenAI or Gemini OPENAI_API_KEY=your_openai_api_key_here GEMINI_API_KEY=your_gemini_api_key_here # Select which service to use (options: "openai" or "gemini") TRANSCRIPTION_SERVICE=openai
-
Using environment variables:
Set up your API keys as environment variables:
# For OpenAI export OPENAI_API_KEY="your-openai-api-key" export TRANSCRIPTION_SERVICE="openai" # OR for Gemini export GEMINI_API_KEY="your-gemini-api-key" export TRANSCRIPTION_SERVICE="gemini"
-
Using the provided script:
# For OpenAI
./set_api_key.sh openai your-openai-api-key
# OR for Gemini
./set_api_key.sh gemini your-gemini-api-key
-
Run the application:
speech-transcriber
-
Or specify a transcription service:
# Use OpenAI Whisper API speech-transcriber --service openai # Use Google Gemini API speech-transcriber --service gemini # View all available options speech-transcriber --help
-
Double-press either the left or right Alt key to start recording.
-
Speak clearly into your microphone
-
Double-press either Alt key again to stop recording and start transcription
-
The transcribed text will be automatically pasted into the active text field
This application requires accessibility permissions to monitor keyboard input. When you first run the application, you may need to:
- Open System Preferences/Settings
- Go to Security & Privacy (or Privacy & Security in newer versions)
- Select the Privacy tab
- Click on Accessibility in the left sidebar
- Click the lock icon at the bottom and enter your password to make changes
- Add Terminal (or your Python IDE) to the list of allowed applications
You can modify the following settings in the config.py
file:
Setting | Description | Default |
---|---|---|
DOUBLE_PRESS_INTERVAL |
Maximum time between Alt key presses to detect as double-press (seconds) | 0.5 |
TRANSCRIPTION_SERVICE |
Which API to use for transcription | openai |
WHISPER_MODEL |
OpenAI Whisper model to use | whisper-1 |
GEMINI_MODEL |
Google Gemini model to use | gemini-pro-vision |
LANGUAGE |
Language code for transcription | en |
MAX_RECORDING_TIME |
Maximum recording time in seconds | 120 |
The following audio configuration options can be modified in config.py
to adjust recording quality:
Setting | Description | Default | Notes |
---|---|---|---|
SAMPLE_RATE |
Audio sampling rate in Hz | 16000 | Matched to Whisper's training data¹. Higher values (e.g., 44100, 48000) can provide more audio detail but increase file size. |
CHANNELS |
Number of audio channels | 1 (Mono) | Mono is recommended for speech recognition². |
CHUNK_SIZE |
Frames per buffer | 1024 | Lower values reduce latency but may cause performance issues. Typical values: 512, 1024, 2048³. |
FORMAT |
Audio format | wav | WAV format provides lossless quality for transcription. |
For the best transcription results, consider these audio optimization tips:
-
Sample Rate Considerations:
- The default is 16000 Hz (Whisper's optimal rate)¹
- Higher sample rates (e.g., 44100 Hz - CD quality) provide more detail but increase file size and processing time
- Whisper models were trained on 16000 Hz audio, so this rate is optimal for accuracy
-
Background Noise Reduction⁸:
- Record in a quiet environment when possible
- Position the microphone closer to the speaker
- Consider using a directional microphone for noisy environments
-
Speech Clarity⁹:
- Speak at a moderate pace with clear articulation
- Avoid overlapping speech when possible
- Maintain consistent volume throughout recording
-
Hardware Recommendations¹⁰:
- External microphones typically provide better quality than built-in laptop/device microphones
- USB condenser microphones are good affordable options for clear speech capture
- Headset microphones can help maintain consistent distance from the sound source
This application supports two transcription services:
OpenAI's Whisper API offers several configuration options that affect transcription quality and behavior:
Setting | Description | Default | Available Options |
---|---|---|---|
WHISPER_MODEL |
Whisper model to use | whisper-1 | • whisper-1 : Standard API model• OpenAI also offers more advanced models like the large-v3 which may be accessible through their API⁴ |
LANGUAGE |
Language code for transcription | en | Any ISO 639-1 language code (e.g., 'en', 'fr', 'de', 'es', 'ja'). Leave empty for auto-detection⁵. |
Google's Gemini API provides an alternative for transcription:
Setting | Description | Default | Notes |
---|---|---|---|
GEMINI_MODEL |
Gemini model to use | gemini-pro-vision | Used for processing audio content |
LANGUAGE |
Language code for transcription | en | Any ISO 639-1 language code to specify the language in the transcription prompt |
Both APIs provide excellent transcription capabilities, but there are some considerations:
- OpenAI Whisper: Specifically designed for speech-to-text with high accuracy
- Google Gemini: More general-purpose AI with multimodal capabilities, which can handle audio transcription
To select which API to use, set the TRANSCRIPTION_SERVICE
value in your .env
file or environment variables to either "openai"
or "gemini"
.
The application includes a comprehensive test suite that covers all core components:
# Run all tests
./run_tests.py
# Run a specific test module
python -m unittest tests.test_clipboard
# Run a specific test case
python -m unittest tests.test_clipboard.TestClipboard.test_copy_to_clipboard_success
The tests use mocking to avoid actual hardware access (microphone) and API calls, making them suitable for CI/CD environments.
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
# Clone your fork
git clone https://github.com/shaabhishek/whisper-transcribe.git
cd whisper-transcribe
# Install development dependencies
pip install -e ".[dev]"
# Run tests
./run_tests.py
This project is licensed under the MIT License - see the LICENSE file for details.
- OpenAI for the Whisper API
- Google for the Gemini API
- PyAudio for audio recording capabilities
- pynput for keyboard monitoring
- Cross-platform support for Windows and Linux
- GUI interface
- Configurable settings via command-line arguments
- Support for additional transcription services
- Custom language model fine-tuning
- OpenAI Whisper GitHub: Audio Preprocessing - The official Whisper implementation uses 16000 Hz for audio processing.
- PyAudio Documentation: Channel Configuration - PyAudio stream configuration for audio channels.
- PyAudio Documentation: Chunk Size Parameters - PyAudio documentation for frame buffer sizes.
- OpenAI Whisper GitHub: Model Card - Official documentation of Whisper models and their parameters.
- OpenAI API Documentation: Speech to Text - Official OpenAI API documentation for Whisper transcription.
- OpenAI Research: Robust Speech Recognition via Large-Scale Weak Supervision - Research paper describing Whisper's development and audio processing.
- OpenAI GitHub: Whisper Performance and Limitations - Official notes on language-specific performance.
- Microsoft Research: Automatic Speech Recognition - Best Practices - Research on ASR performance in varying noise conditions.
- Google Cloud Documentation: Speech-to-Text Best Practices - Recommendations for speech recognition clarity.
- Audio Engineering Society: Microphone Selection Guide - Professional recommendations for speech recording equipment.