User Story: Integrate Image and PDF OCR Functionality
As a VoxLogAI user, I want to be able to select a processing mode (either “Audio/Video Transcription” or “Document OCR”) and then choose the specific input method relevant to that mode (Upload File, YouTube URL, Image Upload, PDF Upload), So that I can easily extract text from audio recordings, YouTube videos, images, and PDF documents using a single, unified interface.
Acceptance Criteria:
1. Mode Selection UI: * A mode selection mechanism (e.g., visually distinct radio buttons or a segmented control) is present above the input tabs. * The modes are clearly labeled, for example: “Audio/Video Transcription” and “Document OCR”. * One mode is selected by default (recommend “Audio/Video Transcription”). * Selecting a mode visually indicates it’s active.
2. Conditional Tab Display: * When “Audio/Video Transcription” mode is selected, the tabs displayed are “Upload Audio” and “YouTube URL”. The content/functionality of these tabs remains as currently implemented. * When “Document OCR” mode is selected, the tabs displayed are “Upload Image” and “Upload PDF”. The “Upload Audio” and “YouTube URL” tabs are hidden. * Switching modes correctly updates the visible set of tabs.
3. Image Upload Tab:
* When the “Upload Image” tab is active (under “Document OCR” mode):
* An upload zone is displayed, specifically prompting for image files.
* The zone uses an appropriate icon (e.g., fa-file-image
).
* The file input accepts common image formats (image/jpeg
, image/png
, image/webp
, image/heic
, image/heif
).
* A reasonable file size limit (20MB) is enforced, with clear error messages if exceeded.
* Selecting a valid image file updates the UI to show the filename and enables the main action button.
4. PDF Upload Tab:
* When the “Upload PDF” tab is active (under “Document OCR” mode):
* An upload zone is displayed, specifically prompting for PDF files.
* The zone uses an appropriate icon (e.g., fa-file-pdf
).
* The file input accepts the PDF format (application/pdf
).
* A reasonable file size limit (20MB) is enforced, with clear error messages if exceeded.
* Selecting a valid PDF file updates the UI to show the filename and enables the main action button.
5. Conditional Options: * The “Include timestamps” checkbox/toggle is only visible when the “Audio/Video Transcription” mode is active. * The “Include timestamps” option is hidden when the “Document OCR” mode is active.
6. Contextual Action Button: * The main action button’s text changes based on the active mode: * “Audio/Video Transcription” mode: Button text is “Transcribe”. * “Document OCR” mode: Button text is “Extract Text” or “Perform OCR”. * The button’s enabled/disabled state correctly reflects whether valid input (file selected or valid URL entered) exists for the currently active tab within the selected mode.
7. Unified Result Display: * The header for the result area is renamed from “Transcript” to something more general like “Result” or “Extracted Text”. * Text extracted from images or PDFs via the Gemini API is displayed in this result area. * The “Copy to clipboard” and “Start new transcription” (perhaps rename to “Start New”) buttons function correctly for the OCR results.
8. Backend Implementation:
* New Flask endpoints (e.g., /ocr_image
, /ocr_pdf
) are created.
* These endpoints accept POST requests containing the image or PDF file data.
* Backend logic uses the Gemini API to perform OCR on the received image (e.g., using PIL) or PDF (e.g., using types.Part.from_bytes
). Please look at random_tests/gemini_image_ocr.py
and random_tests/gemini_pdf_ocr.py
to check for working examples.
* The backend handles potential errors during file processing or API interaction and returns informative JSON error messages to the frontend.
* File size limits are enforced on the backend as well.
* Temporary file handling for OCR is minimal (ideally process in memory or delete immediately after API call), avoiding the file_id
mapping used for audio.
9. State Management & Usability: * Switching between modes (“Audio/Video” <-> “Document OCR”) clears any selected file or entered URL from the previously active mode’s tabs and resets the result area/status messages. * The UI remains responsive and usable on both desktop and mobile screen sizes. The mode switcher + 2 tabs should fit comfortably on mobile.
10. Dependencies:
* Ensure necessary libraries (like Pillow
for image handling if needed server-side, though Gemini might handle it directly) are added to requirements.txt
.
11. Delete data: * Ensure that no data (image/pdf/audio/etc) stays in the server after operation is done. We do NOT want to keep any user data.