Today I added a full OCR feature to VoxLogAIwithout writing a single line of code myself. Just AI prompts, a few clicks, and some glue.

Here's how it went down:


🧠 The Setup

VoxLogAI started as a tool for fast, AI-powered audio transcription (YouTube, MP3, etc). I wanted to expand it to also extract text from PDFs and images — basically add OCR support.

Instead of opening an editor, I opened Google AI Studio and started chatting with Gemini (gemini-2.5-pro-exp-03-25).


📜 The Prompt

First, I set up a system prompt to give Gemini context:

You will act as a technical Product Owner and will help the user refining tickets. The original codebase will be passed under <codebase/>.

Then I passed in the code using a tool called Repomix, which lets us flatten the repo into an XML structure.

I also included working OCR scripts for reference:

  • gemini_image_ocr.py
  • gemini_pdf_ocr.py

My prompt:

Please look at gemini_image_ocr.py and gemini_pdf_ocr.py

These are scripts that send an image/pdf to Gemini to perform OCR.

Now, I’d like to incorporate that into my app. But how, UI-wise? At the moment it's only for audio transcription, so I guess we’d need to add a new OCR tab?

What do you think? Brainstorm with me before writing any code.
<example_code>
[the contents of the python scripts]
</example_code>
Initial Gemini prompt for designing the OCR feature in VoxLogAI


💡 Gemini’s Design Help

Gemini responded with a few UI/UX options. Initially it leaned toward Option 2 (embed OCR in existing flow), but after I asked about mobile usage, it switched to Option 3 — a dedicated OCR tab.

Gemini’s initial UI suggestions for integrating OCR in VoxLogAI
Initial suggestion
Gemini reconsidering UI design after mobile usage feedback
Mid-way UX shift
Gemini’s final suggestion: adding a dedicated OCR tab
Final suggestion: Option 3

Then I asked it:

“then let's refine ticket for option 3 please”

You can view the refined Gemini ticket here: View the full Gemini-generated ticket (Markdown).


🛠️ Claude Does the Code

With the ticket in hand, I switched to Claude.

I asked it to read the ticket and implement the feature. It handled it end-to-end. The result worked out of the box, with just one minor bug:

Claude implementing the OCR feature based on Gemini’s ticket

I reported the bug (which was a minor UI issue) to Claude → fixed:

Claude fixing a minor UI bug after feedback during implementation

Then I asked Claude to refactor a bloated file, and it did.

"perfect! now, a bit of refactor is needed IMO: we have the transcriber logic in transcriber.py, which i like. shouldn't we have a ocr.py (or wtv name you think is best) and include the logic there too? app.py seems to be doing a lot now. what do you think? i'm open to be challenged"


🧼 Gemini Does the PR Review

After Claude was done, I staged the changes. Claude can occasionally overwrite previous changes if the context gets messy — committing often helps preserve state.

Then I saved the git diff into a file to prepare the code review I would be asking Gemini to do:

git diff --staged > /tmp/git_diff

Then I went back to Gemini, passed it the git diff and said:

"please now put your Senior developer hat on and review the git diff that implemetnes this feature. focus on code quality, tech debt, securiuty vulnerabilities, etc"

Gemini came back with a bunch of solid code-level suggestions: Gemini's full code review for the OCR feature (Markdown)


🔁 Claude Finalizes

I passed Gemini’s suggestions back to Claude (literally copy/paste).

Then ran:

git diff > /tmp/git_diff_latest

Sent that to Gemini for a final pass — it approved.


✅ Shipped.

Changes committed. Feature done. You can see the PR here: feat: Add Image/PDF OCR and Refactor UI Logic

  • 🧾 Cost: $2.97
  • ⏱️ Time: < 1 hour
  • 👨‍💻 Code written manually: 0 lines

🧠 Lessons & Workflow Tips

  • Claude is fantastic for coding — especially in small, clean codebases.
  • Gemini is killer at reasoning, product design, and high-level planning.
  • The combo? 🔥. Design with Gemini, code with Claude.
  • I always pass code as XML because AI models seem to understand structure better that way.
  • With Claude, commit often — it might overwrite things if the context window overflows.

⚡ Verdict

This was a real productivity boost.

Not because AI replaced the thinking — but because it helped me move fast through the grind.

And, more importantly, it was fun!