Problem with urdu in kimi k2

‎Dear Kimi Team,

‎I hope this message finds you well ‎I am salman from Pakistan ‎I have recently tested your latest Ai kimi k2 OCR with images containing Urdu text, and unfortunately, the results showed 0% accuracy in recognizing the Urdu picture it’s like it doesn’t know that same picture is attached in email for you to test your self

‎You can view the test results here:Kimi Urdu OCR Test Result

Kimi AI - Kimi K2 Thinking is here

‎In comparison, I tested the same image using Gemini 2.5 Pro, which successfully and accurately recognized all the Urdu text

‎Gemini 2.5 pro Result same picture

https://gemini.google.com/share/c99645acfddc

‎To support the improvement of Urdu language recognition in your AI, I’m willing to share a comprehensive Urdu dataset (approx. 40 GB) collection. your kimi k2 team can use it for training or fine-tuning purposes.

‎Data set folder link for download

urdu datasets shared from Docume******Heaven - TeraBox

‎Urdu is spoken by over 100 million people globally, and adding robust support for it would be a significant step forward for your platform’s accessibility and global relevance.

‎Best regards,

‎Salman

1 Like

Dear Salman,

Thank you for reaching out and sharing your experience with the Kimi K2 OCR. I understand that you encountered issues with the recognition of Urdu text in images, and I appreciate your detailed feedback.

  1. Please note that the Kimi K2 model, as mentioned in the Moonshot AI Open Platform documentation, it “Does not support vision functionality”. This is likely the reason for the 0% accuracy you experienced with Urdu text recognition.
  2. You may want to try switching to the Kimi K1.5 model, which has vision capabilities and might provide better results for your OCR needs.
  3. Thank you for your offer to share a comprehensive Urdu dataset. We are indeed working very hard on bringing image input capabilities to K2, and your dataset could be very valuable for improving our model’s performance in recognizing Urdu text.

Bests,
Yu

dear yuikns

i have already tested that also its worst then kimi k2

It’s clear that no Urdu datasets were used to train your models

Most open-source models from China (including yours) fail to understand or generate proper Urdu simply because they lack representation in Urdu language training data. unlike chatgpt or gemini or claude

Urdu is spoken by millions of people in Pakistan!

I strongly urge you to include my Urdu datasets 40gb big and small handwritten and digital and audio in future model training thx

Kimi K1.5 result

Kimi - 会推理解析,能深度思考的AI助手

Thank you for testing k1.5 and for the additional context, and we truly appreciate it.

bro yuikns did u forward my datasets link to your ai training team to download and analyze it

Dear Msalman,

I had already escalated the ticket internally yesterday.

If it’s convenient for you, sharing the datasets via Google Drive, a public torrent, or another direct link would let the team retrieve and analyze them more smoothly

Bests,
Yu

Terabox is free to use!
Just follow these steps:

Install the Terabox app
Create a free account
Open my link in the app
Start main folder download

you can easily download the main folder on your phone

there is option to save files to your account also

i have copied files to my temp email with temp pass u can download from threre useing my login and with desktop or phone app

jonedave3 @ gmail .com

Note: For security reasons, the password has been removed and hidden.

1 Like

Dear yuikins

Did you downloaded all urdu data sets all good? I have upload some new also including ocr pictured etc ones kindly download all those urdu data sets

i am available for testing out urdu performance on your model before public release to verify its performance on urdu language

Also add this dots.Ocr 3b open source model if possible to your ai I have tested it works on urdu digital pictures but doesn’t work on hand written it will also improve your ai along with those datasets

Dear yuikins

did u manage to download my urdu datasets ??