I hope this message finds you well I am salman from Pakistan I have recently tested your latest Ai kimi k2 OCR with images containing Urdu text, and unfortunately, the results showed 0% accuracy in recognizing the Urdu picture it’s like it doesn’t know that same picture is attached in email for you to test your self
You can view the test results here:Kimi Urdu OCR Test Result
To support the improvement of Urdu language recognition in your AI, I’m willing to share a comprehensive Urdu dataset (approx. 40 GB) collection. your kimi k2 team can use it for training or fine-tuning purposes.
Urdu is spoken by over 100 million people globally, and adding robust support for it would be a significant step forward for your platform’s accessibility and global relevance.
Thank you for reaching out and sharing your experience with the Kimi K2 OCR. I understand that you encountered issues with the recognition of Urdu text in images, and I appreciate your detailed feedback.
Please note that the Kimi K2 model, as mentioned in the Moonshot AI Open Platform documentation, it “Does not support vision functionality”. This is likely the reason for the 0% accuracy you experienced with Urdu text recognition.
You may want to try switching to the Kimi K1.5 model, which has vision capabilities and might provide better results for your OCR needs.
Thank you for your offer to share a comprehensive Urdu dataset. We are indeed working very hard on bringing image input capabilities to K2, and your dataset could be very valuable for improving our model’s performance in recognizing Urdu text.
i have already tested that also its worst then kimi k2
It’s clear that no Urdu datasets were used to train your models
Most open-source models from China (including yours) fail to understand or generate proper Urdu simply because they lack representation in Urdu language training data. unlike chatgpt or gemini or claude
Urdu is spoken by millions of people in Pakistan!
I strongly urge you to include my Urdu datasets 40gb big and small handwritten and digital and audio in future model training thx
If it’s convenient for you, sharing the datasets via Google Drive, a public torrent, or another direct link would let the team retrieve and analyze them more smoothly
Did you downloaded all urdu data sets all good? I have upload some new also including ocr pictured etc ones kindly download all those urdu data sets
i am available for testing out urdu performance on your model before public release to verify its performance on urdu language
Also add this dots.Ocr 3b open source model if possible to your ai I have tested it works on urdu digital pictures but doesn’t work on hand written it will also improve your ai along with those datasets