I hope this message finds you well I am salman from Pakistan I have recently tested your latest Ai kimi k2 OCR with images containing Urdu text, and unfortunately, the results showed 0% accuracy in recognizing the Urdu picture it’s like it doesn’t know that same picture is attached in email for you to test your self
You can view the test results here:Kimi Urdu OCR Test Result
To support the improvement of Urdu language recognition in your AI, I’m willing to share a comprehensive Urdu dataset (approx. 40 GB) collection. your kimi k2 team can use it for training or fine-tuning purposes.
Urdu is spoken by over 100 million people globally, and adding robust support for it would be a significant step forward for your platform’s accessibility and global relevance.
Thank you for reaching out and sharing your experience with the Kimi K2 OCR. I understand that you encountered issues with the recognition of Urdu text in images, and I appreciate your detailed feedback.
Please note that the Kimi K2 model, as mentioned in the Moonshot AI Open Platform documentation, it “Does not support vision functionality”. This is likely the reason for the 0% accuracy you experienced with Urdu text recognition.
You may want to try switching to the Kimi K1.5 model, which has vision capabilities and might provide better results for your OCR needs.
Thank you for your offer to share a comprehensive Urdu dataset. We are indeed working very hard on bringing image input capabilities to K2, and your dataset could be very valuable for improving our model’s performance in recognizing Urdu text.
i have already tested that also its worst then kimi k2
It’s clear that no Urdu datasets were used to train your models
Most open-source models from China (including yours) fail to understand or generate proper Urdu simply because they lack representation in Urdu language training data. unlike chatgpt or gemini or claude
Urdu is spoken by millions of people in Pakistan!
I strongly urge you to include my Urdu datasets 40gb big and small handwritten and digital and audio in future model training thx
If it’s convenient for you, sharing the datasets via Google Drive, a public torrent, or another direct link would let the team retrieve and analyze them more smoothly
Did you downloaded all urdu data sets all good? I have upload some new also including ocr pictured etc ones kindly download all those urdu data sets
i am available for testing out urdu performance on your model before public release to verify its performance on urdu language
Also add this dots.Ocr 3b open source model if possible to your ai I have tested it works on urdu digital pictures but doesn’t work on hand written it will also improve your ai along with those datasets
Thank you so much for taking the time to compile and share this 37 GB collection — it is a significant effort and we very much appreciate the contribution. After going through every archive file by file, here is a transparent record of what we were able to incorporate, and the specific reason in each case where we could not.
Datasets we plan to incorporate in a future Urdu OCR training run
(In the table below, “synthetic” means the line images were generated on a computer by rendering text with Urdu fonts — they are not photos or scans of physical printed material.)
File
Sample count
License
Notes
MMU_OCR_21_Urdu_Printed_Text_Corpus.zip
301,623 line images (text-line subset only; full corpus has 602K char/word/line)
4.75 M ligature images (3.7 M train + 1.04 M test)
The GitHub repo (AtiqueUrRehman/qaida) declares the data under CC-BY 4.0 in its README, but the Kaggle re-upload by the same author (atique/qaida-dataset) is tagged CC-BY-NC 4.0 — and NC (NonCommercial) is not equivalent to plain CC-BY. We will follow up with the author to clarify which terms are actually intended; the dataset is not currently in the training queue, but we may be able to fold it in once the upstream license is unambiguous.
OCR-named datasets we did not include in this round
The PUCIT distribution page itself does not state a license; a Kaggle re-upload (i191796majid/pucit-ohul-...) tags it CC0 but the re-uploader is not the data owner, so we treat the upstream PUCIT terms as authoritative
Nust_UHWR_dataset.zip
10,601 real handwritten lines
Kaggle license is "Unknown"
UHaT_Urdu_handwritten_text_dataset.zip (also a copy under small/)
Kaggle license is explicitly "Unknown" (despite the same author’s MMU-OCR-21 being CC-BY 4.0); we are not treating it as licensed without an explicit statement
Character-level only; not in scope for our line-level pipeline this round
Ink_Insight_Handwritten_Urdu_Letters_Dataset.zip
2,288 letter images
Character-level only; not in scope for our line-level pipeline this round (license itself is fine: CC0)
UrduMNIST.zip
50,260 char images
Character-level only; not in scope for our line-level pipeline this round (Kaggle license is also "Unknown")
Urdu_Aphabets_MNIST.zip
MNIST-style CSV
Stored as a CSV pixel matrix (not raw image files) and character-level only; not in scope for our line-level pipeline this round (license itself is fine: CC-BY 3.0)
urdu_alphabets_dataset.zip
small CSV
Alphabet metadata only; no usable training data
UrduDataset.zip
50,260 char images
Character-level only; not in scope for our line-level pipeline this round
Urdu_Handwritten_Text_Dataset.zip
273 char images
Character-level only; not in scope for our line-level pipeline this round (license itself is fine: CC-BY 4.0 per the Kaggle uploader saurabhshahane)
Urdu_OCR_Ligature_Images.zip
77,430 ligatures
Ligature-level only, out of scope for our line-level pipeline this round; the Kaggle license CC-BY-SA 4.0 (share-alike) would also be incompatible with closed-source model training
Ligature-level only, out of scope this round; Kaggle license is CDLA-Sharing-1.0 (share-alike) which would also be incompatible with closed-source training
Urdu_nastaleeq_ligatures_Images_23206.zip
46,412 ligatures + 2 xls
Ligature-level only, out of scope for our line-level pipeline this round (license itself is fine: CC0)
3 splits of raw images but no transcriptions at all; usable for text-detection pretraining only (locating text regions), not for OCR (reading the text)
Urdu_Artificial_Text_Text_Detection.zip
5,426 news screenshots + 5,426 bbox xml
Labels mark where text appears in each image (bounding boxes around text regions) but do not include what the text says — text-detection only, not transcription; the Kaggle license CC-BY-SA 4.0 (share-alike) would also be incompatible with closed-source training
Hand_written_dataset_urdu.zip
238,650 png images
Top folder is pashto_hand/, so the content appears to be Pashto rather than Urdu
Urdu_Eng_Data.zip
130K snippets
The labels are English medical terms (Allergy, Amoxil 100mg, blood circulation, …), so the content looks like English handwritten clinical text rather than Urdu OCR
Off-topic for OCR (routed to other workstreams)
Category
Files
Note
Speech / ASR with transcript (Automatic Speech Recognition: audio + matching transcript text)
Re-bundles of the small/ subset; same content already present individually.
Same-name copies of Urdu_OCR_Handwriting_Data.zip, UHaT_Urdu_handwritten_text_dataset.zip, Urdu_Text_in_Scene_Image.zip, Urdu_OCR_Ligature_Thickness_Graphs_Real_Dataset.zip, Handwritten_Urdu_Characters_Dataset.zip, Urdu_Speech_Dataset.zip at root vs. in big/ or small/
Identical content stored twice; one canonical copy retained.
XNLI_Multilingual_NLI_urdu_only.zip
Identical to XNLI_18_Langauge_NLI_Dataset_urdu.zip.
shrutilipi_fairseq.zip
Identical to ai4bharat_transc_shrutilipi_fairseq.zip.
Summary
From your contribution, after review, about 312,000 labelled line-level Urdu OCR images under an explicit permissive license are prepared and queued for a future training run (MMU-OCR-21, 80_clean, Urdu_OCR_Handwriting_Data).