Problem with urdu in kimi k2

‎Dear Kimi Team,

‎I hope this message finds you well ‎I am salman from Pakistan ‎I have recently tested your latest Ai kimi k2 OCR with images containing Urdu text, and unfortunately, the results showed 0% accuracy in recognizing the Urdu picture it’s like it doesn’t know that same picture is attached in email for you to test your self

‎You can view the test results here:Kimi Urdu OCR Test Result

Kimi AI - Kimi K2 Thinking is here

‎In comparison, I tested the same image using Gemini 2.5 Pro, which successfully and accurately recognized all the Urdu text

‎Gemini 2.5 pro Result same picture

https://gemini.google.com/share/c99645acfddc

‎To support the improvement of Urdu language recognition in your AI, I’m willing to share a comprehensive Urdu dataset (approx. 40 GB) collection. your kimi k2 team can use it for training or fine-tuning purposes.

‎Data set folder link for download

urdu datasets shared from Docume******Heaven - TeraBox

‎Urdu is spoken by over 100 million people globally, and adding robust support for it would be a significant step forward for your platform’s accessibility and global relevance.

‎Best regards,

‎Salman

1 Like

Dear Salman,

Thank you for reaching out and sharing your experience with the Kimi K2 OCR. I understand that you encountered issues with the recognition of Urdu text in images, and I appreciate your detailed feedback.

  1. Please note that the Kimi K2 model, as mentioned in the Moonshot AI Open Platform documentation, it “Does not support vision functionality”. This is likely the reason for the 0% accuracy you experienced with Urdu text recognition.
  2. You may want to try switching to the Kimi K1.5 model, which has vision capabilities and might provide better results for your OCR needs.
  3. Thank you for your offer to share a comprehensive Urdu dataset. We are indeed working very hard on bringing image input capabilities to K2, and your dataset could be very valuable for improving our model’s performance in recognizing Urdu text.

Bests,
Yu

dear yuikns

i have already tested that also its worst then kimi k2

It’s clear that no Urdu datasets were used to train your models

Most open-source models from China (including yours) fail to understand or generate proper Urdu simply because they lack representation in Urdu language training data. unlike chatgpt or gemini or claude

Urdu is spoken by millions of people in Pakistan!

I strongly urge you to include my Urdu datasets 40gb big and small handwritten and digital and audio in future model training thx

Kimi K1.5 result

Kimi - 会推理解析,能深度思考的AI助手

Thank you for testing k1.5 and for the additional context, and we truly appreciate it.

bro yuikns did u forward my datasets link to your ai training team to download and analyze it

Dear Msalman,

I had already escalated the ticket internally yesterday.

If it’s convenient for you, sharing the datasets via Google Drive, a public torrent, or another direct link would let the team retrieve and analyze them more smoothly

Bests,
Yu

Terabox is free to use!
Just follow these steps:

Install the Terabox app
Create a free account
Open my link in the app
Start main folder download

you can easily download the main folder on your phone

there is option to save files to your account also

i have copied files to my temp email with temp pass u can download from threre useing my login and with desktop or phone app

jonedave3 @ gmail .com

Note: For security reasons, the password has been removed and hidden.

1 Like

Dear yuikins

Did you downloaded all urdu data sets all good? I have upload some new also including ocr pictured etc ones kindly download all those urdu data sets

i am available for testing out urdu performance on your model before public release to verify its performance on urdu language

Also add this dots.Ocr 3b open source model if possible to your ai I have tested it works on urdu digital pictures but doesn’t work on hand written it will also improve your ai along with those datasets

Dear yuikins

did u manage to download my urdu datasets ??

Hi Msalman,

Thank you so much for taking the time to compile and share this 37 GB collection — it is a significant effort and we very much appreciate the contribution. After going through every archive file by file, here is a transparent record of what we were able to incorporate, and the specific reason in each case where we could not.

Datasets we plan to incorporate in a future Urdu OCR training run

(In the table below, “synthetic” means the line images were generated on a computer by rendering text with Urdu fonts — they are not photos or scans of physical printed material.)

File Sample count License Notes
MMU_OCR_21_Urdu_Printed_Text_Corpus.zip 301,623 line images (text-line subset only; full corpus has 602K char/word/line) CC-BY 4.0 Synthetic printed lines × 3 fonts (Naskh, Nastaliq, Tehreer)
80_clean_handwritten_Urdu_OCR_lines.zip 78 real handwritten lines CC0 (per the Kaggle dataset metadata) Small but cleanly transcribed
Urdu_OCR_Handwriting_Data.zip 10,063 lines CC0 (per the Kaggle dataset metadata) Synthetic Nastaliq computer-font renderings (close in style to MMU-Extension-22)

Combined this is about 312,000 labelled line-level Urdu OCR images under an explicit permissive license, queued for an upcoming training run.

License conflict — pending clarification with the author

File Sample count License conflict
Qaida_Dataset_Large_scale_font_independent_Urdu_text_recognition_data_set.zip 4.75 M ligature images (3.7 M train + 1.04 M test) The GitHub repo (AtiqueUrRehman/qaida) declares the data under CC-BY 4.0 in its README, but the Kaggle re-upload by the same author (atique/qaida-dataset) is tagged CC-BY-NC 4.0 — and NC (NonCommercial) is not equivalent to plain CC-BY. We will follow up with the author to clarify which terms are actually intended; the dataset is not currently in the training queue, but we may be able to fold it in once the upstream license is unambiguous.

OCR-named datasets we did not include in this round

File Sample count Reason
UNHD_Dataset_Urdu_Handwritten_Dataset.zip 7,341 real handwritten lines Kaggle license is "Data files © Original Authors" — original authors retain copyright, so we cannot use it for model training without an explicit grant
PUCIT_OHUL_Handwritten_Urdu_Lines_Dataset_OCR.zip 7,269 real handwritten lines The PUCIT distribution page itself does not state a license; a Kaggle re-upload (i191796majid/pucit-ohul-...) tags it CC0 but the re-uploader is not the data owner, so we treat the upstream PUCIT terms as authoritative
Nust_UHWR_dataset.zip 10,601 real handwritten lines Kaggle license is "Unknown"
UHaT_Urdu_handwritten_text_dataset.zip (also a copy under small/) 41,228 char images Kaggle license is "Data files © Original Authors"; also character-level only (out of scope for our line-level pipeline this round)
MMU_Extension_22_Multi_font_Urdu_printed_text_lines.zip 245,000 synthetic printed lines × 7 fonts Kaggle license is explicitly "Unknown" (despite the same author’s MMU-OCR-21 being CC-BY 4.0); we are not treating it as licensed without an explicit statement
Handwritten_Urdu_Characters_Dataset.zip (root + small/) 105,539 char images Character-level only; not in scope for our line-level pipeline this round
Handwritten_Urdu_Characters_Dataset_Digits_only.zip 7,444 digit images Character-level only; not in scope for our line-level pipeline this round
Ink_Insight_Handwritten_Urdu_Letters_Dataset.zip 2,288 letter images Character-level only; not in scope for our line-level pipeline this round (license itself is fine: CC0)
UrduMNIST.zip 50,260 char images Character-level only; not in scope for our line-level pipeline this round (Kaggle license is also "Unknown")
Urdu_Aphabets_MNIST.zip MNIST-style CSV Stored as a CSV pixel matrix (not raw image files) and character-level only; not in scope for our line-level pipeline this round (license itself is fine: CC-BY 3.0)
urdu_alphabets_dataset.zip small CSV Alphabet metadata only; no usable training data
UrduDataset.zip 50,260 char images Character-level only; not in scope for our line-level pipeline this round
Urdu_Handwritten_Text_Dataset.zip 273 char images Character-level only; not in scope for our line-level pipeline this round (license itself is fine: CC-BY 4.0 per the Kaggle uploader saurabhshahane)
Urdu_OCR_Ligature_Images.zip 77,430 ligatures Ligature-level only, out of scope for our line-level pipeline this round; the Kaggle license CC-BY-SA 4.0 (share-alike) would also be incompatible with closed-source model training
Urdu_OCR_Ligature_Thickness_Graphs_Real_Dataset.zip (root + small/) 34,346 ligatures Ligature-level only, out of scope this round; Kaggle license is CDLA-Sharing-1.0 (share-alike) which would also be incompatible with closed-source training
Urdu_nastaleeq_ligatures_Images_23206.zip 46,412 ligatures + 2 xls Ligature-level only, out of scope for our line-level pipeline this round (license itself is fine: CC0)
Urdu Handwritten Ligature dataset(UHLD).zip, Urdu_Handwritten_Ligature_datasetUHLD.zip, Urdu_Hanfwritten_Ligature_Dataset.zip 88 ligature pdfs + 26 jpegs (×3 near-copies) Three Kaggle uploads of similar UHLD content under different licenses (jazz786/...: “Open Database, Contents © Original Authors”; talhaumar/...: MIT); ligature-level data, out of scope for our line-level pipeline this round
Urdu_Text_in_Scene_Image.zip (root + small/) 872 raw scene images 3 splits of raw images but no transcriptions at all; usable for text-detection pretraining only (locating text regions), not for OCR (reading the text)
Urdu_Artificial_Text_Text_Detection.zip 5,426 news screenshots + 5,426 bbox xml Labels mark where text appears in each image (bounding boxes around text regions) but do not include what the text says — text-detection only, not transcription; the Kaggle license CC-BY-SA 4.0 (share-alike) would also be incompatible with closed-source training
Hand_written_dataset_urdu.zip 238,650 png images Top folder is pashto_hand/, so the content appears to be Pashto rather than Urdu
Urdu_Eng_Data.zip 130K snippets The labels are English medical terms (Allergy, Amoxil 100mg, blood circulation, …), so the content looks like English handwritten clinical text rather than Urdu OCR

Off-topic for OCR (routed to other workstreams)

Category Files Note
Speech / ASR with transcript (Automatic Speech Recognition: audio + matching transcript text) CV_13_Urdu_hasankuzagsr.zip, Urdu_Audio_Clips_with_Transcriptions.zip, Urdu_Data_Set_mrohaankhan.zip, Urdu_audio_dataset_with_transcription_20000_file.zip, voice_data_urdu_hashirabbasi121.zip, Urdu_Speech_To_Text_Dataset.zip, ai4bharat_transc_shrutilipi_fairseq.zip Valid Urdu ASR corpora (≈ 9 GB), but they do not influence image OCR; routed separately.
Speech / no transcript religious_urdu_nairsaanvi.wav.files.rar, urdu_religious_dataset.zip, Dataset_Language_Classification_audio_urdu_only.zip, archive1.zip, Indian_Languages_Audio_Dataset.zip, san_project_english_and_urdu.zip, Urdu_Emotion_Speech_Dataset_*.zip, urdu_audio_clips.zip, urdu_speech_dataset_2.zip, urdu_data_set.zip, Urdu_Emotion_Dataset.zip, Urdu_Language_Speech_Dataset.zip, Urdu_Speech_Dataset.zip (×2), Urdu_ML_Dataset.zip, Urdu_Voice_Wav_dataset.zip ≈ 11 GB of audio without ground-truth transcriptions; limited use for ASR fine-tuning.
Pure-text NLP (off-topic for image OCR) xlsum_Urdu_Dataset_news_Summary.zip, XNLI_18_Langauge_NLI_Dataset_urdu.zip, Urdu_Wikipedia_Articles.zip, Urdu_News_Dataset.zip, Urdu_GPT.zip, Urdu_files.zip, Urdu_ngrams.zip, Urdu_Sarcastic_Tweets_Dataset.zip, Urdu_Tweets_Dataset_for_Spam_Detection.zip, Urdu_Authorship_Attribution.zip, Urdu_Named_Entity_Recognition_Dataset.zip, Urdu_Name_Entity_Recognition_Dataset_MK_PUCIT.zip, Urdu_News_Recommendation_System_Data.zip, Urdu_News_Scrapped_Dataset_for_Multiple_Choice.zip, MSCOCO_Urdu.zip, Urdu_Data_Bulk.zip, Urdu_Data_File.zip, Urdu_Dataset_2.zip, urdu_dataset_text.zip, multilingual_sentiment_analysis_dataset_twitter_politcal_parties_Pakistan.zip, old_newspaper_urdu_dataset.zip, The_Holy_Quran.zip, The_Holy_Quran_in_44_Languages.zip, Language_Identification_dataset.zip, Urdu_OCR_Scale_Invariant_Feature_Vectors_MAT_CSV.zip, Urdu_mirfan899.zip Pure-text data; does not influence image OCR.
Roman-Urdu (Latin script, off-topic) Roman_Urdu_Dataset.zip, Roman_Urdu_Dataset_2.zip, Roman_Urdu_Dataset_3.zip, Roman_Urdu_Sentiments.zip, Roman_Urdu_Sentiment_Analysis.zip, Roman_Urdu_Hate_Speech.zip Latin transliteration of Urdu; unrelated to native-script OCR.
Abusive / sensitive abusiveDataset_urdu.zip, research_abuseDetection_urdu_dataset.zip Profanity detection; not used for generative training.
Pretrained vectors Urdu_Word2vec.zip, FastText_Urdu_Vectors.zip, Bert_urdu_dataset_40k.zip Pre-trained text embeddings / training corpora — limited value for image OCR.
Sign language Urdu_Sign_Language.zip, Pakistan_Sign_Language_Dataset_openPose.zip Sign-language gesture images / keypoints; unrelated to text OCR.
Video Test_Videos_Videos_clips_in_Urdu.zip 11 news/podcast clips; no OCR ground truth.
Mislabelled language Yelp_Review_with_Sentiments_and_Features.zip English Yelp reviews, not Urdu.
Fonts Font_Dataset.zip, arial_font.zip TTF/OTF fonts — useful for synthesizing more printed Urdu OCR data, kept for that purpose.

Duplicates and re-bundles (deduplicated)

File Reason
urdu_small_all_data_sets_all_combined1.zip, *combined2.zip, *combined3.zip Re-bundles of the small/ subset; same content already present individually.
Same-name copies of Urdu_OCR_Handwriting_Data.zip, UHaT_Urdu_handwritten_text_dataset.zip, Urdu_Text_in_Scene_Image.zip, Urdu_OCR_Ligature_Thickness_Graphs_Real_Dataset.zip, Handwritten_Urdu_Characters_Dataset.zip, Urdu_Speech_Dataset.zip at root vs. in big/ or small/ Identical content stored twice; one canonical copy retained.
XNLI_Multilingual_NLI_urdu_only.zip Identical to XNLI_18_Langauge_NLI_Dataset_urdu.zip.
shrutilipi_fairseq.zip Identical to ai4bharat_transc_shrutilipi_fairseq.zip.

Summary

From your contribution, after review, about 312,000 labelled line-level Urdu OCR images under an explicit permissive license are prepared and queued for a future training run (MMU-OCR-21, 80_clean, Urdu_OCR_Handwriting_Data).

What is gating the rest is licensing clarity. The real-world handwritten releases (UNHD, PUCIT-OHUL, NUST-UHWR) are typically distributed “for research” on individual request without an open license grant, and several Kaggle uploads (MMU-Extension-22, UHaT, the various character / ligature sets, etc.) ship with Kaggle license "Unknown" or "Data files © Original Authors". We plan to follow up with the original authors / uploaders — including the Qaida author about the GitHub vs Kaggle license discrepancy — to ask about explicit permissive licensing; if any of them confirm, we can fold those datasets in. Until then, we will not incorporate those particular files.

Thank you again for the contribution.

1 Like