Problem with urdu in kimi k2

Msalman · August 2, 2025, 1:06pm

‎Dear Kimi Team,

‎I hope this message finds you well ‎I am salman from Pakistan ‎I have recently tested your latest Ai kimi k2 OCR with images containing Urdu text, and unfortunately, the results showed 0% accuracy in recognizing the Urdu picture it’s like it doesn’t know that same picture is attached in email for you to test your self

‎You can view the test results here:Kimi Urdu OCR Test Result

‎Kimi AI - Kimi K2 Thinking is here

‎In comparison, I tested the same image using Gemini 2.5 Pro, which successfully and accurately recognized all the Urdu text

‎Gemini 2.5 pro Result same picture

‎https://gemini.google.com/share/c99645acfddc

‎To support the improvement of Urdu language recognition in your AI, I’m willing to share a comprehensive Urdu dataset (approx. 40 GB) collection. your kimi k2 team can use it for training or fine-tuning purposes.

‎Data set folder link for download

‎urdu datasets shared from Docume******Heaven - TeraBox

‎Urdu is spoken by over 100 million people globally, and adding robust support for it would be a significant step forward for your platform’s accessibility and global relevance.

‎Best regards,

‎Salman

yuikns · August 2, 2025, 2:42pm

Dear Salman,

Thank you for reaching out and sharing your experience with the Kimi K2 OCR. I understand that you encountered issues with the recognition of Urdu text in images, and I appreciate your detailed feedback.

Please note that the Kimi K2 model, as mentioned in the Moonshot AI Open Platform documentation, it “Does not support vision functionality”. This is likely the reason for the 0% accuracy you experienced with Urdu text recognition.
You may want to try switching to the Kimi K1.5 model, which has vision capabilities and might provide better results for your OCR needs.
Thank you for your offer to share a comprehensive Urdu dataset. We are indeed working very hard on bringing image input capabilities to K2, and your dataset could be very valuable for improving our model’s performance in recognizing Urdu text.

Bests,
Yu

Msalman · August 2, 2025, 6:52pm

dear yuikns

i have already tested that also its worst then kimi k2

It’s clear that no Urdu datasets were used to train your models

Most open-source models from China (including yours) fail to understand or generate proper Urdu simply because they lack representation in Urdu language training data. unlike chatgpt or gemini or claude

Urdu is spoken by millions of people in Pakistan!

I strongly urge you to include my Urdu datasets 40gb big and small handwritten and digital and audio in future model training thx

Kimi K1.5 result

Kimi - 会推理解析，能深度思考的AI助手

yuikns · August 3, 2025, 12:52am

Thank you for testing k1.5 and for the additional context, and we truly appreciate it.

Msalman · August 3, 2025, 8:22pm

bro yuikns did u forward my datasets link to your ai training team to download and analyze it

yuikns · August 4, 2025, 2:15am

Dear Msalman,

I had already escalated the ticket internally yesterday.

If it’s convenient for you, sharing the datasets via Google Drive, a public torrent, or another direct link would let the team retrieve and analyze them more smoothly

Bests,
Yu

Msalman · August 4, 2025, 9:25am

Terabox is free to use!
Just follow these steps:

Install the Terabox app
Create a free account
Open my link in the app
Start main folder download

you can easily download the main folder on your phone

there is option to save files to your account also

Msalman · August 4, 2025, 4:44pm

i have copied files to my temp email with temp pass u can download from threre useing my login and with desktop or phone app

jonedave3 @ gmail .com

Note: For security reasons, the password has been removed and hidden.

Msalman · August 5, 2025, 3:18pm

Dear yuikins

Did you downloaded all urdu data sets all good? I have upload some new also including ocr pictured etc ones kindly download all those urdu data sets

i am available for testing out urdu performance on your model before public release to verify its performance on urdu language

Also add this dots.Ocr 3b open source model if possible to your ai I have tested it works on urdu digital pictures but doesn’t work on hand written it will also improve your ai along with those datasets

Msalman · August 19, 2025, 7:27pm

Dear yuikins

did u manage to download my urdu datasets ??

liyang · May 6, 2026, 7:22am

Hi Msalman,

Thank you so much for taking the time to compile and share this 37 GB collection — it is a significant effort and we very much appreciate the contribution. After going through every archive file by file, here is a transparent record of what we were able to incorporate, and the specific reason in each case where we could not.

Datasets we plan to incorporate in a future Urdu OCR training run

(In the table below, “synthetic” means the line images were generated on a computer by rendering text with Urdu fonts — they are not photos or scans of physical printed material.)

File	Sample count	License	Notes
`MMU_OCR_21_Urdu_Printed_Text_Corpus.zip`	301,623 line images (text-line subset only; full corpus has 602K char/word/line)	CC-BY 4.0	Synthetic printed lines × 3 fonts (Naskh, Nastaliq, Tehreer)
`80_clean_handwritten_Urdu_OCR_lines.zip`	78 real handwritten lines	CC0 (per the Kaggle dataset metadata)	Small but cleanly transcribed
`Urdu_OCR_Handwriting_Data.zip`	10,063 lines	CC0 (per the Kaggle dataset metadata)	Synthetic Nastaliq computer-font renderings (close in style to MMU-Extension-22)

Combined this is about 312,000 labelled line-level Urdu OCR images under an explicit permissive license, queued for an upcoming training run.

License conflict — pending clarification with the author

File	Sample count	License conflict
`Qaida_Dataset_Large_scale_font_independent_Urdu_text_recognition_data_set.zip`	4.75 M ligature images (3.7 M train + 1.04 M test)	The GitHub repo (`AtiqueUrRehman/qaida`) declares the data under CC-BY 4.0 in its README, but the Kaggle re-upload by the same author (`atique/qaida-dataset`) is tagged CC-BY-NC 4.0 — and NC (NonCommercial) is not equivalent to plain CC-BY. We will follow up with the author to clarify which terms are actually intended; the dataset is not currently in the training queue, but we may be able to fold it in once the upstream license is unambiguous.

OCR-named datasets we did not include in this round

File	Sample count	Reason
`UNHD_Dataset_Urdu_Handwritten_Dataset.zip`	7,341 real handwritten lines	Kaggle license is `"Data files © Original Authors"` — original authors retain copyright, so we cannot use it for model training without an explicit grant
`PUCIT_OHUL_Handwritten_Urdu_Lines_Dataset_OCR.zip`	7,269 real handwritten lines	The PUCIT distribution page itself does not state a license; a Kaggle re-upload (`i191796majid/pucit-ohul-...`) tags it CC0 but the re-uploader is not the data owner, so we treat the upstream PUCIT terms as authoritative
`Nust_UHWR_dataset.zip`	10,601 real handwritten lines	Kaggle license is `"Unknown"`
`UHaT_Urdu_handwritten_text_dataset.zip` (also a copy under `small/`)	41,228 char images	Kaggle license is `"Data files © Original Authors"`; also character-level only (out of scope for our line-level pipeline this round)
`MMU_Extension_22_Multi_font_Urdu_printed_text_lines.zip`	245,000 synthetic printed lines × 7 fonts	Kaggle license is explicitly `"Unknown"` (despite the same author’s MMU-OCR-21 being CC-BY 4.0); we are not treating it as licensed without an explicit statement
`Handwritten_Urdu_Characters_Dataset.zip` (root + `small/`)	105,539 char images	Character-level only; not in scope for our line-level pipeline this round
`Handwritten_Urdu_Characters_Dataset_Digits_only.zip`	7,444 digit images	Character-level only; not in scope for our line-level pipeline this round
`Ink_Insight_Handwritten_Urdu_Letters_Dataset.zip`	2,288 letter images	Character-level only; not in scope for our line-level pipeline this round (license itself is fine: CC0)
`UrduMNIST.zip`	50,260 char images	Character-level only; not in scope for our line-level pipeline this round (Kaggle license is also `"Unknown"`)
`Urdu_Aphabets_MNIST.zip`	MNIST-style CSV	Stored as a CSV pixel matrix (not raw image files) and character-level only; not in scope for our line-level pipeline this round (license itself is fine: CC-BY 3.0)
`urdu_alphabets_dataset.zip`	small CSV	Alphabet metadata only; no usable training data
`UrduDataset.zip`	50,260 char images	Character-level only; not in scope for our line-level pipeline this round
`Urdu_Handwritten_Text_Dataset.zip`	273 char images	Character-level only; not in scope for our line-level pipeline this round (license itself is fine: CC-BY 4.0 per the Kaggle uploader `saurabhshahane`)
`Urdu_OCR_Ligature_Images.zip`	77,430 ligatures	Ligature-level only, out of scope for our line-level pipeline this round; the Kaggle license CC-BY-SA 4.0 (share-alike) would also be incompatible with closed-source model training
`Urdu_OCR_Ligature_Thickness_Graphs_Real_Dataset.zip` (root + `small/`)	34,346 ligatures	Ligature-level only, out of scope this round; Kaggle license is CDLA-Sharing-1.0 (share-alike) which would also be incompatible with closed-source training
`Urdu_nastaleeq_ligatures_Images_23206.zip`	46,412 ligatures + 2 xls	Ligature-level only, out of scope for our line-level pipeline this round (license itself is fine: CC0)
`Urdu Handwritten Ligature dataset(UHLD).zip`, `Urdu_Handwritten_Ligature_datasetUHLD.zip`, `Urdu_Hanfwritten_Ligature_Dataset.zip`	88 ligature pdfs + 26 jpegs (×3 near-copies)	Three Kaggle uploads of similar UHLD content under different licenses (`jazz786/...`: “Open Database, Contents © Original Authors”; `talhaumar/...`: MIT); ligature-level data, out of scope for our line-level pipeline this round
`Urdu_Text_in_Scene_Image.zip` (root + `small/`)	872 raw scene images	3 splits of raw images but no transcriptions at all; usable for text-detection pretraining only (locating text regions), not for OCR (reading the text)
`Urdu_Artificial_Text_Text_Detection.zip`	5,426 news screenshots + 5,426 bbox xml	Labels mark where text appears in each image (bounding boxes around text regions) but do not include what the text says — text-detection only, not transcription; the Kaggle license CC-BY-SA 4.0 (share-alike) would also be incompatible with closed-source training
`Hand_written_dataset_urdu.zip`	238,650 png images	Top folder is `pashto_hand/`, so the content appears to be Pashto rather than Urdu
`Urdu_Eng_Data.zip`	130K snippets	The labels are English medical terms (Allergy, Amoxil 100mg, blood circulation, …), so the content looks like English handwritten clinical text rather than Urdu OCR

Off-topic for OCR (routed to other workstreams)

Category	Files	Note
Speech / ASR with transcript (Automatic Speech Recognition: audio + matching transcript text)	`CV_13_Urdu_hasankuzagsr.zip`, `Urdu_Audio_Clips_with_Transcriptions.zip`, `Urdu_Data_Set_mrohaankhan.zip`, `Urdu_audio_dataset_with_transcription_20000_file.zip`, `voice_data_urdu_hashirabbasi121.zip`, `Urdu_Speech_To_Text_Dataset.zip`, `ai4bharat_transc_shrutilipi_fairseq.zip`	Valid Urdu ASR corpora (≈ 9 GB), but they do not influence image OCR; routed separately.
Speech / no transcript	`religious_urdu_nairsaanvi.wav.files.rar`, `urdu_religious_dataset.zip`, `Dataset_Language_Classification_audio_urdu_only.zip`, `archive1.zip`, `Indian_Languages_Audio_Dataset.zip`, `san_project_english_and_urdu.zip`, `Urdu_Emotion_Speech_Dataset_*.zip`, `urdu_audio_clips.zip`, `urdu_speech_dataset_2.zip`, `urdu_data_set.zip`, `Urdu_Emotion_Dataset.zip`, `Urdu_Language_Speech_Dataset.zip`, `Urdu_Speech_Dataset.zip` (×2), `Urdu_ML_Dataset.zip`, `Urdu_Voice_Wav_dataset.zip`	≈ 11 GB of audio without ground-truth transcriptions; limited use for ASR fine-tuning.
Pure-text NLP (off-topic for image OCR)	`xlsum_Urdu_Dataset_news_Summary.zip`, `XNLI_18_Langauge_NLI_Dataset_urdu.zip`, `Urdu_Wikipedia_Articles.zip`, `Urdu_News_Dataset.zip`, `Urdu_GPT.zip`, `Urdu_files.zip`, `Urdu_ngrams.zip`, `Urdu_Sarcastic_Tweets_Dataset.zip`, `Urdu_Tweets_Dataset_for_Spam_Detection.zip`, `Urdu_Authorship_Attribution.zip`, `Urdu_Named_Entity_Recognition_Dataset.zip`, `Urdu_Name_Entity_Recognition_Dataset_MK_PUCIT.zip`, `Urdu_News_Recommendation_System_Data.zip`, `Urdu_News_Scrapped_Dataset_for_Multiple_Choice.zip`, `MSCOCO_Urdu.zip`, `Urdu_Data_Bulk.zip`, `Urdu_Data_File.zip`, `Urdu_Dataset_2.zip`, `urdu_dataset_text.zip`, `multilingual_sentiment_analysis_dataset_twitter_politcal_parties_Pakistan.zip`, `old_newspaper_urdu_dataset.zip`, `The_Holy_Quran.zip`, `The_Holy_Quran_in_44_Languages.zip`, `Language_Identification_dataset.zip`, `Urdu_OCR_Scale_Invariant_Feature_Vectors_MAT_CSV.zip`, `Urdu_mirfan899.zip`	Pure-text data; does not influence image OCR.
Roman-Urdu (Latin script, off-topic)	`Roman_Urdu_Dataset.zip`, `Roman_Urdu_Dataset_2.zip`, `Roman_Urdu_Dataset_3.zip`, `Roman_Urdu_Sentiments.zip`, `Roman_Urdu_Sentiment_Analysis.zip`, `Roman_Urdu_Hate_Speech.zip`	Latin transliteration of Urdu; unrelated to native-script OCR.
Abusive / sensitive	`abusiveDataset_urdu.zip`, `research_abuseDetection_urdu_dataset.zip`	Profanity detection; not used for generative training.
Pretrained vectors	`Urdu_Word2vec.zip`, `FastText_Urdu_Vectors.zip`, `Bert_urdu_dataset_40k.zip`	Pre-trained text embeddings / training corpora — limited value for image OCR.
Sign language	`Urdu_Sign_Language.zip`, `Pakistan_Sign_Language_Dataset_openPose.zip`	Sign-language gesture images / keypoints; unrelated to text OCR.
Video	`Test_Videos_Videos_clips_in_Urdu.zip`	11 news/podcast clips; no OCR ground truth.
Mislabelled language	`Yelp_Review_with_Sentiments_and_Features.zip`	English Yelp reviews, not Urdu.
Fonts	`Font_Dataset.zip`, `arial_font.zip`	TTF/OTF fonts — useful for synthesizing more printed Urdu OCR data, kept for that purpose.

Duplicates and re-bundles (deduplicated)

File	Reason
`urdu_small_all_data_sets_all_combined1.zip`, `combined2.zip`, `combined3.zip`	Re-bundles of the `small/` subset; same content already present individually.
Same-name copies of `Urdu_OCR_Handwriting_Data.zip`, `UHaT_Urdu_handwritten_text_dataset.zip`, `Urdu_Text_in_Scene_Image.zip`, `Urdu_OCR_Ligature_Thickness_Graphs_Real_Dataset.zip`, `Handwritten_Urdu_Characters_Dataset.zip`, `Urdu_Speech_Dataset.zip` at root vs. in `big/` or `small/`	Identical content stored twice; one canonical copy retained.
`XNLI_Multilingual_NLI_urdu_only.zip`	Identical to `XNLI_18_Langauge_NLI_Dataset_urdu.zip`.
`shrutilipi_fairseq.zip`	Identical to `ai4bharat_transc_shrutilipi_fairseq.zip`.

Summary

From your contribution, after review, about 312,000 labelled line-level Urdu OCR images under an explicit permissive license are prepared and queued for a future training run (MMU-OCR-21, 80_clean, Urdu_OCR_Handwriting_Data).

What is gating the rest is licensing clarity. The real-world handwritten releases (UNHD, PUCIT-OHUL, NUST-UHWR) are typically distributed “for research” on individual request without an open license grant, and several Kaggle uploads (MMU-Extension-22, UHaT, the various character / ligature sets, etc.) ship with Kaggle license "Unknown" or "Data files © Original Authors". We plan to follow up with the original authors / uploaders — including the Qaida author about the GitHub vs Kaggle license discrepancy — to ask about explicit permissive licensing; if any of them confirm, we can fold those datasets in. Until then, we will not incorporate those particular files.

Thank you again for the contribution.