When the Model Learns to Read Us

For years, asking Urdu of a large language model was an exercise in mild humiliation. The Nastaliq script came back garbled, when it appeared at all; more often, the system silently swapped it for a stiff Arabic Naskh that no Pakistani reader would recognise as their own. Idioms were rendered literal. Anyone who tried to use these tools in their own tongue learned to switch to English, or to give up. The productivity dividend the rest of the world was quietly receiving was happening in a language the majority of us do not think in.

That is beginning to shift, and the shift matters more than it appears.

This year, a Pakistani researcher named Taimoor Hassan released Qalb, the largest language model built so far for Urdu, trained on nearly two billion tokens and benchmarked across more than seven international evaluation frameworks. Meta’s XLM-R 2.0 posted an eleven-point improvement in transfer to unseen scripts. The frontier labs are still English first, but they are not English only any longer. Models are starting to learn our morphology, our right-to-left script, and our literary register. This is not a research footnote but a quiet redistribution of access.

This year, a Pakistani researcher named Taimoor Hassan released Qalb, the largest language model built so far for Urdu, trained on nearly two billion tokens and benchmarked across more than seven international evaluation frameworks.

Voice changes this further. The Urdu-capable voice mode that arrived in tools like ChatGPT means that, for the first time, a Pakistani who cannot or does not want to type can ask a question aloud and hear an answer back in their own tongue. In a country where literacy is uneven but spoken Urdu is universal, that is a game-changer.

Consider what an Urdu-fluent model unlocks. A clerk in a tehsil office summarising case notes. A nurse in a rural BHU charts in Urdu instead of struggling through clinical English. An agricultural extension officer translating a soil report for a Sindhi farmer in real time. None of these professionals has been served by AI until now, because the tools were trained for someone else’s working language. A 2025 Stanford analysis estimated that countries whose primary languages are underrepresented in AI show roughly twenty per cent lower AI usage, attributable to language alone. That gap is not a market opening but a structural exclusion.

Urdu is spoken by over 200 million people. By population, that places it among the world’s top dozen languages. By representation in training data, it has historically sat far below that. The firms with the capability to build serious Urdu models are not the ones with the strongest commercial incentive to do so. A bank in California building the world’s best English assistant does not feel pressure from the absence of Urdu in its evaluation suite.

This is precisely the point at which a state becomes useful. When private firms have little commercial reason to invest in a language, a credible national buyer can change the calculation. The Prime Minister’s commitment, at Indus AI Week, of a billion dollars in artificial intelligence by 2030 is most consequential not as a number, but as a signal that there is a national commitment to the work of making Urdu, and our other regional languages, first-class citizens of the model. Corpus building, Nastaliq-aware tokenisation, evaluation benchmarks for legal and medical Urdu. These are not items a Silicon Valley product roadmap will reach on its own. They are essential to any country that wants its people to be understood by the machines they will increasingly live alongside.

There is a literary side to this that is easy to miss. Every previous shift in medium, from the printing press to broadcast radio to the open web, redrew who counted in public life. Languages early to a medium shaped it; languages late were shaped by it. Urdu’s relationship with print, in the nineteenth century, was a fight against being treated as an afterthought to Persian and English. Some of us have watched smaller versions of this fight more recently, when Nastaliq itself had to be coaxed onto a web that did not yet know how to render it. The same fight is now beginning in a new medium, and the stakes are higher because this medium does not just carry our writing. It interprets it, summarises it, and decides what counts as a competent rendering of it.

There is reason to be cautiously hopeful. Pakistani researchers are doing serious work, indigenous models are being trained, and national investment is, for the first time, oriented toward this question. The next decade will decide whether the language we read poetry in is also the language we instruct our machines in. That choice is being made now, in code and corpus and budget line, and the country has a narrow window in which to make it the right way.

The writer is a civil servant.

Submit a Comment

Footer