The following are the outputs of the captioning taken during an IGF intervention. Although it is largely accurate, in some cases it may be incomplete or inaccurate due to inaudible passages or transcription errors. It is posted as an aid, but should not be treated as an authoritative record.
***
>> LOUIS DENART: Hello. Can you guys hear me? We will start with the session now. So, please go to channel 1, and then if you can give me a thumbs up if it's working, that will be great. Cool.
Okay. Thank you for joining. Welcome to our panel on the impact of underrepresented languages in AI. My name is Louis Denart. I will be narrating today. Sorry for the slight delay. We are trying to make up for that now. I'm a fellow with the International Digital Policy with the German Ministry for Digital and Transport.
So, obviously, we heard about the AI divide already earlier on today. So the issue of representation and diversity is a critical subject for inclusivity in the digital age. AI is also a technology that is increasingly attempting to model our reality based on training data, but data often fails to capture that reality. So, this is especially the case for languages.
For example, we have the commonly used so-called Common Crawl, a dataset that is made of nearly everything on the Internet and which is often used to train large language models. Yet, nearly half of the data in it is in English. And it leaves out more than 8000 documented languages by UNESCO worldwide.
So, we are going to discuss this topic today. We are joined, as far as I know, by two speakers, which I will now let them introduce themselves. So we start with Nidhi Singh.
So, I give the floor to you to introduce yourself, please.
>> NIDHI SINGH: Yeah. Okay. Hi. Thank you so much for inviting me here today. My name is Nidhi Singh, a Project Manager at the Centre for Communication Governance in the National University Delhi in India. I work primarily in information technology law and policy, and for about the last five years, I have been working in AI governance and AI ethics, focusing on global majority approach to how AI is being developed, how norms are being formed, and how it's being regulated and governed at the international stage.
>> LOUIS DENART: Thank you so much.
Now I would like to give the floor to Gustavo Ribeiro. I think he has joined us online. Could you please introduce yourself.
>> GUSTAVO FONSECA RIBEIRO: Hello. Good afternoon to everyone in Riyadh. I apologize because I think my camera is not functioning. We were trying to fix it before we started. But I was not able. I will introduce myself and then I will try to leave and rejoin so I can see -- you can see if the camera is working.
But thank you all for joining. Really appreciate your presence here. So, my name is Gustavo Fonseca Ribeiro. I'm a lawyer from Brazil. I hold a master's of public policy in Digital Technologies from Sciences Po., a university based in Paris, France. And I am also a specialist consultant for AI and digital transformation at UNESCO.
Here at the global IGF, I am speaking in my capacity as one of the youth ambassadors for the internet society in the year 2024. So, I'm very happy to join this meeting. Yes. Thank you, Louis.
>> LOUIS DENART: Thank you so much, Gustavo, for joining. You can try to rejoin by video. We would love that.
Nidhi, maybe I start with you with the first question. What are the impacts of underrepresentation of languages in AI for human rights and also from a socioeconomic perspective?
>> NIDHI SINGH: Thank you so much for the question. There's a lot of concerns around bias and inclusivity that comes in, but before we talk about it, I just want to talk about how when we talk about what languages that AI models are being trained on, it's not even really just that the resource have languages like English are be adapted. Only specific dialects of English are being adapted. It's not the English we speak especially for native speakers is the language that's going into the model. So we are not part of the majority in either case. Even if you do speak English, it's not your dialect of English that goes in. Even that it's only the version of English that's mostly commonly present on the Internet is something that's being trained on. In a sense, everybody is sort of being excluded.
When you look at what happens from this, there's a couple of just use cases I wanted to bring up before we get into deeper discussion. There's very real world consequences of this. As generative AI has gone up, universities have started using models to check if generative AI is being used students, using generative AI to turn in their homeworks or to write their papers. As a non-native speaker of English, even if you speak English with a high degree of proficiency, you are far more likely to be flagged for plagiarism because the ways these tools are developed, it's developed for native speakers of a certain dialect of language. That's one thing. The other is AI driven translation softwares being used which are increasing used by the state as well. They are not working well for low-resource languages so that's another concern. What happens is as part of the Internet, which it does not speak the majority language or the major dialect, you are already not part of the majority and as this digital divide of the people, sort of, increases, the language becomes a further barrier to that.
So, if the generative AI also just generates in the predominant dialect, there's a chance that in a few years the Internet will only be filled of this one dialect (?) heavy languages and all of the other languages will increasingly be removed from the Internet. Another thing you can see that it's only generative AI content that's now coming up on the Internet. So as that gets collected for more training, it will just be one language that's sort of getting repeated and your native dialect and the way that you speak and your cultural identity on the Internet will slowly be lost. So I think there's a lot of implications of what happens when generative AI models typically ones that are actually focusing on a prompt-based answers have such a big problem with how they have been trained in terms of language.
>> LOUIS DENART: Okay. Thank you very much for the answer. I would like to give the same answer to -- the same question to Gustavo. Could you please share your view on that, please.
>> GUSTAVO FONSECA RIBEIRO: Lewis, can you repeat the question please? Because I was -- I was resetting my camera while you asked it.
>> LOUIS DENART: I was asking what are the impacts of underrepresentation of languages in AI from a human rights, and also socioeconomic perspective.
>> GUSTAVO FONSECA RIBEIRO: Of course. That is quite interesting. So, when you think of languages in artificial intelligence, the first thought that come to mind in terms of human rights is cultural rights. If we look at international covenants, particularly the one on human rights, particularly the one on economic, social and cultural rights, there were, broadly speaking, three cultural rights protected under international human rights law. The first one being the right to access to culture. The second one being the right of society or people to guide, to steer the progress of their own scientific progress. And the third one relates to intellectual property.
So, in terms of human rights, I think to understand this, we to understand the impact of underrepresentation of languages, we have to understand that these technologies, artificial intelligence and the datasets that are fueling it are being, as you mentioned, primarily developed in western setting, let's say that way, for instance, United States or with European languages, with the exception, perhaps, of China in Asia.
So, when the tools are translated into other contexts, contexts that speak different languages. They are not going to work, they are not going to perform as well. And this does affect -- so this does affect those communities, the rights that I have just mentioned. It's going to affect how the scientific community, yeah, explores this new technology and is also going to affect how everyday users of artificial intelligence relate culturally to the outputs of the technology.
And in terms of socioeconomic benefits, I would say that we can think of this in two ways. You can think of this through supply and demand. In terms of demand for AI technology, I think it usually happens because it can bring a lot of productivity. But again if there's underrepresentation of language, the people that speak that language, they are not going to reap the same benefits.
If you look at the major language models out there, such as ChatGPT, they perform very well in English. But they perform very poorly, for example, in African languages.
So, this is one socioeconomic benefit that it's not going to be reaped.
And on the side of the supply, though, we can think of opportunities, though, because if there is a demand for it from local communities, right, there's also an opportunity for local companies to come up, and we do have an example of this -- some examples of in African, for example, in Ghana, you have Ghana ALP. There are over 50 languages in Ghana alone. In South Africa, that is Lelapa AI, and another example is the Masakhane Foundation, which is, yeah, Pan African organization working also in advancing language inclusion.
So, I would say those are the main impacts of it. Thank you.
>> LOUIS DENART: Thank you, Gustavo. I want to turn now to another aspect of this. So, Nidhi, what is the role of legal or ethical frameworks in enhancing AI inclusivity? How can this further language based inclusion work in your opinion?
(Muffled audio)
>> NIDHI SINGH: There are some legal instruments coming up. Even without the legal instruments, countries -- (muffled audio) there are broad-based frameworks in which you can have AI deployment. The UNESCO AI ethics, even the OECD ones, all of them have something on inclusivity.
Now, how that's to be implemented is actually is an interesting question. Inclusivity just dictates you should make it available in all the languages and you have training in all of the languages. That's not helpful because you need high quality datasets in order to train the models. And that means it would require significant amount of time in investment to get those models.
So, just to give you an example, if you try to use ChatGPT in any of the Indian languages, not maybe the bigger ones, but some of the smaller ones like AMEs with Canada which we like to do for fun, it will start speaking Bollywood dialects. The easiest things they could find is Bollywood movies and they started training large language models on that. You have checked it is inclusive but that doesn't make any sense, the model doesn't actually work. So, inclusivity, I think, AI they make a good framework but to implement that framework you need to have a lot more detailed attached then.
Another case study that I want to talk about, this is something that has very real legal concerns, is that in the U.S., four in 10 Afghan asylum cases will be related in 2023 because the use in AI-driven translation, translation algorithm, and they didn't put a lot of effort into how it translated Afghan into English and because of that, asylum applications were being deleted.
If you are going to use and this is the legal consideration, if you are going to be using generative AI in such specific, but also important and critical aspects of your public welfare, like if you are using it in healthcare, if you are using it for asylum, you are using it for security, any sort of public welfare benefits, ideally you want to make sure that it works really well across all languages. This is especially true in countries in the global majority, which generally have a large diversity of languages, like Gustavo said, these models are typically need and frame the Global North where they don't have so much trouble with drastically different languages, so many dialect. What happens in these countries is that the people who potentially know English but also dominant dialect of English will probably be able to access that and the people who don't, will not be able to access them. That further creates a larger digital divide.
So, even in English-speaking countries it clearly only recognizes the dominant dialect. Like the Av dialects and dialects used by immigrant communities in these countries also don't get recognized as well. Which means when you ask questions of ChatGPT, there is a whole (muffled audio) but if you phrase the questions just right you will get a much better answer. That is again something that's very dependent on you knowing the dominant language and being able to understand how to phrase that very specifically.
So a large part of LLMs are designed to give you answers in the dominant language, only if you ask them in the dominant language. This setting does work against you and it requires at the very fundamental level, a lot of change and investment and effort to be made into it.
That might sound like it's a private concern, but it's really not considering that you are losing generative AI to check for cheating and based on that potentially like ending some somebody's Education career or branding them as someone who would be plagiarized. These are things that should have an actual framework which is of much higher threshold. This isn't about us using ChatGPT to Google jokes or check the weather or something like this. These are some things that have very real consequences. When you are deploying AI in this context especially generative AI something that's independent on language in your cultural it needs to have additional safeguards in place.
As of right now, it's -- I think people are only looking at technical solutions to these problems. I don't think social and legal problems can really only be solved by technical solutions. So, there needs to be I think a more holistic approach to how you would approach this but it does need to be some framework in place.
>> LOUIS DENART: If I can get back on this. You said, of course, legal and governance solutions should be at the forefront but can also maybe from a technical perspective, do you think perhaps because with many languages you also have the problem of data availability. Do you think technical solutions such as synthetic data generation could be a potential solution to also address this is this?
>> NIDHI SINGH: Some type of data generation is something that's gotten a lot of traction over the last couple of years. One set further where you have super synthetic data that is data synthesized from technical data and I think those technical solutions work well within certain areas where they have been researched. At the end of the day if your base dataset isn't of high quality, if you don't have the base dataset built up and you try to generate synthetic data out of it, maybe the technology will catch up but right now it exacerbates the problems in it.
And the third problem is that languages don't actually typically directly translate because there's the way you write the languages and the way the concepts are, sort of, explained in Asian languages versus European languages does tend to differ. And in a lot of these LLMs it has some basic ideas of what the word means and it will start to directly translate. That doesn't really work.
And then also if you're using synthetic data and your base data wasn't very good quality data you end up with a lot of synthetic data which is also replicating the same problems and that could have further problems going down.
This is something that I guess you would need a lot of impact assessment, you need a lot of transparency in the model to figure out, which is, of course, something that we are struggling with right now. Because the problem with an LLM is you need a large amount of data, if you sat down and you tried to individually check every single piece of data we would never be able to build a (?). These are things that you need technical solutions, synthetic data could be one of them but there's no way to be sure that that won't exacerbate the problem right now.
Unless we are sure of how this works, let's at least not use it in your justice system, in our welfare delivery system unless you have that pause put in there. I think that's going to keep worsening the problem.
>> LOUIS DENART: Thank you for a thought response.
Gustavo, I would like to get back to you. From a legal perspective, what do you think could be levers, so to speak, to enable more diverse datasets in AI, maybe also thinking about copyright.
>> GUSTAVO FONSECA RIBEIRO: Yeah, thank you for the question. So, another thing that I had mentioned earlier, one of the three international human rights affected rights was intellectual property and I thought this was going to become a longer conversation, that's why I didn't mention anything in the first one.
Yeah, I think one key area of law that calibrates access to data, right, not only language, but all types of data, right, is copyright law. If we are going there, what is copyright, right? So, everybody is in the same page. Copyright is, basically, how we assign intellectual property rights to authors. And this applies as well -- so when you build a dataset that is something that you can do as well. You can protect the data copyright and you have an exclusive use for that and only you can license it. That also happens with the source material that is used to create data. So, creative works.
So, whenever we are using, for instance, newspapers or, like Nidhi said, Bollywood movies, those are protected by copyright. So, the way we work with the copyright governance of these two types of resources, right, the datasets and the raw materials for the datasets, is going to affect access to data.
But right now, our copyrights, we do have an opportunity for expansion right for innovation, but our copyright laws are not necessarily yet adapted to it.
We have seen a handful, for example, of lawsuits in the United States between OpenAI and the New York Times because OpenAI used in New York Times without authorization. Sorry. There is some noise around.
This is the context, this is the problem. But there are solutions in terms of copyright. If you look at the European Union, the AI Act, they do have an exception to what they call the data mining exception. Originally it would not be permissible for an AI company to mine data from these raw materials from the source materials like Bollywood modifies or from the Internet or without authorization of the copyright owner. In the EU there is an exception if you are doing it for noncommercial purposes. That's one way to develop to allow for the progress of science. And maybe for the progress of science but still limit the shared benefit of resources, right.
Second, the second development that we might see soon, it's in the U.S., they have an exception to copyright which is called the fair use doctrine, which is not so binary, is not so certain how it applies. It is on a case-by-case basis, based on a set of criteria, such as whether this copyrighted -- a certain copyrighted material is being used for educational purposes or noncommercial purposes or for research, for example. And some of these lawsuits that we have seen in the U.S. we might see something like that coming out of it, but we have to wait and see.
Another area and in my opinion, this is under explored, I would love to see a larger discussion on it. There exists this concept under international intellectual property law of traditional knowledge. It protects traditional knowledge, and traditional knowledge in the sense of knowledge that has been passed from generation to generation, for example, in traditional and indigenous communities, and in the beginning of the late '90s, early 2000s, we saw a lot of debate on that when it came to healthcare, when it came to medicines, traditional medicines, because it's a lot of big companies using these traditional medicines, and which were invented by communities and those communities not benefiting from it.
And yet we are still -- we have yet to see this debate in the context of artificial intelligence and data. Data has become, per se, the new oil, but we don't really -- we haven't seen any stakeholders talking very strongly about how this idea of traditional knowledge communicates with this new very valuable resource.
And just to conclude, another challenge that we have, which is not copyright law but is associated with it, is personality rights. The personality right is, for example, how -- is the right to your own likeness, your voice, your image, your face, you know. And other people can only use it if you have given consent. And this is not an economic right. It's a moral right. It's attached to your personality. It belongs to you because you're human, not because you own any property, which is often the case, which is generally speaking what happens with copyright. What that means is that you -- what that means is it cannot be easily given away, sold away, in a market, for example.
So, right now we have actually seen some decisions by courts, for example, in India, the high court of Bombay has found that when an AI replicates the voice of someone who exists, that is a violation of personality rights, but what we don't know is if taking the voice of someone and using it to train an AI, and if that in itself is a violation. Like what if an AI is trained with it but it doesn't necessarily replicate it, right? That is an area that we don't know. And deciding on that, the law would also require some adaptation to either allow for the expansion of data or narrow it, right. That will conclude my remarks. Yeah. Thank you.
>> LOUIS DENART: Very insightful. Thank you, thank you very much.
I would have -- would like to pose one last question to the both of you. In my opinion, we can observe that this problem of limited language inclusion in AI leads to efforts on a national, regional level, where countries and regions build their local datasets and models.
Thinking more from the international perspective, what steps can we take together on -- within international organizations or beyond, build more diverse AI? Which, yeah, Nidhi, maybe you want to start.
>> NIDHI SINGH: Thank you. And you are right, actually. I think a lot of the efforts that come from something like that come domestically because AI -- I think that's maybe a better place to start, but also probably because states and governments have a far better incentive.
So you don't make AI accessible because it's financially lucrative. It may not always be, depending on the community. And if it's not financially lucrative, then why something like OpenAI to do this. It's usually up to the states to do this.
To give an example, also the large language models that have been run in regional languages in India, like (?) run in partnership by state. I think in this case, it's important to focus on the fact that this receipt like a favor that's being done. Inclusivity is essential access to the Internet is considered an international human right. If you are -- you are required to make that accessible to everybody.
And like Gustavo said, if you are using public data, there is an implied expectation that you would use it for public good. It's not like you can also do public good for it. But you are training it on everybody's data, so you are required to do good with it.
I think it will be good to in international organizations bring this under the head of accessibility and inclusivity. And in that sense, sort of, like, apply the same protections to it that we are doing when we are trying to spread Internet to everybody, generally trying to bring everybody online.
And, yeah, I think to some extent even for private players, like depending on how the initiative is structured, there might need to be some requirements of including at least some percentage of languages within their training dataset. But all of this will really only be possible once you have transparency and accountability mechanisms down, because we don't actually know how the data is -- what data is being collected, exactly how they are training it. So unless you have very solid transparency mechanisms and accountability mechanisms to see what they are using to train all of these models with, I think it will be hard to really push anything through. It's a very interconnected problem. You want to have inclusivity, in order to have you want to have accountability and transparency. Once you get into implementing it, I imagine a lot of things will get sorted.
>> LOUIS DENART: Gustavo, would you like to add to that?
>> GUSTAVO FONSECA RIBEIRO: Yeah, of course. What can we do at international level and international organizations in particular? I had three things that have come to mind. The first one is sharing of knowledge and strategies to enhance language inclusion. We can learn from -- countries can learn from one another.
For example in India there's this great initiative called Kerala, which is a nonprofit that has this platform that allows data workers and data notetakers to go on it, work on providing information and curating datasets. And once this dataset is sewed by Kerala, the dataset is distributed to the workers. First we get increased inclusion of Indian languages and we manage to revert the profits of it to a socially beneficial purpose, to helping data workers which are often -- don't have the best working conditions in the chain of artificial intelligence.
Bringing that from one country and this pattern, for example, is open-source so international organizations, they do have the ability of bringing knowledge existing from India, for example, to other countries that are in similar situations, for example in Africa or with indigenous languages in Latin American.
The second one I would say is capacity building. Whatever we do with artificial intelligence, we often need data and technology in models, we often need governance as well, like laws that enable its development. But we always need human talent.
So, even as example that I gave with Kerala, a platform doesn't exist by itself, right? It needs training, people around it to do it. So, I think international organizations can work in capacity building. It's actually been one of the priorities outlined in the Global Digital Compact when it comes to artificial intelligence.
And I would say the third one is bringing better balance of power to conversations on international policy. When we are speaking of digital divides in particular there's a clear imbalance to our ability to tailor this conversation between the Global North and the global majority. Wealthier countries have more resources to participate in this conversation, in comparison to low-income countries, for example.
So, international organizations, they also have the ability to provide a fora in which these different actors can talk face to face.
Yeah. So I would say these are three potential roads. Thank you.
>> LOUIS DENART: Thank you. With that, I would like to open up this session to questions either from the in-person audience or from online. I think we have an online moderator. So please let us know if there are any questions online. And, yeah, please also state to whom you want to direct your question. We have a mic in person right here. So, if you have a question, please step forward. And, yes, you are free to ask now.
Any questions from the online audience?
>> KATHLEEN SCROGGIN: We don't have any as of now. But if there aren't any, I have one to throw out to the group. There's been lots of talk about the way that we use platforms to gather this data. Either from a technical sense or from, you know, more of just a user interface sense, but are there things that you all think would be beneficial to creating a universal platform for data collection? There have been lots of different ones. How would you see that going? Thanks.
>> NIDHI SINGH: Should I start? Universal platform for data collection is -- it's a very interesting idea. I think that I have also heard a lot of conversations about AI commons and data commons, and, yeah, the principle behind it is quite sound because, basically, what you are saying is that you put all the data in one place so that everybody can benefit from it. And I do think that in principle, that sounds good. I think, however, and I think this might be a bit pessimistic of me, realistically data is quite like the new oil. It's very unlikely that anybody who has a lot of data would want to share it in a way where other people can profit off of it as well. While labeling the data is currently one of the biggest resources that we have, like good, clean data that you get for training these things. There are economies built around these kind of data and data brokers. So I think actually putting that into place would be difficult.
It's also very interesting and this is something that I thought of after Gustavo was speaking is we have talked a lot about copyright but I think it's quite impressive how much acceptance things like large language models have in our society right now.
When something trolls the Internet to collect information to train a large language model there is a very good chance that you will end up catching a lot of personal data as well. And typically personal data protections used to be very stringent about what you can train models on. Now we are seeing an increasing trend where because of the lucrative promise of LLMs, counties are seeing if you posted it on the Internet in a large language model that is fine, as long as it isn't directly harming your privacy in a way that the output is harming you. Like the case Bombay high court where his voice was directly being used. A lot of these protections are diluted when it comes to training the LLM itself.
In that era of people, sort of, prioritizing, I think, economic progress or economic incentives that you can potentially get from generative AI, I think it's very unlikely that people would agree to pull all of the data on a uniform platform. I hope that will be the case but I think it's unlikely that people would agree to that.
>> GUSTAVO FONSECA RIBEIRO: Yeah, in Italy, if I may jump in on the question as well.
>> LOUIS DENART: Yes, Gustavo, please.
>> GUSTAVO FONSECA RIBEIRO: Thanks. I find it to be a very interesting, interesting proposal to think of a data coms, right? But so the one -- I will first speak of a challenge and then of someone who is trying to do that. So, the biggest challenge I see is how bringing data to the open source world, opening data up in the market of technology is so highly competitive, create effects, comparative advantages, so market advantages, right?
It is true that openness in creatively speaking leads to more access to data and you often see this advocacy, right, oh, to companies, you see it also a lot in the development context, open the data so everybody can have access to it.
But openness can also come with a trade-off of financial sustainability, right. If you open the data without any restrictions whatsoever, it's not that easy to profit from, to have a revenue from that. And in developing contexts, the socioeconomic security of companies and the people working on it, it's a very big priority.
But more importantly, so is, if you manage to get people on board with this idea, you have to either get everyone or no one to create a data commons. Because you get a certain amount of companies into the commons, share the data for free. The companies that have private and have the size to continue charging, like they are private and have not opened their data and already have a big -- like, it's already big, like a Bic tech company, they are going to have a financial advantage over whoever opened their data.
And because their model will be more profitable, their model is also going to grow more, which could actually crowd out the commons, the common data pool. So, that is the challenge I see. It works a lot like a negative externally. You either get everyone on board or it's hard to implement it.
But the second point is, there is an attempt to implement that kind of thinking, actually, in the European Union, through the Data Governance Act. I wish I was the type of learner who was an expert in that. I am not, but there is this regulation at the European data level, European governance Act, (?) of the Data Act, in which they try to create pan European pools of datasets in certain fields that are profitable, like agriculture, healthcare, mobility data and the way they are trying to build that is by kind of creating a compulsory arrangement so the public sector can buy datasets that are of public value from private entities and the private entities, they are mandated to sell it for a reasonable price. That's initially what the law says, a reasonable price.
So, it's somewhat of an attempt to do that but still incorporate the cost structure of developing datasets.
But whether it works or not, well, we will see. Yeah. Thank you.
>> LOUIS DENART: Thank you. So, I think we have two questions from the audience. Maybe we start with the Madam in the front. Would you come to the microphone, please, and let me check quickly whether it's on.
>> PARTICIPANT: Can you hear me? Okay. Thank you so much for the great session and presentations.
I guess my question goes to both of you. You did speak when you were talking about the recommendation on domestically having the incentive coming from government. And I think Gustavo, online he had also mentioned some of the large language models in African. I think you mentioned the Lelapa AI African and Masakhane. So my question is, do you see a lot of appetite, especially from governments to actually support a lot of these initiatives to develop our own local languages and having them included in these models, that's the first one.
And then the second one is just something that I have noticed from a very practical perspective, where I am originally from Zimbabwe and for a learner to graduate and go to university, you must have passed mathematics and English. So not even our Lowell language is included.
I'm thinking from an incentives perspective that why would I if I'm a software developer invest in local languages when they are not useful to students and learners who are supposed to be using these technologies? Because you know you need English to go to university, and all your products should be in English and you are making use of AI products in English.
So, I don't know how you see it. Maybe you have practical examples from your regions where there is direct incentive from government financially to support a lot of these initiatives, and as well with the school system, are there going to be any changes where we see more and more local languages actually being integrated for you to move to the next level, you must have at least passed a local language, not necessarily just English.
So, those are my two questions. Thank you.
>> NIDHI SINGH: Thank you so much for the question. Those are actually really good questions. So for the first one, I think we have seen a lot of appetite for translation in local languages. Very specifically from the government, because of the use of translation softwares and the judiciary. There are maybe 200 languages spoken in India. About 30 of them officially recognized as well.
So, what happens at the court, actually, works in all of these languages, the government works in a lot of these languages simultaneously and that means that the courts have to work in all of these languages. At least in the higher judiciary, a lot of the judges work in English so you need translators and you need translation, they are trying to bring in translation softwares. So that, you know, defends and accused can understand what's happening in courts.
So, because of that, that also created a lot of appetite for these things to work. So the government needs essentially really good high-quality translation softwares, which launched a lot of the reason why you have so much generative AI happening in local languages.
Also in countries where you have many languages, if the government wants to pass something, like a chatbot for farmers, which is another thing that's happening, you need to make sure that it's available in larger languages and if it is not, then you could theoretically go to the court and argue for a violation of your right to equality. Because it's a recognized state language and you are not offering the service in that recognized state language and that's a violation of the right to equality.
Because of this, I think we are seeing a lot more appetite, but more specifically from the government, because this isn't from a financial incentive point of view to, actually, build datasets in these languages, build generative AI and eventually get to building electronic translation softwares.
As for the second question, that's a complicated question which doesn't have a legal answer, it's more for sociological answer. I know some states in my country have a three language model where you must study three languages in school just because of how it works.
I think this is just a general problem that many countries that have multiple languages have, is the fact that the Internet and the Internet infrastructure and now increasingly AI is all really built on one common theme, which happened to be a specific dialect of English. Until you can have a lot more inclusivity in the room where these decisions are being made, that is unlikely to change.
But I think that is something that you are trying to do through these conversations, is to just say that that actually doesn't make any sense. If you just have a chatbot that really only recognizes English and a specific way of accessing the service, then that's not an accessible service. You need to have more either human intervention in all of these problems.
Yeah, I think that's more sociological problem that I think the world generally has a general perspective and not looking at minority perspectives. This is something that will really only get fixed when you have more voices in the room talking about what their experiences are.
>> LOUIS DENART: Okay. With the time in mind, we have one minute, roughly, left. Gustavo, do you want to add to that quickly?
>> GUSTAVO FONSECA RIBEIRO: I think Nidhi's answer was very good. So, very quickly, government support, yes, you can say examples, for example -- examples, for example -- in Rwanda the government has been quite supportive of developing datasets, in Kenya Huandan, in partnership with academia and startups. And in Nigeria, you can also see the government supporting the development of a large language model in Nigerian languages.
As for the second question on the education, I would say that Nidhi's was on spot, I think, Nidhi's answer. I would add that so the market is a powerful tool to drive development of AI. And in many countries that speak many, many languages, this market is in English. It is true. For example, in Kenya, that is the case. Uganda as well.
But you also have other purposes, right. Public services, for example, access of citizens to welfare. And research, which doesn't necessarily have to have a commercial purpose. So, in that context, I would say it's quite relevant. I hope I'm touching on the question.
And there was a question in the chat which asked about the opportunities and risks of the localization, which is going beyond, it’s contextualizing the model as well to the local culture on the opportunities and risks.
I would say very quickly, the opportunities, first, because it's useful, people have a demand for solutions in their local language. And those tools work better for them.
And the second, just cultural preservation, I think there be an opportunity. And I would refer to risks at large with associated AI, even if you are doing in a legal language with culture, local culture embedded, you still have privacy risks, you still have bias risks and, yeah. Thank you. I will give back the floor.
>> LOUIS DENART: So, I saw that there was one in-person question at least left. Maybe I would ask you to, if Nidhi still stays here, you can also ask it after the session.
I would like to thank our speakers for the very interesting insights. Also to the audience for the good questions. And, yeah, I wish you a few other good sessions today. And, yeah. Enjoy IGF. Thank you.