Project DetailsKamusi: Universal Communication among Human Languages
|Laboratory : LSIR||Semester / Master||Proposal|
We start with a simple premise: it is possible to develop a universal dictionary that can document every word in every language. The data will be comprehensive, easily usable by end users, and will lie at the heart of advanced downstream language technology applications such as speech recognition and automatic translation.
Recent student projects have focused on mobile apps and Facebook games for data collection, the design of a new graph database structure, a system to disambiguate the meanings of terms (including multiword expressions) in source documents for greatly improved translation accuracy, and an app to translate restaurant menus accurately across many languages. Upcoming projects may evolve from previous work, or move into new areas, such as vocabulary tools for paramedics and disaster response, language learning, and integration of our lexical data to improve existing machine translation technology.
We also welcome your creative ideas. Here is a multilingual Emoji dictionary/ data crowdsourcing bot for 70+ languages that was developed in just a few weeks in April and May 2016: https://telegram.me/emojiworldbot (requires a free account at http://telegram.org)
Here is a recent article about the project at EPFL:http://actu.epfl.ch/news/a-multilingual-dictionary-accessible-to-all/
And here is one about our inclusion in the White House Big Data Initiative:https://techpresident.com/news/24524/white-house-highlights-big-data-partnerships
Although the premise is simple, there are good reasons that a universal dictionary has never been seriously attempted:
1) Modeling a single human language is devilishly complex, while a universal dictionary multiplies the modeling complexity by 7000 unique languages
2) Most linguistic data is stored in human brains, not print dictionaries or electronic databases, and is therefore impossible to access in a systematic way
3) Any system that tried to address the first two challenges would be far too expensive to implement in the real world
We are in the process of overcoming all of these obstacles. LSIR is providing technical leadership in an active collaboration with partners worldwide to develop a system for harvesting, processing, and sharing the entire range of human linguistic knowledge. We have a number of activities that can be tailored for Masters or Semester projects, depending on your particular goals, interests, and skill set.
Project components can be broadly divided into several categories suitable for action:
1) Crowdsourcing – building a scalable, self-regulating system for gathering the things people know about their own languages. This will include significant elements of gamefication.
2) Data merging – we have access to data sets with billions of points of linguistic information for as many as 1900 languages, but those sources are each internally complex and do not play nicely as a group.
3) Data feature enhancement – in building a data structure to address the initial complexities of a massively multilingual dictionary, we have exposed new elements of human language documentation that can be woven into the larger information system, such as geo-tagging linguistic forms and chronicling language evolution over time.
4) User interface design – the project is already an active website with users worldwide, with numerous opportunities for design fixes and enhancements to improve the user experience, including the tools for people to make editorial contributions. Knowledge of Drupal is essential for this component.
5) Apps – all of the data generated by the project is now available on the web, but should be available on phones, on tablets, through an offline reader, integrated with computer assisted translation software, on OLPC machines for students, and more. These apps need to echo the key functionalities of the website, including editing features and data synchronization.
6) Security – we face many security challenges in the management of a website that invites contributions from the public for millions of pages in potentially thousands of languages. How do we prevent spam registrations? How do we identify malicious users? Once we identify a malicious user, how do we remove their work without affecting subsequent contributions to a page from legitimate users? How can we enable spam-proof discussion on every page, when comments can be posted in numerous languages that we cannot read, on numerous pages we do not have the time or personnel to monitor? Effective solutions could be developed as Drupal modules with widespread applications across the web.
For more information, please refer to Kamusi project website
If you have any question, just drop us an email, or come to our office:
- Martin Benjamin (BC114): firstname.lastname@example.org, email@example.com