Distributed Information Systems Laboratory LSIR

Kamusi: Universal Communication among Human Languages

Project Details

Kamusi: Universal Communication among Human Languages

Laboratory : LSIR Semester / Master Proposal


We start with a simple premise: it is possible to develop a universal dictionary that can document every word in every language. The data will be comprehensive, easily usable by end users, and will lie at the heart of advanced downstream language technology applications such as speech recognition and automatic translation.

Recent student projects have focused on mobile apps and Facebook games for data collection, the design of a new graph database structure, a system to disambiguate the meanings of terms (including multiword expressions) in source documents for greatly improved translation accuracy, an app to translate restaurant menus accurately across many languages, and a unique visual dictionary. Upcoming projects may evolve from previous work, or move into new areas, such as vocabulary tools for paramedics and disaster response, language learning, and integration of our lexical data to improve existing machine translation technology.

Our mobile app, developed as a semester project, has a long list of features to add regarding crowdsourcing, UX, geo-location, photography, sound recording, and optimization for use off-network. You can install the latest version for your device at:
iPhone: https://is.gd/PDXJ15
Android: https://is.gd/IyODZl

Another recent lab project is our Facebook bot, which you can try out by sending a message at facebook.com/kamusiproject.

We also welcome your creative ideas. Here is a multilingual Emoji dictionary/ data crowdsourcing bot for 70+ languages that was developed in just a few weeks in April and May 2016: https://telegram.me/emojiworldbot (requires a free account at http://telegram.org)

Here is a recent article about the project at EPFL: http://actu.epfl.ch/news/a-multilingual-dictionary-accessible-to-all/.

And here is one about our inclusion in the White House Big Data Initiative: https://techpresident.com/news/24524/white-house-highlights-big-data-partnerships.

Although the premise is simple, there are good reasons that a universal dictionary has never been seriously attempted:

1) Modeling a single human language is devilishly complex, while a universal dictionary multiplies the modeling complexity by 7000 unique languages

2) Most linguistic data is stored in human brains, not print dictionaries or electronic databases, and is therefore impossible to access in a systematic way

3) Any system that tried to address the first two challenges would be far too expensive to implement in the real world

We are in the process of overcoming all of these obstacles. LSIR is providing technical leadership in an active collaboration with partners worldwide to develop a system for harvesting, processing, and sharing the entire range of human linguistic knowledge. We have a number of activities that can be tailored for Masters or Semester projects, depending on your particular goals, interests, and skill set.

Project components can be broadly divided into several categories suitable for action:

1) Crowdsourcing – building a scalable, self-regulating system for gathering the things people know about their own languages. This will include significant elements of gamefication.

2) Data merging – we have access to data sets with millions of points of linguistic information for as many as 10000 language varieties, but those sources are each internally complex and do not play nicely as a group.

3) Data feature enhancement – in building a data structure to address the initial complexities of a massively multilingual dictionary, we have exposed new elements of human language documentation that can be woven into the larger information system, such as geo-tagging linguistic forms and chronicling language evolution over time.

4) User interface design – the project is already an active website with users worldwide, with numerous opportunities for design fixes and enhancements to improve the user experience, including the tools for people to make editorial contributions. The ability to work with node.js is essential for this component.

5) Apps – we have numerous features to build into our mobile apps, including editing features and data synchronization. We also need to design a method for the user of different languages to select the particular portion of the data that they need to use our tools offline, for example in rural Africa where network access is difficult and expensive.

6) Security – we face many security challenges in the management of a website that invites contributions from the public for millions of pages in potentially thousands of languages. How do we prevent spam registrations? How do we identify malicious users? Once we identify a malicious user, how do we remove their work without affecting subsequent contributions to a page from legitimate users? How can we enable spam-proof discussion on every page, when comments can be posted in numerous languages that we cannot read, on numerous pages we do not have the time or personnel to monitor?

For more information, please refer to Kamusi project website and the project whiteboard.

If you have any question, just drop us an email:

Martin Benjamin: martin@kamusi.org

Contact: Martin Benjamin
Useful Links

Project Guidelines