Distributed Information Systems Laboratory LSIR

Kamusi: Universal Communication among Human Languages

Project Details

Kamusi: Universal Communication among Human Languages

Laboratory : LSIR Semester / Master Completed


Update, 18 December 2015

One new project is available, to build a system that students can use for language learning, in coordination with data in the multilingual dictionary. The aim is to build a functional prototype that can be used online by the end of the spring semester.

We start with a simple premise: it is possible to develop a universal dictionary that can document every word in every language. The data will be comprehensive, easily useable by end users, and will lie at the heart of advanced downstream language technology applications such as speech recognition and automatic translation.

Here is a recent article about the project at EPFL: http://actu.epfl.ch/news/a-multilingual-dictionary-accessible-to-all/

And here is one about our inclusion in the White House Big Data Initiative: https://techpresident.com/news/24524/white-house-highlights-big-data-partnerships

Although the premise is simple, there are good reasons that a universal dictionary has never been seriously attempted:

1) Modeling a single human language is devilishly complex, while a universal dictionary multiplies the modeling complexity by 7000 unique languages

2) Most linguistic data is stored in human brains, not print dictionaries or electronic databases, and is therefore impossible to access in a systematic way

3) Any system that tried to address the first two challenges would be far too expensive to implement in the real world

We are in the process of overcoming all of these obstacles. LSIR is providing technical leadership in an active collaboration with partners worldwide to develop a system for harvesting, processing, and sharing the entire range of human linguistic knowledge. We have a number of activities that can be tailored for Masters or Semester projects, depending on your particular goals, interests, and skill set.

Project components can be broadly divided into several categories suitable for action:

1) Crowdsourcing – building a scalable, self-regulating system for gathering the things people know about their own languages. This will include significant elements of gamefication.

2) Data merging – we have access to data sets with billions of points of linguistic information for as many as 1900 languages, but those sources are each internally complex and do not play nicely as a group.

3) Data feature enhancement – in building a data structure to address the initial complexities of a massively multilingual dictionary, we have exposed new elements of human language documentation that can be woven into the larger information system, such as geo-tagging linguistic forms and chronicling language evolution over time.

4) User interface design – the project is already an active website with users worldwide, with numerous opportunities for design fixes and enhancements to improve the user experience, including the tools for people to make editorial contributions. Knowledge of Drupal is essential for this component.

5) Apps – all of the data generated by the project is now available on the web, but should be available on phones, on tablets, through an offline reader, integrated with computer assisted translation software, on OLPC machines for students, and more. These apps need to echo the key functionalities of the website, including editing features and data synchronization.

6) Security – we face many security challenges in the management of a website that invites contributions from the public for millions of pages in potentially thousands of languages. How do we prevent spam registrations? How do we identify malicious users? Once we identify a malicious user, how do we remove their work without affecting subsequent contributions to a page from legitimate users? How can we enable spam-proof discussion on every page, when comments can be posted in numerous languages that we cannot read, on numerous pages we do not have the time or personnel to monitor? Effective solutions could be developed as Drupal modules with widespread applications across the web.

If you have any question, just drop us an email, or come to our office:

  • Martin Benjamin (BC114): martin.benjamin@epfl.ch, martin@kamusi.org

Possible starting date: asap.

Contact: Martin Benjamin