Given these clues, can you write 3, 5, 7, and 10 in the
Waorani language? This language, indigenous to Ecuador, is spoken by approximately 1,650 people.
(A) mẽña mẽña mẽña mẽña + mẽña go mẽña = ãẽmãẽmpoke go aroke x 2
(B) aroke2 + mẽña2 = ãẽmãẽmpoke
(C) ãẽmãẽmpoke go aroke2 = mẽña go mẽña x ãẽmãẽmpoke mẽña go mẽña
(D) mẽña x ãẽmãẽmpoke = tipãẽmpoke
Scroll to end for answer
This problem is typical of those given during the
North American Computational Linguistics Olympiad (NACLO), an annual competition to identify high school and middle school students with linguistic talent while acquainting them with the field of computational linguistics. NACLO problems are drawn both from traditional linguistics—covering syntax, semantics, morphology, and phonology—and from computational linguistics, where applications include machine translation, speech recognition, information retrieval, document summarization, and dialog systems.
Unlike the case in most math, chemistry, and physics competitions, NACLO problems are completely self-contained and solvable using logical and algorithmic thinking alone. No prior linguistic training is needed, just an ability to logically think through an interesting and challenging problem. The point is not to test concepts or principles previously learned—high schools rarely offer linguistics—but to present linguistic and computational concepts as logical puzzles that can be solved using the same analytic reasoning and problem-solving skills employed by linguists studying languages and computer scientists structuring problems for computers.
“Most physics and math competitions I’ve entered are very much regurgitating standard forms of problems,” says
Maxim Sigalov (CC’2017), a five-time NACLO competitor. “There’s actually not much creativity, or deep thinking. Whereas in NACLO, every problem has something novel. You’re not performing standard procedures. You have to think and consider the specifics of the problem each time. Which is fun.”
The connection between the problems and computational linguistics may not be immediately apparent and it need not be. Sigalov admits he didn’t immediately draw the connection between NACLO and a particular area of study, nor did he realize that people could make a career of studying and making use of linguistics. “Honestly, it took me a few years into it to understand that this was a field that people were actually studying. I didn’t get that.” He competed because it was fun.
Language is everywhere
One problem stands out for Sigalov. In
“Sk8 Parsr” (Littell, 2009), students are presented with a skateboard videogame in which button presses cause an avatar to perform specific skateboarding moves and tricks. The problem solver is asked to “parse” the incoming sequence of moves into symbols, some of which represent individual moves while others represent combo moves made up of one or more individual moves. The trick is to correctly understand when a move is best represented by the symbol for that move or when a move actually is an element of a combo move and should be collapsed along with other moves into the symbol representing the whole sequence. Moves can be fairly complex, making it a challenge.
On the surface, it doesn’t seem a videogame would fit under the category of linguistics or languages. But it’s just one example of how language is everywhere even in unexpected places. It was only years later, when taking a college course in natural language processing (NLP) at Columbia did Sigalov understand that “Sk8 Parsr” is a simple
shift reduce parser where “words” are buttons, and the “sentences” are combos. In solving “Sk8 Parsr,” NACLO participants are performing parsing, a fundamental first step in linguistics to identify the role played by each word in a sentence.
NACLO is filled with puzzles and concepts seemingly unrelated to language or computational linguistics. Catalan numbers are there in
“One, Two, Tree” (Smith, Gimbel, and Eisner, 2012), which demonstrates the ambiguity potential of compound nouns such as “science fiction writer,” and students in effect use binary bracketing to resolve the ambiguity (the number of parses in which a compound can be grouped is a Catalan number). Many other puzzles are solvable using techniques from combinatorics, though students may not yet have heard of this branch of mathematics.
Levenshtein distance, long used in information theory and computer science to measure distances between sequences, may not be known to high schools students, who nonetheless employ the general concept in
“Nok-nok!” (Fink, 2009) to describe a hypothetical typing tutor for very bad spellers. (All
previous NACLO puzzles are online on the NACLO website.)
The whole point is to expose the fun, challenging aspects of language to students before they graduate high school and encourage them to pursue computational linguistics at the university level. And in Sigalov’s case, NACLO succeeded. A rising junior, he is now majoring in linguistics along with math and computer science—NACLO participants often have multiple interests and skillsets—and will do NLP research this summer at the University of Edinburgh under
Mirella Lapata.
Even for students already interested in linguistics and NLP, NACLO can confirm and refine their choice of study, guiding them toward a university with strong departments in those areas. It’s the reason that Alex Liu, also a NACLO participant, chose Columbia. Liu didn’t know about NACLO until he was a senior in high school and was already interested in NLP to the point he sought out professors willing to allow someone still in high school onto a research project (eventually working with
Kathy McKeown to detect sentiment and label argument structure). In this way, he was exposed early to NLP’s use of machine learning techniques applied to voluminous speech data, which makes it possible to solve real-world speech problems even without having to explicitly apply linguistic principles. NACLO was his introduction to pure linguistics, something he might not have gotten until much later.
A linguistic Olympiad for the computer age
NACLO was co-founded in 2006 by
Dragomir R. Radev, who as a high school student in Bulgaria had competed in linguistics olympiads held over the past 50 years in Russia and Eastern Europe and credited with inspiring hundreds of talented young scholars to choose linguistics as an academic field. While doing graduate study on natural language processing at Columbia, Radev noticed the US lacked a comparable competition. After he became a professor at the University of Michigan, he along with four others—
Lori Levin (Carnegie-Mellon University),
Thomas E. Payne (University of Oregon),
James Pustejovsky (Brandeis University), and Tanya Korelsky (National Science Foundation)— launched a new Olympiad in the field of linguistics. A 2006 planning workshop at Carnegie Mellon University drew two dozen people from within the linguistics and computer science communities, many from Eastern Europe and Russia. Using the model of the
International Linguistic Olympiad (ILO), in existence since 2003 and itself modeled after the earlier linguistic Olympiads, the workshop participants began laying the groundwork for NACLO.
One big change they made, and what differentiates the North American version from the International version, is the inclusion of computational problems. As the amount of speech and text data grows, there is increasing demand for automated tools to make use of this data both to improve speech technologies such as voice-recognition systems and machine translation but also to analyze and mine text from vast amounts of documents. To prepare students for the computational aspects of linguistics today, NACLO organizers designed puzzles to challenge students to structure problems for computers and in the process gain an understanding of pattern recognition, abstraction, generalization, and the need to prune the search space.
Translating computational or linguistics principles into NACLO puzzles is a challenge. It’s not easy and restrictions abound. Algorithmic problems have to be posed in a way that does not give an advantage to those who already know how to program. Because high school students can’t be expected to know much about linguistics, authors of the puzzles have to remove themselves from their knowledge and experience and to forget technical definitions of “phrase” or “noun” or “string” or “function,” and other facts that are second nature to those in the field. It takes a full year for a forty-member “problem committee” made up of linguists, computer scientists, and hobbyists from around the world to dream up and submit close to 40 problems, each of which is reviewed and discussed before being pre-tested by students for difficulty, solvability, and appropriateness. Of the yearly 40 submissions, 16 make it into the competition.
Organizing the competitions also requires time and logistical effort to line up sites, prepare materials (and students), and then finally grade and evaluate student answers. Almost all work is done by volunteers, many of whom are previous NACLO competitors.
Who competes?
The first requirement for students to compete in NACLO is to be aware of its existence. NACLO is still very much a grass-roots effort and has only a limited budget for publicizing the competition. Teachers are often the link between NACLO and students, getting students interested and prepared, and arranging and hosting the competitions at their schools. About half the students, however, find NACLO themselves. Even if their high schools don’t participate, students can attend NACLO at nearby colleges and universities that, like Columbia, host the event for area students.
Positive word of mouth among teachers and students has helped NACLO expand steadily every year, growing from just under 200 students at the 2007 competition to 1706 this year. Unusual for a science Olympiad, about half the participating students are girls, who make up at least 40%, often close to 50%, of participating students.
The NACLO competition follows a two-round format. The first in late January is open to all US and Canadian students, with the top 10% selected from each country to compete in a second, more difficult round held in March. The top eight US winners and the top four Canadian winners then go on to compete at ILO.
The level of competition is high, and the contest questions are designed to be difficult. In the second round, students are given four hours to do eight problems, 30 minutes per problem, so time management counts also. Says Radev, “Every year I think no one will be able to answer most of the questions, yet every year a small number score over 90%. It is incredible how they can solve so many hard problems in such a short amount of time, especially considering that they don’t have any prior schooling in linguistics or computational linguistics.”
This year 1706 students competed in the first round, which was held January 29 at 200 sites and featured syntactic problems from Danish, Greek, Aymara, Old Norse, Japanese, Shan, and Lao as well as problems in inversion transduction grammars, phonotactic constraints, inference of semantic intensity, and other formal and computational concepts. The invitational round on March 12 contained more difficult problems that were taken from Old English, Georgian, Navajo, Malagasy, German, Maxakalí, and Hmong, and included problems dealing with key concepts in computational linguistics such as local ambiguity, finite state transducers, and minimum spelling trees.
After three weeks needed for grading, the winners were named in mid-April. The top eight US students will now form two four-member teams and go on to the ILO this July in Bulgaria. US winners are James Wedgwood of Washington, Conor Stuart-Roe of North Carolina, James Bloxham of Massachusetts, and Kevin Yang of Washington on the first team (Wedgwood and Yang are from the same high school, Lakeside High School, which sent a student last year to the ILO), and Aidan Langston of New York, Julian Gau of New Jersey, Jacqueline Bredenberg of Michigan, and Kevin Li of New Jersey on the second. (Two students—Jacqueline Bredenberg and Aidan Langston—are unable to travel, and their places will be filled by Nilai Sarda and Kevin Li, respectively. This second Kevin Li is from California and was on the 2014 US team, along with James Bloxham.)
The four top-performing Canadian students who will compete at the ILO are James Hyett of Ontario, Eugene Shen of British Columbia, Ben Zhang of Ontario, and Ella Bei of Alberta.
Just as they have for the past nine years, Radev and Lori Levin will coach the two US teams, which have performed impressively from the beginning. The US is especially competitive in the team scoring, having won the team gold for the past four years, and before that, winning or tying for the team gold in every year save one. Individuals have also excelled, with at least one US team member winning a gold medal in all but one year. These results are especially impressive considering that the US team has been competing only since 2007 (and Canada only since 2011).
Radev feels NACLO has made a difference in promoting computational linguistics, encouraging more students to pursue study in both traditional and computational linguistics. Some students have even started linguistics clubs at their schools, a welcome sign given that the number of applications for NLP will continue to grow—sentiment analysis being one new NLP area with applications for business and intelligence.
To help meet the growing demand for computational linguists, Radev and his fellow organizers are looking to expand NACLO by increasing the number of participating students, launching a pilot contest in French-speaking Canada, and recruiting additional sponsors. The NACLO site is also being expanded to add more training materials and additional sample problems to better help students not only learn about traditional and computational linguistics, but to discover their affinity for language, logic, and computation.
Registration for the 2016 competition will start in September. Students looking to compete and teachers wanting to host a NACLO competition should visit the
NACLO site.
NACLO is sponsored by the NSF, the North American chapter of the Association for Computational Linguistics, the Linguistic Society of America, Yahoo!, the University of Michigan, Carnegie Mellon University’s Language Technologies Institute and Gelfand Center for Community Outreach, and Brandeis University, among others.
Answer to question at top:
3 is mẽña go aroke
5 is ãẽmãẽmpoke
7 is ãẽmãẽmpoke go mẽña
10 is tipãẽmpoke
Additional linguistics puzzles, some of which are automatically scored, are found
here.