The WordCorr Project


Home ] [ Background ] [ Technical ] [ SourceForge Project Page ] [ Download ]


Overall ] [ General Plan ] [ Broader Implications ]

Broader Implications

Advancing Discovery
Promoting Learning
Broadening participation
Infrastructure
Dissemination
Implications for Society at Large

Advancing discovery: There are three levels at which WordCorr advances discovery: experienced, student, and guest.

Experienced: Experienced scholars with huge amounts of data will profit from the fact that no data leak out of the system, and that observations expressing part of the investigator's analysis can be attached to any relevant unit. The Residue section holds everything for which the scholar has not yet figured out a proper place in the analysis. As in all science, it's the things that don't quite fit the big picture that are the cracks through which new insights make their way to the inside of one’s mental box. Furthermore, the fact that the scholar no longer has to spend a most of the time on data management details, to the detriment of thinking analytically, means that new ideas about things have a better chance of surfacing. And the ability to set up separate views to follow out the implications of incompatible hypotheses at the same time should lead to good documentation of the reasons for preferring or rejecting alternative analyses.

Student: When the Principal Investigator was studying comparative linguistics, he tended to fall asleep after a certain number of pages of data that began to look all the same. Experience with a prototype of WordCorr's interactive approach makes it look like it holds the user's attention with the intensity of a video game -- well enough that the computer needs to pop up "Do you want to take a break now?" after tabulating each entry. This means that the student using a prepared data set is motivated to retrace the original scholar's path of discovery instead of just reading about the conclusions the scholar reached and the controversies along the way. In fact, the student just might notice something the established scholar missed.

Guest: People with no linguistic background are welcome to browse designated data sets, and are shown how to try their hand at making the comparisons, using the guides that one of the graduate assistants is to prepare. Some of them will not only be motivated to play with real language data; they will discover things about language in general they hadn't thought of before. Personal discovery of that kind could lead some to become linguists, and might make some less prejudiced against languages other than their own.

Top of page

Promoting learning: Graduate students in master's or doctoral programs will find the application invaluable for storing and archiving field data and pulling in data already available from archives or publications. Then as they tabulate what they have collected and form their own hypotheses about the patterns of language divergence that underlie each pattern they find in their tabulations, they will go through what for graduate school is the ultimate learning experience: doing a workmanlike job that actually advances knowledge. Classroom discussions at both graduate and undergraduate levels should be very interesting, since the students will undoubtedly see things the professor has never dealt with. The guest category will also be an avenue of self-directed learning.

Top of page

Broadening participation: Trends in the kinds of papers accepted for broadly based meetings such as the annual meeting of the Linguistic Society of America suggest that fewer and fewer people have been active in comparative linguistic research over the years. One reason may be that when students of linguistics are exposed to the comparative method, they find it exciting—until it hits them how much picky work is involved and how easy it is to overlook something, at which point syntax begins to look like a better career choice. Knowing that there is a tool that diminishes both obstacles could lead to an increase in participation.

The Internet-based team structure also makes it possible for research teams to form, involving collaboration among many institutions, domestic and foreign. And the Guest status may draw in students and members of the public, including native speakers of some of the languages who are otherwise underrepresented in linguistic scholarship because of geographic or social isolation from mainstream academic institutions, but who nevertheless have something to contribute.

Top of page

Infrastructure: A database that will do what has been described is straightforward to manage using current information technologies, though a project of similar scope using traditional means of data management would be hopeless.

The WordCorr design actually makes feasible an infrastructure capable of managing the data and analytical work necessary to complete a thorough classification of all the world's languages. Were it to operate on that scale, it would contain a data component of around 200 megabytes, containing say 10,000 speech varieties with an average of 1,000 entries per variety (based on 6,800 living languages in the current Ethnologue, www.ethnologue.com, revisiting poorly documented dialects of known languages, and finding varieties that linguists are still not aware of; Kurebito 2001 is an example of a 1,000-entry data list), with each variety containing a datum of on average 10 segments of 16-bit Unicode for each entry. The world scale tabulation component could well involve 1,000 investigators, with each investigator looking at 100 varieties and some varieties being looked at by more than one investigator, through 10 different views with annotations of 10 bytes for each datum, for a total of 20 gigabytes. That would make the results component a little over 10 gigabytes, assuming 50 protosegments per view, 100 correspondence sets per protosegment, and 100 16-bit Unicode characters in each correspondence set, plus 20 4-byte citations per set. With a 50% overhead for the management component tables and behind-the-scenes linking tables, this would put the worldwide database for comparative linguistics at around 45 gigabytes, a modest size as serious databases go.

For this ITR project we plan to wait to see how well WordCorr catches on with the international community of linguists before actually scaling it up to world size. One could guess that it will be on the order of five years before the data component grows to 5,000 varieties averaging 500 entries per variety, requiring 50 megabytes. By that time tabulation activity may reach 300 investigators working on an average of 50 varieties each, in 3 views, giving 450 megabytes. Results would still be around 50 protosegments per view and 100 correspondence sets per protosegment, but only 50 segments per average correspondence set because of the 50-variety scope, and fewer available citations per set, giving another 450 megabytes. With overhead, the actual database in five years would be around 1.5 gigabytes, which would fit on the 1999 model computer this page is being written on with room to spare. Educational use of WordCorr to teach comparative phonology might swell this number to 2 gigabytes; but it is unlikely to strain the resources, because educational users are likely to stick to small collections with relatively few varieties. (Agard's excellent pedagogical presentation of the Romance language family (1984), for example, has 475 entries for 8 varieties; it would be ideal as a data set for educational use with WordCorr.) The important thing is to design for the widest possible usage, but to commit resources only for what is likely in a few years, knowing that with today's database technology it can be scaled up as needed.

The other focus of this ITR project is the standalone database for the individual investigator out in the field collecting and analyzing primary data away from the Internet. This person may have already been working alone for years and may not be in a position to join in team research. One person's data component is not likely to be larger than 100 varieties, and many individual investigators collect only about 300 entries per variety, giving about a megabyte total. The tabulation part will serve just one investigator, not hundreds, for 100 varieties and perhaps 3 views, to give another megabyte. The results part is comparable to the others, averaging 50 protosegments per view, 100 sets per protosegment, 100 segments per set with citations, giving around 2 megabytes. The grand total is under 5 megabytes.

Top of page

Dissemination: A report on the results of the project will be sent in hard copy to NSF and selected research centers, and disseminated more widely to others via the World Wide Web using this project Web site.

Top of page

Implications for Society at Large: There is always public interest in knowing about how languages have developed and diverged. Together with archaeology and genetics, comparative linguistics is one of our main sources of knowledge about the paths taken by peoples whose history has never been written, and about what may have gone on in times before any history was written.

At the other end of the scale of language relationships, knowing about the ways in which closely related speech varieties can diverge from each other meshes with Agard's typology of sound changes that can inhibit intelligibility (Agard 1984, pp. 41-47; Grimes 1995a, pp. 4-8). Intelligibility among speech varieties is of interest to educators in multilingual or multidialectal areas. For example, when the Principal Investigator was in Saipan consulting with a project on Carolinian languages, he attracted the attention of a high school principal from Pohnpei in the Federated States of Micronesia who was concerned with providing school texts for students on island chains who speak related languages that are unlike enough that their speakers do not understand each other readily unless they have learned the other varieties as second languages (German and Dutch, or Spanish and Catalán, are European examples of the same phenomenon). When the educator saw the output of the STAMP program (Weber et al. 1990) based on preliminary comparative tables that the Principal Investigator had compiled by traditional paper and pencil means, then used for switching a folk tale from one Carolinian variety to another, he viewed it as a possible solution for his textbook problem.

Making it easy for the general public to try their hand at language comparison via the Internet implementation of WordCorr should have a small societal impact in two different directions. First, it may well attract more people into linguistics. Second, it may allow people to discover for themselves the patterning and beauty of languages that they had previously thought of as primitive or deficient.

Top of page

Overall ] [ General Plan ] [ Broader Implications ]


Home ] [ Background ] [ Technical ] [ SourceForge Project Page ] [ Download ]


For problems or questions regarding this web contact khamasak@users.sourceforge.net.
Last updated: Jan 01, 1970


Sponsors:

SourceForge.net Logo Data House, Inc. Logo University of Hawaii Logo NSF Logo SIL Logo