The WordCorr Project


Home ] [ Background ] [ Technical ] [ SourceForge Project Page ] [ Download ]


Overall ] [ General Plan ] [ Broader Implications ]

General Plan

Activities
Roles
How it Works
Database
Preservation and Sharing
Documentation
References

Activities: Information technology and comparative linguistics come together in a computer application tentatively called WordCorr. The name acknowledges the insight of John Wimbish, who in the 1970s developed a widely used DOS program called WordSurv that set up the data of comparative phonology in an innovative way that is perpetuated in WordCorr. WordSurv never carried through with actually doing comparative phonology, but only tabulated guesses at lexical similarity for pairs of speech varieties, which most comparativists now regard as giving an inadequate picture of language divergence. Nevertheless, a number of comparativists already use WordSurv to store their data, because Wimbish's ideas for visualizing data (Wimbish 1989) made it easy for the linguist to spot similarities and differences in sets of forms. Hence the name. WordCorr also makes provision for comparative linguists to import their WordSurv files into WordCorr directly.

Going beyond the Principal Investigator's crude prototype (Grimes 1995a Chapter 1) to provide a genuinely useful tool for comparative linguistics requires participation by four parties: the Principal Investigator, two graduate assistants in linguistics, a developer of commercial grade software, and a few field linguists and graduate students in comparative linguistics as testers.

Principal Investigator:

Graduate Research Assistants prepare materials and assist the Principal Investigator in training others in the workshops, thus gaining hands on experience in explaining language comparison and in training other comparative linguists in how to use the tools.

Software Developer:

The Principal Investigator will choose the software developer from among for-profit software houses headquartered in Hawai'i that have established good reputations for reliability and interaction with their clients. The developer will operate under a subaward from the main grant to the University of Hawai'i, as required by the NSF 01-149 solicitation; it will be subject to verification by the University sponsored research administrators.

The justification for going the for-profit route is that in the Principal Investigator's experience, the kind of analytically oriented people that universities attract tend to produce software solutions that they themselves and maybe a few others can use on a specific problem; but the same people (including the PI) tend not to meet industry standards for user friendly presentation, compatibility with operating systems, good graphic design, accessibility, and efficient and fast interfacing among the user, application logic, and database levels of organization. Commercial application developers, however, generally show expertise in both solutions and interfaces.

Participating Linguists and Graduate Students:

The workshop at the University of Hawai'i will be limited to participants already active in comparative studies and based in Hawaii, since the application will still be in a tentative form and the workshop will be like a beta test for it, though it will enhance the research of all the participants as well. The second workshop will be on the Internet and will be open to international participation, on the same basis. The participants in both will need to handle English well, because though internationalized versions of the application are part of the long term plan, they will not be implemented during this project.

Top of page

Roles: The functioning of the full scale application revolves around a ranking of roles. Anybody in a role of greater scope (to the left) also has the privileges and capabilities of the roles of more limited scope (to the right):

Developer > Manager > Team Leader > Data Keeper > Investigator or Guest

Developer: Writes and maintains the code and the screen layouts. Creates own initial password in the code, provides initial password to the manager. The Principal Investigator and the software developer jointly fill this role.

Manager: Maintains the Web site and database, including backups. Pulls extracts from the main database for standalone use in the field, and helps the Data Keeper reintegrate them when the investigator returns. Provides an initial password to the team leader of each project, and shows the team leader how to get the project launched.

Team Leader: The lead scholar on a research project, or the teacher of a class. Establishes the structure of each collection of data used in the project, including the speech varieties covered and the sequence and extent of entries in the word lists of the collection. [Note: "Word list" means different things in linguistics and in information technology. For a linguist, a word list is a list of vocabulary items in common use by speakers of a language, usually elicited face to face and often organized semantically. For information technology, a word list is an alphabetic list of unique character sequences found by a computer bounded by spaces or punctuation marks in a particular document or set of documents.] Provides an initial password for the data keeper and each investigator. Trains the investigators and dialogues with them. If permissions for the team to use published data are required, the team leader is responsible for getting them. The team may be multinational and widely dispersed, or it may consist of one investigator working alone.

Data Keeper: The team leader has the option of taking care of the integrity of the data in each collection personally, or of handing that function over to one or two of the investigators, who are then enrolled as data keepers. The data keeper is responsible for the accuracy and integrity of the unannotated data accessed in common by all the members of the team; other investigators cannot change the data. Investigators annotate the common data in their own way, and may develop multiple annotations of the same data simultaneously in order to follow out different hypotheses or to work with different subsets of the speech varieties.

Investigator: Even the team leader and data keeper spend most of their time acting as investigators because that's where the fun is. All the investigators on a project define their own views of the common project data, assign tags to mark off different groups of data within each entry for comparison, annotate data by aligning comparable segments within each group, specify metatheses and semantic shifts, add comments to the view and its components, decide on the organization of the correspondence sets in their environments, review the results presented by the computer, and revise everything as many times as needed. Investigators may grant another investigator on the team permission to view their annotations and results. Students in a class may work using initial annotations provided by the professor, or may develop their own annotations from scratch.

The computer, for its part, keeps track of all this, displays whatever the investigator needs to see, provides channels for real time dialogue among members of the team, and prepares data summaries to back up each point of the emerging analysis.

Guest: People with curiosity about how languages relate to each other, including students not enrolled in linguistics courses, can sign on as guests, with access to a demonstration collection and a guide to the conceptual help files and tutorials. Using the Internet, they can take a shot at doing comparative linguistics. Guests function as investigators except for not being able to choose among collections. Their analysis remains on the computer for only a limited time. As long as the guest keeps coming back within the time limit, the analysis stays there. If the guest loses interest, the analysis disappears. Guests are normally not granted chat privileges, unless one of the regular investigators were volunteer to act as mentor for them, very likely only during certain hours.

Easy access by guests is an invitation to cracking. The extra security needed will be worth the effort if some of the guests are attracted into linguistics.

Standalone: The standalone configuration, in which the user interface, application logic, and database are all inside a single laptop, is intended for the many linguists who work alone in the field. It does not use the Internet, though the software can be delivered by download from the Internet before the linguist heads out the door. Because the standalone user is in effect Team Leader, Data Keeper, and Investigator rolled into one, it is possible to start with a blank database, define a collection and several views, type in or import data, and do a complete analysis. There are not, however, many areas in the world where some comparative work has not been done already; so most linguists will want to structure their collections to take advantage of word lists that have already been collected, and fill in new word lists having the same structure as existing ones. A standalone configuration could thus be partly set up by extracting word lists from a collection that is already in the database or obtainable from an archive, then fleshed out with new data collected in the field.

Top of page

How it works: The application gives the linguist control over three main functions, called Data, Tabulate, and Refine, plus an auxiliary function called Manage Views.

Top of page

Database: Standing behind these functions is a relational database. It has four tightly interlinked parts, called Management, Data, Tabulation, and Results.

There is a technical dilemma that needs to be worked through with the designer as soon as the work begins. The most efficient way to make something as complex as WordCorr available via the Internet is to have most of the intelligence residing on the server computer that manages the master database, then use an ordinary browser for the World Wide Web with no special components to be the interface to all the users. This is the way most bank-by-Internet and airline reservation operations are set up. But if scaling the application down for the standalone mode involves putting a partial copy of the master database on the user's laptop, this may create licensing problems, because the database manager itself, plus a number of component pieces of software that the software developer uses to build the application, operate under license, and it won't do to have unlicensed copies running around the world on standalone laptops.

A second alternative would be to put the intelligence into the laptop (in computerese, "the client side") using only tools such as Visual Basic Studio that produce fully licensed runtime-only software whose internal details are not accessible to any user, and can be protected under a standard agreement not to decompile. This is the same strategy used for common commercial software applications like TurboTax, used to generate and file U.S. income tax returns. There could be dual database interface modules configured either to run over the Internet on a full scale database manager with all the server side bells and whistles, or to run on a simple database system installed in the same computer. A portable extract from the main database would allow standalone users in the field, who at this stage are the ones most interested in using WordCorr and will undoubtedly be the main testers for it, to have exactly the same application at their fingertips as the Internet users have. Runtime-only copies of the application would be downloadable from the Internet by each potential user. The catch is that the serendipity factor by which some potential linguists among the general public discover linguistics while browsing the Internet is lost, because only those who already consider themselves capable of using a comparative linguistics tool are likely to consider downloading one.

A third option is to concentrate the first part of the project on producing a client side version of WordCorr with a local database on the same laptop, as described under the second option, not only to get something into the hands of linguists as soon as possible but to get feedback on the design from people already doing field work. Then, with a firmer grip on the functioning of the tool, the next part of the project would be devoted to transferring the same application logic to the server side, so that an Internet user would not need to download any special software. This development route could be tricky because of having to make sure that everything works in both configurations, and that information in the database can be passed readily from one configuration to another. It is more expensive, because it is like developing an application and a half. Unless this third path hits unforeseen snags, it is the one the project will follow, because

Top of page

Preservation and sharing: Two means of ensuring the preservation of data used by each project will be put in place immediately: First, scheduled backups will be made regularly and a copy kept off site. Second, changes in the common data whose owners have agreed to archiving will be transmitted regularly in XML archival form to a major long term archive of linguistic data in the United States, the Open Languages Archive Community (OLAC, http://www.language-archives.org), related to the NSF-sponsored Electronic Metastructure for Endangered Languages Data (EMELD) at the Linguist List Web site http://www.linguistlist.org. The WordCorr project will act as a resource creator in relation to OLAC.

The archive will also be the primary means of sharing data, since the XML form allows the recipient to shape the data received to local requirements. This is even true for data shared between WordCorr collections whose varieties overlap, since different collections are assumed to have different internal layouts.

Top of page

Documentation An overview of the WordCorr system similar to what is on this page but more modularized, plus detailed instructions explaining the functions associated with each screen, will be available in both standalone and Internet mode. In addition, the code will be essentially self-documented, with ample indication of intent and implementation strategy incorporated in each procedure so that the implementers of future developments will not have to guess about what they are dealing with.

Top of page

References:

Agard, Frederick B. A course in Romance linguistics, Volume 2: A diachronic view. Washington DC: Georgetown University Press, 1984.

Campbell, Lyle. Historical Linguistics: An Introduction. Edinburgh: Edinburgh University Press, 1998. Reissued at Cambridge: The MIT Press, 1999.

Frantz, Donald G. 'A PL/1 program to assist the comparative linguist.' Communications of the ACM 13:6.353-356, 1970.

Grimes, Barbara F., ed. Ethnologue : Languages of the World, fourteenth edition. Dallas: Summer Institute of Linguistics, 2000. http://www.ethnologue.com.

Grimes, Joseph E. Language Survey Reference Guide. Dallas: Summer Institute of Linguistics, 1995a. In the LinguaLinks Library under Sociolinguistics, available through http://www.ethnologue.com/lingualinks.asp.

Grimes, Joseph E. 'Language endangerment in the Pacific.' Oceanic Linguistics 34:1.1-12, 1995b.

Grimes, Joseph E. and Frederick B. Agard. 'Linguistic divergence in Romance.' Language 35:598-604, 1959.

Grimes, Joseph E. and Barbara F. Grimes. Ethnologue Language Family Index. Dallas: Summer Institute of Linguistics, 1993, 1996, 2000. http://www.ethnologue.com.

Grimes, Barbara F., Joseph E. Grimes, Malcolm Ross, Charles E. Grimes, and Darrell T. Tryon. 'Listing of Austronesian languages.' In Darrell T. Tryon, ed., Comparative Austronesian Dictionary: An Introduction to Austronesian Studies, Part 1: Fascicle 1, pp. 121-279, 1995c.

Kurebito, Megumi, ed. Comparative Basic Vocabulary of the Chukchee-Kamchatkan Language Family: 1 (Endangered Languages of the Pacific Rim, Series A2-011). Suita, Japan: Osaka Gakuin University, 2001.

Weber, David J., Stephen R. McConnel, H. Andrew Black, and Alan Buseman. STAMP: A Tool for Dialect Adaptation. Dallas: Summer Institute of Linguistics, 1990. [The key phrase has since been changed to the more accurate Related Language Adaptation, a low-level but practical form of machine translation suitable for closely related varieties.]

Wimbish, John S. WordSurv: A Program for Analyzing Language Survey Word Lists (Occasional Publications in Academic Computing, 13). Dallas: Summer Institute of Linguistics, 1989. In the LinguaLinks Library under Sociolinguistics, available through http://www.ethnologue.com/lingualinks.asp.

Top of page

Overall ] [ General Plan ] [ Broader Implications ]


Home ] [ Background ] [ Technical ] [ SourceForge Project Page ] [ Download ]


For problems or questions regarding this web contact khamasak@users.sourceforge.net.
Last updated: Jan 01, 1970


Sponsors:

SourceForge.net Logo Data House, Inc. Logo University of Hawaii Logo NSF Logo SIL Logo