General Plan

The WordCorr Project

[ Home ] [ Background ] [ Technical ] [ SourceForge Project Page ] [ Download ]

[ Overall ] [ General Plan ] [ Broader Implications ]

General Plan

Activities
Roles
How it Works
Database
Preservation and Sharing
Documentation
References

Activities: Information technology and comparative linguistics come together in a computer application tentatively called WordCorr. The name acknowledges the insight of John Wimbish, who in the 1970s developed a widely used DOS program called WordSurv that set up the data of comparative phonology in an innovative way that is perpetuated in WordCorr. WordSurv never carried through with actually doing comparative phonology, but only tabulated guesses at lexical similarity for pairs of speech varieties, which most comparativists now regard as giving an inadequate picture of language divergence. Nevertheless, a number of comparativists already use WordSurv to store their data, because Wimbish's ideas for visualizing data (Wimbish 1989) made it easy for the linguist to spot similarities and differences in sets of forms. Hence the name. WordCorr also makes provision for comparative linguists to import their WordSurv files into WordCorr directly.

Going beyond the Principal Investigator's crude prototype (Grimes 1995a Chapter 1) to provide a genuinely useful tool for comparative linguistics requires participation by four parties: the Principal Investigator, two graduate assistants in linguistics, a developer of commercial grade software, and a few field linguists and graduate students in comparative linguistics as testers.

Principal Investigator:

Guides the detailed development of the specifications.
Dialogues with practicing comparativists about features that it may be feasible to include in this version or later versions.
Dialogues with the Open Language Archive Community and with Linguist List's NSF-sponsored Electronic Metastructure for Endangered Languages Data (EMELD) to link to both.
Organizes two workshops in which linguists learn to apply WordCorr to their own data, hopefully as a step on the way to publishable studies.
Disseminates a report on the results of the project to NSF and selected research centers in print, and more widely via the World Wide Web.

Graduate Research Assistants prepare materials and assist the Principal Investigator in training others in the workshops, thus gaining hands on experience in explaining language comparison and in training other comparative linguists in how to use the tools.

One works with the Principal Investigator on a conceptual guide to the application, a cross between a mini-textbook in comparative linguistics and a hypertext help facility, but designed for both standalone and Internet presentation. Its scope is comparable to that of the first five chapters of Campbell 1999, especially Chapter 5 on method. Must be good at technical writing in clear English. Gains familiarity with the teaching side of comparative linguistics, learns first hand how to plan educational presentations for the Internet and distance learning, and has ample opportunity to analyze comparative data actively.
The other assistant works with the Principal Investigator on preparing data sets that are appropriate for introducing linguists to the application and for using the application as an educational tool. Gains first hand experience in the comparative analysis of several different language families.

Software Developer:

Adapts and extends the Principal Investigator's preliminary specifications to produce robust and user friendly software.
Makes the application scalable from a team configuration, in which a project leader and colleagues (or a teacher and pupils) work off a common database via the Internet from different locations if necessary, to a standalone configuration in which a database extracted from the main one resides on the same laptop as the application software so that the investigator can work in field locations where there is no Internet access. The standalone version will be addressed first and independent scholars will take part in testing it, so that the more complicated Internet multiuser version can be built around a simpler core that has been shown by experience to work right.
Addresses
1. Using Unicode for symbols from the International Phonetic Alphabet (IPA).
2. Keyboard conventions that linguists can use readily for inputting and editing IPA data.
3. Accessibility by scholars with disabilities such as impaired visual acuity or color blindness.
4. Inputting and editing data sets for a single speech variety entry by entry, or data sets for multiple varieties such as appear in many published works, as well as importation from existing data sets.
5. Security for project data and each investigator's working files, privacy for individuals and means of protecting each one's intellectual property rights, integrity of the database and the Web pages, and security for all the wide area and wireless networking done among the Principal Investigator, the graduate assistants, the software developer, and the workshop participants.
6. Allowing other investigators on the same team to view one's data annotations and current conclusions and copy some of them with permission, and for any or all of them to chat and videoconference via the Internet or a local network such as 802.11b or Ethernet
7. Using the application on computers that are not the latest and greatest, and whose operating systems are one or two versions behind the one the developer uses. For example, Windows 98 is likely to still be in use for several years by linguists working overseas, while developers are using more advanced operating systems.
8. Exporting data to public archives that are being developed now, most specifically that of the Open Language Archives Community in collaboration with the Linguist List EMELD project. This includes the existing Cornell-SIL-Hawai'i archive and collections currently kept in WordSurv data files by individuals and released by them for archiving.
9. Concurrent development of application help files to be integrated with the conceptual help files a graduate assistant is working on
10. Eventual internationalization of the application to make its screens and output forms come up in the major languages of scholarship (notes by the investigators on different parts of the analysis do not need to be translated, but can be kept in the investigator's publication language). Internationalization is not a direct goal of this project, but the right hooks need to be placed in the software early on, so that in future years there will be no need to redesign everything in order to achieve internationalization.
11. Helping the Principal Investigator develop a plan for long term management and maintenance of the facility on the Internet under University of Hawai'i supervision beyond the current grant period, to be funded later.

The Principal Investigator will choose the software developer from among for-profit software houses headquartered in Hawai'i that have established good reputations for reliability and interaction with their clients. The developer will operate under a subaward from the main grant to the University of Hawai'i, as required by the NSF 01-149 solicitation; it will be subject to verification by the University sponsored research administrators.

The justification for going the for-profit route is that in the Principal Investigator's experience, the kind of analytically oriented people that universities attract tend to produce software solutions that they themselves and maybe a few others can use on a specific problem; but the same people (including the PI) tend not to meet industry standards for user friendly presentation, compatibility with operating systems, good graphic design, accessibility, and efficient and fast interfacing among the user, application logic, and database levels of organization. Commercial application developers, however, generally show expertise in both solutions and interfaces.

Participating Linguists and Graduate Students:

Attend a two-week workshop put on by the Principal Investigator and the graduate assistant in which linguists and graduate students in comparative linguistics learn how to use the standalone application, first on data sets prepared for training by the Principal Investigator, and second to augment the way they work with their own comparative data. The first workshop will be held at the University of Hawai'i near the end of the first year of the project, so that the experience gained will be useful to the developers during the second year.
The second workshop will be held on the Internet before the end of the second year, with domestic and international participants working from their home bases and communicating with the Principal Investigator and the graduate assistants, so that the experience gained that way can be incorporated in the project report.
Use updated releases of the application to continue with their own research via the Internet or standalone.

The workshop at the University of Hawai'i will be limited to participants already active in comparative studies and based in Hawaii, since the application will still be in a tentative form and the workshop will be like a beta test for it, though it will enhance the research of all the participants as well. The second workshop will be on the Internet and will be open to international participation, on the same basis. The participants in both will need to handle English well, because though internationalized versions of the application are part of the long term plan, they will not be implemented during this project.

Top of page

Roles: The functioning of the full scale application revolves around a ranking of roles. Anybody in a role of greater scope (to the left) also has the privileges and capabilities of the roles of more limited scope (to the right):

Developer > Manager > Team Leader > Data Keeper > Investigator or Guest

Developer: Writes and maintains the code and the screen layouts. Creates own initial password in the code, provides initial password to the manager. The Principal Investigator and the software developer jointly fill this role.

Manager: Maintains the Web site and database, including backups. Pulls extracts from the main database for standalone use in the field, and helps the Data Keeper reintegrate them when the investigator returns. Provides an initial password to the team leader of each project, and shows the team leader how to get the project launched.

Team Leader: The lead scholar on a research project, or the teacher of a class. Establishes the structure of each collection of data used in the project, including the speech varieties covered and the sequence and extent of entries in the word lists of the collection. [Note: "Word list" means different things in linguistics and in information technology. For a linguist, a word list is a list of vocabulary items in common use by speakers of a language, usually elicited face to face and often organized semantically. For information technology, a word list is an alphabetic list of unique character sequences found by a computer bounded by spaces or punctuation marks in a particular document or set of documents.] Provides an initial password for the data keeper and each investigator. Trains the investigators and dialogues with them. If permissions for the team to use published data are required, the team leader is responsible for getting them. The team may be multinational and widely dispersed, or it may consist of one investigator working alone.

Data Keeper: The team leader has the option of taking care of the integrity of the data in each collection personally, or of handing that function over to one or two of the investigators, who are then enrolled as data keepers. The data keeper is responsible for the accuracy and integrity of the unannotated data accessed in common by all the members of the team; other investigators cannot change the data. Investigators annotate the common data in their own way, and may develop multiple annotations of the same data simultaneously in order to follow out different hypotheses or to work with different subsets of the speech varieties.

Investigator: Even the team leader and data keeper spend most of their time acting as investigators because that's where the fun is. All the investigators on a project define their own views of the common project data, assign tags to mark off different groups of data within each entry for comparison, annotate data by aligning comparable segments within each group, specify metatheses and semantic shifts, add comments to the view and its components, decide on the organization of the correspondence sets in their environments, review the results presented by the computer, and revise everything as many times as needed. Investigators may grant another investigator on the team permission to view their annotations and results. Students in a class may work using initial annotations provided by the professor, or may develop their own annotations from scratch.

The computer, for its part, keeps track of all this, displays whatever the investigator needs to see, provides channels for real time dialogue among members of the team, and prepares data summaries to back up each point of the emerging analysis.

Guest: People with curiosity about how languages relate to each other, including students not enrolled in linguistics courses, can sign on as guests, with access to a demonstration collection and a guide to the conceptual help files and tutorials. Using the Internet, they can take a shot at doing comparative linguistics. Guests function as investigators except for not being able to choose among collections. Their analysis remains on the computer for only a limited time. As long as the guest keeps coming back within the time limit, the analysis stays there. If the guest loses interest, the analysis disappears. Guests are normally not granted chat privileges, unless one of the regular investigators were volunteer to act as mentor for them, very likely only during certain hours.

Easy access by guests is an invitation to cracking. The extra security needed will be worth the effort if some of the guests are attracted into linguistics.

Standalone: The standalone configuration, in which the user interface, application logic, and database are all inside a single laptop, is intended for the many linguists who work alone in the field. It does not use the Internet, though the software can be delivered by download from the Internet before the linguist heads out the door. Because the standalone user is in effect Team Leader, Data Keeper, and Investigator rolled into one, it is possible to start with a blank database, define a collection and several views, type in or import data, and do a complete analysis. There are not, however, many areas in the world where some comparative work has not been done already; so most linguists will want to structure their collections to take advantage of word lists that have already been collected, and fill in new word lists having the same structure as existing ones. A standalone configuration could thus be partly set up by extracting word lists from a collection that is already in the database or obtainable from an archive, then fleshed out with new data collected in the field.

Top of page

How it works: The application gives the linguist control over three main functions, called Data, Tabulate, and Refine, plus an auxiliary function called Manage Views.

Data covers inputting, editing, annotating, importing and exporting, and inspection of a team's common data, in text form (audio samples are sometimes useful, but are not planned for this project). For each individual in a team the data function allows definition of multiple views of the data, and annotations which are different for each view. The linguist may change the annotations as the analysis progresses. If they change, the application will either rearrange automatically the way the results of tabulation are stored, or if the nature of the change makes that impossible, will help the investigator step through the tabulation again on the new arrangement.
Tabulate takes the data and annotations for the entries and groups in a particular view and from them rapidly generates the correspondence sets that are the primary pieces of evidence in comparative analysis. The linguist specifies an environment and a tentative protosegment for each correspondence set, and WordCorr organizes them accordingly. The linguist may look at the complete register of correspondence sets including the Residue sets, and at the annotated data they are derived from, at any stage of the tabulation.
Refine allows the linguist to change how the results are arranged: the assignment of correspondence sets to clusters, and even to protosegments and environments, from the place where they were registered on tabulation to a more appropriate place. This is important in working with incomplete correspondence sets, which may be indeterminate as to where they fit the analysis best, and in filtering out sets resulting from by borrowings or internal analogies by moving them into Residue. It also constructs presentations suitable as an appendix to a comparative monograph, containing a listing as complete as the investigator wants to make it of detailed evidence for each conclusion the linguist has reached, starting with the most convincing evidence.
An auxiliary module called Manage Views allows investigators to define and change their views of the common data. It also allows a Team Leader to set up a project and to define and modify collections of data, each with its own list of entries and coverage of speech varieties, that become available to the whole team.

Top of page

Database: Standing behind these functions is a relational database. It has four tightly interlinked parts, called Management, Data, Tabulation, and Results.

Management tables organize the database as a set of projects. Each project has a team of one or more members associated with it, in the roles already described (a one-person team using WordCorr independently is still treated as a project with the solitary investigator as its Team Leader, Data Keeper, and Investigator; this allows any one-person project to be expanded later to include more people). The Management information relates the Data associated with each team to the views and the Tabulation performed by each member of the team, and to the Results.
Data tables are common to all members of a project. Each project inputs or imports data in the form of one or more collections, each of which holds a set of isomorphic word lists. Each list consists of entries in a particular order that contain vocabulary items for each speech variety covered by the list. ("Variety" is a neutral term that bypasses questions of languages and dialects, because those terms pertain to the results of the investigation more properly than to its input.) Typically all the speech varieties in a collection belong to a single language family; but a single collection is also a useful way to approach an area in which it has not yet been demonstrated whether all the speech varieties are even remotely related or not, in order to highlight the absence of systematically demonstrable relations among some varieties and clear relations among others.
Tabulations are unique to each view of each member of a project. Each member is given a standard view of each collection in the project. This view, always identified as the "Complete" view, contains a list of all the varieties for which that collection contains data. Investigators cannot change the varieties in that master list. Each member can, however, set the order in which the varieties are presented. Each member also sets a threshold, the percentage of varieties in the view for which data must be present in a group in order to make a tabulation. A low threshold allows the investigator to look at the data in a detailed way, while a higher threshold shows only the stronger patterns of correspondence, as is sometimes useful early in an investigation.
The investigator may create views derived from the Complete view by copying it and removing from the copy the varieties that are not relevant to that view, then arranging the remaining varieties in whatever order is convenient. For example, if a project were to compile a collection of word lists representing the more than 1,200 Austronesian languages scattered from Hawai'i and Easter Island to Madagascar, one investigator might work with a view that contains only the languages of Micronesia and another with a view embracing the Austronesian languages of Taiwan.
Before tabulating, the investigator annotates the data for a particular collection and view by tagging the data for each entry, as in WordSurv, for comparability or noncomparability among varieties. Alignment indicates potential dropouts or insertions, notes metatheses and semantic shifts, and stores observations about particular forms or annotations that may be useful later in the analysis.
When the investigator invokes the Tabulate function on the annotated data, WordCorr goes through each data group for every entry and presents the data for the varieties in that group in the view's order. It produces a correspondence set for each position in the data for an entry. Missing data, whether because data collection is incomplete or because some of the forms are tagged as not comparable with each other, are shown by a special Ignore symbol.
Results of tabulations are also unique to each view. The investigator gives each correspondence set an environment symbol of a type linguists conventionally use such as "V_V" for an intervocalic segment, that characterizes its relationship to nearby segment types. The investigator tentatively assigns the set to a protosegment that purports to represent a segment at the stage of language history before the varieties in the view diverged from each other.
Different correspondence sets may represent the same protosegment in different environments; different correspondence sets in the same environment belong to different protosegments. The string of protosegment symbols from each position in the tabulation constitutes an initial reconstruction of the possible precursor of the tag group as a whole.
This version of WordCorr deals only with segmental phonology or with suprasegmental features that can be transcribed as if they were segments. Metrical and multitiered comparisons are on the list for a future version. The justification for not starting with them is that most comparative research is still strongly segment oriented, and sometimes involves data that were collected long before other insights into phonology, or even phonology itself, were thought of. The immediate utility of the tool to many comparative linguists makes it reasonable to put off implementing more flexible models until the simpler one is working.
Correspondence sets that differ only in terms of data that are missing or to be ignored, but that occur in the same environment and are assigned to the same protosegment, are put together in a cluster of sets. The sets in a cluster that contain data from many varieties are given greater weight than sets with data from fewer varieties. Each set is given a citation for the entry, tag group, and position that it comes from. When another identical set comes along in the same environment, only the citation of the new set needs to be added to the one already registered. Being able to identify multiple attestations for some sets is useful when it comes to showing the strength of evidence for hypotheses about how divergence may have taken place.
Each view starts out with an unerasable pseudoprotosegment called Residue, to hold clusters the investigator isn't ready to form a conclusion about, or that appear to represent borrowings from other languages or internal analogies and therefore don't fit any systematic pattern. That way no scrap of data can get lost. There is also an unerasable pseudoprotosegment called Process, where segments that arise from processes like epenthesis can go.
The Refine function operates on the tables of results in the database. It allows a correspondence set to be moved into another cluster if its original assignment is indeterminate due to incomplete data. A cluster can be moved into a different protosegment or merged with another cluster whose environment is considered equivalent to the environment of the cluster being merged. The environment of a set or a cluster may be respecified. New protosegments may be added and filled in, and protosegment symbols can be changed. The total results may be examined and printed out, including results that present the most convincing reconstructed forms first as calibrated by a variation on Frantz's criterion (1970) accompanied by the unannotated data from which they were derived.

There is a technical dilemma that needs to be worked through with the designer as soon as the work begins. The most efficient way to make something as complex as WordCorr available via the Internet is to have most of the intelligence residing on the server computer that manages the master database, then use an ordinary browser for the World Wide Web with no special components to be the interface to all the users. This is the way most bank-by-Internet and airline reservation operations are set up. But if scaling the application down for the standalone mode involves putting a partial copy of the master database on the user's laptop, this may create licensing problems, because the database manager itself, plus a number of component pieces of software that the software developer uses to build the application, operate under license, and it won't do to have unlicensed copies running around the world on standalone laptops.

A second alternative would be to put the intelligence into the laptop (in computerese, "the client side") using only tools such as Visual Basic Studio that produce fully licensed runtime-only software whose internal details are not accessible to any user, and can be protected under a standard agreement not to decompile. This is the same strategy used for common commercial software applications like TurboTax, used to generate and file U.S. income tax returns. There could be dual database interface modules configured either to run over the Internet on a full scale database manager with all the server side bells and whistles, or to run on a simple database system installed in the same computer. A portable extract from the main database would allow standalone users in the field, who at this stage are the ones most interested in using WordCorr and will undoubtedly be the main testers for it, to have exactly the same application at their fingertips as the Internet users have. Runtime-only copies of the application would be downloadable from the Internet by each potential user. The catch is that the serendipity factor by which some potential linguists among the general public discover linguistics while browsing the Internet is lost, because only those who already consider themselves capable of using a comparative linguistics tool are likely to consider downloading one.

A third option is to concentrate the first part of the project on producing a client side version of WordCorr with a local database on the same laptop, as described under the second option, not only to get something into the hands of linguists as soon as possible but to get feedback on the design from people already doing field work. Then, with a firmer grip on the functioning of the tool, the next part of the project would be devoted to transferring the same application logic to the server side, so that an Internet user would not need to download any special software. This development route could be tricky because of having to make sure that everything works in both configurations, and that information in the database can be passed readily from one configuration to another. It is more expensive, because it is like developing an application and a half. Unless this third path hits unforeseen snags, it is the one the project will follow, because

It will keep the project from having to ignore the needs of the linguists headed for field work, who need the standalone configuration to do the original discovery work that will most enhance general knowledge of the world�s languages and their relationships, and
It will keep the project from having to ignore educational users and guest users curious about linguistics, both of whom rely on only a browser on the client side.

Top of page

Preservation and sharing: Two means of ensuring the preservation of data used by each project will be put in place immediately: First, scheduled backups will be made regularly and a copy kept off site. Second, changes in the common data whose owners have agreed to archiving will be transmitted regularly in XML archival form to a major long term archive of linguistic data in the United States, the Open Languages Archive Community (OLAC, http://www.language-archives.org), related to the NSF-sponsored Electronic Metastructure for Endangered Languages Data (EMELD) at the Linguist List Web site http://www.linguistlist.org. The WordCorr project will act as a resource creator in relation to OLAC.

The archive will also be the primary means of sharing data, since the XML form allows the recipient to shape the data received to local requirements. This is even true for data shared between WordCorr collections whose varieties overlap, since different collections are assumed to have different internal layouts.

Top of page

Documentation An overview of the WordCorr system similar to what is on this page but more modularized, plus detailed instructions explaining the functions associated with each screen, will be available in both standalone and Internet mode. In addition, the code will be essentially self-documented, with ample indication of intent and implementation strategy incorporated in each procedure so that the implementers of future developments will not have to guess about what they are dealing with.

Top of page

References:

Agard, Frederick B. A course in Romance linguistics, Volume 2: A diachronic view. Washington DC: Georgetown University Press, 1984.

Campbell, Lyle. Historical Linguistics: An Introduction. Edinburgh: Edinburgh University Press, 1998. Reissued at Cambridge: The MIT Press, 1999.

Frantz, Donald G. 'A PL/1 program to assist the comparative linguist.' Communications of the ACM 13:6.353-356, 1970.

Grimes, Barbara F., ed. Ethnologue : Languages of the World, fourteenth edition. Dallas: Summer Institute of Linguistics, 2000. http://www.ethnologue.com.

Grimes, Joseph E. Language Survey Reference Guide. Dallas: Summer Institute of Linguistics, 1995a. In the LinguaLinks Library under Sociolinguistics, available through http://www.ethnologue.com/lingualinks.asp.

Grimes, Joseph E. 'Language endangerment in the Pacific.' Oceanic Linguistics 34:1.1-12, 1995b.

Grimes, Joseph E. and Frederick B. Agard. 'Linguistic divergence in Romance.' Language 35:598-604, 1959.

Grimes, Joseph E. and Barbara F. Grimes. Ethnologue Language Family Index. Dallas: Summer Institute of Linguistics, 1993, 1996, 2000. http://www.ethnologue.com.

Grimes, Barbara F., Joseph E. Grimes, Malcolm Ross, Charles E. Grimes, and Darrell T. Tryon. 'Listing of Austronesian languages.' In Darrell T. Tryon, ed., Comparative Austronesian Dictionary: An Introduction to Austronesian Studies, Part 1: Fascicle 1, pp. 121-279, 1995c.

Kurebito, Megumi, ed. Comparative Basic Vocabulary of the Chukchee-Kamchatkan Language Family: 1 (Endangered Languages of the Pacific Rim, Series A2-011). Suita, Japan: Osaka Gakuin University, 2001.

Weber, David J., Stephen R. McConnel, H. Andrew Black, and Alan Buseman. STAMP: A Tool for Dialect Adaptation. Dallas: Summer Institute of Linguistics, 1990. [The key phrase has since been changed to the more accurate Related Language Adaptation, a low-level but practical form of machine translation suitable for closely related varieties.]

Wimbish, John S. WordSurv: A Program for Analyzing Language Survey Word Lists (Occasional Publications in Academic Computing, 13). Dallas: Summer Institute of Linguistics, 1989. In the LinguaLinks Library under Sociolinguistics, available through http://www.ethnologue.com/lingualinks.asp.

Top of page

[ Overall ] [ General Plan ] [ Broader Implications ]

[ Home ] [ Background ] [ Technical ] [ SourceForge Project Page ] [ Download ]

For problems or questions regarding this web contact khamasak@users.sourceforge.net.
Last updated: Jan 01, 1970

Sponsors: