Untitled Document

TUTORIAL

Lexicon
- Basic searching (the search template)
- Word searches in Nahuatl
- Word searches in English
- Regular expression searches

Character classes
VLN and PRN
Sounds
Selecting specific fields for display
Data cleansing

Grammar
Encyclopedia

Lexicon

Structure

Searching

Basic searching

Basic searches are accomplished by specifying variables in each of the three columns of the search template. Up to five rows may be specified using the logical expressions and and or. Thus one can search for words that begin with cho: and end with ka through the use of two rows joined by and. There are several things to remember in any search:

the field column lists in text the fields that are searched with any submission. Note that often more than one field is searched. For example, when searching for Ameyaltepec word— several fields are searched:

Word searching in Nahuatl

Word searching in English

At this time, given the immense amount of time it would involve, there are no simple glosses or single-word definitions for Nahuatl words. A search for the Nahuatl equivalent of any English word must be conducted in an English sense field (/sea, /seo, /seao) with the logical operator contains word. What this does is look in the various sense fields for a character string as word (that is, preceded by a space and followed by a space or punctuation). A user could, therefore, search:

English sense—contains word—cry

This will return 5 hits, including

yo:ltepistik : 1 : to be tough of character; to be hard-hearted; to be tenancious; to be able to endure adversity (e.g., a person who does not cry or break down when scolded or beaten, or who shows little tendency to back down when their compasion is appealed to)

Clearly yo:ltepistik is not what most users would expect; it was listed simply because cry is contained in the definition: person who does not cry or break down when scolded or beaten.

There are, however, reasons for not writing a keyword search or simple English word finder function. The first is simply that of resources. Given that many entries have to be redefined, elaborated, and otherwise checked, the implications of creating a word finder list at this time, with a dictionary in process, are that other tasks, which are probably more urgent, would have to be neglected. A second reason is that many Nahuatl words are incapable of being summarized in English to a degree that would permit searches. Finally, there would be a great chance of leaving basic English words out.

The benefits of the present system is that in searching for any word, users are presented with a more complete semantic domain. One only needs to search for English sense—contains word—happy to see these advantages. Moreover, clever use of the multiple search functions should enable users to limit searches, with a little ingenuity. For example, if one searches for the word order, hundreds of hits are given, since any definition with the phrase in order to or in order that would be pulled up. Thus one could simply search for English sense—contains word—order and English sense—does not contain sequence—in order. This yields 8 hits; by further specifying Part of speech—contains—N only two results appear.

Other ways of limiting searches involve placing limits on the size of the sense definition. For example, to find the word for and one can search English sense—contains word—and and English sense—regular expression—^[a-z]{1,15}$, i.e., that the total length of the sense field is between 1 and 15 letter characters.

Regular expression searches

The NLE : Lexicon search engine is based on the submission of regular expression queries to the MySQL database. (A regular expression is a series of symbols used to represent or describe a given string of text.) The regular expression submitted for any query is displayed at the bottom of the search results page. Thus if one submits Ameyaltepec word—begins with—cho:ka, the regexp submitted (and displayed at the foot of the results page) is as follows:

(lxa_REGEXP_'^(%?cho:ka)'_OR_lxa_REGEXP_'%cho:ka[a-zA-Z]*%?'
_OR_lxaa_REGEXP_'^(%?cho:ka)'_OR_lxa_REGEXP_'%cho:ka[a-zA-Z]*%?'
_OR_lxap_REGEXP_'^(%?cho:ka)'_OR_lxa_REGEXP_'%cho:ka[a-zA-Z]*%?'
)__ORDER_BY_alpha

The begins with part of the query is represented by the ^ symbol, which signifies start of line. If the query is changed to Ameyaltepec word—ends with—cho:ka, the regexp submitted (as displayed) is as follows:

(lxa_REGEXP_'(cho:ka)$'_OR_lxa_REGEXP_'%[a-zA-Z]*(cho:ka%?)$'
_OR_lxaa_REGEXP_'(cho:ka)$'_OR_lxa_REGEXP_'%[a-zA-Z]*(cho:ka%?)$'
_OR_lxap_REGEXP_'(cho:ka)$'_OR_lxa_REGEXP_'%[a-zA-Z]*(cho:ka%?)$'
)__ORDER_BY_alpha

In this regexp the $ symbol signifies end of line (though literally it means up to a newline character).

The search template, therefore, converts each column (e.g., Ameyaltepec word, ends with, cho:ka) into a regexp. The expression Ameyaltepec word is set up to prompt a search in three fields: lxa (the lexical headword entry), lxaa (an alternate pronunciation of the headword entry), and lxap (a practical orthography of the headword entry). The search is actually carried out in fields that have been stripped of diacritics (e.g., accents); however, a corresponding display field which has the diacritics is maintained in database for online display. Thus the MySQL database (which is how the information is stored) has a field (or column) named lxa, which is the Ameyaltepec headword stripped of diacritics, as well as a field named lxa_d, which is the original field with all the diacrtics. The search is on the stripped-down field (lxa), the display is of the original field (lxa_d).

Some users might want to use regular expressions in their queries. They can do this by selecting the fields to search on in the pulldown menu of the first column ofthe search template, selecting regular expression from the second column, and then typing in a regular expression in the third column. For example, if users want to search for all words that begin with /t/ or /k/ followed by a long /a:/ they have two options. The first would be to use two rows of the search engine joined by or:

Ameyaltepec word—begins with—ta:

or

Ameyaltepec word—begins with—ka:

However, the same result can be accomplished with a regexp. The user could search:

Ameyaltepec word—regular expression—^[tk]a:

In this case the user-entered regexp might not provide much of an advantage to letting the search engine construct the same query. However, in other cases the possibility of using regular expressions is a powerful tool.

What follows is a brief explanation of the most important symbols used in regular expressions:

Symbol	Meaning	Example	Explanation
^	begins with	^k	searches for all fields that begin with /k/
$	ends with	$k	searches for fields that end with /k/
*	preceding character may not exist or have one to infinity continuous repetitions	^ka*	searches for fields that begin with /k/ followed by zero to infinity of /a/
+	preceding character may be followed by any number of repetitions of that character	^ka+	searches for fields that begin with /k/ followed by one to infinity of /a/
?	preceding character may or may not exist	^ka:?	searches for fields that begin with /k/ followed by /a/ that may or may not be long (i.e., may or may not have a colon after it)
(#)	may be used with a number inside to indicate the exact number of repetitions of the preceding character	^ka(2)	searches for fields that begin with /k/ followed by 2 /a/'s
(#,)	may be used with a number inside to indicate at least the number of repetitions of the preceding character	^ka(2,)	searches for fields that begin with /k/ followed by at least 2 /a/'s
(#,#)	may be used with a number inside to indicate the range of repetitions of the preceding character	^ka(2,4)	searches for fields that begin with /k/ followed by between 2 and 4 /a/'s
[]	used to match a string that contains any of the characters or digits in the brackets	^k[ie]:	searches for fields that begin with /k/ followed by a long /i:/ or a long /e:/
[^]	used to match for the nonpresence of the characters in the brackets	^k[^ie]	searches for words that begin with /k/ and are not followed by /i/ or /e/
.	matches any character (including punctuation but not digits)	^..k	searches for fields whose third character is /k/
-	when used within brackets searches for any character within the range expressed by the characters before and after the dash	^[a-c]	searches for fields that begin with /a/, /b/, or /c/ (this is equivalent to ^(abc) as well as ^a\|b\|c\|d
\|	used to express "or"	^(cho:ka\|to:ka)	searches for fields that begin with cho:ka or to:ka; note that the expression must be included within parentheses

Examples of regular expression searches

To look for the words melancholy, sad, and unhappy anywhere in the English sense field

melancholy|sad[^a-z]|unhappy

This searches for any of the three preceding strings, but makes sure that sad is not followed by any other letter. This eliminates not only saddle, but also sadness and sadden.

One could, therefore, search for

melanchol|sad[^d][^l]|unhapp(iy)|

This searches for all strings melanchol (i.e., includes melancholic). It also finds all sequences of sad that are not followed by dl. And, finally, it will find unhappy and unhappiness.

Often it is the case that one must think about the Nahuatl concepts, thus brother and sister are not concepts that are lexically expressed in Nahuatl, rather there is a single term ikni:wtli (Am) that conveys the sense of sibling. In searching it may often be wise to include a string of related expressions

brother|sister|sibling

Character classes

Character classes comprise a set of characters that are represented by a single, unique symbol. This enables the user to conduct searches that yield results for a variety of conditions. The database is always queried through a regular expression, but convenient shortcuts may be established by selecting a single symbol to represent the entire class. For example, if one wanted to search for any word that started with a sequence t-vowel-t one would write: ^t[aeiou]:?t as the regular expression. The ^ indicates 'beginning of field,' the [ ] indicate any value included within the brackets (in this case any of the five Nahuatl vowels), the colon is used in the present orthography for vowel length, and the ? indicates an optional preceding character. Finally, the t ends the sequence.

A predefined character class, symbolized by V, has been selected to represent any vowel. This symbol inserts the regular expression [aeiou]:? into any search string. Thus, a user who wants to search for any initial sequence of t-vowel-t could simply write tVt. At present there are 3 predefined character sets that have been hard coded into the program:

C = any consonant

V = any vowel, long or short

S = semi-vowels, i.e., /w/ and /y/

In addition, users may predefine there own character sets. To do this they must select a character (capital letters are selected) and then set it as equivalent to a regular expression. For most classes this will involve simply the symbol, the equal sign, and a series of characters that are to be included in the class. For example, one might one to establish a symbol for front vowel (/i/ and /e/), regardless of length. This would be done within the box in the search form as follows:

F = [ie]:?

In this case it is not necessary to use the parentheses, though one could write

F = ([ie]:?)

Parentheses are necessary, however, if one wants to ensure that the characters are included as part of the regular expression. Thus if one wanted a symbol, e.g., T, to represent all alveolar stops and affricates, one would write

T = (t|tl|ts)

Note that multiple character classes cannot be enclosed in square brackets. Instead the pipe symbol | should be used within parentheses to establish the possible variables in the regular expression. For example, if one wishes to final all the words that begin with /kw/ in a closed (consonant final) syllable, one cannot use

^kw[aeiou]:?[CS][CS]

rather, one must write

^kw[aeiou]:?(C|S)(C|S)

The reason for the above limitation is that the character classes

VLN and PRN

These two buttons on the side of the search form serve are neutralization switches that are designed to make searches easier for users unfamiliar with the location of vowel length distinctions in Nahuatl or with the specific phonological rules of Ameyaltepec and Oapan.

VLN: By checking this box on any line, vowel length distinctions are "neutralized" for all string searches in the box to the left. The regular expression submitted to the MySQL database has :? inserted after every vowel (even word final vowels). The display is the same as always, with vowel length displayed. The effect of this VLN function can be seen at the bottom of the results page. If one submits Nahuatl word—begins with—toka, the regular expression submitted to query the relevant fields is ^(%?to:?ka:?).

Sound files

At present sound files are linked to most headwords. In the future illustrative sentences will also have accompanying files. The headword sound files may be accessed by clicking on one of the two icons after the headwords. The diamond-shaped icon remits the user to an mp3 file; the musical note icon remits to a downsampled wave file.

Selecting specific fields for display

If advanced users wish to select specific fields for display they may do so by directly accessing the following webiste: http://www.ldc.upenn.edu/hyperlex2/nahuatl/main_search.php4?user_lang=english&entry_template=generic

Data cleansing

Cross-referenced fields or XML-tagged can be checked for broken links at http://www.ldc.upenn.edu/hyperlex2/nahuatl/cleanse_form.html

The first row allows a user to check that the contents of a field such as \xvca, which should link to an \lxa field, does in fact link. Since \xvca is used only when the link is valid for the Ameyaltepec headword (\lxa) but not valid for the Oapan headword (\lxo) the cleansing parameters should be set as "xvca matches lxa should not match lxo."

To test for valid (and invalid) XML tags, the second row should be used.

Encyclopedia