The linguistic material is represented graphically in double way in order to fulfil the opposite demands of being faithful to the sources and of easy comparability:
(1) Input version in original transcription
In the VA portal, sources are brought together which come from different discipline's traditions (Romance studies, German studies, Slavonic studies) and which represent different historical stages of dialectological research. Some of the dictionary data have been collected at the beginning of the last century (GPSR) and others only a few years ago (ALD). It is therefore necessary for reasons of the history of science to respect the original transcription to the greatest possible extent. For technical reasons, it is, however, impossible to keep unchanged certain conventions. This is true especially for the vertical combination of base characters ('letters') and diacritical marks, as e.g. if a symbol for stress accent is positioned over a symbol for length over a vowel over a symbol for closure (Beta code). These conventions are transferred to linear sequences of characters in each time defined technical transcriptions, in which, however, exclusively ASCII characters are used (so-called Beta code). For the beta encoding, one can make to most of graphic resemblances between the original diacritic and the ASCII equivalence, which are intuitively understandable, to a certain degree. They are mnemonically favourable.
(2) Output version in IPA
The data output in a uniform transcription is desirable from the point of view of comparability and user-friendliness. Therefore, all Beta Codes are transferred to IPA characters using specific substitution routines. There are a few inevitable incompatibilities for the cases where two different basic characters in IPA correspond to one basic character which is specified by diacritics in the input transcription. This is especially the case for the degrees of vowel height: in the palatal row, the two basic characters <i> and <e> in combination with the diacritic closure dot and one or two opening ticks allow depicting six degrees of vowel height. In Beta encoding these vowels are the following: i – i( – i((– e?-- e – e(– e((. In IPA, there are only four basic characters for these vowels: i – ɪ – e – ɛ.
(1924ff.): Glossaire des patois de la Suisse romande, Neuchâtel
Goebl. Hans (2012): Atlant linguistich dl ladin dolomitich y di dialec vejins, 2a pert
The typification of the geocoded linguistic data is one of the fundamental requests of VerbaAlpina. For this, in a first step tokens ('single words') are extracted from the input data after the transcription and registered in the database field of the same name, where this is possible.
The centre of VerbaAlpina's attention is the morphological typification of the collected linguistic material. A morphological type is defined by the agreement of the following properties: language family – part of speech – single word vs. affixed words – gender – lexical basic type. The form by which the morphological type is cited takes a bearing on the lemmas of selected reference dictionaries (see below).
The unity of all merged morpho-lexical types becomes clear by means of the assignment to a common lexical type – also over language borders. By doing so, the following nouns and verbs (which are not described here in detail) can be assigned to one singular basic type malga: malga (MOUNTAIN PASTURE, HERD), malgaro (ALPINE DAIRYMAN), malghese (HERDER), immalgare (TO MOVE ON THE MOUNTAIN PASTURE), dismalgare (TO LEAVE THE MOUNTAIN PASTURE). The lexical basic type, however, does not say anything about the word history of a single morpho-lexical type. It has to be brought out each time individually if a type with Latin-Romance etymon which today is sourced in the Germanic or Slovene language area (as e.g. Slovene bajta 'simple house') goes back to old local substratum or to more recent Romance language contact. For this reason, the designation "etymon" is avoided in this context as it refers in principle to the immediate historical preliminary stage of a word – even if the lexical basic type actually corresponds to the etymon of a morpho-lexical type in many cases.
The morpho-lexical types form the leading category for the management of linguistic data. They are comparable to the lemmas of lexicography. By means of the above-mentioned, robust criteria that can be well operationalised the four phonetic types barga, bark, margun, bargun with the meaning ALPINE HERDSMEN'S HUT, ALPINE STABLE can be reduced to three morpho-lexical types for example:
The membership of the morpho-lexical types to language families (germ., rom., slav.) depends on the respective source. It results automatically through the respective informants in the case of data from atlases or dictionaries and is written accordingly in the database. In case of data which VerbaAlpina itself collects through crowdsourcing, the membership to a language/dialect of the informants is claimed and ideally confirmed quantitatively; the number of confirming informants becomes with that an instrument of data validation.
Morpho-lexical types are limited to a language family. It has to be cleared up by which form a morpho-lexical type should be represented in the search function on the interactive map. Regarding the Germanic and Slavonic language family the answer is quite easy as both are represented by only one standardised individual language ('German' [deu] / 'Slovenian' [slo]). The morpho-lexical types can be depicted by their standard variants, of course on condition that there are equivalents of the type in the standard language. Like this, all corresponding phonetic types of Alemannic and Bavarian which are variants of the standard form 'cheese' can for example be retrieved under this standard form. If there is no such standard variant, the lemmas of the big reference dictionaries (Idiotikon, WBÖ) are called up for comparison.
The situation is much more complex for the Romance language family due to its numerous, partly not sufficiently standardised small languages. For pragmatic reasons, the following way of proceeding has been chosen: all morpho-lexical types are represented by the French and Italian standard forms, if existing. All phonetic types which are variants of beurre/burro 'butter' can be retrieved under these two forms. The reference dictionaries are among others TLF and Treccani. If only one of the two standard languages has an appropriate variant, only this one comes out as in the case of ricotta (the membership to Italian is marked by the notation convention -/ricotta). If there is no variant of the type in any of the two Romance reference languages, we fall back upon an entry of a dialectal reference dictionary, for instance upon LSI. If there are no reliable entries in dialect dictionaries, VerbaAlpina suggests a basic type along with a graphic representation ('VA').
The phonetic typification of the linguistic material is scheduled in the overall concept and the technical implementation, but it is peripheral and therefore not put to practice consistently. The corresponding category is primarily therefore indispensable as linguistic atlases (e.g. SDS and VALTS) and dictionaries document sometimes exclusively phonetic types. When VerbaAlpina typificates phonetically, the tokens are divided up into phonetic types according to criteria of historical phonetics (database field 'phon_typ'). We examine an automation of the phonetic typification on the basis of Levenshtein algorithms and soundex algorithms. If the automation is shown to be possible, we will put it into practice.
The data diversity gets increasingly clear by typification (formation of classes). The following rule is valid: number of tokens > number of phonetic types > number of morpho-lexical types > basic typ. There can be, however, the extreme case of one single attestation (hapax) which corresponds to a token, a phonetic type and a morpho-lexical type as only representative of a basic type. It may make sense to filter out such hapax forms in the depiction.