The linguistic material is represented graphically in double way in order to fulfil the opposite demands of being faithful to the sources and of easy comparability:
(1) Input version in original transcription
In the VA portal, sources are brought together which come from different discipline's traditions (Romance studies, German studies, Slavonic studies) and which represent different historical stages of dialectological research. Some of the dictionary data have been collected at the beginning of the last century (GPSR) and others only a few years ago (ALD). It is therefore necessary for reasons of the history of science to respect the original transcription to the greatest possible extent. For technical reasons, it is, however, impossible to keep unchanged certain conventions. This is true especially for the vertical combination of base characters ('letters') and diacritical marks, as e.g. if a symbol for stress accent is positioned over a symbol for length over a vowel over a symbol for closure (Beta code). These conventions are transferred to linear sequences of characters in each time defined technical transcriptions, in which, however, exclusively ASCII characters are used (so-called Beta code). For the beta encoding, one can make to most of graphic resemblances between the original diacritic and the ASCII equivalence, which are intuitively understandable, to a certain degree. They are mnemonically favourable.
(2) Output version in IPA
The data output in a uniform transcription is desirable from the point of view of comparability and user-friendliness. Therefore, all Beta Codes are transferred to IPA characters using specific substitution routines. There are a few inevitable incompatibilities for the cases where two different basic characters in IPA correspond to one basic character which is specified by diacritics in the input transcription. This is especially the case for the degrees of vowel height: in the palatal row, the two basic characters <i> and <e> in combination with the diacritic closure dot and one or two opening ticks allow depicting six degrees of vowel height. In Beta encoding these vowels are the following: i – i( – i((– e?-- e – e(– e((. In IPA, there are only four basic characters for these vowels: i – ɪ – e – ɛ.
(auct. Thomas Krefeld – trad. Susanne Oberholzer)
Tags: Linguistics Information technology
Methodology
Show all entries
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Transcription
(Quote)
Transcription rules
(Quote)
We distinguish between base characters and diacritics.
Base characters are located on the baseline. All characters that are not on the baseline are considered diacritics. Purely typographic variations of a base character are also treated as diacritics in the broader sense, e.g. if the base character is displayed smaller than the others.
Base characters that exist in the ASCII table are retained (= all Latin characters; not German umlauts!). All other base characters are transcribed by a combination of a letter and a numeral (see table).
Each character used for the transcription of a diacritic may only occur once per base character. There are special rules for the repetition of the same diacritic, e.g. : for two points above a base character or \2 for a double grave accent.
If a diacritic refers to two or more characters, e.g. a͠e, the base characters are placed in square brackets, in this case [ae]~.
Comments (whether in brackets or not) are placed in angle brackets after the attestation to which they refer, e.g: (m.) → <m.>. If the entire attestation is bracketed, the attestation is transcribed without brackets and the remark "in brackets" is added in angle brackets.
Possible morphosyntactic variants of an attestation such as singular and plural forms are separated by commas, different word forms are separated by semicolons. This corresponds to the representation of the attestations in the atlas AIS. If the attestations are separated by other separators (e.g. / or -) in the source, these must be replaced accordingly by commas and semicolons in the transcription. Any numbering of different variants is omitted.
If the source contains both a single attestation and an already typified variant for an informant, only the single attestation is transcribed. Only if this is not possible, the typified variant will be transcribed and will be marked as "phonetic type" or "morpho-lexical type" via the corresponding selection menu. In contrast to single attestations, for types also capital letters are allowed, otherwise the same rules apply for transcription.
If there exist single attestations as well as typified attestations for an informant, two different lines must be created for the transcription, in which the transcripts are marked accordingly as type or as attestation.
All characters used as diacritics in the transcription (including numerals), must be masked by prefixing them with two backslashes, e.g. * → *, if they appear as original characters. This only applies to characters that are part of the phonetic transcription of the single attestation in the source. For characters that have a certain meaning, this meaning must instead be written as a remark in angle brackets behind the attestation. For example, the character † stands for an obsolete form in the AIS and must be marked with in the transcription. Brackets are always replaced (see brackets and comments and placeholders).
The following characters from the AIS can simply be omitted: ℗, ○, P, S, +
All forms of placeholders or shortened spellings must be replaced by the character string they represent. If an attestation with comments is split into multiple attestations, these must be repeated. The following table gives some examples:
There is an exception for small phonetic variations in already typified attestations, e.g. the morpho-lexical type "Sänn(e)hütte" can be transcribed as "Sa:nn\(e\)hu:tte".
The "vacat" button is used for informants for whom no data is entered in a map. If the transcription of an attestation is problematic (e.g. because it is not possible or unclear according to these rules), the "problem" button is used.
When you enter the transcription, a preview of how the attestation will look after the reconversion is displayed behind the corresponding text field for comparison purposes. If the text "Not valid" appears, the attestation is transcribed incorrectly and cannot be entered. If individual characters appear highlighted in red in the beta code, this means that the attestation is valid, but the character cannot yet be converted. This is mainly the case with characters that have not yet occurred in this form. In this case, the attestation can be entered as usual.
Special Blanks
(auct. Stephan Lücke | Florian Zacherl – trad. Christina Mutter)
Tags: Information technology
Base characters are located on the baseline. All characters that are not on the baseline are considered diacritics. Purely typographic variations of a base character are also treated as diacritics in the broader sense, e.g. if the base character is displayed smaller than the others.
Base characters
Base characters that exist in the ASCII table are retained (= all Latin characters; not German umlauts!). All other base characters are transcribed by a combination of a letter and a numeral (see table).
Diacritics
Diacritics are always placed after the base character to which they are assigned. If there are several diacritics on one base character, the following sequence must be observed:-
First, diacritics that mark a typographic variation of a base character are written, e.g. if the base character is set higher or lower. These diacritics are shown in yellow in the table.
-
Then diacritics below and above the base character are written down from bottom to top. In particular, the diacritics below a base character (marked in green) must always come before those above a base character (marked in blue).
-
At last diacritics which come after the base character are written, e.g. a length sign or an apostrophe after a base character. These are marked in orange in the table.
Each character used for the transcription of a diacritic may only occur once per base character. There are special rules for the repetition of the same diacritic, e.g. : for two points above a base character or \2 for a double grave accent.
If a diacritic refers to two or more characters, e.g. a͠e, the base characters are placed in square brackets, in this case [ae]~.
Brackets and Comments
Comments (whether in brackets or not) are placed in angle brackets after the attestation to which they refer, e.g: (m.) → <m.>. If the entire attestation is bracketed, the attestation is transcribed without brackets and the remark "in brackets" is added in angle brackets.
Separators
Possible morphosyntactic variants of an attestation such as singular and plural forms are separated by commas, different word forms are separated by semicolons. This corresponds to the representation of the attestations in the atlas AIS. If the attestations are separated by other separators (e.g. / or -) in the source, these must be replaced accordingly by commas and semicolons in the transcription. Any numbering of different variants is omitted.
Typified attestations
If the source contains both a single attestation and an already typified variant for an informant, only the single attestation is transcribed. Only if this is not possible, the typified variant will be transcribed and will be marked as "phonetic type" or "morpho-lexical type" via the corresponding selection menu. In contrast to single attestations, for types also capital letters are allowed, otherwise the same rules apply for transcription.
If there exist single attestations as well as typified attestations for an informant, two different lines must be created for the transcription, in which the transcripts are marked accordingly as type or as attestation.
Special characters in the source
All characters used as diacritics in the transcription (including numerals), must be masked by prefixing them with two backslashes, e.g. * → *, if they appear as original characters. This only applies to characters that are part of the phonetic transcription of the single attestation in the source. For characters that have a certain meaning, this meaning must instead be written as a remark in angle brackets behind the attestation. For example, the character † stands for an obsolete form in the AIS and must be marked with in the transcription. Brackets are always replaced (see brackets and comments and placeholders).
The following characters from the AIS can simply be omitted: ℗, ○, P, S, +
Placeholders
All forms of placeholders or shortened spellings must be replaced by the character string they represent. If an attestation with comments is split into multiple attestations, these must be repeated. The following table gives some examples:
Attestation from the source | Transcription |
---|---|
u kā́ni; i ~ | u ka-/ni; i ka-/ni |
(Alm)hütte | Almhu:tte; Hu:tte |
(um bé̜l) pašọ́ɳ (selten) | um be(/l pas^o?/n1 <selten>; pas^o?/n1 <selten> |
There is an exception for small phonetic variations in already typified attestations, e.g. the morpho-lexical type "Sänn(e)hütte" can be transcribed as "Sa:nn\(e\)hu:tte".
Transcription not possible
The "vacat" button is used for informants for whom no data is entered in a map. If the transcription of an attestation is problematic (e.g. because it is not possible or unclear according to these rules), the "problem" button is used.
Transcription preview
When you enter the transcription, a preview of how the attestation will look after the reconversion is displayed behind the corresponding text field for comparison purposes. If the text "Not valid" appears, the attestation is transcribed incorrectly and cannot be entered. If individual characters appear highlighted in red in the beta code, this means that the attestation is valid, but the character cannot yet be converted. This is mainly the case with characters that have not yet occurred in this form. In this case, the attestation can be entered as usual.
Base characters
Character | Description | Beta code | Comment |
---|---|---|---|
α | Greek alpha | a1 | |
ɒ | mirror-inverted a | a2 | |
æ | ligature ae | a3 | |
β | Greek beta | b1 | |
ƀ | crossed out b | b2 | |
χ | Greek Chi | c1 | |
ҁ | sign for glottis closure | c2 | |
crossed out c | c3 | ||
ɕ | c4 | ||
δ | Greek delta | d1 | |
đ | crossed out d | d2 | |
ð | eth | d3 | |
ə | schwa | e1 | |
![]() | tick to the left of the e | e2 | |
ε | Greek epsilon | e3 | |
φ | Greek Phi | f1 | |
ƒ | labiodental fortis | f2 | |
ɣ | Greek gamma | g1 | |
![]() | open g on the right | g2 | |
g with bottom line | g3 | ||
ʔ | glottal beat | g4 | |
ɥ | h1 | ||
i̷ | i with slanted line | i1 | |
ı | i without dot | i2 | |
ɨ | i with horizontal line | i3 | |
ɪ | i4 | ||
ɟ | j1 | ||
ł | crossed out l | l1 | |
![]() | l with strongly curved line | l2 | |
![]() | l with two curved lines | l3 | |
λ | Lambda | l4 | |
ʎ | l5 | ||
ɱ | m1 | ||
ɳ | sign for velar "n" (German: kling) | n1 | |
ŋ | velar nasals | n2 | |
ɲ | n3 | ||
œ | ligature oe | o1 | |
ɔ | open o on the left | o2 | |
ơ | o with tick at the upper right margin | o3 | |
ǫ | o with ogonek | o4 | |
ø | o with diagonal line | o5 | |
ω | Greek omega | o6 | |
π | the number Pi | p1 | |
þ | thorn | p2 | |
ꝗ | q with horizontal line | q1 | |
ʀ | Upper case letter R at the height of a lower case letter | r1 | |
ɹ | r2 | ||
ɾ | r3 | ||
ʃ | Esh | s1 | |
![]() | s with diagonal stroke left | s2 | |
ʂ | s3 | ||
ϑ | Greek theta | t1 | |
![]() | Stronger curved u | u1 | |
ʊ | u2 | ||
ʒ | Ezh | z1 | |
ʑ | z2 |
Diacritics
Character | Description | Beta code | Comment | Example |
---|---|---|---|---|
ṣ | dot under base character | ? | s? | |
ė | dot above base character | ?1 | e?1 | |
ä | two dots above base character | : | a: | |
ṳ | two dots under base character | :1 | u:1 | |
o̜ | tick open to the right under base character | ( | o( | |
![]() | two ticks open to the right under base character | (1 | e(1 | |
r͗ | semicircle open to the left (spiritus lenis) above base character | ) | r) | |
o̹ | semicircle open to the left under base character | )1 | o)1 | |
ç | cedilla | )2 | c(2 | |
ó | acute on base character | / | o/ | |
ő | double acute on base character | /2 | o/2 | |
à | gravis on base character | </td> | a</td> | |
ȁ | double gravis on base character | \2 | a\2 | |
![]() | gravis with dot at the upper end on base character | \3 | u\3 | |
ā | horizontal line above base character | - | minus sign - | a- |
ā̄ | two horizontal lines above base character | -2 | minus sign - | a-2 |
ṉ | horizontal line under base character | _ | underscore_ | n_ |
n͇ | Double horizontal line under base character | _1 | n_1 | |
ẽ | tilde ABOVE base character | ~ | e~ | |
![]() | stronger curved tilde ABOVE base character | ~1 | ||
ḛ | tilde UNDER base character | + | e+ | |
ă | semicircle opened to the TOP ABOVE base character | ! | a! | |
ȃ | semicircle opened to the BOTTOM ABOVE base character | % | a% | |
a̯ | semicircle opened to the BOTTOM UNDER base character | @ | a@ | |
k̮ | semicircle opened to the TOP UNDER base character | @1 | k@1 | |
ů | circle ABOVE base character | | | u| | |
s̥ | circle UNDER base character | & | s& | |
e̩ | vertical line under base character | $ | e$ | |
ǧ | hacek | ^ | g^ | |
ĝ | circumflex | ^1 | g^1 | |
o̭ | "circumflex" under base character | ^2 | o^2 | |
d̬ | "hacek" under base character | ^3 | d^3 | |
u∞ | infinity symbol above base character | " | u" | |
n͐ | "greater-than symbol" above base character | > | n> | |
a͓ | cross under base character | * | a* | |
a̽ | cross above base sign | *1 | a*1 | |
g’ | apostrophe after base character | ' | on the #-key | g' |
aߵ | inverted apostrophe after base character | '1 | on the #-key | a'1 |
gˈ | elevated vertical line after base character | '2 | on the #-key | g'2 |
![]() | tick after base character | = | k= | |
c² | superscript number after base character | \<n>0 | mask number with \ and put 0 after it | c\20 |
aː | IPA length character | :2 | a:2 | |
aˑ | half IPA length character | :3 | a:3 | |
ᵃb | base character above the baseline | 0 | a0b | |
![]() | base character on the baseline, smaller than all other characters | 8 | n8d | |
ᵢn | base character below the baseline | 9 | i9n | |
![]() | upper or lower diacritics in brackets | [<d>] | Diacritic in brackets between square brackets | u[:] bzw. e[?] |
aͦ | base character above base character | {<z>} | elevated base character between braces | a{o} |
![]() | base character below base character | {1<z>} | a{1o} |
Special characters
In principle, these characters are equivalent to base characters, except that they cannot be combined with diacritics.
Character | Description | Beta code | Example |
---|---|---|---|
·e̜kọ́ɳ | A dot, before or after the base character. Higher than the baseline. | .1 | .1e(ko?/n1 |
Special Blanks
(Regular blanks are represented by the character ␣ in this table)
Character | Description | Beta code | Example |
---|---|---|---|
w‿d | blank with curve | {␣} | w{␣}d |
(auct. Stephan Lücke | Florian Zacherl – trad. Christina Mutter)
Tags: Information technology
Typification
(Quote)
The typification of the geocoded linguistic data is one of the fundamental requests of VerbaAlpina. For this, in a first step tokens ('single words') are extracted from the input data after the transcription and registered in the database field of the same name, where this is possible.
The centre of VerbaAlpina's attention is the morphological typification of the collected linguistic material. A morphological type is defined by the agreement of the following properties: language family – part of speech – single word vs. affixed words – gender – lexical basic type. The form by which the morphological type is cited takes a bearing on the lemmas of selected reference dictionaries (see below).
The unity of all merged morpho-lexical types becomes clear by means of the assignment to a common lexical type – also over language borders. By doing so, the following nouns and verbs (which are not described here in detail) can be assigned to one singular basic type malga: malga (MOUNTAIN PASTURE, HERD), malgaro (ALPINE DAIRYMAN), malghese (HERDER), immalgare (TO MOVE ON THE MOUNTAIN PASTURE), dismalgare (TO LEAVE THE MOUNTAIN PASTURE). The lexical basic type, however, does not say anything about the word history of a single morpho-lexical type. It has to be brought out each time individually if a type with Latin-Romance etymon which today is sourced in the Germanic or Slovene language area (as e.g. Slovene bajta 'simple house') goes back to old local substratum or to more recent Romance language contact. For this reason, the designation "etymon" is avoided in this context as it refers in principle to the immediate historical preliminary stage of a word – even if the lexical basic type actually corresponds to the etymon of a morpho-lexical type in many cases.
The morpho-lexical types form the leading category for the management of linguistic data. They are comparable to the lemmas of lexicography. By means of the above-mentioned, robust criteria that can be well operationalised the four phonetic types barga, bark, margun, bargun with the meaning ALPINE HERDSMEN'S HUT, ALPINE STABLE can be reduced to three morpho-lexical types for example:

The membership of the morpho-lexical types to language families (gem., roa., sla.) depends on the respective source. It results automatically through the respective informants in the case of data from atlases or dictionaries and is written accordingly in the database. In case of data which VerbaAlpina itself collects through crowdsourcing, the membership to a language/dialect of the informants is claimed and ideally confirmed quantitatively; the number of confirming informants becomes with that an instrument of data validation.
Morpho-lexical types are limited to a language family. It has to be cleared up by which form a morpho-lexical type should be represented in the search function on the interactive map. Regarding the Germanic and Slavonic language family the answer is quite easy as both are represented by only one standardised individual language ('German' [deu] / 'Slovenian' [slo]). The morpho-lexical types can be depicted by their standard variants, of course on condition that there are equivalents of the type in the standard language. Like this, all corresponding phonetic types of Alemannic and Bavarian which are variants of the standard form 'cheese' can for example be retrieved under this standard form. If there is no such standard variant, the lemmas of the big reference dictionaries (Idiotikon, WBÖ) are called up for comparison.
The situation is much more complex for the Romance language family due to its numerous, partly not sufficiently standardised small languages. For pragmatic reasons, the following way of proceeding has been chosen: all morpho-lexical types are represented by the French and Italian standard forms, if existing. All phonetic types which are variants of beurre/burro 'butter' can be retrieved under these two forms. The reference dictionaries are among others TLF and Treccani. If only one of the two standard languages has an appropriate variant, only this one comes out as in the case of ricotta (the membership to Italian is marked by the notation convention -/ricotta). If there is no variant of the type in any of the two Romance reference languages, we fall back upon an entry of a dialectal reference dictionary, for instance upon LSI. If there are no reliable entries in dialect dictionaries, VerbaAlpina suggests a basic type along with a graphic representation ('VA').
The phonetic typification of the linguistic material is scheduled in the overall concept and the technical implementation, but it is peripheral and therefore not put to practice consistently. The corresponding category is primarily therefore indispensable as linguistic atlases (e.g. SDS and VALTS) and dictionaries document sometimes exclusively phonetic types. When VerbaAlpina typificates phonetically, the tokens are divided up into phonetic types according to criteria of historical phonetics (database field 'phon_typ'). We examine an automation of the phonetic typification on the basis of Levenshtein algorithms and soundex algorithms. If the automation is shown to be possible, we will put it into practice.
The data diversity gets increasingly clear by typification (formation of classes). The following rule is valid: number of tokens > number of phonetic types > number of morpho-lexical types > basic typ. There can be, however, the extreme case of one single attestation (hapax) which corresponds to a token, a phonetic type and a morpho-lexical type as only representative of a basic type. It may make sense to filter out such hapax forms in the depiction.
(auct. Thomas Krefeld | Stephan Lücke – trad. Susanne Oberholzer)
Tags: Linguistics
The centre of VerbaAlpina's attention is the morphological typification of the collected linguistic material. A morphological type is defined by the agreement of the following properties: language family – part of speech – single word vs. affixed words – gender – lexical basic type. The form by which the morphological type is cited takes a bearing on the lemmas of selected reference dictionaries (see below).
The unity of all merged morpho-lexical types becomes clear by means of the assignment to a common lexical type – also over language borders. By doing so, the following nouns and verbs (which are not described here in detail) can be assigned to one singular basic type malga: malga (MOUNTAIN PASTURE, HERD), malgaro (ALPINE DAIRYMAN), malghese (HERDER), immalgare (TO MOVE ON THE MOUNTAIN PASTURE), dismalgare (TO LEAVE THE MOUNTAIN PASTURE). The lexical basic type, however, does not say anything about the word history of a single morpho-lexical type. It has to be brought out each time individually if a type with Latin-Romance etymon which today is sourced in the Germanic or Slovene language area (as e.g. Slovene bajta 'simple house') goes back to old local substratum or to more recent Romance language contact. For this reason, the designation "etymon" is avoided in this context as it refers in principle to the immediate historical preliminary stage of a word – even if the lexical basic type actually corresponds to the etymon of a morpho-lexical type in many cases.
The morpho-lexical types form the leading category for the management of linguistic data. They are comparable to the lemmas of lexicography. By means of the above-mentioned, robust criteria that can be well operationalised the four phonetic types barga, bark, margun, bargun with the meaning ALPINE HERDSMEN'S HUT, ALPINE STABLE can be reduced to three morpho-lexical types for example:

The membership of the morpho-lexical types to language families (gem., roa., sla.) depends on the respective source. It results automatically through the respective informants in the case of data from atlases or dictionaries and is written accordingly in the database. In case of data which VerbaAlpina itself collects through crowdsourcing, the membership to a language/dialect of the informants is claimed and ideally confirmed quantitatively; the number of confirming informants becomes with that an instrument of data validation.
Morpho-lexical types are limited to a language family. It has to be cleared up by which form a morpho-lexical type should be represented in the search function on the interactive map. Regarding the Germanic and Slavonic language family the answer is quite easy as both are represented by only one standardised individual language ('German' [deu] / 'Slovenian' [slo]). The morpho-lexical types can be depicted by their standard variants, of course on condition that there are equivalents of the type in the standard language. Like this, all corresponding phonetic types of Alemannic and Bavarian which are variants of the standard form
The situation is much more complex for the Romance language family due to its numerous, partly not sufficiently standardised small languages. For pragmatic reasons, the following way of proceeding has been chosen: all morpho-lexical types are represented by the French and Italian standard forms, if existing. All phonetic types which are variants of beurre/burro 'butter' can be retrieved under these two forms. The reference dictionaries are among others TLF and Treccani. If only one of the two standard languages has an appropriate variant, only this one comes out as in the case of ricotta (the membership to Italian is marked by the notation convention -/ricotta). If there is no variant of the type in any of the two Romance reference languages, we fall back upon an entry of a dialectal reference dictionary, for instance upon LSI. If there are no reliable entries in dialect dictionaries, VerbaAlpina suggests a basic type along with a graphic representation ('VA').
The phonetic typification of the linguistic material is scheduled in the overall concept and the technical implementation, but it is peripheral and therefore not put to practice consistently. The corresponding category is primarily therefore indispensable as linguistic atlases (e.g. SDS and VALTS) and dictionaries document sometimes exclusively phonetic types. When VerbaAlpina typificates phonetically, the tokens are divided up into phonetic types according to criteria of historical phonetics (database field 'phon_typ'). We examine an automation of the phonetic typification on the basis of Levenshtein algorithms and soundex algorithms. If the automation is shown to be possible, we will put it into practice.
The data diversity gets increasingly clear by typification (formation of classes). The following rule is valid: number of tokens > number of phonetic types > number of morpho-lexical types > basic typ. There can be, however, the extreme case of one single attestation (hapax) which corresponds to a token, a phonetic type and a morpho-lexical type as only representative of a basic type. It may make sense to filter out such hapax forms in the depiction.
(auct. Thomas Krefeld | Stephan Lücke – trad. Susanne Oberholzer)
Tags: Linguistics