Open standards for documented linguistic knowledge
Language corpora have become a foundational infrastructure for linguistics, natural language processing, and contemporary artificial intelligence. The term corpus does not merely denote a collection of texts but implies deliberate selection, structuring, and documentation according to explicit design criteria. Within this context, the Text Encoding Initiative Guidelines provide a mature, open, and internationally established framework for the encoding, documentation, and interchange of language corpora.
TEI conceptualizes language corpora as composite texts. Each individual sample retains its integrity as a text, yet it also functions as a component of a larger analytical object. This dual perspective supports statistical representation of linguistic varieties, diachronic analysis, and systematic comparison across datasets. The teiCorpus element embodies this logic by explicitly representing multiplicity while allowing shared metadata and consistent encoding practices at the corpus level.
Contextual information is central to this approach. Variables such as the social background of participants, the communicative setting, or the intended domain of use are not peripheral but essential for meaningful interpretation. TEI integrates these dimensions within the header structure, enabling systematic documentation of production conditions, participants, and situational context. The separation between corpus level and text level metadata allows common assumptions to be stated once while preserving the ability to describe local variation where necessary.
A key innovation lies in the description of texts through situational parameters rather than rigid genre taxonomies. Elements such as channel, purpose, interaction, preparedness, and factuality allow texts to be characterized along continuous dimensions. This enables nuanced comparison across corpora and avoids the analytical limitations of fixed text type labels. Such flexibility is particularly valuable in multilingual and multimodal research environments.
Linguistic annotation constitutes another critical layer. The TEI Guidelines deliberately avoid prescribing a single annotation theory, instead offering general mechanisms capable of representing a wide range of analytic practices. Part of speech tagging, syntactic annotation, discourse relations, and semantic features can all be documented in a transparent and self describing manner. Crucially, TEI emphasizes the documentation of annotation methods, whether manual, automatic, or hybrid, ensuring that analytical claims remain verifiable.
For large scale corpus projects, TEI promotes pragmatic decision making. Not all features need to be encoded exhaustively. By distinguishing between required, recommended, optional, and deliberately excluded features, corpus designers can balance analytical ambition with economic and organizational constraints. This principle supports long term sustainability and consistency, which are often the main challenges in national or cross institutional corpus initiatives.
From an open science perspective, the adoption of TEI is not merely a technical choice but a strategic one. Open standards foster interoperability, reproducibility, and collective stewardship of linguistic resources. They enable smaller language communities to participate on equal terms in global research infrastructures and reduce dependency on proprietary formats. In an era where language data increasingly underpin artificial intelligence systems, TEI based corpora offer a transparent and accountable alternative that aligns with the values of openness, documentation, and public knowledge.
—

