Project Source
Commissioned by:
Principal Investigator: Zhijun Gao
Project Duration: November 1, 2026 - June 1, 2026
Research Tasks
Due to the insufficiency of natural Tibetan data, this project explores large-scale generation of high-quality Tibetan datasets through synthetic data approaches.
Project Objectives
-
Synthetic data quality benchmarked against natural data
- Maintain consistency with natural data in key linguistic features such as vocabulary, syntax, semantics, and discourse, with no significant differences (statistical test p > 0.05; also providing effect size thresholds, such as
|d| < 0.2 / KLdivergence below a set threshold). - Validated through both manual spot-checking and automated quality assessment (including metrics for fluency, grammatical correctness, factual consistency, translation fidelity, etc.).
- Maintain consistency with natural data in key linguistic features such as vocabulary, syntax, semantics, and discourse, with no significant differences (statistical test p > 0.05; also providing effect size thresholds, such as
-
Multi-domain data resource pool with balanced distribution
- Build a Tibetan dataset covering core domains including science and technology, geography, history and culture, education, public services, etc.
- Ensure sample sizes in each domain reach preset scales while preventing excessive scarcity in long-tail domains
-
Multi-task data types covering high-frequency use scenarios for Tibetan users
- Data task type coverage: Chinese-Tibetan/Tibetan-Chinese translation, summarization/key point extraction, dialogue response, concept explanation, QA retrieval-based generation, writing polishing/rewriting, etc.
-
Usability objectives for training and evaluation
- Synthetic data can be directly used in mainstream training pipelines (cleaning, deduplication, annotation fields, complete metadata), forming a reusable data generation and quality control pipeline.
- Deliver quantifiable gains on downstream benchmark tasks (e.g., translation BLEU/COMET, QA accuracy, summarization ROUGE improvements reaching preset targets), and provide comparative experiment reports for “natural data only / synthetic data only / mixed data.”