Multilingual Customer Support Tickets (Synthetic) 
A fully synthetic dataset for training and evaluating help-desk models such as queue, priority, and type classification, plus response-assist pretraining. Created with our Python Synthetic Data Generator and published on Kaggle.
- Kaggle: Ticket Dataset
- Synthetic Data Generation (planned LGPL)
- Need custom data or the tool? sales@softoft.de
Versions at a glance 
| Version | Languages | Size (relative) | Notes | 
|---|---|---|---|
| v5 | EN, DE | Largest | Latest and most refined taxonomy/balancing; focuses on EN/DE quality. | 
| v4 | EN, DE | Large | Similar to v5 focus; slightly older prompts and distributions. | 
| v3 | EN, DE, + more (FR/ES/PT) | Smaller | Earlier pipeline; more languages but less diverse content overall. | 
Older versions include more languages but are generally smaller and less diverse. Newest versions (v5, v4) emphasize EN/DE quality and scale.
Which version should I use? 
- Training EN/DE production models → start with v5 (or v4 if you need a comparable older set).
- Research across multiple languages → v3 (smaller, but includes more locales).
Files & naming 
You’ll find CSV exports per version (examples):
dataset-tickets-multi-lang-4-20k.csv
dataset-tickets-multi-lang3-4k.csv
dataset-tickets-german_normalized.csvSchema 
Every ticket includes core text plus labels used by Open Ticket AI.
| Column | Description | 
|---|---|
| subject | The customer’s email subject | 
| body | The customer’s email body | 
| answer | The agent’s first answer (AI-generated) | 
| type | Ticket type (e.g., Incident, Request, Problem, …) | 
| queue | Target queue (e.g., Technical Support, Billing) | 
| priority | Priority (e.g., low, medium, high) | 
| language | Ticket language (e.g., en,de, …) | 
| version | Dataset version (metadata) | 
| tag_1,tag_2, … | One or more topical tags (may be nullin places) | 
Snippets from the data 
- de (Incident / Technical Support / high)Subject: Wesentlicher Sicherheitsvorfall Body (excerpt): „…ich möchte einen gravierenden Sicherheitsvorfall melden…“ Answer (excerpt): „Vielen Dank für die Meldung…“ 
- en (Incident / Technical Support / high)Subject: Account Disruption Body (excerpt): “I am writing to report a significant problem with the centralized account…” Answer (excerpt): “We are aware of the outage…” 
- en (Request / Returns and Exchanges / medium)Subject: Query About Smart Home System Integration Features Body (excerpt): “I am reaching out to request details about…” Answer (excerpt): “Our products support…” 
Visual tour 



Intended use & limitations 
Intended:
- Cold-start model training for queue/priority/type
- Class balancing experiments
- Multilingual benchmarking (use v3 if you need FR/ES/PT)
Limitations:
- Synthetic distributions may differ from your production traffic. Always validate on a small, anonymized real sample before deployment.
How to load & quick checks 
python
import pandas as pd
df = pd.read_csv("dataset-tickets-multi-lang-4-20k.csv")  # or your chosen version
# Basic sanity checks
print(df.language.value_counts())
print(df.queue.value_counts().head())
# Prepare simple text for classification
X = (df["subject"].fillna("") + "\n\n" + df["body"].fillna("")).astype(str)
y = df["queue"].astype(str)Relationship to Open Ticket AI 
This dataset mirrors the labels Open Ticket AI predicts on inbound tickets (queue, priority, type, * tags*). Use it to bootstrap training and evaluation; deploy your model with Open Ticket AI once you’re happy with metrics.
License & citation 
- Dataset: please add your chosen data license here (e.g., CC BY 4.0).
- Generator: planned LGPL. For access or customizations: sales@softoft.de.
Suggested citation:
Bueck, T. (2025). Multilingual Customer Support Tickets (Synthetic). Kaggle Dataset. Generated with the Open Ticket AI Synthetic Data Generator.
Changelog (high level) 
- v5: EN/DE only; largest set; improved taxonomy and balancing.
- v4: EN/DE; large; earlier prompt set.
- v3: Smaller; includes additional languages (FR/ES/PT), earlier pipeline.