Skip to content

Multilingual Customer Support Tickets (Synthetic)

A fully synthetic dataset for training and evaluating help-desk models such as queue, priority, and type classification, plus response-assist pretraining. Created with our Python Synthetic Data Generator and published on Kaggle.


Versions at a glance

Dataset version network diagram

VersionLanguagesSize (relative)Notes
v5EN, DELargestLatest and most refined taxonomy/balancing; focuses on EN/DE quality.
v4EN, DELargeSimilar to v5 focus; slightly older prompts and distributions.
v3EN, DE, + more (FR/ES/PT)SmallerEarlier pipeline; more languages but less diverse content overall.

Older versions include more languages but are generally smaller and less diverse. Newest versions (v5, v4) emphasize EN/DE quality and scale.

Which version should I use?

  • Training EN/DE production models → start with v5 (or v4 if you need a comparable older set).
  • Research across multiple languagesv3 (smaller, but includes more locales).

Files & naming

You’ll find CSV exports per version (examples):

dataset-tickets-multi-lang-4-20k.csv
dataset-tickets-multi-lang3-4k.csv
dataset-tickets-german_normalized.csv

Schema

Every ticket includes core text plus labels used by Open Ticket AI.

ColumnDescription
subjectThe customer’s email subject
bodyThe customer’s email body
answerThe agent’s first answer (AI-generated)
typeTicket type (e.g., Incident, Request, Problem, …)
queueTarget queue (e.g., Technical Support, Billing)
priorityPriority (e.g., low, medium, high)
languageTicket language (e.g., en, de, …)
versionDataset version (metadata)
tag_1, tag_2, …One or more topical tags (may be null in places)

Snippets from the data

  • de (Incident / Technical Support / high)Subject: Wesentlicher Sicherheitsvorfall Body (excerpt): „…ich möchte einen gravierenden Sicherheitsvorfall melden…“ Answer (excerpt): „Vielen Dank für die Meldung…“

  • en (Incident / Technical Support / high)Subject: Account Disruption Body (excerpt): “I am writing to report a significant problem with the centralized account…” Answer (excerpt): “We are aware of the outage…”

  • en (Request / Returns and Exchanges / medium)Subject: Query About Smart Home System Integration Features Body (excerpt): “I am reaching out to request details about…” Answer (excerpt): “Our products support…”


Visual tour

Word cloud of ticket subjects

Most used tags

Distributions for queue, priority, language, type


Intended use & limitations

Intended:

  • Cold-start model training for queue/priority/type
  • Class balancing experiments
  • Multilingual benchmarking (use v3 if you need FR/ES/PT)

Limitations:

  • Synthetic distributions may differ from your production traffic. Always validate on a small, anonymized real sample before deployment.

How to load & quick checks

python
import pandas as pd

df = pd.read_csv("dataset-tickets-multi-lang-4-20k.csv")  # or your chosen version

# Basic sanity checks
print(df.language.value_counts())
print(df.queue.value_counts().head())

# Prepare simple text for classification
X = (df["subject"].fillna("") + "\n\n" + df["body"].fillna("")).astype(str)
y = df["queue"].astype(str)

Relationship to Open Ticket AI

This dataset mirrors the labels Open Ticket AI predicts on inbound tickets (queue, priority, type, * tags*). Use it to bootstrap training and evaluation; deploy your model with Open Ticket AI once you’re happy with metrics.


License & citation

  • Dataset: please add your chosen data license here (e.g., CC BY 4.0).
  • Generator: planned LGPL. For access or customizations: sales@softoft.de.

Suggested citation:

Bueck, T. (2025). Multilingual Customer Support Tickets (Synthetic). Kaggle Dataset. Generated with the Open Ticket AI Synthetic Data Generator.


Changelog (high level)

  • v5: EN/DE only; largest set; improved taxonomy and balancing.
  • v4: EN/DE; large; earlier prompt set.
  • v3: Smaller; includes additional languages (FR/ES/PT), earlier pipeline.