Idiomator is a multilingual idiom extraction tool powered by AI. It helps users detect idiomatic expressions in English, Spanish, and more languages (coming soon). Whether you're analyzing PDFs, social media posts, or research papers, Idiomator automatically identifies idioms using advanced NLP techniques.
* Idiom detection is hard — even for humans. Our model performs well on common idioms but accuracy varies by language and text domain, especially for longer multi-word expressions. We use a rule-based fallback system to improve coverage alongside our neural model.
One of the biggest barriers in learning — especially language learning — is not knowing what you don't know. Idioms often fall into that blind spot. Idiomator exists to surface that hidden knowledge, tailored to your actual text or materials.
We're working on expanding coverage to more languages, including low-resource ones, and releasing our dataset and benchmark results publicly alongside our upcoming paper. On the modeling side, we're investigating cross-lingual transfer improvements and exploring complementary tools for grammar and syntax analysis.
I’m Shishir Maddineni, a student developer and language enthusiast. I’ve always struggled with idioms while learning languages — there’s no easy way to master them. So I built Idiomator to make that process faster and smarter.
Idiomator is built on a multilingual idiom dataset we created in-house — covering English, Spanish, Hindi, Telugu, and Indonesian, with native-speaker inter-annotator agreement validation currently underway. We're releasing the dataset publicly alongside upcoming research from our team. If you'd like early access or want to help validate idioms in your native language, get in touch.
The extractor is powered by IdiomBERT, a multilingual BIO-sequence tagger built on mBERT and trained on our own annotated dataset. It identifies idiom spans directly in text across English, Spanish, Hindi, and Telugu. On our held-out test set, the current deployed model achieves:
| Language | Exact Match | Overlap F1 |
|---|---|---|
| English | 56.3% | 78.7% |
| Spanish | 68.5% | 83.7% |
| Hindi | 43.5% | 78.4% |
| Telugu | 74.2% | 86.7% |
| Overall | 61.7% | 81.5% |
Results are on our held-out test split. A full description of the architecture and training procedure will be available in our upcoming paper.
Idiomator's earliest prototypes drew on the ID10M dataset (Chakraborty et al., 2022) and Wiktionary's rich gloss data. The current system is trained on our own multilingual dataset, but we remain grateful to the broader MWE and language-learning communities whose open work made this project possible.
Chakraborty et al. 2022 — ID10M dataset (ACL Findings)