lexical similarity

Lexical similarity is a measure of how much of two languages’ vocabularies are shared in common. Ethnologue measures the lexical similarities between numerous pairs of languages based on a Swadesh-like list (that is, looking only at the most common words). It is measured either as a percentage, or out of 1. Ethnologue reckons that a similarity above 85% is suggestive that two varieties are dialects of a common language, but I think it would be reductive to rely on that alone. Of course it depends on how you define languages vs dialects, but generally linguists define it in terms of mutual intelligibility, and there are more factors influencing mutual intelligibility than shared vocabulary alone: phonology and grammar are also huge factors.

At any rate, Ethnologue states that English has a lexical similarity with German of 60%, and with French of only 27% (barely any higher than what English shares with Russian, which is 24%). This might seem surprising, considering that the origins of English words are only 25% Germanic and over half Latin or French. It makes more sense, though, if you remember that Ethnologue was comparing only the most common, everyday words, and that many of English’s borrowings from French are for more higher-order, academic concepts. Other studies, which take into account a wider range of vocabulary, place English’s lexical similarity with French at 51–56%.

Most of the Romance languages have pretty high lexical similarities with each other. To run through them quickly:

  • The major outlier, Romanian, shares 77% lexical similarity with the closest major Romance language (Italian) and 71% with the most distant (Spanish).
  • French is 89% similar to Italian, 85% to Catalan, and 75–80% with all the others.
  • In addition to its high similarity to French, Italian is 87% similar to Catalan, 85% to Sardinian, 82% to Spanish, and in the high 70s for the others.
  • Catalan is 85% similar to French, Spanish and Portuguese and 87% similar to Italian.
  • Spanish and Portuguese share 89% lexical similarity with each other.

While the Slavic languages largely have extremely similar grammars (except for the southeast ones), their lexical similarities are largely lower than with the Romance languages. There are certainly some pairs that score pretty high (like Ukrainian and Belarussian at 84% or Czech and Slovak at perhaps over 90%). But for many other pairs, lexical similarities between 60–70% seem more common.