Text Diff Calculator

Compare two texts and find differences. Get word-level, character-level, and line-level similarity scores, count added and removed words, calculate Levenshtein edit distance, and check plagiarism similarity.

Similarity %
Words Added (in B)
Words Removed (from A)
Edit Distance (chars)
Extended More scenarios, charts & detailed breakdown
Word Similarity %
Unchanged Words
Added Words (unique to B)
Removed Words (unique to A)
Professional Full parameters & maximum detail

Similarity Scores

Sorensen-Dice Similarity %
Jaccard Similarity %

Word Changes

New Unique Words
Removed Unique Words
Common Unique Words

Plagiarism Check

Plagiarism Indicator

How to Use This Calculator

  1. Paste your original text into Text A and your revised text into Text B.
  2. Read the Similarity % — 100% means identical, 0% means nothing in common.
  3. See how many words were added (new in B) and removed (not in B).
  4. Use the Line Diff tab to compare lists or structured text line by line.
  5. Use the Professional tier for Sorensen-Dice, Jaccard, and plagiarism assessment.

Formula

Sorensen-Dice: 2 × |A ∩ B| ÷ (|A| + |B|)
Jaccard: |A ∩ B| ÷ |A ∪ B|
Where A and B are the unique word vocabularies of each text. Scores expressed as percentages.

Example

"The cat sat on the mat" vs "The dog sat on the rug": Common words = {the, sat, on} = 3. Size A = 5, Size B = 5. Dice = 2×3/(5+5) = 60%. Jaccard = 3/7 = 43%.

Frequently Asked Questions

  • The Sorensen-Dice coefficient measures similarity between two sets as 2|A∩B| / (|A| + |B|). A score of 100% means identical word vocabularies. A score of 0% means no words in common. It is more forgiving than Jaccard similarity for small set differences, making it useful for text comparison.
  • Levenshtein distance measures the minimum number of single-character edits (insertions, deletions, substitutions) needed to transform one string into another. "cat" to "car" has an edit distance of 1 (substitute t→r). It is widely used in spell checkers, DNA sequence alignment, and version control systems.
  • As a general guideline: 80–100% similarity typically indicates copied or near-duplicate text. 60–80% suggests substantial overlap worth reviewing. 40–60% may indicate paraphrasing or shared topics. Below 40% usually indicates original work. Academic plagiarism detectors like Turnitin use sophisticated algorithms beyond simple word matching.
  • The Jaccard index is the size of the intersection divided by the size of the union of two word sets: |A∩B| / |A∪B|. Unlike Dice, it does not weight the intersection. Jaccard scores are always ≤ Dice scores. Both measure vocabulary overlap without regard to word order or context.
  • Line diff compares texts line by line (or item by item when one entry is on each line). It is useful for comparing code, configuration files, lists, or structured documents. Lines that appear in A but not B are "removed"; lines in B but not A are "added"; lines in both are "unchanged".

Related Calculators

Sources & References (5)
  1. Myers (1986) — An O(ND) Difference Algorithm and Its Variations — Algorithmica / Eugene W. Myers
  2. GNU Diffutils Documentation — GNU Project
  3. Levenshtein (1965) — Binary codes capable of correcting deletions, insertions, and reversals — Soviet Physics Doklady / V. I. Levenshtein
  4. Hirschberg (1975) — A Linear Space Algorithm for Computing Maximal Common Subsequences — ACM / D. S. Hirschberg
  5. NCBI — Sequence Alignment Algorithms Overview — National Center for Biotechnology Information