Question 1

What is Sorensen-Dice similarity?

Accepted Answer

The Sorensen-Dice coefficient measures similarity between two sets as 2|A∩B| / (|A| + |B|). A score of 100% means identical word vocabularies. A score of 0% means no words in common. It is more forgiving than Jaccard similarity for small set differences, making it useful for text comparison.

Question 2

What is edit distance (Levenshtein distance)?

Accepted Answer

Levenshtein distance measures the minimum number of single-character edits (insertions, deletions, substitutions) needed to transform one string into another. "cat" to "car" has an edit distance of 1 (substitute t→r). It is widely used in spell checkers, DNA sequence alignment, and version control systems.

Question 3

What similarity % indicates plagiarism?

Accepted Answer

As a general guideline: 80–100% similarity typically indicates copied or near-duplicate text. 60–80% suggests substantial overlap worth reviewing. 40–60% may indicate paraphrasing or shared topics. Below 40% usually indicates original work. Academic plagiarism detectors like Turnitin use sophisticated algorithms beyond simple word matching.

Question 4

What is the Jaccard similarity index?

Accepted Answer

The Jaccard index is the size of the intersection divided by the size of the union of two word sets: |A∩B| / |A∪B|. Unlike Dice, it does not weight the intersection. Jaccard scores are always ≤ Dice scores. Both measure vocabulary overlap without regard to word order or context.

Question 5

How does line diff work?

Accepted Answer

Line diff compares texts line by line (or item by item when one entry is on each line). It is useful for comparing code, configuration files, lists, or structured documents. Lines that appear in A but not B are "removed"; lines in B but not A are "added"; lines in both are "unchanged".

Text Diff Calculator

Similarity Scores

Word Changes

Plagiarism Check

How to Use This Calculator

Formula

Example

Frequently Asked Questions

Related Calculators