How are duplicates identified in DataGroomr : DataGroomr Support

DataGroomr provides three methods for identifying duplicates:

Classic Matching Model – Rule-based matching using Exact and Fuzzy approaches.

Machine Learning Matching Model – AI-assisted or user-trained models.
AI Recommend - uses Generative AI (GenAI) to review duplicated identified by the two previous techniques and provide context aware recommendations (e.g. merge, unmatch, link, etc).

Note: Normalization (lowercasing, trimming spaces, removing special characters) is applied across all models before comparison.

Classic Model

Exact Matching

Compares values character by character .
Best for strict fields such as email, ID, or account numbers.
Fields can be set as required so duplicates are only flagged when these fields match.

Fuzzy Matching

Calculates similarity scores using algorithms such as edit distance, phonetics, and token overlap.
Records are flagged when scores meet or exceed a defined threshold.
Options include:
- Weighting fields (e.g., Last Name > City).
- Defining synonyms (e.g., Bob = Robert).
- Ignoring common terms (e.g., Inc, LLC).

Key Advantages of Classic Matching ModelsSimple and quick to implement
Runs well even on large datasets 
Easy to explain, audit, and adjust thresholds
Effective on clean, standardized data

Machine Learning Model

Uses a model trained on your data to assign a confidence score (1–100) for duplicate probability.
Training options:
- AI Assistant Training – Uses Large Landguage Models (LLM) to auto-train the model based on your data
- Manual Training – Pairs of potential duplicate records are visually presented to user for identification as duplicates
Effective for large or inconsistent datasets with missing fields, abbreviations, or cross-field patterns (e.g., name + phone + address).

Key Advantages of Machine Learning ModelsNo need to create or maintain complex rule sets.
Handles incomplete data and variations such as nicknames or abbreviations.
Produces confidence scores to support flexible workflows (e.g., auto-merge vs. manual review).
Reduces false positives and captures duplicates that rule-based methods may miss.
Adapts to your data when retrained.

AI Recommend (GenAI)

Uses large language models (LLMs) to understand semantic meaning, context, and structure of records. Instead of comparing raw text or engineered features, GenAI interprets the data like a human would.

Key Advantages of GenAIRecognizes semantics such as “IBM” and “International Business Machines” or “Jon Smith” and “Jonathan A. Smith” refer to the same entity even if they don’t share many characters.
Can consider multiple fields together (name + email + address) and infer duplication even when some fields differ.
Can generalize across industries (customers, vendors, products) with minimal retraining.
Continuos training through LLM providers,
Generates natural language explanations of why two records are considered duplicates.

How are duplicates identified in DataGroomr Print

Classic Model

Machine Learning Model

AI Recommend (GenAI)

Related Articles