DataGroomr provides three methods for identifying duplicates:
Classic Matching Model – Rule-based matching using Exact and Fuzzy approaches.
Machine Learning Matching Model – AI-assisted or user-trained models.
AI Recommend - uses Generative AI (GenAI) to review duplicated identified by the two previous techniques and provide context aware recommendations (e.g. merge, unmatch, link, etc).
Note: Normalization (lowercasing, trimming spaces, removing special characters) is applied across all models before comparison.
Classic Model
Exact Matching
Compares values character by character .
Best for strict fields such as email, ID, or account numbers.
Fields can be set as required so duplicates are only flagged when these fields match.
Fuzzy Matching
Calculates similarity scores using algorithms such as edit distance, phonetics, and token overlap.
Records are flagged when scores meet or exceed a defined threshold.
Options include:
Weighting fields (e.g., Last Name > City).
Defining synonyms (e.g., Bob = Robert).
Ignoring common terms (e.g., Inc, LLC).
Key Advantages of Classic Matching Models
- Simple and quick to implement
- Runs well even on large datasets
- Easy to explain, audit, and adjust thresholds
- Effective on clean, standardized data
Machine Learning Model
Uses a model trained on your data to assign a confidence score (1–100) for duplicate probability.
Training options:
AI Assistant Training – Uses Large Landguage Models (LLM) to auto-train the model based on your data
Manual Training – Pairs of potential duplicate records are visually presented to user for identification as duplicates
Effective for large or inconsistent datasets with missing fields, abbreviations, or cross-field patterns (e.g., name + phone + address).
Key Advantages of Machine Learning Models
- No need to create or maintain complex rule sets.
- Handles incomplete data and variations such as nicknames or abbreviations.
- Produces confidence scores to support flexible workflows (e.g., auto-merge vs. manual review).
- Reduces false positives and captures duplicates that rule-based methods may miss.
- Adapts to your data when retrained.
AI Recommend (GenAI)
Uses large language models (LLMs) to understand semantic meaning, context, and structure of records. Instead of comparing raw text or engineered features, GenAI interprets the data like a human would.
Key Advantages of GenAI
- Recognizes semantics such as “IBM” and “International Business Machines” or “Jon Smith” and “Jonathan A. Smith” refer to the same entity even if they don’t share many characters.
- Can consider multiple fields together (name + email + address) and infer duplication even when some fields differ.
- Can generalize across industries (customers, vendors, products) with minimal retraining.
- Continuos training through LLM providers,
- Generates natural language explanations of why two records are considered duplicates.