Tracking Record Divergence with Content Hashing

When users duplicate records in a database, tracking whether copies have diverged from their source becomes challenging. A content hash provides a simple mechanism to detect when duplicates no longer match their parent.

The Problem

Consider an email_templates table:

CREATE TABLE email_templates (
  id INT PRIMARY KEY,
  name VARCHAR(255),
  subject VARCHAR(255),
  body TEXT,
  footer TEXT,
  hash VARCHAR(64),
  parent_id INT NULL
);
SQL

User A creates a template (id: 1, parent_id: null). Then User B duplicates this template (id: 2, parent_id: 1). Both records are identical.

User A then updates their template’s subject to “Welcome aboard!”. The parent_id on User B’s record still points to id: 1, but the records no longer match since the subjects are different.

Content Hashing

A hash function converts variable-length input into a fixed-length string. SHA-256 produces a 64-character hexadecimal output. Identical inputs always produce identical hashes; even single character changes produce completely different outputs.

const crypto = require('crypto');

function generateHash(data) {
  return crypto
    .createHash('sha256')
    .update(JSON.stringify(data))
    .digest('hex');
}
JavaScript

Implementation

Whenever a user creates, or updates a record you take all (or some) of that records data and produce a hash based on it which is stored against the record.

async function updateEmailTemplate(id, updates) {
  const hashData = {
    subject: updates.subject,
    body: updates.body,
    footer: updates.footer
  };
  
  const hash = generateHash(hashData);
  
  await db.query(
    'UPDATE email_templates SET subject = ?, body = ?, footer = ?, hash = ? WHERE id = ?',
    [updates.subject, updates.body, updates.footer, hash, id]
  );
}
JavaScript

Detecting Divergence

Now in our example, since we have two records which are linked through id and parent_id we can now determine whether they have diverged from each other by comparing the hashes. Since User A updated the subject heading, their hash will now be different to User B’s copy.

async function hasDiverged(recordId) {
  const record = await db.query(
    'SELECT hash, parent_id FROM email_templates WHERE id = ?',
    [recordId]
  );
  
  if (!record.parent_id) {
    return false; // Not a duplicate
  }
  
  const parent = await db.query(
    'SELECT hash FROM email_templates WHERE id = ?',
    [record.parent_id]
  );
  
  return record.hash !== parent.hash;
}
JavaScript

Considerations

  • Select which fields to include in the hash. Exclude timestamps, usage counts, or other metadata that shouldn’t trigger divergence detection.
  • Store the hash in the database rather than computing it on demand to avoid performance penalties when checking multiple records.
  • For large text fields, consider hashing normalized versions (trimmed, lowercase) to avoid detecting inconsequential changes.

The hash comparison provides O(1) divergence detection without field-by-field comparisons or storing complete record history.

Leave a Reply

Your email address will not be published. Required fields are marked *