JSON Semantic Validator
Bringing Data Validation Hybrid rules + tiny ML to validate and auto-fix JSON against a schema.
Why I Built This
Every data engineer knows this pain:
You define a perfect JSON Schema.
Someone ships a payload with
“yes”
instead oftrue
, or“15 Jan 2024”
instead of“2024-01-15”
.The pipeline breaks, dashboards go blank, and everyone blames “bad data.”
Traditional validation fails hard-it catches structural errors but not intent.
Humans can see that “twenty five”
means 25-a schema can’t.
So the question was:
Could we build a validator that understands what you meant, not just what you typed?
That led to JSON Semantic Validator-a hybrid system where:
Rules ensure structure and determinism.
A small semantic model infers intent and proposes minimal fixes.
Everything is logged, explainable, and re-validated.
Architecture Overview
Rule Layer
Standard JSON Schema validator (
jsonschema
lib).Deterministic, repeatable, zero ambiguity.
Produces a list of structured errors.
Model Layer
Tiny MiniLM-based semantic classifier fine-tuned on synthetic examples.
Learns patterns like:
“yes” → True
“twenty five” → 25
“15 Jan 2024” → 2024-01-15
“pendng” → “pending”
Exports to ONNX for < 100 ms inference on CPU.
Hybrid Pipeline
Run rule validation.
If rules fail → collect error contexts.
Model predicts
fix_action
(e.g.cast_bool
,parse_date_iso
,map_enum
).Apply fixes → re-validate.
Return rule errors, model predictions, and corrected JSON.
Video Demo
Why Hybrid Wins
Deterministic when it matters- semantic when it counts. Rule-based systems provide structure and explainability. Models add context, generalization, and adaptivity.
Together, they reduce maintenance overhead and cut false negatives.
Hybrid = fast, interpretable, resilient.
Closing Thoughts
Every DataOps pipeline - from validators and ETL transforms to type checkers and schema enforcers - will eventually embed small semantic models like this. They won’t replace rules; they’ll complete them. Rules provide structure, models offer meaning, and together they foster resilience. This is what Andrej Karpathy’s Software 3.0 looks like for DataOps: human logic and machine-learned semantics working side by side, each doing what it does best. Rules ensure determinism, models provide context, and hybrid systems deliver both. The hybrid mindset isn’t just about validation- it’s about engineering with understanding, combining rules for control and models for interpretation to build systems that are both reliable and adaptive. That’s not AI eating software; it’s software learning to reason.