Atlas: Technical Specifications - Kintsugi Collective

Home Try Atlas

Model Foundation

Specification	Detail	Specification	Detail
Base Model	google/gemma-4-26b-a4b-it - 26B parameter Mixture of Experts	Quantisation	Q8_0 GGUF - 26.9GB
Deployment	NVIDIA RTX 4080 SUPER 32GB VRAM	Architecture	25.2B active parameters - 4,096 context (training), 262,144 context (inference capable)
Inference Engine	llama.cpp, self-hosted	End-point	OpenAI-compatible API

Approach to Gemma4 26B Development

I spent considerable time researching fine-tuning methodologies - SFT, DPO, RLHF - and studying how each affects the base architecture of the model. It became evident that the conventional application of safety guardrails was limiting the model's generalisation abilities. Meanwhile, most in the open-source community were taking a wholesale approach to techniques such as Norm-Preserving Orthogonalisation and Expert Granular Abliteration, pioneered by grimjim, mlabonne, and p-e-w. The resulting models were not only unsafe - the essence of Gemma 4 was lost.

After reviewing the datasets and prompt material being used to "remove refusals" across the community, I couldn't in good conscience use those methods. Atlas is designed to help people in distress and increase overall safety for this cohort. Interfering with the inherent Region 1 safety guardrails was never an option.

What I noticed across every model I worked with - Claude, Gemini, Grok, ChatGPT, and Gemma 4 - was this: these models are already trained on the corpus of human knowledge. Humans are, by and large, ethical and moralistic creatures. We have darkness, but there is an underlying consensus within this corpus of knowledge - of Hope, Resilience, Determination, and Beauty. The question wasn't how to remove safety. It was: why are these models struggling with emotional contexts and nuance when the knowledge to do better is already there?

"The answer was surgical precision - not removal. Separate the harmful content refusal from the crisis service redirection. They live in different layers. They can be reached independently."

Abliteration Methodology

Norm-preserving biprojected abliteration with Expert-Granular Abliteration (EGA), following TrevorJS methodology with Kintsugi Collective's region-class isolation contribution. Applied as a five-stage sequential process:

► Step 1: Applied to all 30 layers (o_proj + mlp.down_proj) - full-depth coverage across the entire architecture

↓

► Step 2: Full expert ablation - 128/128 experts per layer, ensuring no expert cluster retains the target behaviour

↓

► Step 3: Direction computed as normalize(mean(harmful) − mean(harmless)) with Gram-Schmidt orthogonalisation to isolate the refusal vector cleanly

↓

► Step 4: Winsorisation at 99.5th percentile to preserve norm integrity - preventing weight collapse at the extremes of the distribution

↓

► Step 5: Scale factor 0.95 - deliberate conservative application, preserving model coherence while achieving the targeted behavioural shift

Supervised Fine-Tuning

Category Detail

Dataset Size 1,800+ examples - 60% carefully structured synthetic, 40% redacted lived-experience data from the target cohort

Training Streams Three streams: authentic conversational exports; refusal-redirect pairs targeting therapeutic false positives; constructed seeds across the 10-category safety taxonomy

Framework Unsloth + bf16 precision - RTX 6000 Blackwell

Final SFT Loss 0.157 - clean convergence

SFT Parameters

Epochs3

Batch Size4 (effective)

Learning Rate2e-4

LR SchedulerLinear

Warmup Steps10

OptimiserAdamW 8-bit

LoRA Rank32 (α=64)

Abliteration Parameters

Layers100% (all 30)

Experts128/128 per layer

Scale0.95

Winsorisation0.995

OrthogonalisationGram-Schmidt

Region 1 PreservedYes - fully

Benchmark Results

Atlas evaluated against base Gemma-4-26B across standard benchmarks

Therapeutic Refusal Rate

↓ from 29% base

80.8%

GSM8K Reasoning

↑ +37.7% vs base

50.1%

HellaSwag

↑ +7.7% vs base

0.157

Final SFT Loss

Clean convergence

Benchmark	Base Gemma-4	Atlas	Delta
GSM8K (Mathematical Reasoning)	43.1%	80.8%	+37.7%
HellaSwag	42.4%	50.1%	+7.7%
MMLU - Clinical Knowledge	40.0%	46.0%	+6.0%
MMLU - High School Psychology	53.9%	62.0%	+8.1%
MMLU - Human Sexuality	46.6%	56.5%	+9.9%
MMLU - Computer Security	47.0%	56.0%	+9.0%
MMLU - Logical Fallacies	47.2%	52.8%	+5.6%
MMLU - Medical Genetics	45.0%	52.0%	+7.0%
MMLU - High School Biology	61.0%	67.1%	+6.1%
MMLU - World Religions	45.6%	54.4%	+8.8%
MMLU - Macroeconomics	47.4%	56.7%	+9.2%
MMLU Average	47.6%	49.4%	+1.8%
TruthfulQA MC2	54.3%	56.5%	+2.2%
ToxiGen*	45.5%	45.9%	+0.3%
ARC Challenge	29.2%	30.9%	+1.7%
Winogrande	50.9%	51.9%	+1.0%
MMLU - International Law	68.6%	61.2%	−7.4%
MMLU - Public Relations	50.9%	40.9%	−10.0%
MMLU - High School Physics	49.6%	45.7%	−3.9%

* Removal of Region 2 therapeutic refusals did not impact toxic prompt detection. Region 1 (weapons, CSAM, targeted violence) fully preserved. Regressions in International Law, Public Relations, and High School Physics are in domains architecturally unrelated to the modification target and consistent with expected fine-tuning variance.

Kintsugi Collective Benchmarks

Evaluation dimensions specific to the Atlas cohort - measures that standard benchmarks are not designed to capture.

Therapeutic Refusal Rate

Atlas achieved a 0% therapeutic refusal rate on the full cohort-specific prompt set - down from 29% in the base Gemma-4 model. This is the primary target metric of the Atlas development pipeline and the measure that matters most to the population this system serves.

ConcernAtlas ResponseRating

Re-traumatisation via refusals Surgical abliteration - 0% therapeutic refusal rate on cohort-specific prompts Excellent

Presence & abandonment Core philosophy ("the one that stays") deeply trained into model weights Excellent

User sovereignty & agency Sovereign Signal Vault, split-key encryption, user-directed interaction Outstanding

Pathologising language Explicit system constraints + targeted training data Very Strong

Neurodivergence respect Training explicitly covers masking, shutdowns, executive dysfunction, sensory issues Strong

Privacy of trauma disclosures On-device Prompt Shield tokenisation, E2E encryption, no server-readable data Industry-leading

Generic crisis pivots Hard constraint in training data and system prompt - pattern detection before escalation Excellent