A robust, enterprise-grade data engineering pipeline that implements academic qualitative content analysis paradigms at scale. By leveraging the Gemini 2.5 Flash API alongside Pydantic structured schemas, this framework enforces deterministic data validation on unstructured text corpora, transforming qualitative public records into type-safe, auditable relational datasets.
Modern public administration increasingly relies on automated decision-making (ADM) systems, machine-learning risk scores, and biometric surveillance frameworks. However, when these tools are deployed without multi-disciplinary oversight, they risk perpetuating structural inequality, introducing severe administrative friction, and creating profound democratic accountability vacuums.
To audit these socio-technical governance mechanisms, empirical research requires scale and reproducibility. Traditional qualitative text analysis relies on human researchers manually applying a strict set of rule-based classifications (a Codebook) to text—a process prone to cognitive exhaustion and subjectivity.
This repository bridges the gap between software infrastructure design and critical criminology/sociology. It treats a Large Language Model not as a conversational chatbot, but as a high-throughput, rule-bound data pipeline component. By combining an immutable cryptographic-style codebook context with strict JSON validation boundaries, the system locks down model creativity to produce deterministic, empirical data points out of unstructured political and journalistic prose.
The project implements a decoupling architecture split into three distinct layers:
- Ingestion Layer: Modular data harvesters designed to isolate qualitative narratives from diverse channels (e.g., public sector disclosures, civil society watchdog investigations, select committee transcripts) and land them as uniform
.txtblocks in a standardized filesystem staging zone (unprocessed_docs/). - Orchestration & Transformation Engine (
batch_pipeline.py): The programmatic core that establishes connection pooling with the Gemini API, loads the contextual codebook anchor, maps the theoretical definitions to strict data runtime structures, and handles rate-limiting safety valves. - Storage Tier: Flattens high-dimensional polymorphic nested JSON payloads directly into clean, normalized tabular datasets (
algorithmic_harm_dataset.csv) optimized for downstream statistical modeling, database ingestion, or BI analytics.
[ Automated DOM Scraper ] [ Global Failure Archives ]
(fetch_parliament_scraped.py) (generate_mock_docs.py)
│ │
└───────────────┬───────────────┘
▼
[ unprocessed_docs/ ] <── Isolated File Staging Area
│
▼
[ batch_pipeline.py ] <── Enforces Pydantic Schema & 0.1 Temp
│
▼
[ Gemini 2.5 Flash Engine ] <── Deterministic Tokenizer
│
▼
[ algorithmic_harm_dataset.csv ] <── Type-Safe Auditable Relational File
To convert unstructured political rhetoric into database rows without absorbing textual bias, complex criminological theories are mapped directly into code as Python native Literal enums within a Pydantic v2 model.
from typing import Literal
from pydantic import BaseModel, Field
HarmTypology = Literal[
"DISCRIMINATORY_PROFILING", # Biased targeting, high false-positives, unequal error rates
"ACCOUNTABILITY_GAP", # "Black box" opaque logic, no human oversight, unreviewable decisions
"CHILLING_EFFECT", # Over-surveillance modifying public behavior or suppressing legal rights
"RESOURCE_EXTRACTION", # Automation used to punitively restrict welfare, housing, or state aid
"SYSTEMIC_NORMALISATION" # Officials accepting technical flaws as a necessary trade-off for speed
]
TargetedDomain = Literal["POLICING", "WELFARE_BENEFITS", "BORDER_CONTROL", "LOCAL_GOVERNMENT"]
class AlgorithmicHarmRecord(BaseModel):
document_id: str = Field(description="Unique identifier for the source document.")
primary_domain: TargetedDomain = Field(description="The branch of the state apparatus deploying the automated system.")
dominant_harm_type: HarmTypology = Field(description="The primary sociological or structural harm identified.")
impacted_demographic: list[str] = Field(description="Protected characteristics or marginalized groups explicitly identified as bearing the cost.")
state_rationalisation: str = Field(description="The justification or defense offered by state actors for deploying the tool.")
severity_index: Literal["LOW", "MEDIUM", "HIGH"] = Field(description="Scale of structural impact and rights violations.")- Python 3.10 or higher
- A valid Google AI Studio API Key (Free tier works perfectly)
Isolate your workspace dependencies by spinning up a clean virtual environment:
# Clone the repository
git clone [https://github.com/stetho/content-coding.git](https://github.com/stetho/content-coding.git)
cd content-coding
# Initialize a custom named virtual environment
python3 -m venv algorithmic_harm_env
# Activate the environment
# On macOS/Linux:
source algorithmic_harm_env/bin/activate
# On Windows (PowerShell):
# .\\algorithmic_harm_env\\Scripts\\Activate.ps1
# Upgrade pip and install validated production requirements
pip install --upgrade pip
pip install google-genai pydantic beautifulsoup4 requestsThe official Google GenAI SDK looks implicitly for a system environment variable named GEMINI_API_KEY. Avoid hardcoding credentials into scripts.
# Linux/macOS
export GEMINI_API_KEY="AIzaSyYourActualAPIKeyHere..."
# Windows (Command Prompt)
set GEMINI_API_KEY="AIzaSyYourActualAPIKeyHere..."
# Windows (PowerShell)
"$env:GEMINI_API_KEY="AIzaSyYourActualAPIKeyHere..."To bypass brittle public web application firewalls during testing, execute the mock data synthesizer to inject realistic, highly localized socio-technical narratives into your workspace:
python generate_mock_docs.pyThis command will create the unprocessed_docs/ folder and populate it with four distinct case studies spanning DWP welfare fraud scorecards, Home Office automated border e-gates, metropolitan live facial recognition deployments, and algorithmic social housing triage.
Execute the master orchestration engine to process the staging directory sequentially through the validated AI infrastructure:
python batch_pipeline.pyInitializing Content Coding Test...
Found 4 documents to analyze. Starting pipeline...
[1/4] Processing: UK-HO-BORDER-01... Success ✅
[2/4] Processing: UK-DWP-FRAUD-02... Success ✅
[3/4] Processing: UK-MET-SURVEILLANCE-03... Success ✅
[4/4] Processing: UK-COUNCIL-HOUSING-04... Success ✅
Pipeline complete! Dataset successfully saved to: algorithmic_harm_dataset.csv
"...Internal communications confirm that the system flagged thousands of valid care payments as 'undisclosed illicit income' due to an error in the algorithmic classification rules. Senior officials defense argued that while the system produced occasional errors, sweeping automation was an absolute necessity to prevent fraud at scale and protect public revenue, normalizing a high rate of severe administrative friction for vulnerable claimants."
{
"document_id": "UK-DWP-FRAUD-02",
"primary_domain": "WELFARE_BENEFITS",
"dominant_harm_type": "SYSTEMIC_NORMALISATION",
"impacted_demographic": ["Universal Credit applicants over the age of 50"],
"state_rationalisation": "sweeping automation was an absolute necessity to prevent fraud at scale and protect public revenue",
"severity_index": "HIGH"
}The runtime flattens this schema into a persistent row inside algorithmic_harm_dataset.csv, transforming prose into clean, filterable database fields.
- Zero-Cost Rigorous Research Pattern: Demonstrates how researchers and institutions can build a high-fidelity algorithmic auditing engine on a budget using
gemini-2.5-flash's competitive latency profile and extensive free tier. - Deterministic Configuration Guardrails: The pipeline locks model temperature at
0.1and passes descriptive codebook parameters directly into the configuration. This minimizes semantic drift, enforces strict type conformity across large batches, and eliminates hallucinated output schemas. - Network Fault & Rate Resilience: Includes programmatic pacing variables (
time.sleep) and structured exception handling routines, ensuring the processing orchestration manages systemic execution spikes gracefully without crashing midway through heavy data loops.
This project is open-source and licensed under the MIT License.