Amar Prakash Pandey - ᕦ(ò_óˇ)ᕤ

Functors and Monads: Wrapping Your Data (Not Your Head) Around Them

banner

Why these words sound scarier than they are

If you’ve ever Googled “what is a monad”, you’ve probably landed on a Stack Overflow answer that says something like “a monad is just a monoid in the category of endofunctors” and immediately closed the tab.

Fair.

Here’s the thing: you don’t need category theory to use functors and monads. You need a pipeline, some data, and a small conceptual shift in how you think about wrapping transformations.

I came to these patterns through functional Python — after working through callables, generics, and partial application — and the moment they clicked, I couldn’t unsee them in every data pipeline I’d ever written.


The core idea: a box that knows how to transform itself

Both functors and monads are about one thing: a value inside a container that knows how to apply a function without you having to unwrap it manually every time.

In data engineering, you do this constantly. You have a DataFrame, a list of records, an S3 path wrapped in metadata, a schema alongside its data. You don’t just pass the raw value around — you pass the value plus its context.

Functors and monads are the pattern that makes working with that context composable.


Functors: map without leaving the box

A functor is any container that implements a map method — one that applies a function to the value inside and returns a new container of the same shape.

The simplest data engineering example: a column transformation pipeline.

from typing import TypeVar, Generic, Callable

T = TypeVar('T')
U = TypeVar('U')

class Column(Generic[T]):
    def __init__(self, values: list[T]):
        self.values = values

    def map(self, f: Callable[[T], U]) -> "Column[U]":
        return Column([f(v) for v in self.values])

    def __repr__(self):
        return f"Column({self.values})"

Now you can chain transformations without ever breaking out of the container:

from functools import partial

def cast_to_float(v: str) -> float:
    return float(v)

def clamp(v: float, min_val: float, max_val: float) -> float:
    return max(min_val, min(max_val, v))

clamp_0_100 = partial(clamp, min_val=0.0, max_val=100.0)

raw_scores = Column(["85.3", "102.1", "-4.5", "91.0"])

result = (
    raw_scores
    .map(cast_to_float)
    .map(clamp_0_100)
)

print(result)  # Column([85.3, 100.0, 0.0, 91.0])

Each .map call produces a new Column. The shape is preserved. The transformations are named and composable. You’re not mutating state or writing a loop that accumulates into a new list every time.

This is exactly how you’d think about it in Spark or pandas — df.withColumn(...) is a functor. The DataFrame wraps the data, and the transformation stays inside the wrapper.

The functor laws (briefly, because they matter in practice)

A proper functor follows two rules:

  1. Identity: column.map(lambda x: x) should return the same values — no side effects from map itself.
  2. Composition: column.map(f).map(g) should equal column.map(lambda x: g(f(x))) — chaining maps is the same as composing functions first.

These aren’t just theoretical. If your map mutates state or has side effects, you’ll spend an afternoon debugging a pipeline where the order of operations silently matters.


Where functors break down

Functors are great until your transformation function also returns a container.

def parse_record(raw: str) -> Column[str]:
    fields = raw.strip().split(",")
    return Column(fields)

raw_lines = Column(["a,b,c", "d,e,f"])
result = raw_lines.map(parse_record)
# Column([Column(['a', 'b', 'c']), Column(['d', 'e', 'f'])])

You asked for a Column[str], you got a Column[Column[str]]. Nested containers. This is where monads come in.


Monads: map + flatten, composable context

A monad extends a functor with one extra operation: bind (also called flat_map or >>=).

Bind does what map does, but when the function returns a wrapped value, it flattens it — so you don’t end up with nested containers.

class Column(Generic[T]):
    def __init__(self, values: list[T]):
        self.values = values

    def map(self, f: Callable[[T], U]) -> "Column[U]":
        return Column([f(v) for v in self.values])

    def bind(self, f: Callable[[T], "Column[U]"]) -> "Column[U]":
        result = []
        for v in self.values:
            result.extend(f(v).values)
        return Column(result)

    def __repr__(self):
        return f"Column({self.values})"

Now:

def parse_record(raw: str) -> Column[str]:
    fields = raw.strip().split(",")
    return Column(fields)

raw_lines = Column(["a,b,c", "d,e,f"])
result = raw_lines.bind(parse_record)
print(result)  # Column(['a', 'b', 'c', 'd', 'e', 'f'])

Flat. No nesting. The monad handled the unwrapping.


A more realistic data engineering example: pipeline with error context

Here’s where monads genuinely earn their keep in a data pipeline. Consider processing rows that might fail validation — and you want to carry what happened alongside the data.

from typing import TypeVar, Generic, Callable, Optional
from dataclasses import dataclass, field

T = TypeVar('T')
U = TypeVar('U')

@dataclass
class PipelineResult(Generic[T]):
    value: T
    errors: list[str] = field(default_factory=list)
    dropped: int = 0

    def map(self, f: Callable[[T], T]) -> "PipelineResult[T]":
        return PipelineResult(f(self.value), self.errors, self.dropped)

    def bind(self, f: Callable[[T], "PipelineResult[T]"]) -> "PipelineResult[T]":
        result = f(self.value)
        return PipelineResult(
            result.value,
            self.errors + result.errors,
            self.dropped + result.dropped
        )

    def __repr__(self):
        return f"PipelineResult(rows={len(self.value)}, errors={self.errors}, dropped={self.dropped})"

Now each transformation step can pass data forward and accumulate error context — without try/except noise scattered across every function:

def remove_nulls(records: list[dict]) -> PipelineResult[list[dict]]:
    clean = [r for r in records if r.get("value") is not None]
    dropped = len(records) - len(clean)
    errors = [f"Dropped {dropped} null rows"] if dropped else []
    return PipelineResult(clean, errors, dropped)

def remove_negatives(records: list[dict]) -> PipelineResult[list[dict]]:
    clean = [r for r in records if r["value"] >= 0]
    dropped = len(records) - len(clean)
    errors = [f"Dropped {dropped} negative rows"] if dropped else []
    return PipelineResult(clean, errors, dropped)

def scale_values(records: list[dict]) -> list[dict]:
    return [{**r, "value": r["value"] * 100} for r in records]

raw_data = [
    {"id": 1, "value": 0.85},
    {"id": 2, "value": None},
    {"id": 3, "value": -0.1},
    {"id": 4, "value": 0.92},
]

result = (
    PipelineResult(raw_data)
    .bind(remove_nulls)
    .bind(remove_negatives)
    .map(scale_values)
)

print(result)
# PipelineResult(rows=2, errors=['Dropped 1 null rows', 'Dropped 1 negative rows'], dropped=2)
print(result.value)
# [{'id': 1, 'value': 85.0}, {'id': 4, 'value': 92.0}]

The data flows forward. The errors accumulate alongside. No global state, no mutable error list passed around by reference, no lost context.

This is essentially the Writer monad pattern — and it maps directly to what Spark lineage, dbt test results, or any audit log system does conceptually.


Functor vs Monad: the one-line summary

Container Transformation returns
Functor map a plain value
Monad bind another wrapped value (then flattened)

Use a functor when your transformations are simple value → value functions.

Use a monad when your transformations carry context — errors, logs, optional results, or metadata — and you need that context to compose cleanly across steps.


A closing thought

The reason functors and monads feel abstract is that most introductions teach them through toy examples like Maybe[int] or wrapping a single number. In data engineering, you’re already working with wrapped values all the time — DataFrames, result sets, schema-annotated records, nullable columns.

The pattern isn’t new to you. The vocabulary is.

Once I stopped asking “what is a functor” and started asking “what is the thing I’m wrapping, and what do I want to preserve across transformations”, it became a tool I reached for on purpose.


Lastly, thank you for reading this post. For more awesome posts, you can explore my other articles here, and follow me on Github — amarlearning.

#python #functional-programming #clean-code #refactoring #software-design

Comments