Functors and Monads: Wrapping Your Data (Not Your Head) Around Them
Why these words sound scarier than they are
If you’ve ever Googled “what is a monad”, you’ve probably landed on a Stack Overflow answer that says something like “a monad is just a monoid in the category of endofunctors” and immediately closed the tab.
Fair.
Here’s the thing: you don’t need category theory to use functors and monads. You need a pipeline, some data, and a small conceptual shift in how you think about wrapping transformations.
I came to these patterns through functional Python — after working through callables, generics, and partial application — and the moment they clicked, I couldn’t unsee them in every data pipeline I’d ever written.
The core idea: a box that knows how to transform itself
Both functors and monads are about one thing: a value inside a container that knows how to apply a function without you having to unwrap it manually every time.
In data engineering, you do this constantly. You have a DataFrame, a list of records, an S3 path wrapped in metadata, a schema alongside its data. You don’t just pass the raw value around — you pass the value plus its context.
Functors and monads are the pattern that makes working with that context composable.
Functors: map without leaving the box
A functor is any container that implements a map method — one that applies a function to the value inside and returns a new container of the same shape.
The simplest data engineering example: a column transformation pipeline.
from typing import TypeVar, Generic, Callable
T = TypeVar('T')
U = TypeVar('U')
class Column(Generic[T]):
def __init__(self, values: list[T]):
self.values = values
def map(self, f: Callable[[T], U]) -> "Column[U]":
return Column([f(v) for v in self.values])
def __repr__(self):
return f"Column({self.values})"
Now you can chain transformations without ever breaking out of the container:
from functools import partial
def cast_to_float(v: str) -> float:
return float(v)
def clamp(v: float, min_val: float, max_val: float) -> float:
return max(min_val, min(max_val, v))
clamp_0_100 = partial(clamp, min_val=0.0, max_val=100.0)
raw_scores = Column(["85.3", "102.1", "-4.5", "91.0"])
result = (
raw_scores
.map(cast_to_float)
.map(clamp_0_100)
)
print(result) # Column([85.3, 100.0, 0.0, 91.0])
Each .map call produces a new Column. The shape is preserved. The transformations are named and composable. You’re not mutating state or writing a loop that accumulates into a new list every time.
This is exactly how you’d think about it in Spark or pandas — df.withColumn(...) is a functor. The DataFrame wraps the data, and the transformation stays inside the wrapper.
The functor laws (briefly, because they matter in practice)
A proper functor follows two rules:
- Identity:
column.map(lambda x: x)should return the same values — no side effects from map itself. - Composition:
column.map(f).map(g)should equalcolumn.map(lambda x: g(f(x)))— chaining maps is the same as composing functions first.
These aren’t just theoretical. If your map mutates state or has side effects, you’ll spend an afternoon debugging a pipeline where the order of operations silently matters.
Where functors break down
Functors are great until your transformation function also returns a container.
def parse_record(raw: str) -> Column[str]:
fields = raw.strip().split(",")
return Column(fields)
raw_lines = Column(["a,b,c", "d,e,f"])
result = raw_lines.map(parse_record)
# Column([Column(['a', 'b', 'c']), Column(['d', 'e', 'f'])])
You asked for a Column[str], you got a Column[Column[str]]. Nested containers. This is where monads come in.
Monads: map + flatten, composable context
A monad extends a functor with one extra operation: bind (also called flat_map or >>=).
Bind does what map does, but when the function returns a wrapped value, it flattens it — so you don’t end up with nested containers.
class Column(Generic[T]):
def __init__(self, values: list[T]):
self.values = values
def map(self, f: Callable[[T], U]) -> "Column[U]":
return Column([f(v) for v in self.values])
def bind(self, f: Callable[[T], "Column[U]"]) -> "Column[U]":
result = []
for v in self.values:
result.extend(f(v).values)
return Column(result)
def __repr__(self):
return f"Column({self.values})"
Now:
def parse_record(raw: str) -> Column[str]:
fields = raw.strip().split(",")
return Column(fields)
raw_lines = Column(["a,b,c", "d,e,f"])
result = raw_lines.bind(parse_record)
print(result) # Column(['a', 'b', 'c', 'd', 'e', 'f'])
Flat. No nesting. The monad handled the unwrapping.
A more realistic data engineering example: pipeline with error context
Here’s where monads genuinely earn their keep in a data pipeline. Consider processing rows that might fail validation — and you want to carry what happened alongside the data.
from typing import TypeVar, Generic, Callable, Optional
from dataclasses import dataclass, field
T = TypeVar('T')
U = TypeVar('U')
@dataclass
class PipelineResult(Generic[T]):
value: T
errors: list[str] = field(default_factory=list)
dropped: int = 0
def map(self, f: Callable[[T], T]) -> "PipelineResult[T]":
return PipelineResult(f(self.value), self.errors, self.dropped)
def bind(self, f: Callable[[T], "PipelineResult[T]"]) -> "PipelineResult[T]":
result = f(self.value)
return PipelineResult(
result.value,
self.errors + result.errors,
self.dropped + result.dropped
)
def __repr__(self):
return f"PipelineResult(rows={len(self.value)}, errors={self.errors}, dropped={self.dropped})"
Now each transformation step can pass data forward and accumulate error context — without try/except noise scattered across every function:
def remove_nulls(records: list[dict]) -> PipelineResult[list[dict]]:
clean = [r for r in records if r.get("value") is not None]
dropped = len(records) - len(clean)
errors = [f"Dropped {dropped} null rows"] if dropped else []
return PipelineResult(clean, errors, dropped)
def remove_negatives(records: list[dict]) -> PipelineResult[list[dict]]:
clean = [r for r in records if r["value"] >= 0]
dropped = len(records) - len(clean)
errors = [f"Dropped {dropped} negative rows"] if dropped else []
return PipelineResult(clean, errors, dropped)
def scale_values(records: list[dict]) -> list[dict]:
return [{**r, "value": r["value"] * 100} for r in records]
raw_data = [
{"id": 1, "value": 0.85},
{"id": 2, "value": None},
{"id": 3, "value": -0.1},
{"id": 4, "value": 0.92},
]
result = (
PipelineResult(raw_data)
.bind(remove_nulls)
.bind(remove_negatives)
.map(scale_values)
)
print(result)
# PipelineResult(rows=2, errors=['Dropped 1 null rows', 'Dropped 1 negative rows'], dropped=2)
print(result.value)
# [{'id': 1, 'value': 85.0}, {'id': 4, 'value': 92.0}]
The data flows forward. The errors accumulate alongside. No global state, no mutable error list passed around by reference, no lost context.
This is essentially the Writer monad pattern — and it maps directly to what Spark lineage, dbt test results, or any audit log system does conceptually.
Functor vs Monad: the one-line summary
| Container | Transformation returns | |
|---|---|---|
| Functor | map |
a plain value |
| Monad | bind |
another wrapped value (then flattened) |
Use a functor when your transformations are simple value → value functions.
Use a monad when your transformations carry context — errors, logs, optional results, or metadata — and you need that context to compose cleanly across steps.
A closing thought
The reason functors and monads feel abstract is that most introductions teach them through toy examples like Maybe[int] or wrapping a single number. In data engineering, you’re already working with wrapped values all the time — DataFrames, result sets, schema-annotated records, nullable columns.
The pattern isn’t new to you. The vocabulary is.
Once I stopped asking “what is a functor” and started asking “what is the thing I’m wrapping, and what do I want to preserve across transformations”, it became a tool I reached for on purpose.
Lastly, thank you for reading this post. For more awesome posts, you can explore my other articles here, and follow me on Github — amarlearning.
#python #functional-programming #clean-code #refactoring #software-design