Function Library
DataChain ships built-in functions in dc.func that run inside the Query Engine as SQL. They never touch Python at runtime and execute at warehouse speed.
All functions are accessed through dc.func after import datachain as dc.
Distance Functions
Array distance functions for vector search and analytics:
import datachain as dc
chain.mutate(
cos_dist=dc.func.cosine_distance("embedding", target_embedding),
euc_dist=dc.func.euclidean_distance("embedding", target_embedding),
)
cosine_distance(column, reference): cosine distance between two vectorseuclidean_distance(column, reference): Euclidean distancel2_distance(column, reference): L2 (squared Euclidean) distance
Aggregate Functions
Standard SQL aggregates, usable in group_by:
import datachain as dc
chain.group_by(
avg_size=dc.func.avg("file.size"),
total_size=dc.func.sum("file.size"),
count=dc.func.count(),
partition_by="column.category",
)
count(): number of rowssum(col): total of a numeric columnavg(col): arithmetic meanmin(col),max(col): minimum/maximum valuecollect(col): gather values into a listconcat(col): concatenate string valuesany_value(col): arbitrary value from the group
Window Functions
Partitioned analytics without leaving the engine:
import datachain as dc
w = dc.func.window(partition_by="category", order_by="created_at")
chain.mutate(
row_num=dc.func.row_number().over(w),
running_rank=dc.func.rank().over(w),
first_path=dc.func.first("file.path").over(w),
)
window(partition_by=, order_by=): creates a window spec (both required)rank(): rank with gaps for tiesdense_rank(): rank without gapsrow_number(): sequential row number within partitionfirst(col): first value in the partition
Path Functions
Work natively with storage paths via dc.func.path.*:
import datachain as dc
chain.mutate(
ext=dc.func.path.file_ext("file.path"),
stem=dc.func.path.file_stem("file.path"),
filename=dc.func.path.name("file.path"),
parent=dc.func.path.parent("file.path"),
)
Conditional Functions
SQL-style branching:
import datachain as dc
chain.mutate(
label=dc.func.case(
(dc.C("score") > 0.9, "high"),
(dc.C("score") > 0.5, "medium"),
else_="low",
),
status=dc.func.ifelse(dc.func.isnone("result"), "pending", "done"),
)
case((cond, val), ..., else_=): multi-branch conditionalifelse(cond, true_val, false_val): two-branch conditionalisnone(col): null check
String Functions
String operations via dc.func.string.*:
length(col): string lengthsplit(col, sep): split on separatorreplace(col, old, new): substring replacementregexp_replace(col, pattern, replacement): regex-based replacement
Numeric / Bit Functions
bit_and(col_a, col_b),bit_or,bit_xor: bitwise operationsbit_hamming_distance(col_a, col_b): Hamming distance between bit vectors
Hash Functions (ClickHouse only)
sip_hash_64(col): SipHash-2-4 producing a 64-bit integerint_hash_64(col): integer hash producing a 64-bit integer