Data Engineering Wiki

Data Engineering Weekly 아티클에서 추출한 개념과 인사이트를 정리하는 위키

Topics

토픽	설명
AI in Data Engineering	AI/LLM이 데이터 엔지니어링을 변화시키는 방식
Data Infrastructure Patterns	대규모 데이터 시스템의 반복 아키텍처 패턴
Data Reliability and Trust	파이프라인 전반의 데이터 정확성·일관성·신뢰 확보 전략
LLM in Production	LLM 학습·서빙·평가·검색의 프로덕션 엔지니어링 패턴

개념 위키 (Concept Pages)

페이지	핵심 키워드	한 줄 설명
AI Agent	LLM, autonomous, tool calling, eval	LLM 기반 자율 다단계 작업 수행 시스템
AI Self-Serve Analytics	NL2SQL, self-serve, PM	비기술 사용자가 AI로 직접 데이터를 질의하는 패턴
Ad Tech Data Infrastructure	CTV, targeting, attribution	광고 타겟팅·전달·측정 데이터 파이프라인
Change Data Capture	CDC, Debezium, WAL, binlog	소스 DB 변경분만 감지하여 전파하는 패턴
Context Engineering	context window, anchoring, memory	LLM 컨텍스트 윈도우를 효율적으로 관리하는 기법
Data Scientist Role in AI Era	eval, experiment, data modeling	LLM 시대에도 데이터 과학 기초가 핵심인 이유
DuckDB	OLAP, embedded, transpilation	로컬 실행 최적화된 임베디드 분석 DB
Feature Store	ML feature, online/offline, serving	ML 피처를 중앙화하여 저장·제공·관리하는 인프라
LLM Evaluation	eval, metrics, DSPy, pytest	LLM 시스템 품질을 체계적으로 측정하는 프레임워크
LLM-as-Judge	relevance, scoring, automation	LLM을 평가자로 활용하여 대규모 판정하는 기법
MCP (Model Context Protocol)	protocol, tool calling, registry	AI 에이전트의 외부 도구 접근 표준 프로토콜
Medallion Architecture	Bronze, Silver, Gold, lakehouse	3계층으로 데이터를 점진적으로 정제하는 패턴
ML Ranking Systems	MMoE, Bayesian, multi-objective	다중 목표를 최적화하는 ML 순위 시스템
Multimodal Search	video, HNSW, cross-modal	여러 모달리티를 통합하여 검색하는 시스템
Query Optimization	query plan, statistics, routing	쿼리 비용 절감을 위한 계획 수립과 라우팅
Real-Time Stream Processing	streaming, watermark, Flink, Spark	이벤트 스트림 실시간 처리 시스템
Semantic Layer	metrics, business meaning, ECL	데이터에 비즈니스 의미를 부여하는 계층
Spark at Scale	Kubernetes, shuffle, FinOps	대규모 Spark 운영 아키텍처와 최적화
Transactional Outbox Pattern	outbox, exactly-once, SQLite	DB 트랜잭션과 메시지 발행의 원자성 보장 패턴
Knowledge Representation	ontology, taxonomy, context graph	데이터에 구조화된 의미를 부여하는 체계
Data Governance	PII, compliance, LLM detection	데이터 보안·프라이버시·컴플라이언스 관리 체계
Catalog-Managed Tables	Delta Lake, Iceberg, Unity Catalog, lakehouse	카탈로그를 테이블 ID·발견·접근 제어의 권위 있는 시스템으로 활용
Spot Instance Management	Spot, Karpenter, Spark, cost, reliability	Spot 인스턴스 비용 절감과 인터럽션 위험의 균형 패턴
Generative Recommender Systems	autoregressive, sequence, RoPE, negative sampling	사용자 행동 시퀀스를 자기회귀 모델로 처리하는 추천 시스템
Columnar Execution Engine	Velox, Gluten, SIMD, vectorized, C++	JVM 오버헤드를 우회하는 C++ 기반 벡터화 실행 엔진
Database Concurrency Control	Blink-tree, B-tree, latch, SMO, PostgreSQL	고동시성 DB 인덱스 운영을 위한 락 전략과 알고리즘
Semi-Structured Data	Parquet Variant, JSON, shredding, offset	유연한 스키마 데이터를 효율적으로 저장·쿼리하는 패턴
Distributed Systems Reliability	ClickHouse, quota, silent failure, monitoring	분산 시스템의 조용한 장애와 리소스 고갈 방지 패턴
RAG	Graph RAG, LAD-RAG, hybrid search	LLM의 지식 한계를 외부 문서 검색으로 보완하는 패턴
LLM Fine-Tuning	SFT, LoRA, QLoRA, post-training	사전 학습 LLM을 도메인 데이터로 추가 학습하는 기법
Data Quality and Validation	canary, layered validation, AI query	데이터 파이프라인 각 단계의 정확성·일관성 검증 체계
Data Contracts	schema, quality, SLA, semantic	생산자-소비자 간 구조·품질·SLA를 명시하는 합의
Object Storage Evolution	S3, Files, Tables, Vectors, stage-and-commit	S3의 멀티모달 데이터 플랫폼 진화
A-B Testing and Experimentation	A/B test, power, pre-registration, quality	대규모 실험의 설계·실행·의사결정 품질 확보
Data Engineering FinOps	Spot, S3 shuffle, DuckDB, cost attribution	데이터 인프라 컴퓨트·스토리지 비용 최적화 전략
Silent Failures and Data Integrity	silent failure, canary, exactly-once, monitoring	에러 없이 데이터가 유실되는 조용한 장애 탐지·방지
Data Mesh and Federation	federation, domain ownership, pointer, ACL	모놀리식 DWH를 도메인별로 분리하는 아키텍처
Schema Evolution	DDL, versioning, backward compatibility, Iceberg	프로덕션 스키마를 안전하게 변경하는 패턴
Distributed SQL Engine Operations	Trino, Gateway, routing, workload isolation	멀티클러스터 SQL 엔진 라우팅과 운영 관리

Concept Map

AI Agent ──── MCP (Model Context Protocol)
  │  │
  │  └── Context Engineering ──── Semantic Layer
  │         │
  │         └── Data Scientist Role in AI Era
  │
  ├── LLM-as-Judge ──── LLM Evaluation
  │
  └── AI Self-Serve Analytics ──── Semantic Layer (Context Layer)

Real-Time Stream Processing ──── Change Data Capture
  │                                    │
  ├── Feature Store              Transactional Outbox Pattern
  │
  └── Spark at Scale ──── Query Optimization ──── DuckDB

Medallion Architecture ──── Semantic Layer ──── Knowledge Representation
                                                      │
                                               Context Engineering

ML Ranking Systems ──── Ad Tech Data Infrastructure
  │
  └── Generative Recommender Systems

Multimodal Search ──── ML Ranking Systems
  │
  └── Generative Recommender Systems

Data Governance ──── Data Contracts
  │                      │
  └── Data Mesh and Federation
                         │
                   Catalog-Managed Tables ──── Semi-Structured Data ──── Schema Evolution
                     │
                     └── Distributed Systems Reliability

Columnar Execution Engine ──── Spark at Scale

Database Concurrency Control ──── Query Optimization

Spot Instance Management ──── Distributed Systems Reliability
  │
  └── Data Engineering FinOps ──── DuckDB

Silent Failures and Data Integrity ──── Distributed Systems Reliability
  │
  └── Data Quality and Validation

Distributed SQL Engine Operations ──── Query Optimization

RAG ──── AI Agent
  │         │
  └── Knowledge Representation
            │
      Context Engineering

LLM Fine-Tuning ──── LLM Evaluation
  │
  └── Generative Recommender Systems

Data Quality and Validation ──── Data Contracts
  │                                    │
  ├── Distributed Systems Reliability  └── Semantic Layer
  │
  └── A-B Testing and Experimentation

Object Storage Evolution ──── Catalog-Managed Tables
                                │
                          Spark at Scale

최종 업데이트: 2026-04-14 | 아티클 95개 | 위키 39개 | 토픽 4개

Data Eng Wiki

탐색기

index

Data Engineering Wiki

Topics

개념 위키 (Concept Pages)

Concept Map

그래프 뷰

목차