Core Data Engine — Tech Prep

Senior Kotlin/Java Engineer · Synthesized.io (confirmed) · Core Data Engine · via NEWHR

What the company does

Synthesized.io — confirmed

Product: E2E Test Data Management platform (TDK — Test Data Kit). Three core features: data masking (replace PII with realistic fake data), synthetic data generation (create production-like data from scratch), database subsetting (extract consistent smaller copy of production DB).

Clients: Deutsche Bank, UBS, European Commission, Accenture, Clarity AI. Enterprise-grade, used in banking, healthcare, telecom.

Funding: Series A. Investors: IQ Capital, Seedcamp, Deutsche Bank. Offices: London (Shoreditch), New York, Alpharetta GA.

Supported databases: PostgreSQL, MySQL, Oracle, SQL Server, SAP HANA, DB2, MariaDB, Salesforce, Snowflake, BigQuery, Redshift. ~15 total. Deployed via Docker/Kubernetes/CLI. CI/CD integration (Jenkins, GitLab, GitHub Actions).

Config approach: "Data as Code" — YAML config files define masking rules, generation schemas, subsetting criteria. Deterministic, version-controlled, auditable.

Your role (core team): Build the data engine that connects to customer databases, reads schemas via JDBC, analyzes FK relationships, and executes masking/generation/subsetting. Handle: TB-scale data volumes, 500+ table schemas, cyclical FK references, cross-database type mapping, JVM performance optimization.

Key docs concepts to know: Transformers (masking/generation functions), Virtual Foreign Keys (user-defined FKs not in schema), CycleResolutionStrategy (handling circular references), Workers (parallel processing units), PII auto-detection.

Requirements mapping

Strong match

STRONG Java backend — 4 years Java (Amazon, Alfa-Bank). Kotlin at Amazon.

STRONG Concurrency & multithreading — Moscow Exchange trading (microsecond latency), Amazon distributed systems.

STRONG High-load / low-latency — Moscow Exchange (trading system architecture), Alfa-Bank (trading infrastructure).

STRONG Relational databases — 15 years SQL at Alfa-Bank. PostgreSQL, MS SQL Server, Firebird.

STRONG Large data volumes — Moscow Exchange market data, Alfa-Bank trading history.

STRONG Algorithms & data structures — 25 years + interview prep with 120 quiz questions.

STRONG Open-ended research problems — System Architect role at Moscow Exchange.

Areas to prepare

PREP JVM performance tuning — GC tuning, profiling, memory optimization. You've done it but need to articulate specific techniques.

PREP Database internals — how PostgreSQL/MySQL store data, MVCC, query planning, JDBC internals. Beyond just writing SQL.

PREP Data masking/synthetic data concepts — referential integrity, format-preserving encryption, statistical distributions.

NICE ClickHouse — column-oriented database. Read basics for the interview.

Expected interview process

1. HR screening (NEWHR) — Russian, 30 min. Background, motivation, salary, English check.

2. Technical screening — 60-90 min. JVM internals, concurrency, database knowledge, algorithm problem.

3. System design / Deep dive — "Design a data masking engine for a 500-table PostgreSQL database." Architecture, performance, trade-offs.

4. Culture fit / Founder interview — Startup values, ownership, ambiguity tolerance.

JVM Performance — key topic for this role

The job says: "Improve JVM performance: concurrency, memory usage, latency, throughput"

How would you profile a JVM application that's slow?

Step 1: Define "slow" — high latency (p99 > threshold)? Low throughput? High CPU? High memory? GC pauses?

Step 2: Choose the right profiler:

CPU profiling: async-profiler (sampling, low overhead, production-safe) or JFR (Java Flight Recorder, built into JVM). Find which methods consume CPU time. Look for: hot loops, inefficient algorithms, excessive object creation.

Memory profiling: JFR + jcmd for allocation profiling. heap dump + Eclipse MAT for leak analysis. Look for: large object retention, growing collections, unclosed resources.

GC analysis: enable GC logs (-Xlog:gc*), analyze with GCViewer or GCEasy. Look for: long STW pauses, frequent Full GC, promotion failures.

Latency: distributed tracing (OpenTelemetry) for per-request breakdown. Which service/DB call is the bottleneck?

Step 3: Typical findings for a data engine:

Problem: Processing 100M rows takes 30 minutes
Profile shows:
  40% — GC pauses (too many temporary objects)
  30% — JDBC ResultSet processing (row-by-row instead of batch)
  20% — String allocations (building SQL strings)
  10% — actual data transformation logic

Fixes:
  1. Object pooling / flyweight for repeated values
  2. Batch JDBC fetches (fetchSize = 10000)
  3. StringBuilder / prepared statements
  4. Off-heap buffers for large datasets
How would you reduce GC pressure in a data-heavy application?

The problem: processing millions of rows creates millions of temporary objects. GC spends more time collecting than the app spends processing.

Technique 1: Reduce allocations

// BAD — new String per row:
for (row in resultSet) {
    val masked = "***" + row.getString("email").substringAfter("@")
    // Creates: substring String, concatenation String, intermediate CharArray
    // 10M rows = 30M String objects = GC nightmare
}

// GOOD — reuse StringBuilder:
val sb = StringBuilder(256)
for (row in resultSet) {
    sb.clear()
    sb.append("***").append(row.getString("email"), atIndex, length)
    val masked = sb.toString()  // one allocation per row
}

Technique 2: Primitive arrays instead of boxed collections

// BAD: List<Int> = List<Integer> on JVM. Each Integer = 16 bytes object.
val ids: List<Int> = rows.map { it.getInt("id") }  // 10M Integer objects

// GOOD: IntArray = int[] on JVM. No boxing.
val ids = IntArray(rowCount) { resultSet.getInt("id") }  // 10M * 4 bytes = 40MB

Technique 3: Off-heap memory (DirectByteBuffer)

// Store large datasets outside GC-managed heap:
val buffer = ByteBuffer.allocateDirect(1024 * 1024 * 100)  // 100MB off-heap
// GC doesn't scan this memory. No pauses.
// Must manage lifecycle manually (like C/C++ malloc/free)

Technique 4: Object pooling

// Reuse objects instead of creating new ones:
val rowPool = ArrayDeque<MutableRow>(10000)
fun getRow(): MutableRow = rowPool.pollFirst() ?: MutableRow()
fun returnRow(row: MutableRow) { row.clear(); rowPool.addLast(row) }

Technique 5: Choose the right GC

For a data engine processing batch data: G1 GC with tuned region size. For latency-sensitive API layer: ZGC. For maximum throughput batch processing: Parallel GC.

What is JDBC fetchSize and why does it matter for large datasets?

Default behavior: PostgreSQL JDBC loads the ENTIRE result set into memory before returning the first row. Query returns 50M rows = 50M rows in JVM heap = OutOfMemoryError.

// DEFAULT — loads everything into memory:
val stmt = connection.createStatement()
val rs = stmt.executeQuery("SELECT * FROM big_table")  // 50M rows in memory!

// WITH fetchSize — streaming, N rows at a time:
connection.autoCommit = false  // required for PostgreSQL streaming
val stmt = connection.createStatement()
stmt.fetchSize = 10000         // fetch 10,000 rows at a time
val rs = stmt.executeQuery("SELECT * FROM big_table")
// Only 10,000 rows in memory. Next batch fetched when needed.

Why autoCommit = false is required: PostgreSQL uses server-side cursors for streaming. Cursors only work within a transaction. With autoCommit = true, each statement is its own transaction = no cursor = full result loaded.

MySQL difference: MySQL uses ResultSet.TYPE_FORWARD_ONLY + fetchSize = Integer.MIN_VALUE for streaming. Different driver, different API.

For a data engine this is CRITICAL: you're scanning entire databases (hundreds of GB). Without streaming, you can't even start processing. fetchSize controls memory usage vs network round trips — tune it based on row size and available heap.

What JVM flags would you use for a data processing service?
# Memory:
-Xms4g -Xmx4g              # fixed heap, no resize overhead
-XX:MaxDirectMemorySize=2g  # for off-heap buffers

# GC (for batch processing — throughput over latency):
-XX:+UseG1GC                # or -XX:+UseParallelGC for pure batch
-XX:MaxGCPauseMillis=200    # G1 target pause time
-XX:G1HeapRegionSize=16m    # larger regions for large heap
-XX:InitiatingHeapOccupancyPercent=45  # trigger concurrent GC earlier

# GC logging:
-Xlog:gc*:file=gc.log:time,level,tags:filecount=10,filesize=10m

# Performance:
-XX:+UseStringDeduplication  # G1 deduplicates identical strings in heap
-XX:+OptimizeStringConcat    # optimize StringBuilder chains
-XX:+UseCompressedOops       # 32-bit object references in <32GB heap

# Diagnostics:
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/tmp/heapdump.hprof
-XX:+FlightRecorder          # enable JFR for profiling

For data engine specifically: -XX:+UseStringDeduplication is huge — if you're processing database dumps, many column values are repeated strings (country names, statuses, types). G1 deduplicates them in memory automatically.

Concurrency — Java-level, not just coroutines

This role emphasizes raw Java concurrency more than Kotlin coroutines

How would you parallelize processing of a 100M row table?

Approach 1: Partition by primary key ranges

// Split into N chunks by ID range:
val totalRows = 100_000_000
val chunkSize = 1_000_000  // 1M rows per chunk
val chunks = (0 until totalRows step chunkSize).map { offset ->
    "SELECT * FROM users WHERE id BETWEEN $offset AND ${offset + chunkSize - 1}"
}

// Process chunks in parallel:
val executor = Executors.newFixedThreadPool(8)
val futures = chunks.map { sql ->
    executor.submit<ProcessingResult> {
        processChunk(sql)  // each chunk on its own JDBC connection
    }
}
val results = futures.map { it.get() }  // wait for all

Approach 2: Producer-consumer with BlockingQueue

// One reader thread (fast I/O):
val queue = ArrayBlockingQueue<Row>(10_000)
val reader = Thread {
    val rs = stmt.executeQuery("SELECT * FROM users")
    while (rs.next()) {
        queue.put(Row.from(rs))  // blocks if queue full (backpressure!)
    }
    queue.put(POISON_PILL)       // signal end
}

// Multiple processor threads (CPU-intensive):
val processors = (1..8).map { Thread {
    while (true) {
        val row = queue.take()          // blocks if queue empty
        if (row === POISON_PILL) break
        maskAndWrite(row)
    }
}}

Approach 3: Kotlin coroutines with Channel (modern)

val channel = Channel<Row>(capacity = 10_000)

// Producer:
launch(Dispatchers.IO) {
    resultSet.forEach { row -> channel.send(row) }
    channel.close()
}

// N consumers:
repeat(8) {
    launch(Dispatchers.Default) {
        for (row in channel) { maskAndWrite(row) }
    }
}

Key consideration: each approach needs its own JDBC connection for reading/writing. Connection pool size must match parallelism level. Database might be the bottleneck, not the JVM.

What is the ForkJoinPool and when would you use it?

ForkJoinPool is designed for divide-and-conquer parallelism with work-stealing. Each thread has a deque of tasks. When a thread finishes its tasks, it steals from other threads' deques.

// Recursive task — process a large dataset by splitting:
class MaskTask(val data: List<Row>, val threshold: Int = 1000) 
    : RecursiveAction() {
    
    override fun compute() {
        if (data.size <= threshold) {
            // Small enough — process directly
            data.forEach { maskRow(it) }
        } else {
            // Split in half, process both in parallel
            val mid = data.size / 2
            val left = MaskTask(data.subList(0, mid))
            val right = MaskTask(data.subList(mid, data.size))
            invokeAll(left, right)  // fork both, join when done
        }
    }
}

ForkJoinPool(8).invoke(MaskTask(allRows))

When to use: CPU-bound parallel processing of large in-memory datasets. Not for I/O. Java parallel streams use ForkJoinPool internally.

Work-stealing advantage: if one thread's partition finishes early (data is unevenly sized), it steals tasks from busy threads. Automatic load balancing. Regular ThreadPool doesn't do this.

How do you handle concurrent writes to the same table from multiple threads?

Option 1: Batch inserts per thread, no conflicts

// Each thread writes to a DIFFERENT temporary table, merge at end:
Thread 1 → INSERT INTO masked_users_part1 ...
Thread 2 → INSERT INTO masked_users_part2 ...
...
// Final: INSERT INTO masked_users SELECT * FROM masked_users_part1 UNION ALL ...

Option 2: Single writer thread with batching

// Multiple reader/processor threads → Channel → one writer thread
// Writer batches inserts for efficiency:
val batch = mutableListOf<Row>()
for (row in channel) {
    batch.add(row)
    if (batch.size >= 5000) {
        executeBatch(batch)  // JDBC batch insert: 5000 rows at once
        batch.clear()
    }
}

Option 3: COPY command (PostgreSQL fastest path)

// PostgreSQL COPY is 5-10x faster than INSERT:
val copyManager = CopyManager(connection as BaseConnection)
val writer = copyManager.copyIn("COPY masked_users FROM STDIN WITH CSV")
// Write rows as CSV directly to the COPY stream
// No SQL parsing, no transaction per row. Pure bulk loading.

For a data engine: COPY is the standard approach for bulk loading into PostgreSQL. MySQL has LOAD DATA INFILE (similar concept). The key insight: INSERT is optimized for OLTP (small transactions). COPY/LOAD DATA is optimized for ETL (bulk loading).

Database Internals — beyond writing SQL

"Work deeply with relational databases, add support for new databases"

How does PostgreSQL store data on disk?

Pages: PostgreSQL stores data in 8KB pages (blocks). Each table is a file on disk, divided into 8KB pages. Each page contains row data (tuples) + free space + item pointers.

Tuples: each row is a tuple with a header (23 bytes: xmin/xmax transaction IDs for MVCC, null bitmap) + actual data. Rows are not sorted — new rows go into any page with free space.

TOAST: values larger than ~2KB are compressed and/or stored in a separate TOAST table. So a TEXT column with 1MB of data doesn't bloat the main table pages.

Indexes: separate files. BTree index: sorted tree of (key, pointer-to-heap-tuple). Index scan: traverse tree → find pointer → fetch tuple from heap page.

Why this matters for data masking:

When you mask a column, you're rewriting tuples. If the masked value is a different size (shorter email, longer name), the tuple size changes. PostgreSQL handles this with HOT (Heap-Only Tuple) updates when possible, but large-scale masking often means rewriting the entire table — essentially CREATE TABLE masked AS SELECT ... FROM original.

What is MVCC and why does it matter?

MVCC (Multi-Version Concurrency Control): readers don't block writers, writers don't block readers. Each transaction sees a consistent snapshot of the data.

How PostgreSQL implements it: each row has xmin (transaction that created it) and xmax (transaction that deleted/updated it). An UPDATE creates a NEW tuple (new xmin) and marks the old one as deleted (sets xmax). Old versions remain on disk until VACUUM removes them.

Why this matters for a data engine:

If you're reading a 10GB table while production writes are happening, MVCC guarantees you see a consistent snapshot — no dirty reads, no torn rows. But: long-running reads prevent VACUUM from cleaning old versions → table bloat. For TB-scale reads, this is a real concern. Solution: read from a replica, or use pg_dump with snapshot isolation.

MySQL difference: InnoDB uses MVCC with undo logs. Old versions stored in undo tablespace, not inline. Different performance characteristics for long-running reads.

How would you read an entire database schema programmatically?

JDBC DatabaseMetaData:

val meta = connection.metaData

// All tables:
val tables = meta.getTables(catalog, schema, "%", arrayOf("TABLE"))
while (tables.next()) {
    val tableName = tables.getString("TABLE_NAME")
}

// Columns for each table:
val columns = meta.getColumns(catalog, schema, tableName, "%")
while (columns.next()) {
    val colName = columns.getString("COLUMN_NAME")
    val colType = columns.getString("TYPE_NAME")  // varchar, int4, etc.
    val nullable = columns.getString("IS_NULLABLE")
}

// Foreign keys — critical for referential integrity:
val fks = meta.getImportedKeys(catalog, schema, tableName)
while (fks.next()) {
    val fkColumn = fks.getString("FKCOLUMN_NAME")
    val pkTable = fks.getString("PKTABLE_NAME")
    val pkColumn = fks.getString("PKCOLUMN_NAME")
    // users.department_id → departments.id
}

// Primary keys, indexes, unique constraints similarly

PostgreSQL information_schema (alternative):

SELECT table_name, column_name, data_type, is_nullable
FROM information_schema.columns 
WHERE table_schema = 'public'
ORDER BY table_name, ordinal_position;

For data masking: you need the full schema graph — tables, columns, types, foreign keys, unique constraints, check constraints. Foreign keys tell you: if you mask users.email, you must also mask orders.customer_email if it references the same data. Referential integrity must be preserved.

What is ClickHouse and how does it differ from PostgreSQL?

PostgreSQL = row-oriented (OLTP): stores all columns of a row together. Fast for: SELECT * FROM users WHERE id = 123 (fetch one complete row). Slow for: SELECT AVG(age) FROM users (must read every row's entire data to extract one column).

ClickHouse = column-oriented (OLAP): stores each column separately. Fast for: SELECT AVG(age) FROM users (reads only the age column — much less I/O). Slow for: SELECT * FROM users WHERE id = 123 (must reassemble row from many column files).

ClickHouse specifics: designed for analytical queries on huge datasets (billions of rows). No UPDATE/DELETE (append-only, merge-on-read). No transactions. Massively parallel query execution. Compression per column (similar values compress well).

For the data engine: adding "support for ClickHouse" means understanding that you can't UPDATE rows in place (no masking via UPDATE). You must read data → mask → write to a new table. Also: ClickHouse has different data types, different SQL dialect, different JDBC driver behavior.

Algorithms for large datasets

"Strong CS fundamentals: algorithms and data structures"

How would you find all foreign key dependencies in a database?

This is a graph problem. Tables = nodes. Foreign keys = directed edges. You need topological sort to determine processing order (parent tables first).

// Build dependency graph:
data class Table(val name: String)
data class ForeignKey(val from: Table, val to: Table)

val graph = mutableMapOf<Table, MutableList<Table>>()  // adjacency list

for (fk in allForeignKeys) {
    graph.getOrPut(fk.from) { mutableListOf() }.add(fk.to)
}

// Topological sort (Kahn's algorithm):
fun topologicalSort(graph: Map<Table, List<Table>>): List<Table> {
    val inDegree = mutableMapOf<Table, Int>()
    // ... count incoming edges
    val queue: Queue<Table> = ArrayDeque()
    // ... add nodes with inDegree 0
    val sorted = mutableListOf<Table>()
    while (queue.isNotEmpty()) {
        val node = queue.poll()
        sorted.add(node)
        for (neighbor in graph[node].orEmpty()) {
            inDegree[neighbor] = inDegree[neighbor]!! - 1
            if (inDegree[neighbor] == 0) queue.add(neighbor)
        }
    }
    return sorted  // process tables in this order
}

Why topological sort matters: if orders has FK to users, you must mask users first, then orders — so masked user IDs are consistent across tables.

Cycle detection: circular FKs are rare but possible. If topological sort doesn't include all tables → cycle exists. Handle by temporarily dropping constraints, processing, re-adding.

How would you generate unique masked emails for 100M users?

Requirements: each original email maps to a unique masked email. Same original → same masked (deterministic for referential integrity). Format looks realistic. Can't reverse to original.

// Approach: Format-Preserving Hash
fun maskEmail(original: String): String {
    val (local, domain) = original.split("@")
    
    // Deterministic hash — same input always gives same output:
    val hash = MessageDigest.getInstance("SHA-256")
        .digest("$SALT:$original".toByteArray())
    
    // Convert to readable format:
    val maskedLocal = Base36.encode(hash.take(8))  // "a7x2k9m1"
    val maskedDomain = "example.com"               // or hash the domain too
    
    return "$maskedLocal@$maskedDomain"
}

// original: alex@gmail.com → masked: a7x2k9m1@example.com
// Same input always gives same output
// Different inputs give different outputs (collision probability ~1/2^64)

Performance for 100M emails: SHA-256 is ~500ns per hash. 100M * 500ns = 50 seconds. Parallelizable across 8 cores → ~6 seconds. Memory: only one email in memory at a time (streaming).

Uniqueness guarantee: SHA-256 collision probability for 100M items is astronomically low (~10^-53). In practice: guaranteed unique. If absolute uniqueness required: hash + check against a HashSet (RAM: 100M * ~50 bytes = ~5GB) or BloomFilter for approximate check.

How would you create a consistent subset of a database?

Problem: production database has 500 tables, 10TB. Developer needs a 100GB subset for testing that's consistent (no broken foreign keys).

Algorithm — walk the FK graph from seed:

// 1. User selects "seed": I want 1000 users from the 'users' table
// 2. Engine walks FK graph:
//    users(1000 rows) 
//      → orders (all orders for these 1000 users)
//      → order_items (all items for these orders)
//      → products (all products referenced by order_items)
//      → categories (all categories for these products)
//    Recursively follow all FKs

// 3. BFS/DFS traversal:
fun collectSubset(seedTable: String, seedIds: Set<Long>) {
    val collected = mutableMapOf<String, MutableSet<Long>>()
    val queue: Queue<Pair<String, Set<Long>>> = ArrayDeque()
    queue.add(seedTable to seedIds)
    
    while (queue.isNotEmpty()) {
        val (table, ids) = queue.poll()
        if (collected.containsKey(table)) continue
        collected[table] = ids.toMutableSet()
        
        // Find all tables that reference this table:
        for (fk in foreignKeysReferencingTable(table)) {
            val childIds = fetchChildIds(fk, ids)
            queue.add(fk.childTable to childIds)
        }
        
        // Find all tables this table references:
        for (fk in foreignKeysFromTable(table)) {
            val parentIds = fetchReferencedIds(fk, ids)
            queue.add(fk.parentTable to parentIds)
        }
    }
    return collected  // table → set of IDs to include
}

Challenges: cycle detection (table A references B, B references A), size control (subset might grow exponentially through many-to-many relationships), performance (hundreds of queries to resolve FKs).

Domain Knowledge — Data Masking & Synthetic Data

Know the problem space even if you haven't built it before

What is data masking and what types exist?

Data masking = replacing sensitive data with realistic but fake data. So developers/testers can work with production-like data without accessing real PII.

Types:

Static masking: create a masked COPY of the database. Original untouched, copy has fake data. Developers access the copy. Most common for testing environments.

Dynamic masking: mask data on-the-fly at query time. Same database, different views for different users. DBA sees real data, developer sees masked. More complex, used for production access control.

Masking techniques by data type:

Email:    alex.budanov@gmail.com → user_7a3f@masked.com
Phone:    +7 702 365 6813       → +7 702 XXX XXXX
Name:     Alexander Budanov     → James Wilson (from name dictionary)
SSN:      123-45-6789           → 987-65-4321 (format-preserving)
Address:  123 Main St           → 456 Oak Ave (from address generator)
Date:     1985-03-15            → shift by random offset (preserve age distribution)
Salary:   $150,000              → $147,832 (noise within ±10%)

Key requirements: referential integrity (same person_id in all tables maps to same masked identity), format preservation (masked phone looks like a phone), statistical distribution preservation (age distribution in masked data resembles original), deterministic (same input → same output across multiple runs).

What is format-preserving encryption (FPE)?

FPE encrypts data while preserving its format. A 16-digit credit card number encrypts to another 16-digit number. A 10-character name encrypts to another 10-character name.

Unlike regular encryption (AES produces random-looking bytes), FPE output looks like valid data. Applications that validate format (length, character set, check digits) still work with FPE-encrypted data.

Standards: FF1 and FF3-1 (NIST SP 800-38G). Based on AES internally but with Feistel network structure adapted for small domains.

Use case in data masking: mask a credit card number so it still passes Luhn check. Mask a phone number so it still has correct country code format. The masked data is deterministic (same key + same input = same output) enabling referential integrity across tables.

What is synthetic data generation?

Synthetic data = entirely generated data that has the same statistical properties as real data but contains no real information.

Approaches:

Rule-based: define rules per column. "Name = random from dictionary, Age = normal distribution(35, 10), Email = name + random domain." Simple, fast, predictable.

Statistical: analyze real data distributions and correlations, then generate new data matching those distributions. Preserves: column distributions, correlations between columns, cardinality. Doesn't preserve: individual records.

AI/ML-based: train a generative model (GAN, VAE, diffusion) on real data. Model learns patterns and generates new data. Most realistic but: risk of memorizing real records, harder to guarantee privacy, slower.

For the company's product: likely a combination. Rule-based for simple columns (names, emails), statistical for numerical data (preserving distributions), schema-aware for maintaining FK relationships and constraints.

Key Phrases — for HR and technical rounds

About yourself (English)

"Senior Software Engineer, 25 years of experience. 15 years at Alfa-Bank building trading infrastructure — high-performance systems working with large datasets and relational databases. System Architect at Moscow Exchange — designed trading system architecture, optimized latency and throughput. Amazon — distributed systems with Java and Kotlin on AWS. My core strengths: JVM performance, concurrency, deep database knowledge, and the ability to solve ambiguous engineering problems."

Why this role

"Two things excite me. First, the deep technical challenge — building a core data engine that works with database internals, not just CRUD APIs. I've spent my career at the JVM-meets-database intersection: trading systems, exchange infrastructure, financial data processing. Second, the data masking problem requires understanding referential integrity, schema graphs, and data distributions at a deep level — my 15 years of banking SQL gives me that foundation."

Java/Kotlin experience

"My primary background is C#/.NET — 15+ years. Java — about 4 years at Alfa-Bank and Amazon. Kotlin — used at Amazon. But at the senior level, the language is the easy part. JVM internals — GC tuning, concurrency, profiling — that's the same whether I write Java or Kotlin. The patterns are identical: threading, memory management, JDBC, database internals. I'm productive in Kotlin and actively deepening my skills."

Concurrency experience

"At Moscow Exchange — trading system where microsecond latency mattered. Optimized Hazelcast distributed data structures, designed WebSocket protocol for market data streaming. At Amazon — distributed systems processing events at scale using Kafka and DynamoDB. I understand the full spectrum: from low-level CAS operations and lock-free data structures to high-level patterns like producer-consumer, fork-join, and coroutines."

Database experience

"15 years at Alfa-Bank working with financial databases — query optimization, indexing strategies, transaction management. At Moscow Exchange — Firebird database for trading data. I understand database internals: how PostgreSQL stores data in pages, MVCC, WAL, VACUUM. For this role — connecting to different databases, reading schemas, understanding FK relationships, processing data efficiently — this is exactly my background."

Large data volumes

"At Moscow Exchange — processing real-time market data from all trading instruments simultaneously. At Alfa-Bank — 15 years of trading history, portfolio calculations across thousands of positions. I understand the techniques: streaming with JDBC fetchSize, batch processing, parallel chunk processing, off-heap memory for large datasets, choosing the right GC for the workload."

Research / ambiguous problems

"As System Architect at Moscow Exchange, I regularly faced problems with no ready-made solution — designing custom protocols, optimizing specific bottlenecks, choosing between competing approaches. I'm comfortable with research-heavy work: reading database documentation, profiling to find the real bottleneck, prototyping solutions, and iterating based on measurements."

What NOT to overemphasize

This is NOT a microservices/fintech role. Don't lead with: Kafka, Spring Boot, REST APIs, Docker/K8s, event-driven architecture. These are secondary. Lead with: JVM performance, concurrency, database internals, algorithms, large-scale data processing.

Coding Tasks — practice for technical interview

Problems that mirror real Synthesized.io core engine challenges. Write in Kotlin/Java.

Task 1: Schema Graph — Topological Sort (30 min)

ALGORITHMS GRAPH

Problem: Given a list of tables and foreign key relationships, determine the order to process tables so that parent tables are always processed before children. Detect cycles.

schema_sort.kt — starter code
data class ForeignKey(
    val fromTable: String,   // child table
    val fromColumn: String,
    val toTable: String,     // parent table
    val toColumn: String
)

/**
 * Return tables in topological order (parents first).
 * Throw IllegalStateException if cycle detected.
 *
 * Example:
 *   tables = ["orders", "users", "order_items", "products"]
 *   fks = [
 *     FK(orders.user_id -> users.id),
 *     FK(order_items.order_id -> orders.id),
 *     FK(order_items.product_id -> products.id)
 *   ]
 *   Result: ["users", "products", "orders", "order_items"]
 *   (users and products first — no dependencies. Then orders. Then order_items.)
 */
fun topologicalSort(tables: List<String>, fks: List<ForeignKey>): List<String> {
    // Your code here — Kahn's algorithm (BFS with in-degree)
    TODO()
}

// Bonus: what if there IS a cycle?
// e.g. employees.manager_id -> employees.id (self-reference)
// How would Synthesized handle this? (Hint: CycleResolutionStrategy in their docs)

Task 2: Deterministic Email Masking (20 min)

DATA MASKING HASHING

Problem: Implement a masking function that replaces emails with fake ones. Must be deterministic (same input = same output for referential integrity across tables) and irreversible.

email_masker.kt
import java.security.MessageDigest

/**
 * Mask an email deterministically.
 * Same input always produces the same output.
 * Different inputs produce different outputs.
 * Output looks like a valid email.
 *
 * Example:
 *   mask("alex@gmail.com")    -> "a7x2k9@masked.io"
 *   mask("alex@gmail.com")    -> "a7x2k9@masked.io"  (same!)
 *   mask("maria@yahoo.com")   -> "m3p8q1@masked.io"  (different)
 *
 * Requirements:
 *   - Use SHA-256 with a salt for irreversibility
 *   - Local part: 6 alphanumeric chars derived from hash
 *   - Domain: always "masked.io"
 *   - Must handle null input (return null)
 */
fun maskEmail(email: String?, salt: String = "s3cret"): String? {
    // Your code here
    TODO()
}

// Bonus: How to ensure uniqueness for 100M emails?
// Bonus: What about format-preserving encryption (FPE)?

Task 3: Consistent Database Subset (40 min)

GRAPH BFS DATABASE CORE FEATURE

Problem: Given a database schema with FKs, extract a consistent subset starting from seed rows. All FK references must be satisfied in the subset.

subsetter.kt — this is a core Synthesized feature
data class Table(val name: String)
data class FK(val child: String, val childCol: String, val parent: String, val parentCol: String)

/**
 * Given seed rows from a starting table, find ALL rows across
 * ALL tables that must be included for referential integrity.
 *
 * Example schema:
 *   users(id, name, email)
 *   orders(id, user_id -> users.id, total)
 *   order_items(id, order_id -> orders.id, product_id -> products.id)
 *   products(id, name, price)
 *
 * Seed: users where id IN (1, 2)
 * Result:
 *   users: {1, 2}
 *   orders: {10, 11, 12}       (all orders for users 1,2)
 *   order_items: {100..108}    (all items for those orders)
 *   products: {5, 7, 12}       (all products referenced by those items)
 *
 * Algorithm: BFS from seed, following FKs in BOTH directions:
 *   - Forward (parent to child): users -> orders -> order_items
 *   - Backward (child to parent): order_items -> products
 */
fun collectSubset(
    seedTable: String,
    seedIds: Set<Long>,
    fks: List<FK>,
    fetchRelatedIds: (table: String, column: String, ids: Set<Long>) -> Set<Long>
): Map<String, Set<Long>> {
    // Your code here — BFS through FK graph
    TODO()
}

// Think about:
// 1. How to avoid infinite loops with circular FKs?
// 2. What if the subset grows exponentially? (100 users -> 10M order_items)
// 3. How to parallelize this for performance?

Task 4: Parallel Table Processor with Backpressure (30 min)

CONCURRENCY JVM PERFORMANCE

Problem: Process rows from a large table using producer-consumer pattern. One reader thread streams from DB, N worker threads transform data. Backpressure when workers are slow.

parallel_processor.kt
import java.sql.Connection
import java.util.concurrent.*

/**
 * Read all rows from a table, transform each row, write to output.
 * Requirements:
 *   - Single reader thread (JDBC is not thread-safe per connection)
 *   - N worker threads for CPU-intensive transformation
 *   - Backpressure: if workers are slow, reader pauses (bounded queue)
 *   - Batch writes for output (every 5000 rows)
 *   - Track progress: log every 100,000 rows processed
 *
 * Think about:
 *   - Why fetchSize matters (hint: PostgreSQL loads ALL rows by default!)
 *   - How to handle worker exceptions without losing data
 *   - Graceful shutdown (SIGTERM while processing 10M rows)
 *   - GC pressure: reuse objects? StringBuilder? IntArray vs List?
 */
fun processTable(
    sourceConn: Connection,
    targetConn: Connection,
    tableName: String,
    workerCount: Int = 8,
    transform: (row: Map<String, Any?>) -> Map<String, Any?>
) {
    // Your code here — BlockingQueue or Channel based
    TODO()
}

// Variant: implement the same with Kotlin coroutines + Channel
// Compare: which approach is simpler? Which uses less memory?

Task 5: JDBC Schema Reader (25 min)

DATABASE INTERNALS JDBC

Problem: Read complete schema information from a database using JDBC DatabaseMetaData. Build an in-memory model of tables, columns, types, PKs, and FKs.

schema_reader.kt — real Synthesized.io work
import java.sql.Connection

data class Column(val name: String, val type: String, val nullable: Boolean, val size: Int)
data class PrimaryKey(val columns: List<String>)
data class ForeignKey(
    val name: String,
    val columns: List<String>,
    val referencedTable: String,
    val referencedColumns: List<String>
)
data class TableSchema(
    val name: String,
    val columns: List<Column>,
    val primaryKey: PrimaryKey?,
    val foreignKeys: List<ForeignKey>
)

/**
 * Read all tables from a database schema using JDBC metadata API.
 * Return a map of table_name -> TableSchema.
 *
 * Use: connection.metaData.getTables(), .getColumns(), 
 *      .getPrimaryKeys(), .getImportedKeys()
 *
 * Handle: composite PKs (multiple columns), composite FKs,
 *         nullable columns, different SQL types.
 */
fun readSchema(conn: Connection, schemaName: String = "public"): Map<String, TableSchema> {
    // Your code here
    TODO()
}

// Think about:
// 1. Different databases return different type names (int4 vs INTEGER vs INT)
// 2. Some databases don't support getImportedKeys (need information_schema fallback)
// 3. Virtual FKs — relationships not declared in DB but known to users

Task 6: Data Type Transformer Registry (20 min)

DESIGN PATTERNS KOTLIN

Problem: Design a transformer registry — user configures masking/generation rules per column type (email, phone, name, date, number), engine applies the right transformer to each column.

transformer_registry.kt — OOP/functional design
/**
 * Design a transformer system:
 *   - Each transformer handles one type of data (email, phone, name...)
 *   - User configures via YAML which transformer to use per column
 *   - Transformers must be deterministic (same input -> same output)
 *   - New transformers can be added without modifying existing code
 *
 * Example YAML config (Synthesized's actual format):
 *   tables:
 *     users:
 *       columns:
 *         email:    { transformer: "email_masker" }
 *         phone:    { transformer: "phone_masker", format: "+X XXX XXX XXXX" }
 *         name:     { transformer: "name_generator", locale: "en" }
 *         salary:   { transformer: "numeric_noise", variance: 0.1 }
 *         birth_date: { transformer: "date_shift", max_days: 30 }
 */
sealed interface Transformer {
    fun transform(value: Any?, config: Map<String, Any>): Any?
}

class EmailMasker(private val salt: String) : Transformer {
    override fun transform(value: Any?, config: Map<String, Any>): Any? {
        // implement deterministic email masking
        TODO()
    }
}

class TransformerRegistry {
    // Register and lookup transformers by name
    // Apply the right transformer to each column based on config
    TODO()
}

// Think about:
// 1. How to make this extensible (new transformer types via plugins?)
// 2. Thread safety — transformers called from multiple threads
// 3. How does Synthesized handle cross-table consistency?
//    (same email in users.email and orders.contact_email must mask identically)
Approach for technical interview

Don't memorize solutions — understand the patterns. They might ask a variation. The key signals they're looking for:

1. Graph thinking: FK relationships form a DAG. Topological sort, cycle detection, BFS/DFS traversal. You dealt with dependency graphs in trading systems.

2. Performance awareness: fetchSize, batch writes, COPY, off-heap memory, GC pressure. Mention these proactively — it shows you've worked with large data.

3. Database internals: JDBC metadata API, differences between PostgreSQL/MySQL/Oracle type systems, MVCC, streaming vs buffered results.

4. Concurrency patterns: producer-consumer, bounded queues, thread safety, graceful shutdown. Raw Java preferred over coroutines for this role.

5. Domain knowledge: referential integrity in masking, deterministic transformations, consistent subsetting. Shows you understand the business problem.